Changes

Jump to navigation Jump to search
no edit summary
[[Category:Internal]]{{Project|Has project output=|Has sponsor=McNair Center|Has title=Patent Data Processing - SQL Steps|Has owner=Shelby Bice,|Has project status=Subsume|Has keywords=Tool}}Return to [[Internal Classification::Legacy| Patent Data (Wiki Page)]].
== Objective ==
The schema for the assignees table in '''patentdata''' database is:
Column | Type | Modifiers -------------+-------------------+----------- patent | integer | asgtype | integer | assignee | character varying | city | character varying | state | character varying | country | character varying | nationality | character varying | residence | character varying | asgseq | integer |
The schema for the assignees table in patent_2015 is :
Column | Type | Modifiers
- --------------+---------+-----------
lastname | text |
firstname | text |
'''Overlapping Columns'''
patent_2015 | patentdata --------------+-------------- orgname | assignee city | city country | country patentnumber | patent state | state
These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.
'''Final Schema'''
 
Table "public.assignees"
Column | Type | Modifiers ---------------+-------------------+----------- lastname | character varying | firstname | character varying | address | character varying | postcode | character varying | orgname | character varying | city | character varying | country | character varying | patentnumber | integer | state | character varying | patentcountry | character varying | nationality2 | character varying | residence | character varying | asgseq | integer | asgtype | integer |
'''Non-overlapping Columns'''
'''Patent_2015 Schema:'''
Column | Type | Modifiers
Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.
 
Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant.
==== Index and Key Creation ====
INSERT INTO patents_merged
(
SELECT patent, kind, gdate, 'NULL', 'NULL', NULL, NULL, 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', -1, 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', claims, apptype, appnum, gyear, appdate, appyear FROM patents ); -- RESULT : INSERT 0 3984771
patent_2015:
COPY SCRIPTS:
patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; --COPY 3984771
patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; --COPY 1646225
PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; -- RESULT : COPY 3984771 \COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; -- RESULT : COPY 1646225
====TESTING ====
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t; --RESULT: 5411151 EXPECTED: 5426566
We found some copies of a few rows, where both the patent_2015 and patentdata
SELECT COUNT(*), * FROM patents GROUP BY patentnumber,
kind,
grantdate,
appdate,
appyear
HAVING COUNT(*) > 1;
SELECT patentnumber, count(*) FROM patents GROUP BY patentnumber HAVING count(*)>1; --7640598
SELECT * FROM patents op WHERE op.patentnumber IN
(
SELECT ip.patentnumber
HAVING COUNT(*)>1
)
ORDER BY op.patentnumber;
( SELECT * INTO patentsCleaned FROM patents op WHERE op.patentnumber IN
(
SELECT ip.patentnumber
HAVING COUNT(*)=1
)
ORDER BY op.patentnumber ) --SELECT 5191306
INSERT INTO patentsCleaned( SELECT * FROM patents op WHERE op.patentnumber IN
(
SELECT ip.patentnumber
HAVING COUNT(*)>1
)
AND op.applicationnumber NOT LIKE 'NULL' ORDER BY op.patentnumber );
--219845
====TESTING:==== allpatent=# select count(*) from patentsCleaned; count --------- 5411151 (1 row)
allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1; count | patentnumber -------+-------------- (0 rows)
== Citations==
 
In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)
To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) {
my $var=$_[0];
$$ LANGUAGE plperl;
patentdata:
Column | Type | Modifiers
-------------+-------------------+-----------
patent | integer |
cit_date | date |
cit_name | character varying |
cit_kind | character varying |
cit_country | character varying |
citation | integer |
category | character varying |
citseq | integer |
SELECT patent as citingpatentnumber, citation AS citedpatentnumber
INTO citations_merged
FROM citations;
--SELECT 38452957
'''patentdata schema:'''  Column | Type | Modifiers ------------+-------------------+----------- patent | integer | cit_date | date | cit_name | character varying | cit_kind | character varying | cit_country | character varying | citation | integer | category | character varying | citseq | integer |   SELECT patent as citingpatentnumber, citation AS citedpatentnumber INTO citations_merged FROM citations; --SELECT 38452957 '''patent_2015schema:''' Column | Type | Modifiers ---------------------+---------+----------- citingpatentnumber | integer | citingpatentcountry | text | citedpatentnumber | text | citedpatentcountry | text |  SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber INTO citations_merged FROM citations; -- RESULT : SELECT 59227881
SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumberINTO citations_mergedFROM citations;-- RESULT : SELECT 59227881'''Overlapping Columns'''
FINAL TABLE patent_2015 | patentdata | ---------------------+---------------+ citingpatentnumber | patent | citedpatentnumber | citation | ''' Combined Schema:'''
CREATE TABLE citations Column | Type | Modifiers(--------------------+--------+----------- citingpatentnumber | bigint,| citedpatentnumber | bigint);|
patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; --COPY 38452957
patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; -- RESULT : COPY 59227881
allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; --RESULT : COPY 59227881
\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV; --RESULT: COPY 38452957
== USPTO Consolidated Patent Data ==
ScriptsThe USPTO has a repository of patent data on their Bulk Data Storage system. We have this data downloaded and loaded into a table on the patent database. Here are the steps followed:* Download file from the BDS system - we have access to CSV files.* Create table with required specs* use the \COPY command to copy the data from the file into the table.
/* creating patent data tables from : https://bulkdataScript follows.uspto.gov/data2/patent/maintenancefee/*/
CREATE TABLE PatentMaintenanceFee(patentnumber varchar,applicationnumber int,smallentity varchar,filingdate date,grantissuedate date,maintenancefeedate date,maintenancefeecode varchar);'''Script:'''
\COPY PatentMaintenanceFee FROM ' /bulk* creating patent data tables from : https:/USPTO_Consolidated/MaintFeeEvents_20160613bulkdata.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;-- RESULT : COPY 14042059uspto.gov/data2/patent/maintenancefee/*/
/* creating tables for historical patent data - USPTO */ CREATE TABLE PatentMaintenanceFee( patentnumber varchar, applicationnumber int, smallentity varchar, filingdate date, grantissuedate date, maintenancefeedate date, maintenancefeecode varchar );
CREATE TABLE HistoricalPatentData(applicationid int,pubno varchar,patentnumber varchar,NBER int,USPC varchar,USPC_sub varchar,applicationdate date,prioritydate date,pubdate date,displaydate date,disptype varchar,exp_dt date,exp_dt_max date,pta int \COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;); -- RESULT : COPY 14042059
\COPY historicalpatentdata FROM ' /bulk* creating tables for historical patent data - USPTO */USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;
CREATE TABLE HistoricalPatentData( applicationid int, pubno varchar, patentnumber varchar, NBER int, USPC varchar, USPC_sub varchar, applicationdate date, prioritydate date, pubdate date, displaydate date, disptype varchar, exp_dt date, exp_dt_max date, pta int );  \COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;  --COPY 11191813 [[Category:Internal]][[Internal Classification::Legacy| ]][[Category:Patent]]

Navigation menu