Changes

1,558 bytes added , 11:56, 6 October 2020

no edit summary

{{Project|Has project output=Data,Tool,How-to|Has image= |Has title= USPTO ~~Assignees~~ Bulk Data Processing|Has owner=|Has start date=|Has deadline=|Has keywords=Data |Has sponsor=McNair Center|Has notes=|Has project status=Subsume|Is dependent on=|Does subsume=}}

~~We would like~~ Return to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases[[Patent Data]].

<section begin=bulk />The USPTO provides bulk data recording patent transactions, applications, properties, reassignments, and history through XML files to the general public. These files have been downloaded and the data has been compiled in tables using PostgreSQL. The objective of processing the bulk data is to enhance the McNair Center's historical datasets ([[Patent Data Processing - SQL Steps|patent_2015 and patentdata]]) and track the entirety of US patent activity, specifically concerning utility patents. <section end=bulk /> == Steps Followed to Extract the USPTO Assignees Data ==

===Extracting Data from XML Files ===

== Scripts for processing data ==

The programs/scripts (see details below) are located at on our [[Software Repository|Bonobo Git Server]].

repository: Patent_Data_Parser

branch: next

directory: /uspto_assignees_xml_parser

file: USPTO_Assignee_Download.pl

~~The XML files are available at https://bulkdata.uspto.gov/data2/patent/assignment/~~

The down-loader script used to download XML files is essentially same, with minor changes, as the one used for downloading USPTO patent-data.

That is, the current version of down-loader script downloads all files from the base URL ~~(see above)~~: https://bulkdata.uspto.gov/data2/patent/assignment/

=== Parsing the XML files ===

==== NAME ====

uspto_assignees_XML_parser.plx - ~~Retrieves~~ Parses XML files and ~~parses Whois information~~populates a database. Specifically, ~~takes~~ parses every file in a ~~file with~~ directory according to a ~~column of domain names and~~ schema (see above).Then populates a database on the~~corresponding columns with information from the WhoIs API~~RDP.

==== SYNOPSIS ====

==== USAGE & FEATURES ====

'''Arguments'''

The full path to directory is provided as a command line argument. It should contain the XML files to parse and no other file.

This path should be specified in Windows format (with '\') and NOT unix format.

'''Features and Effects'''

As each XML file is parsed, a database on local host (RDP) is populated. If at any point there is an error, for example a particular

XML file is bad/invalid or the psql statement cannot be executed, the program aborts with a message.

We choose to populate local database because remote connections are too slow. The database is eventually moved to DataBase server manually.

==== TESTS ====

The first version does the job as expected. It was used to populate the assignees database by parsing XML files from USPTO(see above).

We parsed all XML files dated till 7/4/2016.

==== TO DO ====

*Add more command line options to improve usability.

*Improve portability to allow Unix/Linux pathnames. This is straightforward to do with Perl modules File::Basename and File::Spec.

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Bulk Patent Assignee Processing (view source)

Revision as of 11:56, 6 October 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools