Shoeb Mohammed (Work Log)

2016-07-18T20:00:06Z

ShoebMohammed:

[[Category:Work Log]]
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | (Work Log)]]

*07/15/2016
**Started work log and research plan.
**Review and profile the Matcher perl code.
**Read tutorials on using Selenium with perl. Below links are good to get started
https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf
https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf

*07/18/2016
**Completed listing page on the wiki for the software repository.
**Finished installing linux for code development.

Shoeb Mohammed (Work Log)

2016-07-18T19:36:50Z

ShoebMohammed:

2016-07-15T16:11:53Z

ShoebMohammed:

[[Category: McNair Admin]]
==Dylan Dickens==
{{:Dylan Dickens (Work Log)}}

==Jake Silberman==
{{:Jake Silberman (Work Log)}}

==Marcela Interiano==
{{:Marcela Interiano (Work Log)}}

==Veeral Shah==
{{:Veeral Shah (Work Log)}}

==Ariel Sun==
{{:Ariel Sun (Work Log)}}

==Gunny Liu==
{{:Gunny Liu (Work Log)}}

[[admin_classification::General Information| ]]

==Ben Baldazo==

{{:Ben Baldazo (Work Log)}}

==Shoeb Mohammed==
{{:Shoeb Mohammed (Work Log)}}

2016-07-05T20:37:21Z

ShoebMohammed: /* Scripts for processing data */

== USPTO Assignees Data ==

We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.

== Steps Followed to Extract the Data ==

===Extracting Data from XML Files ===

All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:

<patent-assignment>
+<assignment-record>
+<patent-assignors>
+<patent-assignees>
+<patent-properties>
</patent-assignment>

Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.

Corresponding tables are:
*assignment-records : assignment
*patent-assignors : assignors
*patent-assignees : assignees
*patent-properties : properties

Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.

==== Assignment Records ====

The fields in the assignment record are:
* last_update_date
* purge_indicator
* recorded_date
* correspondent_name
* correspondent_address_1
* correspondent_address_2
* correspondent_address_3
* correspondent_address_4
* conveyance_text

Here is the corresponding XML that we are mapping:

-<assignment-record>
<reel-no>27132</reel-no>
<frame-no>841</frame-no>
-<last-update-date>
<date>20160122</date>
</last-update-date>
<purge-indicator>N</purge-indicator>
-<recorded-date>
<date>20111027</date>
</recorded-date>
<page-count>2</page-count>
-<correspondent>
<name>DOUGLAS B. MCKNIGHT</name>
<address-1>595 MINER ROAD</address-1>
<address-2>INTELLECTUAL PROPERTY & STANDARDS</address-2>
<address-3>CLEVELAND, OH 44143</address-3>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>

==== Assignors ====

Here are the columns in the assignors table:
* reel_no
* frame_no
* assignor_name
* execution_date

The corresponding XML node is :

-<patent-assignors>
-<patent-assignor>
<name>WALKER, MATTHEW J.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
-<patent-assignor>
<name>OLSZEWSKI, MARK E.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
</patent-assignors>

==== Assignees ====

Here are the columns in the assignees table:

* reel_no
* frame_no
* assignee_name
* assignee_address_1
* assignee_address_2
* assignee_city
* assignee_state
* assignee_country
* assignee_postcode

The corresponding XML nodes are:

-<patent-assignees>
-<patent-assignee>
<name>KONINKLIJKE PHILIPS ELECTRONICS N V</name>
<address-1>GROENEWOUDSEWEG 1</address-1>
<city>EINDHOVEN</city>
<country-name>NETHERLANDS</country-name>
<postcode>5621 BA</postcode>
</patent-assignee>
</patent-assignees>

==== Patent Properties ====

Here are the columns in the properties table:

* reel_no
* frame_no
* documentid
* country
* kind
* filingdate
* invention_title

The corresponding XML segment would be:

-<patent-properties>
-<patent-property>
-<document-id>
<country>US</country>
<doc-number>14143589</doc-number>
<kind>X0</kind>
<date>20131230</date>
</document-id>
-<document-id>
<country>US</country>
<doc-number>20140260305</doc-number>
<kind>A1</kind>
<date>20140918</date>
</document-id>
<invention-title lang="en">LEAN AZIMUTHAL FLAME COMBUSTOR</invention-title>
</patent-property>
</patent-properties>

Patent properties have a many-to-one relationship : one patent can have more than one properties.
Note: We are not sure what documents with kind 'X0' say

==== Patent Assignment ====

Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.

Here are the columns in the table:

* reel_no
* frame_no
* action_key_code
* USPTO_Transaction_Date
* USPTO_Date_Produced
* version

Here is what the XML in a downloaded file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-assignments>
-<us-patent-assignments date-produced="20131101" dtd-version="1.0">
<action-key-code>DA</action-key-code>
-<transaction-date>
<date>20160122</date>
</transaction-date>
-<patent-assignments>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
.
.
.
</patent-assignments>
</us-patent-assignments>

====DTD====
Here is the DTD specified by the USPTO, which specifies optional fields and :

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)>
<!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED
date-produced CDATA #IMPLIED>
<!ELEMENT action-key-code (#PCDATA)>
<!ELEMENT transaction-date (date)>
<!ELEMENT patent-assignments (data-available-code | patent-assignment+)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT data-available-code (#PCDATA)>
<!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)>
<!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)>
<!ELEMENT patent-assignors (patent-assignor+)>
<!ELEMENT patent-assignees (patent-assignee+)>
<!ELEMENT patent-properties (patent-property+)>
<!ELEMENT reel-no (#PCDATA)>
<!ELEMENT frame-no (#PCDATA)>
<!ELEMENT last-update-date (date)>
<!ELEMENT purge-indicator (#PCDATA)>
<!ELEMENT recorded-date (date)>
<!ELEMENT page-count (#PCDATA)>
<!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)>
<!ELEMENT conveyance-text (#PCDATA)>
<!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)>
<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)>
<!ELEMENT patent-property (document-id*, invention-title?)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST name name-type (natural | legal) #IMPLIED>
<!ELEMENT address-1 (#PCDATA)>
<!ELEMENT address-2 (#PCDATA)>
<!ELEMENT address-3 (#PCDATA)>
<!ELEMENT address-4 (#PCDATA)>
<!ELEMENT execution-date (date)>
<!ELEMENT date-acknowledged (date)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT country-name (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
<!ELEMENT document-id (country, doc-number, kind?, name?, date?)>
<!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*>
<!ATTLIST invention-title id ID #IMPLIED
lang CDATA #REQUIRED>
<!ELEMENT country (#PCDATA)>
<!ELEMENT doc-number (#PCDATA)>
<!ELEMENT kind (#PCDATA)>

<!ELEMENT b (#PCDATA | i | u | smallcaps)*>

<!ELEMENT i (#PCDATA | b | u | smallcaps)*>

<!ELEMENT u (#PCDATA | b | i | smallcaps)*>
<!ATTLIST u style (single | double | dash | dots ) 'single' >

<!ELEMENT sup (#PCDATA | b | u | i)*>

<!ELEMENT sub (#PCDATA | b | u | i)*>

<!ELEMENT smallcaps (#PCDATA | b | u | i)*>
]>

===Inserting Extracted Data into Tables ===

===Clean Up ===

== Scripts for processing data ==
The programs/scripts are located at our [[Software Repository]].
repository: Patent_Data_Parser
branch: next
directory: /uspto_assignees_xml_parser

=== Downloading raw bulk data from USPTO ===
repository: Patent_Data_Parser
branch: next
directory: /uspto_assignees_xml_parser
file: USPTO_Assignee_Download.pl

The XML files are available at https://bulkdata.uspto.gov/data2/patent/assignment/

The down-loader script used to download XML files is essentially same, with minor changes, as the one used for downloading USPTO patent-data.
That is, the current version of down-loader script downloads all files from the base URL (see above).

=== Parsing the XML files ===
repository: Patent_Data_Parser
branch: next
directory: /uspto_assignees_xml_parser
file: uspto_assignees_XML_parser.plx

==== NAME ====

uspto_assignees_XML_parser.plx - Parses XML files and populates a database.

Specifically, parses every file in a directory according to a schema (see above).
Then populates a database on the RDP.

==== SYNOPSIS ====

perl uspto_assignees_XML_parser.plx /path/to/directory_containing_XML_files

==== USAGE & FEATURES ====

'''Arguments'''
The full path to directory is provided as a command line argument. It should contain the XML files to parse and no other file.
This path should be specified in Windows format (with '\') and NOT unix format.

'''Features and Effects'''
As each XML file is parsed, a database on local host (RDP) is populated. If at any point there is an error, for example a particular
XML file is bad/invalid or the psql statement cannot be executed, the program aborts with a message.

We choose to populate local database because remote connections are too slow. The database is eventually moved to DataBase server manually.

==== TESTS ====
The first version does the job as expected. It was used to populate the assignees database by parsing XML files from USPTO(see above).
We parsed all XML files dated till 7/4/2016.

==== TO DO ====
*Add more command line options to improve usability.
*Improve portability to allow Unix/Linux pathnames. This is straightforward to do with Perl modules File::Basename and File::Spec.