Difference between revisions of "Patent Data Extraction Scripts (Tool)"

From edegan.com
Jump to navigation Jump to search
(Created page with "===Utility patent grants fields=== ====Patent==== *patent number *kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes *gran...")
 
Line 2: Line 2:
  
 
====Patent====
 
====Patent====
 
+
<onlyinclude>
 
*patent number
 
*patent number
 
*kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
 
*kind: http://www.uspto.gov/patents-application-process/patent-search/authority-files/uspto-kind-codes
 
*grantdate
 
*grantdate
 
+
</onlyinclude>
 
For version 4.5:
 
For version 4.5:
 
  <publication-reference>
 
  <publication-reference>
Line 16: Line 16:
 
   </document-id>
 
   </document-id>
 
  </publication-reference>
 
  </publication-reference>
 
+
<onlyinclude>
 
*type
 
*type
 
*applicationnumber
 
*applicationnumber
 
*filingdate
 
*filingdate
 +
</onlyinclude>
 
  <application-reference appl-type="utility">
 
  <application-reference appl-type="utility">
 
   <document-id>
 
   <document-id>
Line 28: Line 29:
 
  </application-reference>
 
  </application-reference>
  
 +
<onlyinclude>
 
For priority, if there is more than 1, we want sequence 01
 
For priority, if there is more than 1, we want sequence 01
 
*prioritydate
 
*prioritydate
 
*prioritycountry (should use ISO country codes - may need a lookup table)
 
*prioritycountry (should use ISO country codes - may need a lookup table)
 
*prioritypatentnumber
 
*prioritypatentnumber
 +
</onlyinclude>
 
*'''find 4.3 file with priority claim'''
 
*'''find 4.3 file with priority claim'''
  
Line 41: Line 44:
 
   </priority-claim>
 
   </priority-claim>
 
  </priority-claims>
 
  </priority-claims>
 
+
<onlyinclude>
Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
+
Classification IPC </onlyinclude>- we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf
 
*Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
 
*Section, Class, SubClass - Together these concord to US subclass: http://www.uspto.gov/web/patents/classification/international/ipc/ipc8/ipc_concordance/ipcsel.htm#a
 
*MainGroup, SubGroup
 
*MainGroup, SubGroup
Line 63: Line 66:
 
  ...
 
  ...
 
  </classifications-ipcr>
 
  </classifications-ipcr>
 
+
<onlyinclude>
 
Classification CPC - we only need the main one
 
Classification CPC - we only need the main one
  
Line 71: Line 74:
 
*Main Group, Subgroup
 
*Main Group, Subgroup
 
*'''v 4.2, 4.3, 4.4 does not have this'''
 
*'''v 4.2, 4.3, 4.4 does not have this'''
 
+
</onlyinclude>
 
  <classifications-cpc>
 
  <classifications-cpc>
 
   <main-cpc>
 
   <main-cpc>
Line 89: Line 92:
 
   </main-cpc>
 
   </main-cpc>
 
  </classifications-cpc>
 
  </classifications-cpc>
 
+
<onlyinclude>
 
Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
 
Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)
 
*Country
 
*Country
 
*Class
 
*Class
 
+
</onlyinclude>
 
'''THIS IS NOT UNIQUE. What classifications are we searching for?'''
 
'''THIS IS NOT UNIQUE. What classifications are we searching for?'''
 
  <classification-national>
 
  <classification-national>
Line 99: Line 102:
 
   <main-classification>2 211</main-classification>
 
   <main-classification>2 211</main-classification>
 
  </classification-national>
 
  </classification-national>
 
+
<onlyinclude>
Title of the patent:
+
Title of the patent</onlyinclude>:
 
  <invention-title id="d2e61">Aircrew ensembles</invention-title>
 
  <invention-title id="d2e61">Aircrew ensembles</invention-title>
 
+
<onlyinclude>
Number of Claims:
+
Number of Claims</onlyinclude>:
 
  <number-of-claims>12</number-of-claims>
 
  <number-of-claims>12</number-of-claims>
 
+
<onlyinclude>
 
Primary examiner:
 
Primary examiner:
 
*FirstName, LastName, Department
 
*FirstName, LastName, Department
 
+
</onlyinclude>
 
  <examiners>
 
  <examiners>
 
   <primary-examiner>
 
   <primary-examiner>
Line 117: Line 120:
 
  ...
 
  ...
 
  </examiners>
 
  </examiners>
 
+
<onlyinclude>
 
PCT/Regional Patent Number:
 
PCT/Regional Patent Number:
 
*PCTNumber (just the doc number - if it starts with PCT set a flag)
 
*PCTNumber (just the doc number - if it starts with PCT set a flag)
Line 123: Line 126:
 
*'''not in v 4.2, 4.3, 4.4'''
 
*'''not in v 4.2, 4.3, 4.4'''
 
*'''maybe not all patents are filed under PCT, need to use code to search all files for key word'''
 
*'''maybe not all patents are filed under PCT, need to use code to search all files for key word'''
 
+
</onlyinclude>
 
  <pct-or-regional-filing-data>
 
  <pct-or-regional-filing-data>
 
   <document-id>
 
   <document-id>
Line 135: Line 138:
  
 
====Citations====
 
====Citations====
 
+
<onlyinclude>
 
Patent Citations (we need all of them):
 
Patent Citations (we need all of them):
 
*CitingPatentNumber (from the patent)
 
*CitingPatentNumber (from the patent)
 
*CitingPatentCountry (from the patent)
 
*CitingPatentCountry (from the patent)
 
+
</onlyinclude>
 
  <publication-reference>
 
  <publication-reference>
 
   <document-id>
 
   <document-id>
Line 148: Line 151:
 
   </document-id>
 
   </document-id>
 
  </publication-reference>
 
  </publication-reference>
 
+
<onlyinclude>
 
*CitedPatentNumber
 
*CitedPatentNumber
 
*CitedPatentCountry
 
*CitedPatentCountry
 
*'''V 4.2 does not have <us-references-cited>
 
*'''V 4.2 does not have <us-references-cited>
 
+
</onlyinclude>
 
  <us-references-cited>
 
  <us-references-cited>
 
   <us-citation>
 
   <us-citation>
Line 172: Line 175:
 
  ...
 
  ...
 
  </us-references-cited>
 
  </us-references-cited>
 
+
<onlyinclude>
 
For non-patent references, we are just going to count them:
 
For non-patent references, we are just going to count them:
 
*NoNonPatRefs
 
*NoNonPatRefs
 
+
</onlyinclude>
 
  <us-references-cited>
 
  <us-references-cited>
 
  ...
 
  ...
Line 189: Line 192:
  
 
====Inventors====
 
====Inventors====
 
+
<onlyinclude>
 
*'''For v 4.3, 4.4, 4.5'''
 
*'''For v 4.3, 4.4, 4.5'''
 
*PatentNumber (and country) to build a key
 
*PatentNumber (and country) to build a key
 
*We need a "standard" name and address object for each inventor
 
*We need a "standard" name and address object for each inventor
 +
</onlyinclude>
 +
 
  <us-parties>
 
  <us-parties>
 
   <us-applicants>
 
   <us-applicants>
Line 238: Line 243:
 
   ...
 
   ...
 
  </parties>
 
  </parties>
 
+
<onlyinclude>
 
====Assignees====
 
====Assignees====
  
 
*PatentNumber (and country) to build a key
 
*PatentNumber (and country) to build a key
 
*We need a "standard" name and address object for each assignee
 
*We need a "standard" name and address object for each assignee
 
+
</onlyinclude>
 
  <assignees>
 
  <assignees>
 
   <assignee>
 
   <assignee>
Line 257: Line 262:
 
  </assignees>
 
  </assignees>
  
+
<onlyinclude>
 
====Other things we might want====
 
====Other things we might want====
  
Line 272: Line 277:
 
*SymbolPosition, ClassificationValue - we likely don't need them
 
*SymbolPosition, ClassificationValue - we likely don't need them
 
*Classification status and data source - no idea what these do
 
*Classification status and data source - no idea what these do
 
+
</onlyinclude>
 
====About the scripts====
 
====About the scripts====
  

Revision as of 18:10, 7 June 2016

Utility patent grants fields

Patent

For version 4.5:

<publication-reference>
 <document-id>
  <country>US</country>
  <doc-number>08925112</doc-number>
  <kind>B2</kind>
  <date>20150106</date>
 </document-id>
</publication-reference>
  • type
  • applicationnumber
  • filingdate
<application-reference appl-type="utility">
 <document-id>
  <country>US</country>
  <doc-number>13824291</doc-number>
  <date>20110929</date>
 </document-id>
</application-reference>


For priority, if there is more than 1, we want sequence 01

  • prioritydate
  • prioritycountry (should use ISO country codes - may need a lookup table)
  • prioritypatentnumber
  • find 4.3 file with priority claim
<priority-claims>
 <priority-claim sequence="01" kind="national">
  <country>GB</country>
  <doc-number>1016384.8</doc-number>
  <date>20100930</date>
 </priority-claim>
</priority-claims>

Classification IPC - we only need the first one: http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf

<classifications-ipcr>
 <classification-ipcr>
  <ipc-version-indicator>
   <date>20060101</date>
  </ipc-version-indicator>
  <classification-level>A</classification-level>
  
B
<class>64</class> <subclass>G</subclass> <main-group>6</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-ipcr> ... </classifications-ipcr>

Classification CPC - we only need the main one

CPC is a classification scheme set up by the USPTO and the European Patent Office (EPO). The first classification codes rolled out on November 9, 2012.[1] Full implementation of the CPC classification system occurred on January 2015, at the same time of version 4.5 of the USPTO patent bulk data.[2]

  • Section, Class, Subclass
  • Main Group, Subgroup
  • v 4.2, 4.3, 4.4 does not have this
<classifications-cpc>
 <main-cpc>
  <classification-cpc>
    <cpc-version-indicator>
      <date>20130101</date>
    </cpc-version-indicator>
    
B
<class>64</class> <subclass>D</subclass> <main-group>10</main-group> <subgroup>00</subgroup> <symbol-position>F</symbol-position> <classification-value>I</classification-value> ... </classification-cpc> </main-cpc> </classifications-cpc>

Classification National: Note that the one below comes out to 2/2.11 (http://www.google.com/patents/US8925112#classifications)

  • Country
  • Class

THIS IS NOT UNIQUE. What classifications are we searching for?

<classification-national>
 <country>US</country>
  <main-classification>2 211</main-classification>
</classification-national>

Title of the patent:

<invention-title id="d2e61">Aircrew ensembles</invention-title>

Number of Claims:

<number-of-claims>12</number-of-claims>

Primary examiner:

  • FirstName, LastName, Department
<examiners>
 <primary-examiner>
  <last-name>Patel</last-name>
  <first-name>Tejash</first-name>
  <department>3765</department>
 </primary-examiner>
...
</examiners>

PCT/Regional Patent Number:

  • PCTNumber (just the doc number - if it starts with PCT set a flag)
  • not in all v 4.5
  • not in v 4.2, 4.3, 4.4
  • maybe not all patents are filed under PCT, need to use code to search all files for key word
<pct-or-regional-filing-data>
 <document-id>
  <country>WO</country>
  <doc-number>PCT/EP2011/067014</doc-number>
  <kind>00</kind>
  <date>20110929</date>
 </document-id>
...
</pct-or-regional-filing-data>

Citations

Patent Citations (we need all of them):

  • CitingPatentNumber (from the patent)
  • CitingPatentCountry (from the patent)
<publication-reference>
 <document-id>
  <country>US</country>
  <doc-number>08925112</doc-number>
  <kind>B2</kind>
  <date>20150106</date>
 </document-id>
</publication-reference>
  • CitedPatentNumber
  • CitedPatentCountry
  • V 4.2 does not have <us-references-cited>
<us-references-cited>
 <us-citation>
  <patcit num="00001">
   <document-id>
    <country>US</country>
    <doc-number>1105569</doc-number>
    <kind>A</kind>
    <name>Lacrotte</name>
    <date>19140700</date>
   </document-id>
  </patcit>
  <category>cited by examiner</category>
  <classification-national>
   <country>US</country>
   <main-classification>2 214</main-classification>
  </classification-national>
 </us-citation>
...
</us-references-cited>

For non-patent references, we are just going to count them:

  • NoNonPatRefs
<us-references-cited>
...
 <us-citation>
  <nplcit num="00020">
   <othercit>
    European Search Report dated Jan. 20, 2011 as received in European Patent Application No. GB1016384.8.
   </othercit>
  </nplcit>
  <category>cited by applicant</category>
 </us-citation>
</us-references-cited>

Inventors

  • For v 4.3, 4.4, 4.5
  • PatentNumber (and country) to build a key
  • We need a "standard" name and address object for each inventor


<us-parties>
 <us-applicants>
...
 </us-applicants>
 <inventors>
   <inventor sequence="001" designation="us-only">
    <addressbook>
     <last-name>Oliver</last-name>
     <first-name>Paul</first-name>
    <address>
     <city>Rhyl</city>
     <country>GB</country>
    </address>
   </addressbook>
  </inventor>
...
 </inventors>
...
<us-parties>


  • For v 4.2
<parties>
 <applicants>
  <applicant sequence="001" app-type="applicant-inventor" designation="us-only">
   <addressbook>
    <last-name>Kamath</last-name>
    <first-name>Sandeep</first-name>
    <address>
     <city>Bangalore</city>
     <country>IN</country>
    </address>
   </addressbook>
   <nationality>
    <country>omitted</country>
   </nationality>
   <residence>
    <country>IN</country>
   </residence>
  </applicant>
 ...
 </applicants>
 ...
</parties>

Assignees

  • PatentNumber (and country) to build a key
  • We need a "standard" name and address object for each assignee
<assignees>
  <assignee>
   <addressbook>
    <orgname>Survitec Group Limited</orgname>
    <role>03</role>
   <address>
    <city>Merseyside</city>
    <country>GB</country>
   </address>
  </addressbook>
 </assignee>
</assignees>


Other things we might want

  • Abstract
  • Claims (other than their count)

Things we don't need

General:

Classification related:

  • Level - This appears to be either core or advanced. Not sure it matters.
  • SymbolPosition, ClassificationValue - we likely don't need them
  • Classification status and data source - no idea what these do

About the scripts

The scripts to process the Patent Data are all located under /bulk/Software/Scripts/PatentData/ ("E:\Software\Scripts\PatentData\")

There are currently 5 .pm files: PatentApplication.pm, Inventor.pm, Claim.pm, and Addressbook.pm, and Loader.pm available.

Each of the first 4 represents an Object type. The last one is a helper object that is able to extract the wanted fields as a perl object given a schema file. Future work should be done in this file to support more schema files.

Example Usage:

perl PatentParser.pl -file=ipa150319.xml

This will parse the xml file with name ipa150319.xml, extract all the Patents (in this case PatentApplications) each as a temporary xml file, and then, using a Loader object with a specified schema file, in this case "us-patent-application-v44-2014-04-03.dtd" to be able to extract each of the 4 object types from the Patents. If any error happened during the parsing of any file, that file will be moved to a directory called "failed_files". Most likely if a file failed the parsing it is likely not a Utility patent.

About the Harvard Dataverse

The patents from 1975-2010 loaded as .sqlite3 and csv files can be found at

Harvard Dataverse

I have also downloaded all of them on to the database server and can be found by

cd /bulk/patent