Changes

Jump to navigation Jump to search
no edit summary
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,
}}
 
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with the USPTO data. By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.
 
== Directory Layout ==
 
All of the information for this project is located at <code>E:\McNair\Projects\SimplerPatentData</code>
 
There are three interesting directories:
 
* <code>zipfiles/</code> is USPTO bulkdata, unmodified and validated to have the correct file size
* <code>extracts/</code> is a directory of a strict subset of the information stored in <code>zipfiles/</code>. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles.
* <code>src/</code> is the main code repository for the java project
 
In addition, there are three interesting files in the base directory:
 
* <code>extracts.7z</code> is an archived version of the <code>extracts/</code> directory for backup and transfer reasons.
 
<nowiki>Name: extracts.7z
Size: 55847284301 bytes (53260 MB)
SHA256: C653E5B736530711DB2212191853EAABBF36CF48820915F8B57DB54E1990BDC0</nowiki>
 
* <code>hashes.tsv</code> is a tab-separated value file with SHA-256 hashes of the files as downloaded from the USPTO.
* <code>index.tsv</code> is a tab-separated value file with the URLs, modified-by datetime, and supposed filesize in bytes.
 
=== Input Files ===
 
All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. To find a specific year's XML file, find it in
 
<code>E:\McNair\Projects\SimplerPatentData\extracts</code>
 
== Schema Reconciliation ==
 
TODO
 
=== Processing ===
 
TODO
 
=== Attributes ===
 
 
== Related Projects ==

Navigation menu