Changes

Reproducible Patent Data (view source)

Revision as of 17:33, 25 May 2017

1,575 bytes added , 17:33, 25 May 2017

no edit summary

|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,

}}

A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with the USPTO data. By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.

== Directory Layout ==

All of the information for this project is located at <code>E:\McNair\Projects\SimplerPatentData</code>

There are three interesting directories:

* <code>zipfiles/</code> is USPTO bulkdata, unmodified and validated to have the correct file size

* <code>extracts/</code> is a directory of a strict subset of the information stored in <code>zipfiles/</code>. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles.

* <code>src/</code> is the main code repository for the java project

In addition, there are three interesting files in the base directory:

* <code>extracts.7z</code> is an archived version of the <code>extracts/</code> directory for backup and transfer reasons.

<nowiki>Name: extracts.7z

Size: 55847284301 bytes (53260 MB)

SHA256: C653E5B736530711DB2212191853EAABBF36CF48820915F8B57DB54E1990BDC0</nowiki>

* <code>hashes.tsv</code> is a tab-separated value file with SHA-256 hashes of the files as downloaded from the USPTO.

* <code>index.tsv</code> is a tab-separated value file with the URLs, modified-by datetime, and supposed filesize in bytes.

=== Input Files ===

All of the text-only Red Book files for granted patents from 1976 to 2016, inclusive. To find a specific year's XML file, find it in

<code>E:\McNair\Projects\SimplerPatentData\extracts</code>

== Schema Reconciliation ==

TODO

=== Processing ===

TODO

=== Attributes ===

== Related Projects ==

OliverC

Bots, Bureaucrats, Administrators (Semantic MediaWiki), Administrators

329

edits

Changes

Reproducible Patent Data (view source)

Revision as of 17:33, 25 May 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools