Changes

Jump to navigation Jump to search
no edit summary
}}
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO data). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data. Currently, it succeeds in bulk downloading from the USPTO; streaming file splitting, that is, splitting large concatenated files into their component parts in-memory; and parsing of XML to Java objects, APS to Java Maps, and maintenance fee data to Java objects.
== Progress ==
# <del>Splitter</del> ''done''
# <del>Parser</del> ''done''
# Data Source Merger (''only USPTO'' not Harvard Dataverse or Lex Machina currently)Create tooling for minions# Setup PostgreSQL JDBC# Create naive schema based on previous approaches# Create new data structures
# Database Insert (modify <code>models/</code> files with some mapping to database fields)
# Data Cleanup (reference [[Patent_Assignment_Data_Restructure|Marcela and Sonia's work]])
# Setup pipeline script to complete all of these steps in series
# Data Source Merger (''only USPTO granted, maintfee, assignment'' not USPTO applications or Harvard Dataverse or Lex Machina currently)
== Directory Layout ==
All of the information for this project is located at <code>E:\McNair\Projects\SimplerPatentData</code>
There are three four interesting directories:
* <code>data/downloads/</code> is USPTO bulkdata, unmodified straight from the scraper
|January 1976 to December 2001
|APS
|Yes (syntactic parsing but little semantic knowledge)Only syntax
|-
|<del>January 2001 to December 2001</del>
|January 2002 to December 2004
|XML Version 2.5
|NoOnly syntax
|-
|January 2005 to December 2005

Navigation menu