Changes

5,997 bytes added , 12:06, 6 October 2020

no edit summary

{{Project|Has project output=Data,Tool,Content,How-to,Guide|Has sponsor=McNair ~~Projects~~Center

|Has Image=Uspto web logo.jpg

|Has title=Reproducible Patent Data

}}

A <onlyinclude>The [[Reproducible Patent Data]] project is a continuation of the [[Redesigning Patent Database]] ~~that~~ project. It aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.See also the [[Patent Data]] umbrella project. </onlyinclude>

== ~~Progress~~ Quickstart ==

To get up and running with the code, do the following: # Clone the git project (link at end of page) to your user directory# Launch IntelliJ with >= Java 8 and Maven configured (default version installed on the RDP is setup to do this)# Open project in IntelliJ# Create an empty database (see [[#Database]])# Run the table creation scripts in <~~del~~code>~~Downloader~~src/db/schemas/</~~del~~code> ~~''done''~~in your new database# Modify the constant <~~del~~code>~~Splitter~~DATABASE_NAME</~~del~~code> ~~''done''~~# in <~~del~~code>~~Parser~~E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java</~~del~~code> ~~''done''~~# Run the Driver scripts in IntelliJ with the correct value for <~~del~~code>~~Setup PostgreSQL JDBC~~DATA_DIRECTORY</~~del~~code> ~~''done''~~# (or run <~~del~~code>~~Create naive schema based on previous approaches~~RunInitialImport.java</~~del~~code> ~~''done''~~which will do all of the data directories for that patent item type)# ~~<del>Create new~~ [Take a really, really long lunch...in total should take no more than five hours to load data ~~structures</del> ''done''~~on RDP]# ~~<del>Database Insert (modify~~ Run scripts in <code>~~models~~src/db/constraints</code> ~~files with some mapping~~ to ~~database fields)</del>~~ check data assumptions# That's it! ===Troubleshooting=== If you're new to IntelliJ (and even if you'~~done~~re not) you might run into problems with importing the project. ''~~# <del>Create tooling for minions</del>~~ 'Setting Up Project as a Maven project'~~skipped~~''~~# Create XPath queries~~ It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for ~~reissue~~example, ~~design patents (only utility right now)~~you won't see an option with a green triangle next to it that says "Run 'RunInitialImport.java'", and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [https://www.jetbrains.com/help/idea/maven.html# maven_import_project_start link]. * Note that when you click "Import Project", you should select the "Simpler Patent Data" folder, not the "src" folder within Simpler Patent Data ~~Cleanup~~ , otherwise you won't get the pom.xml file that you need to let IntelliJ know that this is a Maven project.* On the second window (~~reference [[Patent_Assignment_Data_Restructure|Marcela and Sonia's work]]~~there will be several options with check boxes next to them)make sure "import Maven projects automatically" is selected~~# Investigate parallel speedup (e~~* On the "Please select project SDK" window, make sure it says "1.g8" in the "Name" slot. ~~multithread~~* On the next window, ~~mmap)~~enter a name for the project and enter a folder location. This should ensure that the project is set up as a Maven project. # '''Setting Up Your Data Source ~~Merger (~~''~~only USPTO granted~~' If you run into a message across the top that says something along the lines of "Configure Data Source", ~~maintfee~~then you have not connected IntelliJ to a database. You will not be able to run the code located under src/db until you configure one. Start by clicking on the link in the message, ~~assignment~~or if it doesn'~~' not USPTO applications or Harvard Dataverse or Lex Machina currently)~~t appear, follow the instructions [https://www.jetbrains.com/help/idea/connecting-to-a-database.html here] to open up the "Data Sources and Drivers" pop-up to add a PostgreSQL database. When you get to the dialogue asking about the host, database, user, and password, do the following to connect to the database on the RDP: host: localhost database: whatever the constant DATABASE_NAME is set to - the default is patentsj user: postgres password: tabspaceenter ~~# Setup pipeline script~~ Make sure to test the connection by clicking "Test Connection". Now you should be able to ~~complete all~~ run the scripts under src/db. If you're seeing issues such as "column [something] of ~~these steps~~ relation [something] doesn't exist" but you've run the schema scripts, you probably have different database name under the data source than the one the constant DATABASE_NAME is set to. To change this, right click on the data source in ~~series~~the Database tab and select "Properties".

== Directory Layout ==

<code>E:\McNair\Projects\SimplerPatentData\data\extracts\granted\</code>

'''To find application data''' from 2001 to 2016, inclusive, look in

<code>E:\McNair\Projects\SimplerPatentData\data\extracts\applications\</code>

'''To find assignment data''', look in

'''To find maintenance fee data''', look in

<code>E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\</code>

=== Where is the Code? ===

The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.

The git repository can be found at ~~https~~http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent

==== Prior Art ====

The code can also be run via the standard <code>javac</code> and <code>java</code> commands but since this project has a complicated structure you end up having to run commands like

<code>"C:\Program Files\Java\jdk1.8.0_131\bin\java" "-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin" -Dfile.encoding=UTF-8 -classpath "~~C:\Program Files\Java\jdk1~~[.8.~~0_131\jre\lib\charsets~~.~~jar;C:\Program Files\Java\jdk1~~contents truncated.8.~~0_131\jre\lib\deploy~~.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_131\jre\lib\rt.jar;E:\McNair\Projects\SimplerPatentData\target\classes;C:\Users\OliverC\.m2\repository\com\mashape\unirest\unirest-java\1.4.9\unirest-java-1.4.9.jar;C:\Users\OliverC\.m2\repository\org\apache\httpcomponents\httpclient\4.5.2\httpclient-4.5.2.jar;C:\Users\OliverC\.m2\repository\org\apache\httpcomponents\httpcore\4.4.4\httpcore-4.4.4.jar;C:\Users\OliverC\.m2\repository\commons-logging\commons-logging\1.2\commons-logging-1.2.jar;C:\Users\OliverC\.m2\repository\org\apache\httpcomponents\httpasyncclient\4.1.1\httpasyncclient-4.1.1.jar;C:\Users\OliverC\.m2\repository\org\apache\httpcomponents\httpcore-nio\4.4.4\httpcore-nio-4.4.4.jar;C:\Users\OliverC\.m2\repository\org\apache\httpcomponents\httpmime\4.5.2\httpmime-4.5.2.jar;C:\Users\OliverC\.m2\repository\org\json\json\20160212\json-20160212.jar;C:\Users\OliverC\.m2\repository\com\google\guava\guava\21.0\guava-21.0.jar;C:\Users\OliverC\.m2\repository\org\jsoup\jsoup\1.10.2\jsoup-1.10.2.jar;C:\Users\OliverC\.m2\repository\commons-codec\commons-codec\1.10\commons-codec-1.10.jar;C:\Users\OliverC\.m2\repository\org\jetbrains\annotations\15.0\annotations-15.0.jar;C:\Users\OliverC\.m2\repository\org\apache\commons\commons-lang3\3.5\commons-lang3-3.5.jar];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar" org.bakerinstitute.mcnair.uspto_assignments.XmlDriver</code>

to include all of the runtime dependencies and it's just not worth it.

== Schema Reconciliation ==

~~As it turns out~~For the work by Joe, ~~since many of the fields we care about are date and author data the schemas are~~ see the ~~same ("universal")~~[[Patent Schema Reconciliation]] page

~~<code>select count~~=== Patents (*Granted)~~, patent_type from patents group by patent_type;725;"design"18;"plant"14;"reissue"28187;"utility"</code>~~===

~~where "plant" literally refers to flora~~See <code>E:\McNair\Projects\SimplerPatentData\data\examples\granted</code> for extracted examples of what specific data is available for a sample of the data.

{| class="wikitable"

|+Granted Patent Data Formats

|-

|-

|January 1976 to December 2001

|APS

|<code>data/extracts/granted/vintage</code>|style="background: ~~red~~green; color: white;" | ~~Only syntax~~Yes|✓|~|~|~

|-

|<del>January 2001 to December 2001</del>

|<del>SGML</del>

|Ignored; use concurrently recorded APS data

|✗No|✗N/A|✗N/A|N/A|N/A

|-

|January 2002 to December 2004

|XML Version 2.5

|<code>data/extracts/granted/blunderyears</code>|style="background: ~~red~~green; color: white;" | ~~Only syntax~~Yes|✓|~|~|~

|-

|January 2005 to December 2005

|XML Version 4.0 ICE

|<code>data/extracts/granted/modern</code>|style="background: ~~yellow~~green; color: white;" | ~~Maybe~~Yes|✓|~|~|~

|-

|January 2006 to December 2006

|XML Version 4.1 ICE

|<code>data/extracts/granted/modern</code>|style="background: ~~yellow~~green; color: white;" | ~~Maybe~~Yes|✓|~|~|~

|-

|January 2007 to December 2012

|XML Version 4.2 ICE

|<code>data/extracts/granted/modern</code>|style="background: ~~yellow~~green; color: white;" | ~~Maybe~~Yes|✓|~|~|~

|-

|January 2013 to September 24, 2013

|XML Version 4.3 ICE

|<code>data/extracts/granted/modern</code>

|style="background: green; color: white;" | Yes

|✓

|✓~|✓~|~

|-

|October 8, 2013 to December 2014

|XML Version 4.4 ICE

|<code>data/extracts/granted/modern</code>

|style="background: green; color: white;" | Yes

|✓

|✓~|✓~|~

|-

|January 2015 to December 2016

|XML Version 4.5 ICE

|<code>data/extracts/granted/modern</code>

|style="background: green; color: white;" | Yes

|✓

|✓~|✓~|~

|}

=== APS Rosetta Stone ===

The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].

It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.

==== APS Gotchas ====

* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.

=== Patents (Applications) ===

{| class="wikitable"

|+Patent Application Data Formats

|-

! scope="col" | Dates Used !! scope="col" | Format !! scope="col" | Location !! scope="col" | Supported by Parser?

|-

|March 15, 2001 to December 2001

|XML Version 1.5

|<code>data/extracts/applications/vintage</code>

|style="background: yellow;" | Yes, for basic information, inventors, and correspondents

|-

|January 2002 to December 2004

|XML Version 1.6

|<code>data/extracts/applications/vintage</code>