|Has title=Redesigning Patent Database
|Has owner=Shelby Bice
|Has keywords=Database, Patent
|Has notes= |Is dependent on= |Depends upon it=|Has project status= Active
Documentation on the process and eventual designs for the new patent database. Not to be confused with "Patent Data Restructure", which deals with condensing and cleaning USPTO assignees data.
Deliverables by end of semester :Design and build new patent database with core tables from data , add tables from Patent Data Restructure and Small Inventors Project, write thorough documentation on schema of new patent database and instructions on what documentation should be written when it is altered
TODO on 4/13/2017
Keep working on updating patent data
* How to parse patent data
* How to parse assignment data
* Also update documentation as I go
== '''Redesigning Patent Database''' ==
*[[Patent Data Restructure]]
*[[Small Inventors Project]] - uses Fee Status and Citations
*[[Medical Centers and Grants]] - uses patent assignees, specifically their zipcodes and organizations
'''Previous documentation on the patent database:'''
'''As of 3/21/2017 the most up-to-date database containing patent data is "patent" not "allpatent" or "allpatent_clone" and "patent" is the database that the the other patent data redesign project, Restructuring Patent Data (link above) is working with. '''
http://mcnair.bakerinstitute.org/wiki/Patent_Data_(Wiki_Page) Patent Data] - overview of what the data is and where it came from, probably starting point for changing documentation
http://mcnair.bakerinstitute.org/wiki/Patent Patent Database] - overview of schema of database (specifically, the database patent, which includes data from Harvard dataverse (originally stored in patentdata) and USPTO (patent_2015)
http://mcnair.bakerinstitute.org/wiki/USPTOAssigneesData USPTO Assignees Database] - enhances assignee info in patent database, also being redesigned
http://mcnair.bakerinstitute.org/wiki/Patent_Data_Issues Problems with Patent Database] - lists issues with current schema
http://mcnair.bakerinstitute.org/wiki/Data_Model Previous ER Diagram] - does not match up with schema described in [ http://mcnair.bakerinstitute.org/wiki/Patent Patent Database] and contains outdated list of what we want to pull from XML files
http://mcnair.bakerinstitute.org/wiki/Patent_Data_Processing_-_SQL_Steps Processing Patent Data] - states that allpatent is the newest database and an amalgamation of patentdata or patent_2015
== Description ==
== Development ==
Design will be built upon a relational database model. I will be referencing this article on database design as I develop the design (http://en.tekstenuitleg.net/articles/software/database-design-tutorial/one-to-many.html), and I will be creating an ER diagram using [https://erdplus.com/#/standalone ERDPlus] or [https://creately.com/app/?tempID=hqdgwjki1&login_type=demo# Creately].
== Current Design and Scripts Documentation ==
The following pages are relevant to how previous databases are built/how to build tables in the database:
http://mcnair.bakerinstitute.org/wiki/Harvard_Dataverse_Data Harvard Dataverse Data] - explains how to make tables from Harvard Dataverse data, where to find scripts, etc.
http://mcnair.bakerinstitute.org/wiki/USPTO_Bulk_Data_Processing USPTO Data] - explains how to make tables from USPTO data, where to find scripts, etc, specifically for assignment data.
http://mcnair.bakerinstitute.org/wiki/Patent_Data_Extraction_Scripts_(Tool) Patent Data Extraction] - explains locations of XML files and lists (at the bottom) where the Perl scripts can be found
http://mcnair.bakerinstitute.org/wiki/Patent_Data_Cleanup_ -_June_2016 Patent Data Cleanup] - explains changes that were made to clean up problems in the database allpatent as a result of merging the Harvard Dataverse data and the USPTO data
http://mcnair.bakerinstitute.org/wiki/Patent_Data_Processing_-_SQL_Steps Patent Data Processing - SQL Steps] - explains SQL needed to merge two existing databases, one that contained the Harvard Dataverse data and one that contained the USPTO data
Here at the instructions I'm developing for downloading, parsing, and hopefully adding new data to the database since the documentation is very sparse (can also be found under McNair/Projects/Redesigning Patent Database/Instructions on how to download patent data form USPTO bulk data.
===How to run Perl Scripts to extract Patent Data===
1) map the network drive to your computer (instructions can be found under Help for New Staff scroll down to the "Working with the infrastructure and click on the link to "How to Map the network drive) 2) Don't consult this handy link on how to install and run perl programs (https:// www. thoughtco. com/how-to-install-and-run-perl-2641103) because perl is already installed for you to use. Just open a command line and type perl scriptname. pl 3) Now it's time to download the data so it can be parsed.
USPTO Assignment Data , there appears to be a script, under McNair/usptoAssignment, called USPTO_Assignee_Download, which appears to let a user pass it a url and then ir downloads all th zip files available at that URL. It then places the downloaded zip files in "E:/McNair/usptoAssigneeData/name", where "name" is the name of the file. If you want to check which files have already been processed, check "McNair/usptoAssigneeData/Finished" to see the finished zip files.
====For the Main Patent Data====
The equivalent for patent data is called "USPTO_Parser " and can be found under
*You may need to fix the base url in the script. The USPTO has changed it before. It is currently: https://bulkdata.uspto.gov/data2/patent/grant/redbook
*Instead of taking a url, as the USPTO_Assignee Download does, it takes two arguments, year 1 and year2, which are supposed to represent the range of data that you wish to download (for example, 2015 to 2016).
*The perl script places the downloaded zip files into "E:/McNair/PatentData/name" where "name" is the name of the zip file.
*The folder "Processed" under McNair/PatentData appears to hold all the unzipped zip files that have been downloaded and processed already. So if you are curious if some files have already been processed, you could look there. They are organized by year.
Note the zip files should appear briefly (sequentially) in E:/McNair/Software/Scripts/Patent before disappearing and reappearing unzipped in E:/McNair/PatentData
3b) Now we need to split the files into individual, valid xml files. To do this:
Move the files to be split into E:/McNair/PatentData/Queue
Run the command:
Each file will then be blown out into a directory of xml files in E:/McNair/PatentData/Processed
4) The next step would be to parse the actual files.
For the patent data files, based on the existing documentation, it looks like PatentParser, found in McNair/Software/Scripts/Patent, has to be run on each xml file that was downloaded and unzipped during the previous step. (For future updates to the perl files, we should update this script so that it can be run on a directory of files like the parser for USPTO assignment data). *It then stores the parsed xml files all in a text file called "Results.txt" (which I assume will have to be deleted afterward). This script utilizes the Claim.pm, Inventor.pm, PatentApplication.pm, and Loader.pm modules. *It no longer uses the AddressBook.pm module. *If we have a perl module for getting the inventor, why do we not have an inventors table in the database? THIS IS A GOOD QUESTION!
For the USPTO Assignment Data, the parsing file is called USPTO_Assignee_XML_parser. It takes the path to the files that need to be parsed (an example
I think is ":E/ PatentData/ Processed/year" where" year" is the name of the folder you've placed the xml files to be parsed. It iterates through all the files in the "year" directory that you passed. This file directly loads the information into the database
while it parses the file.
5) This parser will open an ODBC ( or similar) connection to a database on the RDP's installation of postgres. It will then put the data directly into this database. Once complete. we manually move the tables to the dbase server's database (i. e. patent ).
== Specifications of USPTO Data To Extract ==
Go to https://bulkdata.uspto.gov/ to bulk data from USPTO.
For assignment data, we pull from https://bulkdata.uspto.gov/data2/patent/assignment/. A description of all the elements in an USPTO assignment XML file is called a DTD, and the one for the assignment data can be found in the bulk drive under McNair/Project/Redesigning Patent Database/USPTO Assignment DTD
I'm currently looking at the
DTsD for USPTO patent data (2005 and up DTDs have to be opened by Microsoft Visual Studio) to ascertain if there are any fields we're not currently pulling from the bulk data that we should. I am using the following link to figure out how to read a DTD: http://www.ldodds.com/delta/dtd_guide.html. There are only DTDs for 2005 and up, but there is a very long pdf that appears to detail the format of patent files pre - 2005. It has been saved under McNair/Projects/Redesigning Patent Database/
'''Pre - 2005'''
* Some tables that will later be deleted were included on the spreadsheet because their are currently being tables built to replace them
* May try to just move all the (twenty-something) "pto-" tables that have been created due to the "Restructuring Patent Data" project from "patent" to the new database
* Will work on understanding SQL for filling new database from this link next week
http://mcnair.bakerinstitute.org/wiki/Patent_Data_Processing_-_SQL_Steps and http://mcnair.bakerinstitute.org/wiki/Patent_Data_Cleanup_ -_June_2016
'''4/4/2017''' - Found all the pages on extracting data and making tables and databases
* Do not need to pull Harvard Dataverse data again, it's saved in CSV files on the bulk drive
* Started looking through DTDs for USPTO patent and assignment data to determine if there is extra information that we should extract from USPTO data.