Changes

Redesigning Patent Database (view source)

Revision as of 09:52, 24 May 2017

693 bytes removed , 09:52, 24 May 2017

use proper form for internal links to wiki

==Related Projects==

*[[Patent Assignment Data Restructure]]

*[[Small Inventors Project]] - uses Fee Status and Citations

*[[Medical Centers and Grants]] - uses patent assignees, specifically their zipcodes and organizations

'''As of 3/21/2017 the most up-to-date database containing patent data is "patent" not "allpatent" or "allpatent_clone" and "patent" is the database that the the other patent data redesign project, Restructuring Patent Data (link above) is working with. The datbase "allpatent" has since been removed, but it can be restored if it is needed.'''

[~~http://mcnair.bakerinstitute.org/wiki/Patent_Data_(Wiki_Page)~~ [Patent Data]] - overview of what the data is and where it came from, probably starting point for changing documentation

[~~http://mcnair.bakerinstitute.org/wiki/~~[Patent |Patent Database]] - overview of schema of database (specifically, the database patent, which includes data from Harvard dataverse (originally stored in patentdata) and USPTO (patent_2015)

[~~http://mcnair.bakerinstitute.org/wiki/~~[USPTOAssigneesData |USPTO Assignees Database]] - enhances assignee info in patent database, also being redesigned

[~~http://mcnair.bakerinstitute.org/wiki/~~[Patent_Data_Issues |Problems with Patent Database]] - lists issues with current schema

[~~http://mcnair.bakerinstitute.org/wiki/~~[Data_Model |Previous ER Diagram]] - does not match up with schema described in [~~http://mcnair.bakerinstitute.org/wiki/~~[Patent |Patent Database]] and contains outdated list of what we want to pull from XML files

[~~http://mcnair.bakerinstitute.org/wiki/~~[Patent_Data_Processing_-_SQL_Steps |Processing Patent Data]] - states that allpatent is the newest database and an amalgamation of patentdata or patent_2015

== Description ==

The following pages are relevant to how previous databases are built/how to build tables in the database:

[~~http://mcnair.bakerinstitute.org/wiki/~~[Harvard_Dataverse_Data ~~Harvard Dataverse Data~~]] - explains how to make tables from Harvard Dataverse data, where to find scripts, etc.

[~~http://mcnair.bakerinstitute.org/wiki/~~[USPTO_Bulk_Data_Processing |USPTO Data]] - explains how to make tables from USPTO data, where to find scripts, etc, specifically for assignment data.

[~~http://mcnair.bakerinstitute.org/wiki/~~[Patent_Data_Extraction_Scripts_(Tool) |Patent Data Extraction]] - explains locations of XML files and lists (at the bottom) where the Perl scripts can be found

[~~http://mcnair.bakerinstitute.org/wiki/~~[Patent_Data_Cleanup_~~-_June_2016~~ (June_2016)|Patent Data Cleanup]] - explains changes that were made to clean up problems in the database allpatent as a result of merging the Harvard Dataverse data and the USPTO data

[~~http://mcnair.bakerinstitute.org/wiki/~~[Patent_Data_Processing_-_SQL_Steps |Patent Data Processing - SQL Steps]] - explains SQL needed to merge two existing databases, one that contained the Harvard Dataverse data and one that contained the USPTO data

Here at the instructions I'm developing for downloading, parsing, and hopefully adding new data to the database since the documentation is very sparse (can also be found under McNair/Projects/Redesigning Patent Database/Instructions on how to download patent data form USPTO bulk data.

Existing documentation that seems relevant to cleaning/moving the UPSTO Assignee data over from the RDP database:

~~http://mcnair.bakerinstitute.org/wiki/~~[[USPTO_Bulk_Data_Processing]]~~http://mcnair.bakerinstitute.org/wiki/~~[[PTO_Tables]]~~http://mcnair.bakerinstitute.org/wiki/~~[[USPTOAssigneesData]]

The xml_parser2.plx creates four tables: Assignment, Assignees, Assignors, and Properties. These appear to correspond to ptoassignment, ptoassignee, ptoassignor, and ptoproperty, respectively in the "patent" database. There is another pto table called "ptopatentfile" that has the following schema, but I cannot find out how it is populated; the xml_parser2.plx does not create this table.

Where "MaintFeeEvents_20170410-wHeaders" is the name of the file with the added headers at the top. This script will put the normalized (cleaned) file in MaintFeeEvents_20170410-wHeader-normal.txt (basically appends "-normal" to whatever file name you pass it).

To then make a table out of the normalized text file, use the SQL detailed on ~~the following page.~~ ~~http://mcnair.bakerinstitute.org/wiki/~~[[Patent_Expiration_Rules]]

This will create entirely new tables from the maintenance fee data. To avoid repeating data, we will most likely just replace the existing tables in the database with the new tables.

* Some tables that will later be deleted were included on the spreadsheet because their are currently being tables built to replace them

* May try to just move all the (twenty-something) "pto-" tables that have been created due to the "Restructuring Patent Data" project from "patent" to the new database

* Will work on understanding SQL for filling new database from this link next week ~~http://mcnair.bakerinstitute.org/wiki/~~[[Patent_Data_Processing_-_SQL_Steps ]] and ~~http://mcnair.bakerinstitute.org/wiki/~~[[Patent_Data_Cleanup_~~-_June_2016~~(June_2016)]]

'''4/4/2017''' - Found all the pages on extracting data and making tables and databases

* Do not need to pull Harvard Dataverse data again, it's saved in CSV files on the bulk drive

* Started looking through DTDs for USPTO patent and assignment data to determine if there is extra information that we should extract from USPTO data.

OliverC

Bots, Bureaucrats, Administrators (Semantic MediaWiki), Administrators

329

edits

Changes

Redesigning Patent Database (view source)

Revision as of 09:52, 24 May 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools