Redesigning Patent Database
Redesigning Patent Database | |
---|---|
Project Information | |
Project Title | Redesigning Patent Database |
Owner | Shelby Bice |
Start Date | 201701 |
Deadline | 201705 |
Keywords | Database, Patent |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Documentation on the process and eventual designs for the new patent database.
Deliverables by end of semester: Design and build new patent database with core tables from data, add tables from Patent Data Restructure and Small Inventors Project, write thorough documentation on schema of new patent database and instructions on what documentation should be written when it is altered
TODO on 3/21/2017 and 3/23/2017
- Determine what tables in database need to always be there (patents, inventors, etc.) and which tables were added for separate projects
- Start documentation for the "core" tables of the databases that should not be deleted.
- Make ER Diagram for "core" tables in database
- Start documentation for how one should go about changing/altering the "core" tables
- Determine if any current tables in the database "patent" that I do not have currently listed as "core" tables should be added as core tables by determining which project(s) they're related to:
- For example, I know what I should include lexjudge in the new schema, but now they are several other "lex-" tables. Do these needs to be considered "core" tables?
- Evaluate perl scripts for pulling data for new database
Future:
- Start documentation for how new tables should be added to database
Contents
Redesigning Patent Database
- Design a better representation for database
- Fix scripts if necessary
- Start moving data into new database by querying existing databases (using SQL)
- Use scripts to query new data
- Test database
- Remove extraneous information from database (copies, patents that we're not interested in, etc.)
Documentation I need to include:
- Schema of new database (with justification of design), would like to include a visual representation
- SQL commands that were used to fill database with explanation of what they do
- Clear instructions on where to find scripts in bulk drive and an explanation of what each script does
- Visual representation of example table entries that isn't just copied and pasted from a CSV file
Related Projects
- Patent Data Restructure
- Small Inventors Project - uses Fee Status and Citations
- Medical Centers and Grants - uses patent assignees, specifically their zipcodes and organizations
Documentation Relevant to Current Patent Database
Previous documentation on the patent database:
As of 3/21/2017 the most up-to-date database containing patent data is "patent" not "allpatent" or "allpatent_clone" and "patent" is the database that the the other patent data redesign project, Restructuring Patent Data (link above) is working with.
Patent Data - overview of what the data is and where it came from
Patent Database - overview of schema of database (specifically, the database allpatent, which includes data from Harvard dataverse (originally stored in patentdata) and USPTO (patent_2015)
USPTO Assignees Database - enhances assignee info in patent database, also being redesigned
Problems with Patent Database - lists issues with current schema
Previous ER Diagram - does not match up with schema described in Patent Database and contains outdated list of what we want to pull from XML files
Processing Patent Data - states that allpatent is the newest database and an amalgamation of patentdata or patent_2015
Description
The purpose of this project is to create a new, redesigned database to hold all of the patent information that the McNair Center has accumulated and document the process so that the design can be easily understood and replicated or edited as needed.
This database will include design patents, utility patents, and reissues.
Development
Design will be built upon a relational database model. I will be referencing this article on database design as I develop the design (http://en.tekstenuitleg.net/articles/software/database-design-tutorial/one-to-many.html), and I will be creating an ER diagram using ERDPlus or Creately.
Current Design and Scripts information
The scripts for querying data for the patent database exist in McNair/software/scripts/patent. If the design for the schema of the database tables are altered too much, may have to write new scripts.
Test Plan
Log:
2/16/2017 - Talked over project with Ed, began reading existing wiki pages related to patent data and databases
2/21/2017 - Brushed up on SQL, Entity - Relationship model of designing databases
- In the documentation, I want to briefly explain what the entity-relationship model is before including
the diagram so that readers have a little bit of background
- Found a tool for creating a visual representation called ERDPlus.com - create a standalone instead of an account, can download
Learning commands from Patent Data - SQL Steps
- copy command is PostgreSQL that copies a SQL table to a text file
- DELIMITER set what will separate columns in text file
- HEADER specifies that there will be a header in the text file with the names of the columns
- Definitely need to include more detail about what these do in the documentation
- insert into command inserts a new entry into the table
2/23/2017 - Read great database design article, dug through some more wiki articles, started reviewing Perl
- What client do we use to interact with the current patent database?
- Will need to determine all the fields that need to be included in the database before finishing the design and ER diagram, will need Ed's input
3/2/2017 - Started compiling a list of what fields to include and how they would be related.
- Created an Excel spreadsheet that records the each table, their current attributes in the existing patentdata table, what I think the attributes should be in the new table, their relationship to a patent (i.e. one-to-many, many-to-many, etc.), their primary key, questions I have relating to the table, future steps for cleaning up the data in the table (i.e., once all the data has been move to the new database, removing patents that are not US-based), and current problems that have been recorded with the existing table for that information (if an existing table exists)
- Once Excel spreadsheet is completed (and questions in the Questions column are answered or removed from the spreadsheet entirely) I will look into trying to embed it on my "Redesigning Patent Database" wiki page so that future users can sort of follow my thought process. I will also create separate wiki pages to explain each table once the new database is created
- instructions for adding a table to a wiki page https://www.mediawiki.org/wiki/Help:Tables
- Making spreadsheet led me to realize there is some data that is repeated (filedate in fee table when it is also located in patent table, and the fee table includes the patent
3/7/2017 - Continued working on spreadsheet, update project page with relevant links, developed following plan for documentation
- Most important thing for new documentation in my opinion is a clear flow, i.e., one central page on the database, which contains links to the following:
- Where data is (patent data) where it came from (Harvard Dataverse, etc), the scripts/SQL used to obtain the data, and where original xml files can be located
- Next, there should be an ER diagram that shows the general design, followed by short descriptions of each table is, what it's related to, and what it's use is
- Each table should contain a link to a separate wiki page on the table which will explain all the columns, where the data came from, provide a visual example of data in the table, list issues, and also contain a log of what has been done to the table over time
- there should also be instructions on the page for how, if someone is going to change the database (add a table, alter a column, etc.) which documentation they should update (the wikipage for the table, for example)
- Links to USPTO Assignee Database with an explanation of how it is joined with the Patent Database
- Lastly, everything should be dated. I know that each wiki-page lists when it was last edited, but the main page, next to table descriptions, should have the date the table was originally added. The ER diagram should be updated as needed and also dated so that readers know if it's the most current representation of the database.
3/9/2017 - Finished up first draft of spreadsheet, found ER diagram tools
- Old ER diagram (see Documentation Relevant to Current Databases) is difficult to read - it appears to show tables that don't exist and data that doesn't exist in the current schema
- psql allpatent does not work. psql allpatent_clone does for some reason
3/21/2017 - Working on determining "core" tablse that will always be in database and understanding work on assignees info
- http://postgresonline.com/special_feature.php?sf_name=postgresql83_psql_cheatsheet&outputformat=html
- helpful commands for interacting with psql
- Notes on determining "core" tables for database are saved under Projects/Redesigning Patent Database/Determining which tables should be core tables