<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=ShelbyBice</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=ShelbyBice"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/ShelbyBice"/>
	<updated>2026-06-09T23:26:32Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22348</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22348"/>
		<updated>2017-12-08T20:42:23Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data. These three tables should fit the schema seen here, but as stated previously the schema in the code DOES NOT fit the schema here exactly and should be altered in the future to fit this schema. &lt;br /&gt;
&lt;br /&gt;
'''For Oliver:''' Unfortunately I did not get to finish making the schema in your code fit the schema that is outlined here. For whoever works on this next, whether that be you or another intern, please note that the variable in the schema in the code do not match up exactly with the schema outlined here, except for perhaps the Reissues, Plants, and Designs tables in the Patent database (called patentsj in the code, I believe, but you can create a database with any name of course).&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this matches the design below but may not match the exact database (nor the code that creates the schema of the database currently. In the future these should be made to sync up.&lt;br /&gt;
&lt;br /&gt;
[[File:Erdplus-diagram (3).png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
If you would like to edit the diagram, you can find the ER diagram you can go to https://erdplus.com/ and click &amp;quot;Open Diagram File&amp;quot; under the &amp;quot;Diagram&amp;quot; dropdown, then navigate to E:/McNair/Project/Redesigning Patent Database/New Patent Database Project/ERDiagramforPatentandAssignmentDatabases.erdplus. You should be able to edit the file then.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;br /&gt;
&lt;br /&gt;
==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
XML 4.4, 4.3, 4.1, and 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i&lt;br /&gt;
&lt;br /&gt;
XML 4.2 &lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i &lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/&lt;br /&gt;
fields: parent-status&lt;br /&gt;
&lt;br /&gt;
XML 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/&lt;br /&gt;
fields: parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
(everything in XML 4.3 except parent-status)	&lt;br /&gt;
&lt;br /&gt;
XML 4.3&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number	&lt;br /&gt;
	&lt;br /&gt;
XML 4.1&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields:&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/&lt;br /&gt;
fields: &lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
XML 4.0 and XML 4.2&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: hague-agreement-data/international-registration-date/date&lt;br /&gt;
	hague-agreement-data/international-registration-publication-date/date&lt;br /&gt;
	us-term-of-grant/length-of-grant&lt;br /&gt;
	hague-agreement-data/international-registration-number&lt;br /&gt;
	hague-agreement-data/international-filing-date/date&lt;br /&gt;
&lt;br /&gt;
XML 4.1, 4.3, and 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/&lt;br /&gt;
fields: length-of-grant&lt;br /&gt;
&lt;br /&gt;
XML 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: us-term-of-grant/length-of-grant&lt;br /&gt;
	classification-locarno/edition&lt;br /&gt;
	classification-locarno/main-classification&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22347</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22347"/>
		<updated>2017-12-08T20:36:15Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* ER Diagram for Assignment and Patent Databases */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data. &lt;br /&gt;
&lt;br /&gt;
'''For Oliver:''' Unfortunately I did not get to finish making the schema in your code fit the schema that is outlined here. For whoever works on this next, whether that be you or another intern, please note that the variable in the schema in the code do not match up exactly with the schema outlined here, except for perhaps the Reissues, Plants, and Designs tables in the Patent database (called patentsj in the code, I believe, but you can create a database with any name of course).&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this matches the design below but may not match the exact database (nor the code that creates the schema of the database currently. In the future these should be made to sync up.&lt;br /&gt;
&lt;br /&gt;
[[File:Erdplus-diagram (3).png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
If you would like to edit the diagram, you can find the ER diagram you can go to https://erdplus.com/ and click &amp;quot;Open Diagram File&amp;quot; under the &amp;quot;Diagram&amp;quot; dropdown, then navigate to E:/McNair/Project/Redesigning Patent Database/New Patent Database Project/ERDiagramforPatentandAssignmentDatabases.erdplus. You should be able to edit the file then.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;br /&gt;
&lt;br /&gt;
==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
XML 4.4, 4.3, 4.1, and 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i&lt;br /&gt;
&lt;br /&gt;
XML 4.2 &lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i &lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/&lt;br /&gt;
fields: parent-status&lt;br /&gt;
&lt;br /&gt;
XML 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/&lt;br /&gt;
fields: parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
(everything in XML 4.3 except parent-status)	&lt;br /&gt;
&lt;br /&gt;
XML 4.3&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number	&lt;br /&gt;
	&lt;br /&gt;
XML 4.1&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields:&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/&lt;br /&gt;
fields: &lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
XML 4.0 and XML 4.2&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: hague-agreement-data/international-registration-date/date&lt;br /&gt;
	hague-agreement-data/international-registration-publication-date/date&lt;br /&gt;
	us-term-of-grant/length-of-grant&lt;br /&gt;
	hague-agreement-data/international-registration-number&lt;br /&gt;
	hague-agreement-data/international-filing-date/date&lt;br /&gt;
&lt;br /&gt;
XML 4.1, 4.3, and 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/&lt;br /&gt;
fields: length-of-grant&lt;br /&gt;
&lt;br /&gt;
XML 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: us-term-of-grant/length-of-grant&lt;br /&gt;
	classification-locarno/edition&lt;br /&gt;
	classification-locarno/main-classification&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22346</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22346"/>
		<updated>2017-12-08T20:29:14Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data. &lt;br /&gt;
&lt;br /&gt;
'''For Oliver:''' Unfortunately I did not get to finish making the schema in your code fit the schema that is outlined here. For whoever works on this next, whether that be you or another intern, please note that the variable in the schema in the code do not match up exactly with the schema outlined here, except for perhaps the Reissues, Plants, and Designs tables in the Patent database (called patentsj in the code, I believe, but you can create a database with any name of course).&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this matches the design below but may not match the exact database (nor the code that creates the schema of the database currently. In the future these should be made to sync up.&lt;br /&gt;
&lt;br /&gt;
[[File:erdplus-diagram(3).png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;br /&gt;
&lt;br /&gt;
==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
XML 4.4, 4.3, 4.1, and 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i&lt;br /&gt;
&lt;br /&gt;
XML 4.2 &lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i &lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/&lt;br /&gt;
fields: parent-status&lt;br /&gt;
&lt;br /&gt;
XML 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/&lt;br /&gt;
fields: parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
(everything in XML 4.3 except parent-status)	&lt;br /&gt;
&lt;br /&gt;
XML 4.3&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number	&lt;br /&gt;
	&lt;br /&gt;
XML 4.1&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields:&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/&lt;br /&gt;
fields: &lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
XML 4.0 and XML 4.2&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: hague-agreement-data/international-registration-date/date&lt;br /&gt;
	hague-agreement-data/international-registration-publication-date/date&lt;br /&gt;
	us-term-of-grant/length-of-grant&lt;br /&gt;
	hague-agreement-data/international-registration-number&lt;br /&gt;
	hague-agreement-data/international-filing-date/date&lt;br /&gt;
&lt;br /&gt;
XML 4.1, 4.3, and 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/&lt;br /&gt;
fields: length-of-grant&lt;br /&gt;
&lt;br /&gt;
XML 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: us-term-of-grant/length-of-grant&lt;br /&gt;
	classification-locarno/edition&lt;br /&gt;
	classification-locarno/main-classification&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22345</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22345"/>
		<updated>2017-12-08T20:20:56Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* ER Diagram for Assignment and Patent Databases */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data. &lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this matches the design below but may not match the exact database (nor the code that creates the schema of the database currently. In the future these should be made to sync up.&lt;br /&gt;
&lt;br /&gt;
[[File:erdplus-diagram(3).png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;br /&gt;
&lt;br /&gt;
==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
XML 4.4, 4.3, 4.1, and 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i&lt;br /&gt;
&lt;br /&gt;
XML 4.2 &lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i &lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/&lt;br /&gt;
fields: parent-status&lt;br /&gt;
&lt;br /&gt;
XML 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/&lt;br /&gt;
fields: parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
(everything in XML 4.3 except parent-status)	&lt;br /&gt;
&lt;br /&gt;
XML 4.3&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number	&lt;br /&gt;
	&lt;br /&gt;
XML 4.1&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields:&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/&lt;br /&gt;
fields: &lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
XML 4.0 and XML 4.2&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: hague-agreement-data/international-registration-date/date&lt;br /&gt;
	hague-agreement-data/international-registration-publication-date/date&lt;br /&gt;
	us-term-of-grant/length-of-grant&lt;br /&gt;
	hague-agreement-data/international-registration-number&lt;br /&gt;
	hague-agreement-data/international-filing-date/date&lt;br /&gt;
&lt;br /&gt;
XML 4.1, 4.3, and 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/&lt;br /&gt;
fields: length-of-grant&lt;br /&gt;
&lt;br /&gt;
XML 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: us-term-of-grant/length-of-grant&lt;br /&gt;
	classification-locarno/edition&lt;br /&gt;
	classification-locarno/main-classification&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=File:Erdplus-diagram_(3).png&amp;diff=22344</id>
		<title>File:Erdplus-diagram (3).png</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=File:Erdplus-diagram_(3).png&amp;diff=22344"/>
		<updated>2017-12-08T20:13:57Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22343</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22343"/>
		<updated>2017-12-08T19:04:36Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-12-08 1:00 pm - 2:45 pm - updated the ER Diagram on my project page to include the three tables for reissue, plant, and design patents respectively. Finished typing up the status of the project as I am leaving it with notes to Oliver and Ed&lt;br /&gt;
&lt;br /&gt;
2017-12-07 3:15 PM - 4:45 PM  (came in late due to finals) - finished debugging additions to Oliver's code for the tables that are related to design, reissue, and plant patents, added a troubleshooting section to Oliver's page with instructions on how to deal with issues importing the project. &lt;br /&gt;
&lt;br /&gt;
2017-12-04 2:45 pm - 4:00 pm - continued debugging and started typing up troubleshooting tips for the next person who alters the patent code&lt;br /&gt;
&lt;br /&gt;
2017-12-01 3:15 pm - 5:00 pm - ran code (and ran into errors) which I have been working on fixing. If I don't finish today, I'll continue doing so on Monday. I plan to write up some of the mistakes I made in adding the tables to the database and add them to Oliver's page, with his permission so that people in the future who are not familiar with the code (like I wasn't) hopefully won't fall into the same pitfalls. The main things are 1) how to set up a maven project (if IntelliJ doesn't automatically set it up for you when you open/import the project, and 2) how to set up the data source so you can run SQL scripts and actually load data into the database on the RDP.&lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables. &lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22332</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22332"/>
		<updated>2017-12-07T22:33:55Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-12-07 3:15 PM - 4:45 PM  (came in late due to finals) - finished debugging additions to Oliver's code for the tables that are related to design, reissue, and plant patents, added a troubleshooting section to Oliver's page with instructions on how to deal with issues importing the project. &lt;br /&gt;
&lt;br /&gt;
2017-12-04 2:45 pm - 4:00 pm - continued debugging and started typing up troubleshooting tips for the next person who alters the patent code&lt;br /&gt;
&lt;br /&gt;
2017-12-01 3:15 pm - 5:00 pm - ran code (and ran into errors) which I have been working on fixing. If I don't finish today, I'll continue doing so on Monday. I plan to write up some of the mistakes I made in adding the tables to the database and add them to Oliver's page, with his permission so that people in the future who are not familiar with the code (like I wasn't) hopefully won't fall into the same pitfalls. The main things are 1) how to set up a maven project (if IntelliJ doesn't automatically set it up for you when you open/import the project, and 2) how to set up the data source so you can run SQL scripts and actually load data into the database on the RDP.&lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables. &lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22331</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22331"/>
		<updated>2017-12-07T22:33:28Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-12-07 3:15 PM - 4:45 PM  (came in late do to finals) - finished debugging additions to Oliver's code for the tables that are related to design, reissue, and plant patents, added a troubleshooting section to Oliver's page with instructions on how to deal with issues importing the project. &lt;br /&gt;
&lt;br /&gt;
2017-12-04 2:45 pm - 4:00 pm - continued debugging and started typing up troubleshooting tips for the next person who alters the patent code&lt;br /&gt;
&lt;br /&gt;
2017-12-01 3:15 pm - 5:00 pm - ran code (and ran into errors) which I have been working on fixing. If I don't finish today, I'll continue doing so on Monday. I plan to write up some of the mistakes I made in adding the tables to the database and add them to Oliver's page, with his permission so that people in the future who are not familiar with the code (like I wasn't) hopefully won't fall into the same pitfalls. The main things are 1) how to set up a maven project (if IntelliJ doesn't automatically set it up for you when you open/import the project, and 2) how to set up the data source so you can run SQL scripts and actually load data into the database on the RDP.&lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables. &lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22329</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=22329"/>
		<updated>2017-12-07T22:09:51Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data. &lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;br /&gt;
&lt;br /&gt;
==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
XML 4.4, 4.3, 4.1, and 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i&lt;br /&gt;
&lt;br /&gt;
XML 4.2 &lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i &lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/&lt;br /&gt;
fields: parent-status&lt;br /&gt;
&lt;br /&gt;
XML 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/&lt;br /&gt;
fields: parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
(everything in XML 4.3 except parent-status)	&lt;br /&gt;
&lt;br /&gt;
XML 4.3&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number	&lt;br /&gt;
	&lt;br /&gt;
XML 4.1&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields:&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/&lt;br /&gt;
fields: &lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
XML 4.0 and XML 4.2&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: hague-agreement-data/international-registration-date/date&lt;br /&gt;
	hague-agreement-data/international-registration-publication-date/date&lt;br /&gt;
	us-term-of-grant/length-of-grant&lt;br /&gt;
	hague-agreement-data/international-registration-number&lt;br /&gt;
	hague-agreement-data/international-filing-date/date&lt;br /&gt;
&lt;br /&gt;
XML 4.1, 4.3, and 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/&lt;br /&gt;
fields: length-of-grant&lt;br /&gt;
&lt;br /&gt;
XML 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: us-term-of-grant/length-of-grant&lt;br /&gt;
	classification-locarno/edition&lt;br /&gt;
	classification-locarno/main-classification&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22328</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22328"/>
		<updated>2017-12-07T21:53:58Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start link].&lt;br /&gt;
&lt;br /&gt;
* Note that when you click &amp;quot;Import Project&amp;quot;, you should select the &amp;quot;Simpler Patent Data&amp;quot; folder, not the &amp;quot;src&amp;quot; folder within Simpler Patent Data, otherwise you won't get the pom.xml file that you need to let IntelliJ know that this is a Maven project.&lt;br /&gt;
* On the second window (there will be several options with check boxes next to them) make sure &amp;quot;import Maven projects automatically&amp;quot; is selected&lt;br /&gt;
* On the &amp;quot;Please select project SDK&amp;quot; window, make sure it says &amp;quot;1.8&amp;quot; in the &amp;quot;Name&amp;quot; slot. &lt;br /&gt;
* On the next window, enter a name for the project and enter a folder location.&lt;br /&gt;
&lt;br /&gt;
This should ensure that the project is set up as a Maven project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Your Data Source'''&lt;br /&gt;
&lt;br /&gt;
If you run into a message across the top that says something along the lines of &amp;quot;Configure Data Source&amp;quot;, then you have not connected IntelliJ to a database. You will not be able to run the code located under src/db until you configure one. Start by clicking on the link in the message, or if it doesn't appear, follow the instructions  [https://www.jetbrains.com/help/idea/connecting-to-a-database.html here] to open up the &amp;quot;Data Sources and Drivers&amp;quot; pop-up to add a PostgreSQL database. When you get to the dialogue asking about the host, database, user, and password, do the following to connect to the database on the RDP:&lt;br /&gt;
&lt;br /&gt;
 host: localhost&lt;br /&gt;
 database: whatever the constant DATABASE_NAME is set to - the default is patentsj&lt;br /&gt;
 user: postgres&lt;br /&gt;
 password: tabspaceenter&lt;br /&gt;
&lt;br /&gt;
Make sure to test the connection by clicking &amp;quot;Test Connection&amp;quot;. Now you should be able to run the scripts under src/db.&lt;br /&gt;
&lt;br /&gt;
If you're seeing issues such as &amp;quot;column [something] of relation [something] doesn't exist&amp;quot; but you've run the schema scripts, you probably have different database name under the data source than the one the constant DATABASE_NAME is set to. To change this, right click on the data source in the Database tab and select &amp;quot;Properties&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22327</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22327"/>
		<updated>2017-12-07T21:50:48Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start link].&lt;br /&gt;
&lt;br /&gt;
* Note that when you click &amp;quot;Import Project&amp;quot;, you should select the &amp;quot;Simpler Patent Data&amp;quot; folder, not the &amp;quot;src&amp;quot; folder within Simpler Patent Data, otherwise you won't get the pom.xml file that you need to let IntelliJ know that this is a Maven project.&lt;br /&gt;
* On the second window (there will be several options with check boxes next to them) make sure &amp;quot;import Maven projects automatically&amp;quot; is selected&lt;br /&gt;
* On the &amp;quot;Please select project SDK&amp;quot; window, make sure it says &amp;quot;1.8&amp;quot; in the &amp;quot;Name&amp;quot; slot. &lt;br /&gt;
* On the next window, enter a name for the project and enter a folder location.&lt;br /&gt;
&lt;br /&gt;
This should ensure that the project is set up as a Maven project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Your Data Source'''&lt;br /&gt;
&lt;br /&gt;
If you run into a message across the top that says something along the lines of &amp;quot;Configure Data Source&amp;quot;, then you have not connected IntelliJ to a database. You will not be able to run the code located under src/db until you configure one. Start by clicking on the link in the message, or if it doesn't appear, follow the instructions  [https://www.jetbrains.com/help/idea/connecting-to-a-database.html here] to open up the &amp;quot;Data Sources and Drivers&amp;quot; pop-up to add a PostgreSQL database. When you get to the dialogue asking about the host, database, user, and password, do the following to connect to the database on the RDP:&lt;br /&gt;
&lt;br /&gt;
 host: localhost&lt;br /&gt;
 database: whatever the constant DATABASE_NAME is set to - the default is patentsj&lt;br /&gt;
 user: postgres&lt;br /&gt;
 password: tabspaceenter&lt;br /&gt;
&lt;br /&gt;
Make sure to test the connection by clicking &amp;quot;Test Connection&amp;quot;. Now you should be able to run the scripts under src/db.&lt;br /&gt;
&lt;br /&gt;
If you're seeing issues such as &amp;quot;column [something] of relation [something] doesn't exist&amp;quot; but you've run the schema scripts, you probably have different database name under the data source than the one the constant DATABASE_NAME is set to.&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22326</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22326"/>
		<updated>2017-12-07T21:44:51Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start link].&lt;br /&gt;
&lt;br /&gt;
* Note that when you click &amp;quot;Import Project&amp;quot;, you should select the &amp;quot;Simpler Patent Data&amp;quot; folder, not the &amp;quot;src&amp;quot; folder within Simpler Patent Data, otherwise you won't get the pom.xml file that you need to let IntelliJ know that this is a Maven project.&lt;br /&gt;
* On the second window (there will be several options with check boxes next to them) make sure &amp;quot;import Maven projects automatically&amp;quot; is selected&lt;br /&gt;
* On the &amp;quot;Please select project SDK&amp;quot; window, make sure it says &amp;quot;1.8&amp;quot; in the &amp;quot;Name&amp;quot; slot. &lt;br /&gt;
* On the next window, enter a name for the project and enter a folder location.&lt;br /&gt;
&lt;br /&gt;
This should ensure that the project is set up as a Maven project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Your Data Source'''&lt;br /&gt;
&lt;br /&gt;
If you run into a message across the top that says something along the lines of &amp;quot;Configure Data Source&amp;quot;, then you have not connected IntelliJ to a database. You will not be able to run the code located under src/db until you configure one. Start by clicking on the link in the message, or if it doesn't appear, follow the instructions  [https://www.jetbrains.com/help/idea/connecting-to-a-database.html here] to open up the &amp;quot;Data Sources and Drivers&amp;quot; pop-up to add a PostgreSQL database. When you get to the dialogue asking about the host, database, user, and password, do the following to connect to the database on the RDP:&lt;br /&gt;
&lt;br /&gt;
 host: localhost&lt;br /&gt;
 database: whatever the constant DATABASE_NAME is set to - the default is patentsj&lt;br /&gt;
 user: postgres&lt;br /&gt;
 password: tabspaceenter&lt;br /&gt;
&lt;br /&gt;
Make sure to test the connection by clicking &amp;quot;Test Connection&amp;quot;. Now you should be able to run the scripts under src/db.&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22325</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22325"/>
		<updated>2017-12-07T21:44:07Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start link].&lt;br /&gt;
&lt;br /&gt;
* Note that when you click &amp;quot;Import Project&amp;quot;, you should select the &amp;quot;Simpler Patent Data&amp;quot; folder, not the &amp;quot;src&amp;quot; folder within Simpler Patent Data, otherwise you won't get the pom.xml file that you need to let IntelliJ know that this is a Maven project.&lt;br /&gt;
* On the second window (there will be several options with check boxes next to them) make sure &amp;quot;import Maven projects automatically&amp;quot; is selected&lt;br /&gt;
* On the &amp;quot;Please select project SDK&amp;quot; window, make sure it says &amp;quot;1.8&amp;quot; in the &amp;quot;Name&amp;quot; slot. &lt;br /&gt;
* On the next window, enter a name for the project and enter a folder location.&lt;br /&gt;
&lt;br /&gt;
This should ensure that the project is set up as a Maven project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Your Data Source'''&lt;br /&gt;
&lt;br /&gt;
If you run into a message across the top that says something along the lines of &amp;quot;Configure Data Source&amp;quot;, then you have not connected IntelliJ to a database. You will not be able to run the code located under src/db until you configure one. Start by clicking on the link in the message, or if it doesn't appear, follow the instructions  [https://www.jetbrains.com/help/idea/connecting-to-a-database.html here] to open up the &amp;quot;Data Sources and Drivers&amp;quot; pop-up to add a PostgreSQL database. When you get to the dialogue asking about the host, database, user, and password, do the following to connect to the database on the RDP:&lt;br /&gt;
&lt;br /&gt;
 host: localhost&lt;br /&gt;
&lt;br /&gt;
 database: whatever the constant DATABASE_NAME is set to - the default is patentsj&lt;br /&gt;
&lt;br /&gt;
 user: postgres&lt;br /&gt;
&lt;br /&gt;
 password: tabspaceenter&lt;br /&gt;
&lt;br /&gt;
Make sure to test the connection by clicking &amp;quot;Test Connection&amp;quot;. Now you should be able to run the scripts under src/db.&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22324</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22324"/>
		<updated>2017-12-07T21:43:27Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start link].&lt;br /&gt;
&lt;br /&gt;
* Note that when you click &amp;quot;Import Project&amp;quot;, you should select the &amp;quot;Simpler Patent Data&amp;quot; folder, not the &amp;quot;src&amp;quot; folder within Simpler Patent Data, otherwise you won't get the pom.xml file that you need to let IntelliJ know that this is a Maven project.&lt;br /&gt;
* On the second window (there will be several options with check boxes next to them) make sure &amp;quot;import Maven projects automatically&amp;quot; is selected&lt;br /&gt;
* On the &amp;quot;Please select project SDK&amp;quot; window, make sure it says &amp;quot;1.8&amp;quot; in the &amp;quot;Name&amp;quot; slot. &lt;br /&gt;
* On the next window, enter a name for the project and enter a folder location.&lt;br /&gt;
&lt;br /&gt;
This should ensure that the project is set up as a Maven project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Your Data Source'''&lt;br /&gt;
&lt;br /&gt;
If you run into a message across the top that says something along the lines of &amp;quot;Configure Data Source&amp;quot;, then you have not connected IntelliJ to a database. You will not be able to run the code located under src/db until you configure one. Start by clicking on the link in the message, or if it doesn't appear, follow the instructions  [https://www.jetbrains.com/help/idea/connecting-to-a-database.html here] to open up the &amp;quot;Data Sources and Drivers&amp;quot; pop-up to add a PostgreSQL database. When you get to the dialogue asking about the host, database, user, and password, do the following to connect to the database on the RDP:&lt;br /&gt;
&lt;br /&gt;
host: localhost&lt;br /&gt;
&lt;br /&gt;
database: whatever the constant DATABASE_NAME is set to - the default is patentsj&lt;br /&gt;
&lt;br /&gt;
user: postgres&lt;br /&gt;
&lt;br /&gt;
password: tabspaceenter&lt;br /&gt;
&lt;br /&gt;
Make sure to test the connection by clicking &amp;quot;Test Connection&amp;quot;. Now you should be able to run the scripts under src/db.&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22323</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22323"/>
		<updated>2017-12-07T21:42:54Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start link].&lt;br /&gt;
&lt;br /&gt;
* Note that when you click &amp;quot;Import Project&amp;quot;, you should select the &amp;quot;Simpler Patent Data&amp;quot; folder, not the &amp;quot;src&amp;quot; folder within Simpler Patent Data, otherwise you won't get the pom.xml file that you need to let IntelliJ know that this is a Maven project.&lt;br /&gt;
* On the second window (there will be several options with check boxes next to them) make sure &amp;quot;import Maven projects automatically&amp;quot; is selected&lt;br /&gt;
* On the &amp;quot;Please select project SDK&amp;quot; window, make sure it says &amp;quot;1.8&amp;quot; in the &amp;quot;Name&amp;quot; slot. &lt;br /&gt;
* On the next window, enter a name for the project and enter a folder location.&lt;br /&gt;
&lt;br /&gt;
This should ensure that the project is set up as a Maven project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Your Data Source'''&lt;br /&gt;
&lt;br /&gt;
If you run into a message across the top that says something along the lines of &amp;quot;Configure Data Source&amp;quot;, then you have not connected IntelliJ to a database. You will not be able to run the code located under src/db until you configure one. Start by clicking on the link in the message, or if it doesn't appear, follow the instructions  [https://www.jetbrains.com/help/idea/connecting-to-a-database.html here] to open up the &amp;quot;Data Sources and Drivers&amp;quot; pop-up to add a PostgreSQL database. When you get to the dialogue asking about the host, database, user, and password, do the following to connect to the database on the RDP:&lt;br /&gt;
&lt;br /&gt;
host: localhost&lt;br /&gt;
database: whatever the constant DATABASE_NAME is set to - the default is patentsj&lt;br /&gt;
user: postgres&lt;br /&gt;
password: tabspaceenter&lt;br /&gt;
&lt;br /&gt;
Make sure to test the connection by clicking &amp;quot;Test Connection&amp;quot;. Now you should be able to run the scripts under src/db.&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22322</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22322"/>
		<updated>2017-12-07T21:27:41Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start link].&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22321</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22321"/>
		<updated>2017-12-07T21:26:33Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [link](https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start).&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22320</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22320"/>
		<updated>2017-12-07T21:25:47Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Troubleshooting */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [[link]https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start]:&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22319</id>
		<title>Reproducible Patent Data</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Reproducible_Patent_Data&amp;diff=22319"/>
		<updated>2017-12-07T21:25:01Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Quickstart */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Uspto web logo.jpg&lt;br /&gt;
|Has title=Reproducible Patent Data&lt;br /&gt;
|Has owner=Oliver Chang&lt;br /&gt;
|Has start date=May 17&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Redesigning Patent Database, Patent Assignment Data Restructure,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
A continuation of [[Redesigning Patent Database]] that aims to write faster, more centralized code to deal with data from the United States Patent and Trademark Office (USPTO). By having an end-to-end pipeline we can easily reproduce or update data without worrying about unintentional side effects or missing data.&lt;br /&gt;
&lt;br /&gt;
== Quickstart ==&lt;br /&gt;
&lt;br /&gt;
To get up and running with the code, do the following:&lt;br /&gt;
&lt;br /&gt;
# Clone the git project (link at end of page) to your user directory&lt;br /&gt;
# Launch IntelliJ with &amp;gt;= Java 8 and Maven configured (default version installed on the RDP is setup to do this)&lt;br /&gt;
# Open project in IntelliJ&lt;br /&gt;
# Create an empty database (see [[#Database]])&lt;br /&gt;
# Run the table creation scripts in &amp;lt;code&amp;gt;src/db/schemas/&amp;lt;/code&amp;gt; in your new database&lt;br /&gt;
# Modify the constant &amp;lt;code&amp;gt;DATABASE_NAME&amp;lt;/code&amp;gt; in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres\DatabaseHelper.java&amp;lt;/code&amp;gt;&lt;br /&gt;
# Run the Driver scripts in IntelliJ with the correct value for &amp;lt;code&amp;gt;DATA_DIRECTORY&amp;lt;/code&amp;gt; (or run &amp;lt;code&amp;gt;RunInitialImport.java&amp;lt;/code&amp;gt; which will do all of the data directories for that patent item type)&lt;br /&gt;
# [Take a really, really long lunch...in total should take no more than five hours to load data on RDP]&lt;br /&gt;
# Run scripts in &amp;lt;code&amp;gt;src/db/constraints&amp;lt;/code&amp;gt; to check data assumptions&lt;br /&gt;
# That's it!&lt;br /&gt;
&lt;br /&gt;
===Troubleshooting===&lt;br /&gt;
&lt;br /&gt;
If you're new to IntelliJ (and even if you're not) you might run into problems with importing the project. &lt;br /&gt;
&lt;br /&gt;
'''Setting Up Project as a Maven project'''&lt;br /&gt;
It should be clear if the project is not set up as a Maven project - when you right click on RunInitialImport.java, for example, you won't see an option with &lt;br /&gt;
a green triangle next to it that says &amp;quot;Run 'RunInitialImport.java'&amp;quot;, and the green triangle in the top toolbar will be grayed out. If the project is not set up as a Maven project, you will not be able to run any of the code. To set up the project as a Maven project, when you import the project, follow the instructions at the following [[link]]: https://www.jetbrains.com/help/idea/maven.html#maven_import_project_start&lt;br /&gt;
&lt;br /&gt;
== Directory Layout ==&lt;br /&gt;
&lt;br /&gt;
=== Where is the Data? ===&lt;br /&gt;
&lt;br /&gt;
==== Directories ====&lt;br /&gt;
&lt;br /&gt;
All of the information for this project is located at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are several interesting directories:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt; is USPTO bulkdata, unmodified straight from the scraper&lt;br /&gt;
* &amp;lt;code&amp;gt;data/extracts/&amp;lt;/code&amp;gt; is a directory of a strict subset of the information stored in &amp;lt;code&amp;gt;data/downloads/&amp;lt;/code&amp;gt;. It is the result of running a bulk 7-zip job on that directory to get everything unzipped in a flat data structure. Note that these files have the USPTO modified-by time since that metadata is stored in the zipfiles. To extract files in this nice format, select all of the zipfiles and setup an extraction job like in this [[media:7zip-params.png|screenshot]]&lt;br /&gt;
* &amp;lt;code&amp;gt;data/backups/&amp;lt;/code&amp;gt; is a 7zip'd backup of the corresponding directory in extracts&lt;br /&gt;
* &amp;lt;code&amp;gt;src/&amp;lt;/code&amp;gt; is the main code repository for the java project&lt;br /&gt;
&lt;br /&gt;
==== Input Files ====&lt;br /&gt;
&lt;br /&gt;
All of the text-only Red Book files for '''granted patents''' from 1976 to 2016, inclusive. To find a specific year's XML file, find it in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find application data''' from 2001 to 2016, inclusive, look in&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\applications\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find assignment data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\extracts\granted\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''To find maintenance fee data''', look in &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\downloads\maintenance\&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Where is the Code? ===&lt;br /&gt;
&lt;br /&gt;
The code has the same parent directory as the data, so it is at &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src&amp;lt;/code&amp;gt;. You might notice a lot of single-entry directories; this is an idiomatic Java pattern that is used for package separation. If using IntelliJ or some other IDE, these directories are a bit less annoying.&lt;br /&gt;
&lt;br /&gt;
The development environment is Java 8 JDK, IntelliJ Ultimate IDE, Maven build tools, and git VCS.&lt;br /&gt;
&lt;br /&gt;
The git repository can be found at http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent&lt;br /&gt;
&lt;br /&gt;
==== Prior Art ====&lt;br /&gt;
&lt;br /&gt;
This tool is not so concerned with adding new functionality; rather, it aims to take a bunch of spread out Perl scripts and create a faster system that is easier to work with. As such, its functionality is largely stolen from those scripts:&lt;br /&gt;
&lt;br /&gt;
* Downloader: &amp;lt;code&amp;gt;E:\McNair\Software\Scripts\Patent\USPTO_Parser.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Splitter: &amp;lt;code&amp;gt;E:\McNair\PatentData\splitter.pl&amp;lt;/code&amp;gt;&lt;br /&gt;
* XML Parsing: &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\xmlparser_4.5_4.4_4.3.pl&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;E:\McNair\PatentData\Processed\*.pm&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In addition, I used several non-standard Java libraries listed below:&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/mashape/unirest-java/ Unirest] for easy HTTP requests (MIT License)&lt;br /&gt;
* [https://github.com/google/guava Google Guava] for immutable collections and Stream utilities (Apache v2.0 License)&lt;br /&gt;
* [https://github.com/jhy/jsoup/ jsoup] for HTML parsing (MIT License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-codec/ Apache Commons Codec] (Apache v2.0 License)&lt;br /&gt;
* [http://commons.apache.org/proper/commons-lang/ Apache Commons Lang v3] (Apache v2.0 License)&lt;br /&gt;
* [https://mvnrepository.com/artifact/org.jetbrains/annotations/15.0 Jetbrains Annotations] for enhanced null checks (Apache v2.0 License)&lt;br /&gt;
* [http://search.maven.org/#artifactdetails%7Corg.postgresql%7Cpostgresql%7C42.1.1.jre7%7Cbundle PostgreSQL JDBC] (BSD 3-clause per https://github.com/pgjdbc/pgjdbc-jre7/blob/master/LICENSE)&lt;br /&gt;
&lt;br /&gt;
If using maven, these dependencies are listed and should automatically be setup.&lt;br /&gt;
&lt;br /&gt;
==== Using Code ====&lt;br /&gt;
&lt;br /&gt;
Any file with a line that says &amp;lt;code&amp;gt;public static void main(String[] args) {&amp;lt;/code&amp;gt; can be run as a standalone file. The easiest way to do this is to load the project and then the file in IntelliJ and click the little green play arrow next to this bit of code.&lt;br /&gt;
&lt;br /&gt;
The code can also be run via the standard &amp;lt;code&amp;gt;javac&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;java&amp;lt;/code&amp;gt; commands but since this project has a complicated structure you end up having to run commands like &lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;&amp;quot;C:\Program Files\Java\jdk1.8.0_131\bin\java&amp;quot; &amp;quot;-javaagent:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\lib\idea_rt.jar=62364:C:\Users\OliverC\IntelliJ IDEA 2017.1.3\bin&amp;quot; -Dfile.encoding=UTF-8 -classpath &amp;quot;[...contents truncated...];C:\Users\OliverC\.m2\repository\org\postgresql\postgresql\42.1.1\postgresql-42.1.1.jar&amp;quot; org.bakerinstitute.mcnair.uspto_assignments.XmlDriver&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
to include all of the runtime dependencies and it's just not worth it.&lt;br /&gt;
&lt;br /&gt;
==== Altering Code ====&lt;br /&gt;
&lt;br /&gt;
* Use the IntelliJ command Reformat code (found in the menus at &amp;lt;code&amp;gt;Code &amp;gt; Reformat Code&amp;lt;/code&amp;gt;&lt;br /&gt;
* Use the optimize imports function found under the same menu&lt;br /&gt;
* Use spaces for indentation&lt;br /&gt;
* Loosely try to keep lines below 120 characters&lt;br /&gt;
* Commit changes to the Git remote repository &amp;quot;bonobo&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== Schema Reconciliation ==&lt;br /&gt;
&lt;br /&gt;
For the work by Joe, see the [[Patent Schema Reconciliation]] page &lt;br /&gt;
&lt;br /&gt;
=== Patents (Granted) ===&lt;br /&gt;
&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\data\examples\granted&amp;lt;/code&amp;gt; for extracted examples of what specific data is available for a sample of the data.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Granted Patent Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported? !! scope=&amp;quot;col&amp;quot; | Utility !! scope=&amp;quot;col&amp;quot; | Reissue !! scope=&amp;quot;col&amp;quot; | Design !! scope=&amp;quot;col&amp;quot; | Plant&lt;br /&gt;
|-&lt;br /&gt;
|January 1976 to December 2001&lt;br /&gt;
|APS&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|&amp;lt;del&amp;gt;January 2001 to December 2001&amp;lt;/del&amp;gt;&lt;br /&gt;
|&amp;lt;del&amp;gt;SGML&amp;lt;/del&amp;gt;&lt;br /&gt;
|Ignored; use concurrently recorded APS data&lt;br /&gt;
|No&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|N/A&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 2.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/blunderyears&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to September 24, 2013&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|October 8, 2013 to December 2014&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to December 2016&lt;br /&gt;
|XML Version 4.5 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/granted/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: green; color: white;&amp;quot; | Yes&lt;br /&gt;
|✓&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|~&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== APS Rosetta Stone ===&lt;br /&gt;
&lt;br /&gt;
The Advanced Patent System (APS) is a fixed-width text format used to store historical patent grant data. The documentation for this sucks; there are pages missing at random. Luckily, we only care about the content contained here: [[File:PatentFullTextAPSDoc_GreenBook_pgs13-22.pdf]].&lt;br /&gt;
&lt;br /&gt;
It's worth mentioning that the APS contains an advanced text markup system for chemical formulae, basic text markup, tables, etc. that can lead to seemingly garbled text that is perfectly well-formed.&lt;br /&gt;
&lt;br /&gt;
==== APS Gotchas ====&lt;br /&gt;
&lt;br /&gt;
* PATN.WKU is the granted patent number. It is 7 digits while the spec promises 6 digits. The rightmost digit is a check digit modulus 11. See [[File:Aps-wku-modulus11.pdf]] for the words from the horse's mouth.&lt;br /&gt;
&lt;br /&gt;
=== Patents (Applications) ===&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|+Patent Application Data Formats&lt;br /&gt;
|-&lt;br /&gt;
! scope=&amp;quot;col&amp;quot; | Dates Used !! scope=&amp;quot;col&amp;quot; | Format !! scope=&amp;quot;col&amp;quot; | Location !! scope=&amp;quot;col&amp;quot; | Supported by Parser?&lt;br /&gt;
|-&lt;br /&gt;
|March 15, 2001 to December 2001&lt;br /&gt;
|XML Version 1.5&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Yes, for basic information, inventors, and correspondents&lt;br /&gt;
|-&lt;br /&gt;
|January 2002 to December 2004&lt;br /&gt;
|XML Version 1.6&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/vintage&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2005 to December 2005&lt;br /&gt;
|XML Version 4.0 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2006 to December 2006&lt;br /&gt;
|XML Version 4.1 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2007 to December 2012&lt;br /&gt;
|XML Version 4.2 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2013 to December 2014&lt;br /&gt;
|XML Version 4.3 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|-&lt;br /&gt;
|January 2015 to ''Present''&lt;br /&gt;
|XML Version 4.4 ICE&lt;br /&gt;
|&amp;lt;code&amp;gt;data/extracts/applications/modern&amp;lt;/code&amp;gt;&lt;br /&gt;
|style=&amp;quot;background: yellow;&amp;quot; | Ditto&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Database ==&lt;br /&gt;
&lt;br /&gt;
Because there isn't a compelling reason not to, I used the existing PostgreSQL infrastructure on the RDP.&lt;br /&gt;
The &amp;quot;Java Way&amp;quot; of interacting with databases is the Java Database Connectivity API (JDBC), an implementation-agnostic API for interacting with databases.&lt;br /&gt;
This project uses the stock [https://jdbc.postgresql.org/ Postgres JDBC], version 42.1.1&lt;br /&gt;
&lt;br /&gt;
=== Create an empty database on RDP ===&lt;br /&gt;
&lt;br /&gt;
To create an empty database, run this command: &amp;lt;code&amp;gt;$ createdb --username=postgres database-name-goes-here # password is tabspaceenter&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Abstraction Layer ===&lt;br /&gt;
&lt;br /&gt;
Since writing raw SQL is a bit cumbersome and error-prone, I have added some abstraction layers that make it much easier to quickly add bulk data. By using Postgres's &amp;lt;code&amp;gt;CopyManager&amp;lt;/code&amp;gt; class, we buffer SQL copy commands in memory (as many as possible) and then flush these rows. To understand how the abstraction layers work, see the code in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\postgres&amp;lt;/code&amp;gt;. See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\models\GrantedPatent.java&amp;lt;/code&amp;gt; for '''an example of how to extend''' the abstraction layer to deal with more complex scenarios.&lt;br /&gt;
&lt;br /&gt;
=== New Table Checklist ===&lt;br /&gt;
&lt;br /&gt;
* Create schema DDL SQL code for the new table in &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db&amp;lt;/code&amp;gt;&lt;br /&gt;
* Run the schema creation&lt;br /&gt;
* Create an enum with the same names for attributes as in the DDL (case-insensitive! prefer all-caps screaming snake case)&lt;br /&gt;
* Create a class which subclasses &amp;lt;code&amp;gt;AbstractInsertableData&amp;lt;/code&amp;gt;&lt;br /&gt;
* Inside that class, create a static class which subclasses &amp;lt;code&amp;gt;AbstractTableMetadata&amp;lt;/code&amp;gt; and has the proper values for getTableName(), getStringColumns(), getIntColumns()&lt;br /&gt;
* (Optional) Implement builder pattern&lt;br /&gt;
* (Optional) Create a custom databasehelper for complex extras (see PatentApplication and GrantedPatent for examples)&lt;br /&gt;
* Write the data to the table (see DatabaseHelper for the pattern I use)&lt;br /&gt;
&lt;br /&gt;
== Address Data ==&lt;br /&gt;
&lt;br /&gt;
To get the most granular address data (street level, or at least postcode level) about who owns patents, the path is not so straightforward because off the complicated mapping of ownership to a granted patent.&lt;br /&gt;
This is the final part of this project that I am working on and it is all at the level of SQL.&lt;br /&gt;
See &amp;lt;code&amp;gt;E:\McNair\Projects\SimplerPatentData\src\db\joins&amp;lt;/code&amp;gt; for my attempts to create a clean mapping.&lt;br /&gt;
Optimistically speaking, the data generated here should be superset of the data present in the Patent Assignment Data Restructure project.&lt;br /&gt;
&lt;br /&gt;
Note that as of the beginning of August 2017, this part '''has not been completed.'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Intuition ===&lt;br /&gt;
&lt;br /&gt;
Use &amp;lt;code&amp;gt;assignments_longform.last_update_date&amp;lt;/code&amp;gt; to find current/latest (or first/earliest) date of assignment. Then match with &amp;lt;code&amp;gt;properties.docid&amp;lt;/code&amp;gt; on &amp;lt;code&amp;gt;reelno, frameno&amp;lt;/code&amp;gt; to find patent application id. With this mapping to granted patents, we can discover the details of the original granted patent. And with the right date and reelno and frameno, we can match to the &amp;lt;code&amp;gt;assignees&amp;lt;/code&amp;gt; table and get fine granularity addresses.&lt;br /&gt;
&lt;br /&gt;
== Related Pages ==&lt;br /&gt;
&lt;br /&gt;
* [[Redesign_Assignment_and_Patent_Database|Redesign Assignment and Patent Database, Fall 2017 by Shelby]]&lt;br /&gt;
* [[Equivalent_XPath_and_APS_Queries|Equivalent XPath and APS Queries, Summer 2017 by Oliver &amp;amp; Joe]]&lt;br /&gt;
* [[US_Address_Verification|US Address Verification, Summer 2017 based on tables from Assignment Data Restructure]]&lt;br /&gt;
* [[Patent_Assignment_Data_Restructure|Assignment Data Restructure, Spring 2017 by Marcela and Sonia]]&lt;br /&gt;
* [[Redesigning_Patent_Database|Redesigning Patent Database, Spring 2017 by Shelby]]&lt;br /&gt;
* [[Patent_Data_Cleanup_(June_2016)|Patent Data Cleanup, June 2016 by Marcela]]&lt;br /&gt;
* [[Patent_Data|Patent Data, Spring 2016 by Marcela]] &lt;br /&gt;
* [[Lex_Machina|Lex Machina]]&lt;br /&gt;
* [[USPTO_Patent_Litigation_Data|USPTO Patent Litigation Research Dataset by Ed]]&lt;br /&gt;
* [[Patent_Litigation_and_Review|Patent Litigation and Review by Marcela]]* [[Patent|Existing Database Schema]]&lt;br /&gt;
* [[Oliver_Chang_(Work_Log)|My Work Log]]&lt;br /&gt;
&lt;br /&gt;
== External Links ==&lt;br /&gt;
&lt;br /&gt;
* Understanding Assignment Data: [https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf USPTO Documentation on their cleanup of this data]&lt;br /&gt;
* [https://bulkdata.uspto.gov/data2/patent/grant/redbook/fulltext/1976/PatentFullTextAPSDoc_GreenBook.pdf USPTO Green Book (APS) Documentation]&lt;br /&gt;
* [https://bulkdata.uspto.gov/ USPTO Bulk Data Storage System (BDSS)]&lt;br /&gt;
* [https://en.wikipedia.org/wiki/Builder_pattern Builder Pattern in Object-Oriented Programming]&lt;br /&gt;
* [http://rdp.mcnaircenter.org/codebase/Repository/ReproduciblePatent Git Repository]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22276</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22276"/>
		<updated>2017-12-04T21:56:39Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-12-04 2:45 pm - 4:00 pm - continued debugging and started typing up troubleshooting tips for the next person who alters the patent code&lt;br /&gt;
&lt;br /&gt;
2017-12-01 3:15 pm - 5:00 pm - ran code (and ran into errors) which I have been working on fixing. If I don't finish today, I'll continue doing so on Monday. I plan to write up some of the mistakes I made in adding the tables to the database and add them to Oliver's page, with his permission so that people in the future who are not familiar with the code (like I wasn't) hopefully won't fall into the same pitfalls. The main things are 1) how to set up a maven project (if IntelliJ doesn't automatically set it up for you when you open/import the project, and 2) how to set up the data source so you can run SQL scripts and actually load data into the database on the RDP.&lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables. &lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22267</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22267"/>
		<updated>2017-12-01T22:48:44Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-12-01 3:15 pm - 5:00 pm - ran code (and ran into errors) which I have been working on fixing. If I don't finish today, I'll continue doing so on Monday. I plan to write up some of the mistakes I made in adding the tables to the database and add them to Oliver's page, with his permission so that people in the future who are not familiar with the code (like I wasn't) hopefully won't fall into the same pitfalls. The main things are 1) how to set up a maven project (if IntelliJ doesn't automatically set it up for you when you open/import the project, and 2) how to set up the data source so you can run SQL scripts and actually load data into the database on the RDP.&lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables. &lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22265</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22265"/>
		<updated>2017-12-01T22:41:01Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-12-01 3:15 pm - 5:00 pm - ran code (and ran into errors) which I have been working on fixing. If I don't finish today, I'll continue doing so on Monday. I plan to write up some of the mistakes I made in adding the tables to the database and add them to Oliver's page, with his permission so that people in the future who are not familiar with the code (like I wasn't) hopefully won't fall into the same pitfalls. &lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables. &lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22263</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22263"/>
		<updated>2017-12-01T22:30:53Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-12-01 3:15 pm - 5:00 pm - ran code (and ran into errors) which I have been working on fixing. If I don't finish today, I'll continue doing so on Monday. &lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables. &lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22249</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22249"/>
		<updated>2017-11-30T21:52:18Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents, and went through the checklist to make sure I had done everything to create these new tables based on Oliver's Reproducible Patent Data page. Will definitely run code tomorrow and will type up the exact process I went through to create new tables. &lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22247</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22247"/>
		<updated>2017-11-30T20:32:01Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Wrote creation tables script in SQL for creating tables for the design, reissue, and plant patents. Will be testing to see if it runs.&lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22241</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=22241"/>
		<updated>2017-11-30T20:03:49Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-30-17 1:55 pm - 3:55 pm - continued altering code. Will be testing to see if it runs.&lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21977</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21977"/>
		<updated>2017-11-17T22:56:50Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Finding New Fields Unique to Plant, Reissue, and/or Design Patents */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;br /&gt;
&lt;br /&gt;
==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
XML 4.4, 4.3, 4.1, and 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i&lt;br /&gt;
&lt;br /&gt;
XML 4.2 &lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i &lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/&lt;br /&gt;
fields: parent-status&lt;br /&gt;
&lt;br /&gt;
XML 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/&lt;br /&gt;
fields: parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
(everything in XML 4.3 except parent-status)	&lt;br /&gt;
&lt;br /&gt;
XML 4.3&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number	&lt;br /&gt;
	&lt;br /&gt;
XML 4.1&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields:&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/&lt;br /&gt;
fields: &lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
XML 4.0 and XML 4.2&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: hague-agreement-data/international-registration-date/date&lt;br /&gt;
	hague-agreement-data/international-registration-publication-date/date&lt;br /&gt;
	us-term-of-grant/length-of-grant&lt;br /&gt;
	hague-agreement-data/international-registration-number&lt;br /&gt;
	hague-agreement-data/international-filing-date/date&lt;br /&gt;
&lt;br /&gt;
XML 4.1, 4.3, and 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/&lt;br /&gt;
fields: length-of-grant&lt;br /&gt;
&lt;br /&gt;
XML 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: us-term-of-grant/length-of-grant&lt;br /&gt;
	classification-locarno/edition&lt;br /&gt;
	classification-locarno/main-classification&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21976</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21976"/>
		<updated>2017-11-17T22:56:10Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Finding New Paths Unique to Plant, Reissue, and/or Design Patents */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;br /&gt;
&lt;br /&gt;
==Paths for the New Fields Related to Plant Reissue, and/or Design Patents&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
XML 4.4, 4.3, 4.1, and 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i&lt;br /&gt;
&lt;br /&gt;
XML 4.2 &lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic&lt;br /&gt;
fields: latin-name, variety&lt;br /&gt;
&lt;br /&gt;
parent node: us-patent-grant/us-claim-statement/&lt;br /&gt;
field: i &lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/&lt;br /&gt;
fields: parent-status&lt;br /&gt;
&lt;br /&gt;
XML 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/&lt;br /&gt;
fields: parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
(everything in XML 4.3 except parent-status)	&lt;br /&gt;
&lt;br /&gt;
XML 4.3&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number	&lt;br /&gt;
	&lt;br /&gt;
XML 4.1&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields:&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/&lt;br /&gt;
fields: &lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
&lt;br /&gt;
XML 4.0 and XML 4.2&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/&lt;br /&gt;
fields: parent-doc/document-id/kind&lt;br /&gt;
	parent-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/kind&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/country&lt;br /&gt;
	child-doc/document-id/country&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/doc-number&lt;br /&gt;
	child-doc/document-id/doc-number&lt;br /&gt;
	parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
	parent-doc/parent-status&lt;br /&gt;
	parent-doc/document-id/country&lt;br /&gt;
	parent-doc/document-id/date&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
XML 4.5&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: hague-agreement-data/international-registration-date/date&lt;br /&gt;
	hague-agreement-data/international-registration-publication-date/date&lt;br /&gt;
	us-term-of-grant/length-of-grant&lt;br /&gt;
	hague-agreement-data/international-registration-number&lt;br /&gt;
	hague-agreement-data/international-filing-date/date&lt;br /&gt;
&lt;br /&gt;
XML 4.1, 4.3, and 4.4&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/&lt;br /&gt;
fields: length-of-grant&lt;br /&gt;
&lt;br /&gt;
XML 4.0&lt;br /&gt;
parent node: us-patent-grant/us-bibliographic-data-grant/&lt;br /&gt;
fields: us-term-of-grant/length-of-grant&lt;br /&gt;
	classification-locarno/edition&lt;br /&gt;
	classification-locarno/main-classification&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21974</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21974"/>
		<updated>2017-11-17T22:49:09Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-17 2:15 pm - 5:00 pm - continued altering code to include the special fields for plant, reissue, and design patent. Their models and the changes to XmlParser relevant to those models should be done. I will go through the rest of the code and check where else I need to make alterations next time I am in. &lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21972</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21972"/>
		<updated>2017-11-17T22:39:01Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fields */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21971</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21971"/>
		<updated>2017-11-17T22:37:20Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fields */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21924</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21924"/>
		<updated>2017-11-16T14:44:55Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-16 8:45 am - 10:30 am - continued altering code to include the special fields for plant, reissue, and design patents&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21827</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21827"/>
		<updated>2017-11-14T16:28:48Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-14 8:30 - 10:30 am - met with Oliver to go over his code, began working on scripts and finding paths for the fields qunie to design, reissue, and plant patents.&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-03: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-02: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-09-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-09-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-09-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-09-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-09-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-09-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-04-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-04-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-04-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-04-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-04-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-04-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-04-06: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-04-04: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-03-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-03-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-03-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-09: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-03-07: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-03-02: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-02-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-02-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-02-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-02-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21816</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21816"/>
		<updated>2017-11-14T14:37:47Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* PLANT */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21791</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21791"/>
		<updated>2017-11-10T22:54:40Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-10: 2:00 - 5:00 pm - updated format of worklog, finished researching fields of the new design, utility, and reissue tables, and documenting what each table contains&lt;br /&gt;
&lt;br /&gt;
2017-11-3: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-2: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-9-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-9-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-9-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-9-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-9-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-9-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-4-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-4-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-4-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-4-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-4-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-4-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-4-6: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-4-4: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-3-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-3-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-3-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-3-9: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-3-7: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-3-2: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-2-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-2-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-2-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-2-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21790</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21790"/>
		<updated>2017-11-10T22:53:19Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
=====Fields=====&lt;br /&gt;
* length_of_grant (int) length of grant, most likely in years&lt;br /&gt;
* hague_registration_date (date) filing date of international patent application&lt;br /&gt;
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application&lt;br /&gt;
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application&lt;br /&gt;
* hague_international_registration_number (varchar(255)) international registration number&lt;br /&gt;
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification&lt;br /&gt;
* main_classification (varchar(255)) classification for what type of design the patent is for&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Date&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21783</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21783"/>
		<updated>2017-11-10T22:26:06Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
 us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
Therefore, a particular reissue patent will probably only have the fields filled for either the &amp;quot;parent document&amp;quot; or &amp;quot;parent grant document&amp;quot; if I am right about what they represent.&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21782</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21782"/>
		<updated>2017-11-10T22:23:27Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
&amp;gt; us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number (varchar(255)) probably the application number of the child application&lt;br /&gt;
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* child_doc_country (varchar(255)) country of origin of the child application&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21780</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21780"/>
		<updated>2017-11-10T22:06:22Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Fields */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
&amp;gt; us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not&lt;br /&gt;
* parent_doc_number (int) probably application number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of origin of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was filed&lt;br /&gt;
* parent_grant_doc_number (int) probably application number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was filed&lt;br /&gt;
* child_doc_number&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21778</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21778"/>
		<updated>2017-11-10T21:58:01Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
&amp;gt; us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields represent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21777</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21777"/>
		<updated>2017-11-10T21:56:49Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
`us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date`&lt;br /&gt;
&lt;br /&gt;
=====Fields=====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields represent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21776</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21776"/>
		<updated>2017-11-10T21:55:37Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields represent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21775</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21775"/>
		<updated>2017-11-10T21:54:50Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
In the following table, you'll notice lots of fields related to three different kinds of &amp;quot;documents&amp;quot; - a &amp;quot;parent document&amp;quot;, a &amp;quot;child document&amp;quot;, and a &amp;quot;parent grant document&amp;quot;. It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these &amp;quot;documents&amp;quot; are, but please know that I do not have definitive proof as there is little available information about these &amp;quot;documents&amp;quot; in regards to a reissue patent.&lt;br /&gt;
&lt;br /&gt;
It is possible that the &amp;quot;parent document&amp;quot; is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a &amp;quot;child document&amp;quot; - a &amp;quot;child document&amp;quot; could be a child application, which is filed while a parent application is still pending. So in this case, a &amp;quot;child document&amp;quot; of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A &amp;quot;child document&amp;quot; is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
Based on this logic, I think that the difference between a &amp;quot;parent document&amp;quot; and a &amp;quot;parent grant document&amp;quot; depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under &amp;quot;parent document&amp;quot;. However, if the parent patent application has been granted, then the information will be stored under &amp;quot;parent grant document&amp;quot;. This seems like the most logical explanation, especially considered the path to any field related to &amp;quot;parent grant document&amp;quot; contains &amp;quot;parent document&amp;quot; as with the example below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields represent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21769</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21769"/>
		<updated>2017-11-10T21:40:23Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields represent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
It is possible that the parent document is the parent application - that is, the first patent application filed in regards to an invention. That would explain why Reissue also has several fields related to a child document - a child document could be a child application, which is filed while a parent application is still pending. So in this case, a child document of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A child document is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21768</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21768"/>
		<updated>2017-11-10T21:33:59Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields represent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
It is possible that the parent document is the parent application - that is, the first patent application filed in regards to an invention. That would explain why Reissue also has several fields related to a child document - a child document could be a child application, which is filed while a parent application is still pending. So in this case, a child document of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A child document is either a continuation, disclosure, or continuation-in-part [http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21762</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21762"/>
		<updated>2017-11-10T20:57:37Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Spring 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-3: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-2: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-9-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-9-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-9-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-9-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-9-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-9-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2017-4-27:  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
2017-4-25:  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
2017-4-20:  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
2017-4-18:  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
2017-4-13: 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
2017-4-11: 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
2017-4-6: 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
2017-4-4: 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
2017-3-23: 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
2017-3-22: 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
2017-3-21: 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
2017-3-9: 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
2017-3-7: 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
2017-3-2: 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
2017-2-23: 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
2017-2-21: 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2017-2-16: 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2017-2-14: 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21758</id>
		<title>Shelby Bice (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shelby_Bice_(Work_Log)&amp;diff=21758"/>
		<updated>2017-11-10T20:45:18Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
[[Shelby Bice]] [[Work Logs]] [[Shelby Bice (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-3: 2:30 - 3:30 pm, 3:45 pm - 5:00 pm (I went to get a snack because I was starving!) - Finished up finding unique fields, added them to database design, and researched what the fields represented - again, this is taking a while since there is little to no documentation online about what fields on these XML files actually represent.&lt;br /&gt;
&lt;br /&gt;
2017-11-2: 2:15 pm - 4:00 pm - made a main data page and continued going through the xpaths Oliver found. Notified Michelle so she can continue adding patent pages&lt;br /&gt;
&lt;br /&gt;
2017-10-31: 9:00 am - 10:30 am - started trying to determine from Oliver's research what paths are unique to utility, reissue, and plant patents that we have not yet accounted for in our design of the table&lt;br /&gt;
&lt;br /&gt;
2017-10-27: 2:00 pm - 3:30 pm - Kept updating tables. For some of the new values I've been adding to tables (I didn't realize we had XPaths for some fields) I'm struggling to find out exactly what the field means (for example, the date on a citation - is this the date the citedpatent was granted? or is the date that the citingpatent cited the citedpatent?) so the process is slow going. I want to get the documentation right though so in the future someone can look up and see exactly what each field in the table represents and not have to guess&lt;br /&gt;
&lt;br /&gt;
2017-10-26: 2:00 pm - 4:00 pm - worked on updating the designs for the patent databases tables based on what was discussed on Friday (specifically adding inventors and lawyers tables, altering fields on patent tables) - will continue this on Friday&lt;br /&gt;
&lt;br /&gt;
2017-10-20: - 2:00 pm - 5:00 pm - met with Ed and Oliver about patent database design&lt;br /&gt;
&lt;br /&gt;
2017-10-19: 2:00 pm - 4:00 pm - finished up ER diagram and description of all tables for patent database, started reading papers concerning creating an inventor's database (looks like other research groups have merged the USPTO data with the Harvard Dataverse data in order to create an inventors table)&lt;br /&gt;
&lt;br /&gt;
2017-10-17: 8:45 am - 10:30 am - continued trying to solve twitter issues, worked on ER diagram, skimmed through a paper I found (linked at the top of the page for Redesign Assignment and Patent Database) to see how they cleaned the data, since I assume that will be the next step to research after finishing the ER diagrams&lt;br /&gt;
&lt;br /&gt;
2017-10-12: 2:30 pm - 5:00 pm write blog post on Grace Hopper Celebration and attempted to solve Twitter issue&lt;br /&gt;
&lt;br /&gt;
2017-10-03: 8:45 am - 10:30 am - finished adding descriptions to the fields for each table in the patent database, started work on an ER diagram&lt;br /&gt;
&lt;br /&gt;
2017-9-29: 2:00 pm - 5:00 pm - finished adding the design for the Patent database to the project page, added descriptions of the fields for each table to the project page including the datatype that I think the field will be when it's loaded into the database&lt;br /&gt;
&lt;br /&gt;
2017-9-28: 9:00 am - 10:30 am - continued going over and documenting last semester's patent database design and adding the details to the Redesign Assignment and Patent Database project page. Additionally, I began trying to determine how to match up the information in the Document_Info table in Assignment to match up with a patent_id in the Patent table in Patent.&lt;br /&gt;
&lt;br /&gt;
Tomorrow I will finish adding the design for the Patent database to the project page, add descriptions of the fields for each table to the project page, and start working on ER diagrams for the two databases.&lt;br /&gt;
&lt;br /&gt;
Links for creating ER diagrams: https://erdplus.com/#/standalone or https://creately.com/app/?tempID=hqdgwjki1&amp;amp;login_type=demo#&lt;br /&gt;
&lt;br /&gt;
2017-9-26: 8:45 am - 10:00 am - continued worked on design of Assignment database by checking my design against the work done last semester on the assignment data restructure to make sure I didn't miss anything major. Began going over my patent database design from last semester to tweak it. Will need to sync up with Joe Reilly to see if there are any new fields that we are pulling from the data. Additionally, I made a new project page called Redesign Assignment and Patent Database that encompasses the new design for the Assignment database and Patent database redesign and moved some of the notes from McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper to the project page.&lt;br /&gt;
&lt;br /&gt;
The main takeaway from looking over Patent Assignment Data Restructure is that, after assembling the table according to my design (which doesn't seem to have any contradictions with the Patent Assignment Data Restructure) that there will by multiple steps for cleaning the data, specifically the fields relating to location and address in the assignment table. While the Patent Assignment Data Restructure mentions connecting to the Patent database, it is not clear from the page what field would be used to connect to the Patent database.&lt;br /&gt;
&lt;br /&gt;
2017-9-23: 2:00 pm - 4:00 pm - continued working on design of Assignment database and how it will connect to Patent database by writing out what will be in each table in Assignment and questions about different possible structures of tables that we will have to address before finalizing the design - the notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper. Questions are highlighted in yellow throughout the document[[Category:Work Log]]&lt;br /&gt;
&lt;br /&gt;
2017-9-22: 8:30 am - 10:30 am - continued looking at paper on USPTO assignment data and adding to the notes on what the design of that database should look like, specifically on what I need for different tables and what I don't know yet about the design. Had to set up connection to RDP again due to technical issues. &lt;br /&gt;
&lt;br /&gt;
2017-9-15: 2:00 pm - 5:00 pm - introduced to new patent database projected, reviewed and took notes on USPTO Assignment data (notes can be found under McNair/Projects/Redesigning Patent Database/New Patent Database Project as Notes on USPTO Assignment Data Paper)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt; &lt;br /&gt;
&lt;br /&gt;
===Spring 2017===&lt;br /&gt;
&lt;br /&gt;
2/14/2017 10:00 am - 12:00 pm Set up personal wiki, set up work log&lt;br /&gt;
&lt;br /&gt;
2/16/2017 10:30 am - 12:00 pm Researched past work on databases, discussed project with Ed&lt;br /&gt;
&lt;br /&gt;
2/21/2017 9:00 am - 12:00 pm Set up work page, reviewed SQL, researched designing database, continued going through wiki&lt;br /&gt;
&lt;br /&gt;
2/23/2017 9:30 am - 12:00 pm Reviewed Perl, read about database design, set up project page for redesigning database, started documenting process&lt;br /&gt;
&lt;br /&gt;
3/2/2017 9:15 am - 12:15 pm Started excel spreadsheet to document current schema design and improvements to be made, updated project pages&lt;br /&gt;
&lt;br /&gt;
3/7/2017 9:30 am - 12:00 pm Continued working on spreadsheet, added relevant page links to project page, took notes on what I want documentation to look like in the future&lt;br /&gt;
&lt;br /&gt;
3/9/2017 9:00 am - 12:00 pm Finished first draft of spreadsheet describing the current schema (and possible changes) to the Patent database&lt;br /&gt;
&lt;br /&gt;
3/21/2017 9:30 am - 12:00 pm - Worked on determining &amp;quot;core&amp;quot; tables for new patent database&lt;br /&gt;
&lt;br /&gt;
3/22/2017 5:30 pm to 6:30 pm - Patent Data meeting&lt;br /&gt;
&lt;br /&gt;
3/23/2017 9:15 am - 12:00 pm - Narrowed down core tables and fields&lt;br /&gt;
&lt;br /&gt;
4/4/2017 9:00 am - 11:45 pm - Worked on updating documentation, found documentation on pulling data/making tables and databases, started looking through DTDs to find extra fields to pull&lt;br /&gt;
&lt;br /&gt;
4/6/2017 9:30 am - 11:30 pm - Kept looking through DTDs, kept updating documentation&lt;br /&gt;
&lt;br /&gt;
4/11/2017 9:15 am - 12:00 pm - Worked on trying to update patent data through 2016&lt;br /&gt;
&lt;br /&gt;
4/13/2017 9:30 am - 12:00 pm - Continued working on trying to update patent data through 2016, specifically parsing the data, worked with Ed to update perl scripts&lt;br /&gt;
&lt;br /&gt;
4/18/2017  9:45 am - 12:30 pm - Cleaned up documentation more, kept working through the process of parsing the data&lt;br /&gt;
&lt;br /&gt;
4/20/2017  10:00 am - 11:30 pm - wrote copy statements for copying data from RDP to database, continued working on documentation.&lt;br /&gt;
&lt;br /&gt;
4/25/2017  10:00 am - 12:00 pm - worked on documentation, tried to determine how to clean up the USPTO Assignee Data&lt;br /&gt;
&lt;br /&gt;
4/27/2017  1:00 pm - 3:00 pm - worked on documentation more, tried to figure out how to clean citation data&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21604</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21604"/>
		<updated>2017-11-03T21:58:45Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* Unique Attributes Tables */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables, finish description of attributes for each table.&lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields reprent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21603</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21603"/>
		<updated>2017-11-03T21:58:11Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables. &lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields reprent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21602</id>
		<title>Redesign Assignment and Patent Database</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Redesign_Assignment_and_Patent_Database&amp;diff=21602"/>
		<updated>2017-11-03T21:57:33Z</updated>

		<summary type="html">&lt;p&gt;ShelbyBice: /* REISSUE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Redesign Assignment and Patent Database&lt;br /&gt;
|Has owner=Shelby Bice,&lt;br /&gt;
|Has start date=9/2017&lt;br /&gt;
|Has keywords=patent&lt;br /&gt;
|Is dependent on=Reproducible Patent Data,&lt;br /&gt;
}}&lt;br /&gt;
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. Then, under my design for the Patent Database, there's a new section called &amp;quot;Unique Attributes Tables&amp;quot; where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:&lt;br /&gt;
&lt;br /&gt;
'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way. &lt;br /&gt;
&lt;br /&gt;
'''2.''' There were some unique attributes to utility patents that I also included in &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot;. They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.&lt;br /&gt;
&lt;br /&gt;
'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a &amp;quot;Parent Document&amp;quot; and/or &amp;quot;Parent Grant Document&amp;quot; might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.&lt;br /&gt;
&lt;br /&gt;
This is an extension of the work I did last semester under &amp;quot;Redesigning Patent Database&amp;quot;. Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.&lt;br /&gt;
&lt;br /&gt;
Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf&lt;br /&gt;
&lt;br /&gt;
Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf&lt;br /&gt;
&lt;br /&gt;
'''Particularly useful, because it provides solutions for cleaning location and firm data'''&lt;br /&gt;
&lt;br /&gt;
==ER Diagram for Assignment and Patent Databases==&lt;br /&gt;
&lt;br /&gt;
Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.&lt;br /&gt;
&lt;br /&gt;
[[File:PatentAndAssignmentER2.png]]&lt;br /&gt;
&lt;br /&gt;
Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.&lt;br /&gt;
&lt;br /&gt;
==Assignment Database Structure==&lt;br /&gt;
&lt;br /&gt;
After reading through the paper &amp;quot;The USPTO Patent Assignment Dataset:&lt;br /&gt;
Descriptions and Analysis&amp;quot;(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.&lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table. &lt;br /&gt;
&lt;br /&gt;
Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some  of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated. &lt;br /&gt;
&lt;br /&gt;
Full list of tables: assignee, assignor, document_info, assignment, conveyance&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: store all the assignees from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name&lt;br /&gt;
&lt;br /&gt;
All columns (variables) in the table: &lt;br /&gt;
&lt;br /&gt;
*rf_id (bigint) reel frame number &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* address_line_1 (varchar(255)) first line of address of assignee&lt;br /&gt;
* address_line_2 (varchar(255)) second line of address of assignee&lt;br /&gt;
* city (varchar(255)) city of assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* country var(char(255)) country of assignee&lt;br /&gt;
* postal_code (varchar(255)) post code of the assignee&lt;br /&gt;
&lt;br /&gt;
===ASSIGNOR===&lt;br /&gt;
	&lt;br /&gt;
Purpose: store all the assignors from all the assignments&lt;br /&gt;
&lt;br /&gt;
Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id and assignor_name&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
&lt;br /&gt;
===CONVEYANCE===&lt;br /&gt;
Purpose: represent the conveyance of a particular patent overtime&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id &lt;br /&gt;
&lt;br /&gt;
All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.&lt;br /&gt;
	&lt;br /&gt;
Not-repeated (with exception of primary/foreign key): &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)&lt;br /&gt;
* num_conveyance (int) number in order of conveyance by date&lt;br /&gt;
* acknowledgment_date (date) date that the assignor acknowledged the transaction&lt;br /&gt;
* execution_date (date) date that the transaction took place&lt;br /&gt;
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:&lt;br /&gt;
** assignment&lt;br /&gt;
** merger&lt;br /&gt;
** change of name&lt;br /&gt;
** government interest&lt;br /&gt;
** agreement&lt;br /&gt;
** security agreement&lt;br /&gt;
** release&lt;br /&gt;
&lt;br /&gt;
	&lt;br /&gt;
Repeated, also include: &lt;br /&gt;
* assignee_name (varchar(255)) name of the assignee&lt;br /&gt;
* assignor_name (varchar(255)) name of the assignor&lt;br /&gt;
* patent_number (varchar(255)) identifying number for the patent&lt;br /&gt;
&lt;br /&gt;
===DOCUMENT_INFO===&lt;br /&gt;
Purpose: store extra information relevant to the patents represented in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables) when created: &lt;br /&gt;
* rf_id (bigint) reel frame number&lt;br /&gt;
* app_num (varchar(255)) application document USPTO number &lt;br /&gt;
* app_date (date) date that the application was filed&lt;br /&gt;
* app_country (varhar(255)) country in which it was filed&lt;br /&gt;
* pgpub_num (varchar(255)) pre-grant publication document USPTO number&lt;br /&gt;
* pgpub_date (date) date that pre-grant publication was released&lt;br /&gt;
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number&lt;br /&gt;
* patent_number (varchar(255)) granted patent document USPTO numbe&lt;br /&gt;
* grant_date (date) date that the patent is officially published&lt;br /&gt;
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number&lt;br /&gt;
* invention_title (varchar(255)) title of the invention&lt;br /&gt;
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate &lt;br /&gt;
* reel_num (bigint) number of the reel the assignment was stored on&lt;br /&gt;
* frame_num (bigint) number of the frame on the reel the assignment was stored on&lt;br /&gt;
&lt;br /&gt;
===ASSIGNMENT===&lt;br /&gt;
Purpose: represent assignment transactions – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent assignment transaction&lt;br /&gt;
&lt;br /&gt;
Primary key: rf_id&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* rf_id (bigint) reel fram number&lt;br /&gt;
* assignment_id (bigint) US patent assignment id &lt;br /&gt;
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer&lt;br /&gt;
* correspondent_address1 (varchar(255)) first line of address for correspondent&lt;br /&gt;
* correspondent_address2 (varchar(255)) second line of address for correspondent&lt;br /&gt;
* correspondent_address3 (varchar(255)) third line of address for correspondent&lt;br /&gt;
* correspondent_address4 (varchar(255)) fourth line of address for correspondent&lt;br /&gt;
* record_date (date) date recorded with USPTO)&lt;br /&gt;
* last_update_date (date) date that information for this assignment was last update with the USPTO &lt;br /&gt;
* page_cnt (bigint) page count of assignment record)&lt;br /&gt;
&lt;br /&gt;
More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper&lt;br /&gt;
&lt;br /&gt;
==Patent Database Structure==&lt;br /&gt;
&lt;br /&gt;
Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new. &lt;br /&gt;
&lt;br /&gt;
The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.&lt;br /&gt;
&lt;br /&gt;
Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors&lt;br /&gt;
&lt;br /&gt;
Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the &amp;quot;Equivalent XPath and APS Queries&amp;quot; project page.&lt;br /&gt;
&lt;br /&gt;
===PATENT===&lt;br /&gt;
Purpose: represent patents – central table for the database&lt;br /&gt;
&lt;br /&gt;
Each entry represents: an individual patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* patent_type (varchar(255)) the type of patent, like utility&lt;br /&gt;
* patent_kind (varchar(2)) the kind of patent, a letter and a number&lt;br /&gt;
* title (varchar(255)) the title of the patent&lt;br /&gt;
* grantdate (date) date that patent was officially granted including the year&lt;br /&gt;
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)&lt;br /&gt;
* prioritycountry (varchar(255)) country where patent is first filed&lt;br /&gt;
* patent_country (varchar(255)) country in which this patent was published&lt;br /&gt;
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number&lt;br /&gt;
* cpcsubclass (varchar(255)) cooperative patent classification subclass&lt;br /&gt;
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup&lt;br /&gt;
* cpcmaingroup (varchar(255)) cooperative patent classification main group&lt;br /&gt;
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass&lt;br /&gt;
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number&lt;br /&gt;
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)&lt;br /&gt;
* appdate (date) date that the application was filed, including year - probably also the filing date&lt;br /&gt;
* ipcrsubclass (varchar(255)) International patent classification subclass&lt;br /&gt;
* ipcrmaingroup (varchar(255)) International patent classification main group&lt;br /&gt;
* ipcrsubgroup (varchar(255)) International patent classification subgroup&lt;br /&gt;
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup &lt;br /&gt;
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification&lt;br /&gt;
* natioanl_classification_country (varchar(255)) - should be United States&lt;br /&gt;
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)&lt;br /&gt;
&lt;br /&gt;
Discrepancies: &lt;br /&gt;
These fields were previously a a part of the table, but we do not currently have xpaths for them:&lt;br /&gt;
app_year - the year the application was filed&lt;br /&gt;
grant_year - the year the patent was granted&lt;br /&gt;
uspc - United States Patent Classification &lt;br /&gt;
(might be the same as national_classification)&lt;br /&gt;
uspcsub - United States Patent subgroup&lt;br /&gt;
&lt;br /&gt;
===ASSIGNEE===&lt;br /&gt;
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete) &lt;br /&gt;
&lt;br /&gt;
Each entry represents: the assignee for a patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar &lt;br /&gt;
*first_name (varchar(255)) first name of assignee&lt;br /&gt;
*last name (varchar(255)) last name of assignee&lt;br /&gt;
* address (varchar(255)) address of assignee&lt;br /&gt;
* postcode (varchar(255)) postcode of assignee&lt;br /&gt;
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames) &lt;br /&gt;
* city (varchar(255)) city of the assignee&lt;br /&gt;
* country (varchar(255)) country of the assignee&lt;br /&gt;
* state (varchar(255)) state of assignee&lt;br /&gt;
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above&lt;br /&gt;
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is&lt;br /&gt;
&lt;br /&gt;
===CITATION===&lt;br /&gt;
Purpose: Represent all the patents in PATENT that another patent in PATENT cites&lt;br /&gt;
&lt;br /&gt;
Each entry represents: one citation that a patent in PATENT makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)&lt;br /&gt;
&lt;br /&gt;
All columns (variables): &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited&lt;br /&gt;
* description (varchar(255)) description of the citation&lt;br /&gt;
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the &amp;quot;date of cited document&amp;quot;&lt;br /&gt;
* citation_kind (varchar(255)) type of cited document&lt;br /&gt;
* citation_county (varchaar(255)) country of origin for the cited patent &lt;br /&gt;
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to&lt;br /&gt;
&lt;br /&gt;
Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent&lt;br /&gt;
&lt;br /&gt;
===FEE===&lt;br /&gt;
Purpose: Represent information about the fees paid for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about the fees paid on the patent&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar  &lt;br /&gt;
* four (boolean) payment of maintenance fee 4th year&lt;br /&gt;
* eight (boolean) payment of maintenance fee 8th year&lt;br /&gt;
* twelve (boolean) payment of maintenance fee 12th year&lt;br /&gt;
* fee_date (date) date that fee is paid on&lt;br /&gt;
* fee_code (varchar(255)) code for type of fee to pay&lt;br /&gt;
&lt;br /&gt;
===CLAIMS===&lt;br /&gt;
&lt;br /&gt;
Purpose: Represent information about the claims for an individual patent&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a claim that a patent makes&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number and claim_id&lt;br /&gt;
&lt;br /&gt;
All columns: &lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* claim_id (varchar(255)) id number for the claim&lt;br /&gt;
* language (varchar(255)) language of claim&lt;br /&gt;
* claim_type (varchar(255)) type of claim&lt;br /&gt;
* status (varchar(255)) status of the claim&lt;br /&gt;
* claim_text (text) the text of the claim&lt;br /&gt;
&lt;br /&gt;
===HISTPATENT===&lt;br /&gt;
Purpose: Represent historical data about the patent pertaining to dates and publication&lt;br /&gt;
&lt;br /&gt;
Each entry represents: sequence of historical data about an individual patent in PATENT&lt;br /&gt;
&lt;br /&gt;
Primary key: patent_number&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* pubno (varchar(255)) patent publication number&lt;br /&gt;
* pubdate (date) date patent was issued&lt;br /&gt;
* dispdate (date) disposal date for the application&lt;br /&gt;
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending&lt;br /&gt;
* exp_dt (date) expiration date of patent&lt;br /&gt;
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days&lt;br /&gt;
&lt;br /&gt;
===INVENTORS===&lt;br /&gt;
Purpose: Represent the inventor for each of patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about an inventor for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?&lt;br /&gt;
* name (varchar(255)) concatenated string of the first and last name of the inventor&lt;br /&gt;
* organme (varchar(255)) name of the organization the inventor works for&lt;br /&gt;
* state (varchar(255)) - state in which the inventor resided&lt;br /&gt;
* address (varchar(255)) - address at which the inventor resided&lt;br /&gt;
* country (varchar(255)) country in which the inventor resided&lt;br /&gt;
* city (varchar(255)) city in which the inventor resided&lt;br /&gt;
&lt;br /&gt;
===LAWYERS===&lt;br /&gt;
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table&lt;br /&gt;
&lt;br /&gt;
Each entry represents: information about a lawyer for a specific patent&lt;br /&gt;
&lt;br /&gt;
Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)&lt;br /&gt;
&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?&lt;br /&gt;
* org_name (varchar(255)) the name of the firm that did arbitration for the patent&lt;br /&gt;
&lt;br /&gt;
===Unique Attributes Tables===&lt;br /&gt;
&lt;br /&gt;
There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes. &lt;br /&gt;
&lt;br /&gt;
For each table, the primary key is the patent_number, which connects a row to a row in the main patent table. &lt;br /&gt;
&lt;br /&gt;
TODO: Figure out best way to generate these tables. &lt;br /&gt;
&lt;br /&gt;
* Note that I did include some unique attributes for Utility patents as seen below in the &amp;quot;Finding New Paths Unique to Plant, Reissue, and/or Design Patents&amp;quot; section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.&lt;br /&gt;
&lt;br /&gt;
====DESIGN====&lt;br /&gt;
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar&lt;br /&gt;
* latin_name (varchar(255)) latin name for the plant&lt;br /&gt;
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties&lt;br /&gt;
&lt;br /&gt;
====REISSUE====&lt;br /&gt;
&lt;br /&gt;
* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning &amp;quot;parent_doc&amp;quot;&lt;br /&gt;
* parent_doc_number (int) number for the parent document&lt;br /&gt;
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_doc_country (varchar(255)) the country of the parent document&lt;br /&gt;
* parent_doc_date (date) probably the date the parent document was published&lt;br /&gt;
 &lt;br /&gt;
In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields reprent&lt;br /&gt;
&lt;br /&gt;
* parent_grant_doc_number (int) number for the parent grant document&lt;br /&gt;
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.&lt;br /&gt;
* parent_grant_doc_kind (varchar(255)) this may actually denote what the &amp;quot;parent grant document&amp;quot; is - I will need to look in an actually XML file to see what these field looks like&lt;br /&gt;
* parent_grant_doc_country (varchar(255)) the country of the parent grant document&lt;br /&gt;
* parent_grant_doc_date (date) probably the date the parent grant document was published&lt;br /&gt;
&lt;br /&gt;
====PLANT====&lt;br /&gt;
&lt;br /&gt;
==Connecting Patent database and Assignment database==&lt;br /&gt;
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into &amp;quot;The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called &amp;quot;grant_doc_num&amp;quot; (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)&lt;br /&gt;
&lt;br /&gt;
The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The &amp;quot;appno_doc_num&amp;quot; is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that &amp;quot;administrative data&amp;quot; refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.&lt;br /&gt;
&lt;br /&gt;
Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.&lt;br /&gt;
&lt;br /&gt;
Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.&lt;br /&gt;
&lt;br /&gt;
==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==&lt;br /&gt;
&lt;br /&gt;
Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.&lt;br /&gt;
&lt;br /&gt;
When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.&lt;br /&gt;
&lt;br /&gt;
'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''&lt;br /&gt;
&lt;br /&gt;
===Plant Patents===&lt;br /&gt;
* US Claim Statement (XML45) - unclear what this represents&lt;br /&gt;
* Latin Name (XML44)&lt;br /&gt;
* US Botanical Variety (XML44)&lt;br /&gt;
&lt;br /&gt;
===Reissue Patents===&lt;br /&gt;
* Parent Status (XML45)&lt;br /&gt;
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields&lt;br /&gt;
** Date&lt;br /&gt;
** Document number&lt;br /&gt;
** Country &lt;br /&gt;
** Document ID&lt;br /&gt;
** Kind (XML42)&lt;br /&gt;
* Child Document (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Document Number&lt;br /&gt;
** Country (XML43)&lt;br /&gt;
* Continuing Reissue (XML44)&lt;br /&gt;
** Document ID&lt;br /&gt;
** Relation&lt;br /&gt;
** Document Number&lt;br /&gt;
&lt;br /&gt;
Note about fields related to the parent document and/or parent grant document:&lt;br /&gt;
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue. &lt;br /&gt;
&lt;br /&gt;
However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.&lt;br /&gt;
&lt;br /&gt;
===Design Patents===&lt;br /&gt;
* length of grant (XML45)&lt;br /&gt;
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application&lt;br /&gt;
** International Registration Date&lt;br /&gt;
** International Filing Data&lt;br /&gt;
** International Registration Publication Date&lt;br /&gt;
** International Registration Number&lt;br /&gt;
* Classification Locarno (XML40)&lt;br /&gt;
** Edition&lt;br /&gt;
** Main Classification&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some  of the items is meant to represent the latest XML version where the field was seen.&lt;br /&gt;
&lt;br /&gt;
Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41&lt;br /&gt;
&lt;br /&gt;
* US Related Documents&lt;br /&gt;
** Substitution&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Parent Document&lt;br /&gt;
***** Docuemnt ID&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Country&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
***** Parent Status&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Document Number&lt;br /&gt;
***** Document ID&lt;br /&gt;
***** Country&lt;br /&gt;
** Continuation/Continuation in Part&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Doc&lt;br /&gt;
***** Date&lt;br /&gt;
***** Kind&lt;br /&gt;
** Division&lt;br /&gt;
*** Relation&lt;br /&gt;
**** Child Document&lt;br /&gt;
***** Kind&lt;br /&gt;
&lt;br /&gt;
This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:&lt;br /&gt;
* US Deceased Inventor (XML41)&lt;br /&gt;
** Post Code&lt;br /&gt;
** Address&lt;br /&gt;
* US Provisional Application Status (XML41)&lt;/div&gt;</summary>
		<author><name>ShelbyBice</name></author>
		
	</entry>
</feed>