edegan.com - User contributions [en]

Redesign Assignment and Patent Database

2017-12-08T20:42:23Z

ShelbyBice:

{{McNair Projects
|Has title=Redesign Assignment and Patent Database
|Has owner=Shelby Bice,
|Has start date=9/2017
|Has keywords=patent
|Is dependent on=Reproducible Patent Data,
}}
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data. These three tables should fit the schema seen here, but as stated previously the schema in the code DOES NOT fit the schema here exactly and should be altered in the future to fit this schema.

'''For Oliver:''' Unfortunately I did not get to finish making the schema in your code fit the schema that is outlined here. For whoever works on this next, whether that be you or another intern, please note that the variable in the schema in the code do not match up exactly with the schema outlined here, except for perhaps the Reissues, Plants, and Designs tables in the Patent database (called patentsj in the code, I believe, but you can create a database with any name of course).

This is an extension of the work I did last semester under "Redesigning Patent Database". Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.

Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf

Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf

'''Particularly useful, because it provides solutions for cleaning location and firm data'''

==ER Diagram for Assignment and Patent Databases==

Note: this matches the design below but may not match the exact database (nor the code that creates the schema of the database currently. In the future these should be made to sync up.

[[File:Erdplus-diagram (3).png]]

Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.

If you would like to edit the diagram, you can find the ER diagram you can go to https://erdplus.com/ and click "Open Diagram File" under the "Diagram" dropdown, then navigate to E:/McNair/Project/Redesigning Patent Database/New Patent Database Project/ERDiagramforPatentandAssignmentDatabases.erdplus. You should be able to edit the file then.

==Assignment Database Structure==

After reading through the paper "The USPTO Patent Assignment Dataset:
Descriptions and Analysis"(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.

The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table.

Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated.

Full list of tables: assignee, assignor, document_info, assignment, conveyance

===ASSIGNEE===
Purpose: store all the assignees from all the assignments

Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table

Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name

All columns (variables) in the table:

*rf_id (bigint) reel frame number
* assignee_name (varchar(255)) name of the assignee
* address_line_1 (varchar(255)) first line of address of assignee
* address_line_2 (varchar(255)) second line of address of assignee
* city (varchar(255)) city of assignee
* state (varchar(255)) state of assignee
* country var(char(255)) country of assignee
* postal_code (varchar(255)) post code of the assignee

===ASSIGNOR===

Purpose: store all the assignors from all the assignments

Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table

Primary key: rf_id and assignor_name

All columns (variables):
* rf_id (bigint) reel frame number
* assignor_name (varchar(255)) name of the assignor

===CONVEYANCE===
Purpose: represent the conveyance of a particular patent overtime

Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent

Primary key: rf_id

All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.

Not-repeated (with exception of primary/foreign key):
* rf_id (bigint) reel frame number
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)
* num_conveyance (int) number in order of conveyance by date
* acknowledgment_date (date) date that the assignor acknowledged the transaction
* execution_date (date) date that the transaction took place
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:
** assignment
** merger
** change of name
** government interest
** agreement
** security agreement
** release

Repeated, also include:
* assignee_name (varchar(255)) name of the assignee
* assignor_name (varchar(255)) name of the assignor
* patent_number (varchar(255)) identifying number for the patent

===DOCUMENT_INFO===
Purpose: store extra information relevant to the patents represented in ASSIGNMENT

Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT

Primary key: rf_id

All columns (variables) when created:
* rf_id (bigint) reel frame number
* app_num (varchar(255)) application document USPTO number
* app_date (date) date that the application was filed
* app_country (varhar(255)) country in which it was filed
* pgpub_num (varchar(255)) pre-grant publication document USPTO number
* pgpub_date (date) date that pre-grant publication was released
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number
* patent_number (varchar(255)) granted patent document USPTO numbe
* grant_date (date) date that the patent is officially published
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number
* invention_title (varchar(255)) title of the invention
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate
* reel_num (bigint) number of the reel the assignment was stored on
* frame_num (bigint) number of the frame on the reel the assignment was stored on

===ASSIGNMENT===
Purpose: represent assignment transactions – central table for the database

Each entry represents: an individual patent assignment transaction

Primary key: rf_id

All columns (variables):
* rf_id (bigint) reel fram number
* assignment_id (bigint) US patent assignment id
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer
* correspondent_address1 (varchar(255)) first line of address for correspondent
* correspondent_address2 (varchar(255)) second line of address for correspondent
* correspondent_address3 (varchar(255)) third line of address for correspondent
* correspondent_address4 (varchar(255)) fourth line of address for correspondent
* record_date (date) date recorded with USPTO)
* last_update_date (date) date that information for this assignment was last update with the USPTO
* page_cnt (bigint) page count of assignment record)

More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper

==Patent Database Structure==

Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new.

The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.

Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors

Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the "Equivalent XPath and APS Queries" project page.

===PATENT===
Purpose: represent patents – central table for the database

Each entry represents: an individual patent

Primary key: patent_number

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* patent_type (varchar(255)) the type of patent, like utility
* patent_kind (varchar(2)) the kind of patent, a letter and a number
* title (varchar(255)) the title of the patent
* grantdate (date) date that patent was officially granted including the year
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)
* prioritycountry (varchar(255)) country where patent is first filed
* patent_country (varchar(255)) country in which this patent was published
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number
* cpcsubclass (varchar(255)) cooperative patent classification subclass
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup
* cpcmaingroup (varchar(255)) cooperative patent classification main group
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)
* appdate (date) date that the application was filed, including year - probably also the filing date
* ipcrsubclass (varchar(255)) International patent classification subclass
* ipcrmaingroup (varchar(255)) International patent classification main group
* ipcrsubgroup (varchar(255)) International patent classification subgroup
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification
* natioanl_classification_country (varchar(255)) - should be United States
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)

Discrepancies:
These fields were previously a a part of the table, but we do not currently have xpaths for them:
app_year - the year the application was filed
grant_year - the year the patent was granted
uspc - United States Patent Classification
(might be the same as national_classification)
uspcsub - United States Patent subgroup

===ASSIGNEE===
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete)

Each entry represents: the assignee for a patent in PATENT

Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
*first_name (varchar(255)) first name of assignee
*last name (varchar(255)) last name of assignee
* address (varchar(255)) address of assignee
* postcode (varchar(255)) postcode of assignee
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames)
* city (varchar(255)) city of the assignee
* country (varchar(255)) country of the assignee
* state (varchar(255)) state of assignee
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is

===CITATION===
Purpose: Represent all the patents in PATENT that another patent in PATENT cites

Each entry represents: one citation that a patent in PATENT makes

Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited
* description (varchar(255)) description of the citation
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the "date of cited document"
* citation_kind (varchar(255)) type of cited document
* citation_county (varchaar(255)) country of origin for the cited patent
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to

Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent

===FEE===
Purpose: Represent information about the fees paid for an individual patent

Each entry represents: information about the fees paid on the patent

Primary key: patent_number

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* four (boolean) payment of maintenance fee 4th year
* eight (boolean) payment of maintenance fee 8th year
* twelve (boolean) payment of maintenance fee 12th year
* fee_date (date) date that fee is paid on
* fee_code (varchar(255)) code for type of fee to pay

===CLAIMS===

Purpose: Represent information about the claims for an individual patent

Each entry represents: information about a claim that a patent makes

Primary key: patent_number and claim_id

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* claim_id (varchar(255)) id number for the claim
* language (varchar(255)) language of claim
* claim_type (varchar(255)) type of claim
* status (varchar(255)) status of the claim
* claim_text (text) the text of the claim

===HISTPATENT===
Purpose: Represent historical data about the patent pertaining to dates and publication

Each entry represents: sequence of historical data about an individual patent in PATENT

Primary key: patent_number

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* pubno (varchar(255)) patent publication number
* pubdate (date) date patent was issued
* dispdate (date) disposal date for the application
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending
* exp_dt (date) expiration date of patent
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days

===INVENTORS===
Purpose: Represent the inventor for each of patents in the PATENT table

Each entry represents: information about an inventor for a specific patent

Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?
* name (varchar(255)) concatenated string of the first and last name of the inventor
* organme (varchar(255)) name of the organization the inventor works for
* state (varchar(255)) - state in which the inventor resided
* address (varchar(255)) - address at which the inventor resided
* country (varchar(255)) country in which the inventor resided
* city (varchar(255)) city in which the inventor resided

===LAWYERS===
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table

Each entry represents: information about a lawyer for a specific patent

Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?
* org_name (varchar(255)) the name of the firm that did arbitration for the patent

===Unique Attributes Tables===

There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes.

For each table, the primary key is the patent_number, which connects a row to a row in the main patent table.

TODO: Figure out best way to generate these tables, finish description of attributes for each table.

* Note that I did include some unique attributes for Utility patents as seen below in the "Finding New Paths Unique to Plant, Reissue, and/or Design Patents" section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.

====PLANT====
=====Fields=====
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* latin_name (varchar(255)) latin name for the plant
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties

====REISSUE====

In the following table, you'll notice lots of fields related to three different kinds of "documents" - a "parent document", a "child document", and a "parent grant document". It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these "documents" are, but please know that I do not have definitive proof as there is little available information about these "documents" in regards to a reissue patent.

It is possible that the "parent document" is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a "child document" - a "child document" could be a child application, which is filed while a parent application is still pending. So in this case, a "child document" of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A "child document" is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).

Based on this logic, I think that the difference between a "parent document" and a "parent grant document" depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under "parent document". However, if the parent patent application has been granted, then the information will be stored under "parent grant document". This seems like the most logical explanation, especially considered the path to any field related to "parent grant document" contains "parent document" as with the example below:

us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date

Therefore, a particular reissue patent will probably only have the fields filled for either the "parent document" or "parent grant document" if I am right about what they represent.

=====Fields=====

* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not
* parent_doc_number (int) probably application number for the parent document
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_doc_kind (varchar(255)) this may actually denote what the "parent document" is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)
* parent_doc_country (varchar(255)) the country of origin of the parent document
* parent_doc_date (date) probably the date the parent document was filed
* parent_grant_doc_number (int) probably application number for the parent grant document
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_grant_doc_kind (varchar(255)) this may actually denote what the "parent grant document" is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document
* parent_grant_doc_date (date) probably the date the parent grant document was filed
* child_doc_number (varchar(255)) probably the application number of the child application
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* child_doc_country (varchar(255)) country of origin of the child application

====DESIGN====
=====Fields=====
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* length_of_grant (int) length of grant, most likely in years
* hague_registration_date (date) filing date of international patent application
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application
* hague_international_registration_number (varchar(255)) international registration number
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification
* main_classification (varchar(255)) classification for what type of design the patent is for

==Connecting Patent database and Assignment database==
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into "The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called "grant_doc_num" (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)

The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The "appno_doc_num" is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that "administrative data" refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.

Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.

Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.

==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==

Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.

When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.

'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''

===Plant Patents===
* US Claim Statement (XML45) - unclear what this represents
* Latin Name (XML44)
* US Botanical Variety (XML44)

===Reissue Patents===
* Parent Status (XML45)
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields
** Date
** Document number
** Country
** Document ID
** Kind (XML42)
* Child Document (XML44)
** Document ID
** Document Number
** Country (XML43)

Note about fields related to the parent document and/or parent grant document:
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue.

However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.

===Design Patents===
* length of grant (XML45)
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application
** International Registration Date
** International Filing Date
** International Registration Publication Date
** International Registration Number
* Classification Locarno (XML40)
** Edition
** Main Classification

Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some of the items is meant to represent the latest XML version where the field was seen.

Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41

* US Related Documents
** Substitution
*** Relation
**** Parent Document
***** Docuemnt ID
***** Document Number
***** Country
***** Date
***** Kind
***** Parent Status
**** Child Document
***** Document Number
***** Document ID
***** Country
** Continuation/Continuation in Part
*** Relation
**** Child Doc
***** Date
***** Kind
** Division
*** Relation
**** Child Document
***** Kind

This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:
* US Deceased Inventor (XML41)
** Post Code
** Address
* US Provisional Application Status (XML41)

==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==

===Plant Patents===
XML 4.4, 4.3, 4.1, and 4.0
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic
fields: latin-name, variety

XML 4.5
parent node: us-patent-grant/us-claim-statement/
field: i

XML 4.2
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic
fields: latin-name, variety

parent node: us-patent-grant/us-claim-statement/
field: i

===Reissue Patents===

XML 4.5
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/
fields: parent-status

XML 4.4
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/
fields: parent-doc/parent-grant-document/document-id/date
parent-doc/document-id/country
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
child-doc/document-id/doc-number
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/doc-number
parent-doc/document-id/date
(everything in XML 4.3 except parent-status)

XML 4.3
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields: child-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-status
parent-doc/document-id/date
parent-doc/document-id/country
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
parent-doc/parent-grant-document/document-id/doc-number

XML 4.1
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields:
child-doc/document-id/doc-number
child-doc/document-id/country
parent-doc/parent-status
parent-doc/document-id/kind
parent-doc/document-id/country
parent-doc/document-id/doc-number
parent-doc/document-id/date
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-grant-document/document-id/doc-number

other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/
fields:
child-doc/document-id/doc-number
child-doc/document-id/country
parent-doc/document-id/date
parent-doc/document-id/doc-number
parent-doc/document-id/country
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-grant-document/document-id/doc-number

XML 4.0 and XML 4.2
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields: parent-doc/document-id/kind
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
parent-doc/parent-grant-document/document-id/doc-number
child-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-status
parent-doc/document-id/country
parent-doc/document-id/date

===Design Patents===
XML 4.5
parent node: us-patent-grant/us-bibliographic-data-grant/
fields: hague-agreement-data/international-registration-date/date
hague-agreement-data/international-registration-publication-date/date
us-term-of-grant/length-of-grant
hague-agreement-data/international-registration-number
hague-agreement-data/international-filing-date/date

XML 4.1, 4.3, and 4.4
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/
fields: length-of-grant

XML 4.0
parent node: us-patent-grant/us-bibliographic-data-grant/
fields: us-term-of-grant/length-of-grant
classification-locarno/edition
classification-locarno/main-classification

Redesign Assignment and Patent Database

2017-12-08T20:36:15Z

ShelbyBice: /* ER Diagram for Assignment and Patent Databases */

{{McNair Projects
|Has title=Redesign Assignment and Patent Database
|Has owner=Shelby Bice,
|Has start date=9/2017
|Has keywords=patent
|Is dependent on=Reproducible Patent Data,
}}
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data.

'''For Oliver:''' Unfortunately I did not get to finish making the schema in your code fit the schema that is outlined here. For whoever works on this next, whether that be you or another intern, please note that the variable in the schema in the code do not match up exactly with the schema outlined here, except for perhaps the Reissues, Plants, and Designs tables in the Patent database (called patentsj in the code, I believe, but you can create a database with any name of course).

This is an extension of the work I did last semester under "Redesigning Patent Database". Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.

Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf

Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf

'''Particularly useful, because it provides solutions for cleaning location and firm data'''

==ER Diagram for Assignment and Patent Databases==

Note: this matches the design below but may not match the exact database (nor the code that creates the schema of the database currently. In the future these should be made to sync up.

[[File:Erdplus-diagram (3).png]]

Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.

If you would like to edit the diagram, you can find the ER diagram you can go to https://erdplus.com/ and click "Open Diagram File" under the "Diagram" dropdown, then navigate to E:/McNair/Project/Redesigning Patent Database/New Patent Database Project/ERDiagramforPatentandAssignmentDatabases.erdplus. You should be able to edit the file then.

==Assignment Database Structure==

After reading through the paper "The USPTO Patent Assignment Dataset:
Descriptions and Analysis"(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.

The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table.

Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated.

Full list of tables: assignee, assignor, document_info, assignment, conveyance

===ASSIGNEE===
Purpose: store all the assignees from all the assignments

Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table

Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name

All columns (variables) in the table:

*rf_id (bigint) reel frame number
* assignee_name (varchar(255)) name of the assignee
* address_line_1 (varchar(255)) first line of address of assignee
* address_line_2 (varchar(255)) second line of address of assignee
* city (varchar(255)) city of assignee
* state (varchar(255)) state of assignee
* country var(char(255)) country of assignee
* postal_code (varchar(255)) post code of the assignee

===ASSIGNOR===

Purpose: store all the assignors from all the assignments

Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table

Primary key: rf_id and assignor_name

All columns (variables):
* rf_id (bigint) reel frame number
* assignor_name (varchar(255)) name of the assignor

===CONVEYANCE===
Purpose: represent the conveyance of a particular patent overtime

Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent

Primary key: rf_id

All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.

Not-repeated (with exception of primary/foreign key):
* rf_id (bigint) reel frame number
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)
* num_conveyance (int) number in order of conveyance by date
* acknowledgment_date (date) date that the assignor acknowledged the transaction
* execution_date (date) date that the transaction took place
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:
** assignment
** merger
** change of name
** government interest
** agreement
** security agreement
** release

Repeated, also include:
* assignee_name (varchar(255)) name of the assignee
* assignor_name (varchar(255)) name of the assignor
* patent_number (varchar(255)) identifying number for the patent

===DOCUMENT_INFO===
Purpose: store extra information relevant to the patents represented in ASSIGNMENT

Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT

Primary key: rf_id

All columns (variables) when created:
* rf_id (bigint) reel frame number
* app_num (varchar(255)) application document USPTO number
* app_date (date) date that the application was filed
* app_country (varhar(255)) country in which it was filed
* pgpub_num (varchar(255)) pre-grant publication document USPTO number
* pgpub_date (date) date that pre-grant publication was released
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number
* patent_number (varchar(255)) granted patent document USPTO numbe
* grant_date (date) date that the patent is officially published
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number
* invention_title (varchar(255)) title of the invention
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate
* reel_num (bigint) number of the reel the assignment was stored on
* frame_num (bigint) number of the frame on the reel the assignment was stored on

===ASSIGNMENT===
Purpose: represent assignment transactions – central table for the database

Each entry represents: an individual patent assignment transaction

Primary key: rf_id

All columns (variables):
* rf_id (bigint) reel fram number
* assignment_id (bigint) US patent assignment id
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer
* correspondent_address1 (varchar(255)) first line of address for correspondent
* correspondent_address2 (varchar(255)) second line of address for correspondent
* correspondent_address3 (varchar(255)) third line of address for correspondent
* correspondent_address4 (varchar(255)) fourth line of address for correspondent
* record_date (date) date recorded with USPTO)
* last_update_date (date) date that information for this assignment was last update with the USPTO
* page_cnt (bigint) page count of assignment record)

More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper

==Patent Database Structure==

Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new.

The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.

Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors

Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the "Equivalent XPath and APS Queries" project page.

===PATENT===
Purpose: represent patents – central table for the database

Each entry represents: an individual patent

Primary key: patent_number

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* patent_type (varchar(255)) the type of patent, like utility
* patent_kind (varchar(2)) the kind of patent, a letter and a number
* title (varchar(255)) the title of the patent
* grantdate (date) date that patent was officially granted including the year
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)
* prioritycountry (varchar(255)) country where patent is first filed
* patent_country (varchar(255)) country in which this patent was published
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number
* cpcsubclass (varchar(255)) cooperative patent classification subclass
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup
* cpcmaingroup (varchar(255)) cooperative patent classification main group
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)
* appdate (date) date that the application was filed, including year - probably also the filing date
* ipcrsubclass (varchar(255)) International patent classification subclass
* ipcrmaingroup (varchar(255)) International patent classification main group
* ipcrsubgroup (varchar(255)) International patent classification subgroup
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification
* natioanl_classification_country (varchar(255)) - should be United States
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)

Discrepancies:
These fields were previously a a part of the table, but we do not currently have xpaths for them:
app_year - the year the application was filed
grant_year - the year the patent was granted
uspc - United States Patent Classification
(might be the same as national_classification)
uspcsub - United States Patent subgroup

===ASSIGNEE===
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete)

Each entry represents: the assignee for a patent in PATENT

Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
*first_name (varchar(255)) first name of assignee
*last name (varchar(255)) last name of assignee
* address (varchar(255)) address of assignee
* postcode (varchar(255)) postcode of assignee
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames)
* city (varchar(255)) city of the assignee
* country (varchar(255)) country of the assignee
* state (varchar(255)) state of assignee
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is

===CITATION===
Purpose: Represent all the patents in PATENT that another patent in PATENT cites

Each entry represents: one citation that a patent in PATENT makes

Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited
* description (varchar(255)) description of the citation
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the "date of cited document"
* citation_kind (varchar(255)) type of cited document
* citation_county (varchaar(255)) country of origin for the cited patent
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to

Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent

===FEE===
Purpose: Represent information about the fees paid for an individual patent

Each entry represents: information about the fees paid on the patent

Primary key: patent_number

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* four (boolean) payment of maintenance fee 4th year
* eight (boolean) payment of maintenance fee 8th year
* twelve (boolean) payment of maintenance fee 12th year
* fee_date (date) date that fee is paid on
* fee_code (varchar(255)) code for type of fee to pay

===CLAIMS===

Purpose: Represent information about the claims for an individual patent

Each entry represents: information about a claim that a patent makes

Primary key: patent_number and claim_id

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* claim_id (varchar(255)) id number for the claim
* language (varchar(255)) language of claim
* claim_type (varchar(255)) type of claim
* status (varchar(255)) status of the claim
* claim_text (text) the text of the claim

===HISTPATENT===
Purpose: Represent historical data about the patent pertaining to dates and publication

Each entry represents: sequence of historical data about an individual patent in PATENT

Primary key: patent_number

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* pubno (varchar(255)) patent publication number
* pubdate (date) date patent was issued
* dispdate (date) disposal date for the application
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending
* exp_dt (date) expiration date of patent
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days

===INVENTORS===
Purpose: Represent the inventor for each of patents in the PATENT table

Each entry represents: information about an inventor for a specific patent

Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?
* name (varchar(255)) concatenated string of the first and last name of the inventor
* organme (varchar(255)) name of the organization the inventor works for
* state (varchar(255)) - state in which the inventor resided
* address (varchar(255)) - address at which the inventor resided
* country (varchar(255)) country in which the inventor resided
* city (varchar(255)) city in which the inventor resided

===LAWYERS===
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table

Each entry represents: information about a lawyer for a specific patent

Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?
* org_name (varchar(255)) the name of the firm that did arbitration for the patent

===Unique Attributes Tables===

There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes.

For each table, the primary key is the patent_number, which connects a row to a row in the main patent table.

TODO: Figure out best way to generate these tables, finish description of attributes for each table.

* Note that I did include some unique attributes for Utility patents as seen below in the "Finding New Paths Unique to Plant, Reissue, and/or Design Patents" section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.

====PLANT====
=====Fields=====
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* latin_name (varchar(255)) latin name for the plant
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties

====REISSUE====

In the following table, you'll notice lots of fields related to three different kinds of "documents" - a "parent document", a "child document", and a "parent grant document". It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these "documents" are, but please know that I do not have definitive proof as there is little available information about these "documents" in regards to a reissue patent.

It is possible that the "parent document" is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a "child document" - a "child document" could be a child application, which is filed while a parent application is still pending. So in this case, a "child document" of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A "child document" is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).

Based on this logic, I think that the difference between a "parent document" and a "parent grant document" depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under "parent document". However, if the parent patent application has been granted, then the information will be stored under "parent grant document". This seems like the most logical explanation, especially considered the path to any field related to "parent grant document" contains "parent document" as with the example below:

us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date

Therefore, a particular reissue patent will probably only have the fields filled for either the "parent document" or "parent grant document" if I am right about what they represent.

=====Fields=====

* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not
* parent_doc_number (int) probably application number for the parent document
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_doc_kind (varchar(255)) this may actually denote what the "parent document" is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)
* parent_doc_country (varchar(255)) the country of origin of the parent document
* parent_doc_date (date) probably the date the parent document was filed
* parent_grant_doc_number (int) probably application number for the parent grant document
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_grant_doc_kind (varchar(255)) this may actually denote what the "parent grant document" is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document
* parent_grant_doc_date (date) probably the date the parent grant document was filed
* child_doc_number (varchar(255)) probably the application number of the child application
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* child_doc_country (varchar(255)) country of origin of the child application

====DESIGN====
=====Fields=====
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* length_of_grant (int) length of grant, most likely in years
* hague_registration_date (date) filing date of international patent application
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application
* hague_international_registration_number (varchar(255)) international registration number
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification
* main_classification (varchar(255)) classification for what type of design the patent is for

==Connecting Patent database and Assignment database==
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into "The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called "grant_doc_num" (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)

The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The "appno_doc_num" is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that "administrative data" refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.

Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.

Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.

==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==

Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.

When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.

'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''

===Plant Patents===
* US Claim Statement (XML45) - unclear what this represents
* Latin Name (XML44)
* US Botanical Variety (XML44)

===Reissue Patents===
* Parent Status (XML45)
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields
** Date
** Document number
** Country
** Document ID
** Kind (XML42)
* Child Document (XML44)
** Document ID
** Document Number
** Country (XML43)

Note about fields related to the parent document and/or parent grant document:
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue.

However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.

===Design Patents===
* length of grant (XML45)
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application
** International Registration Date
** International Filing Date
** International Registration Publication Date
** International Registration Number
* Classification Locarno (XML40)
** Edition
** Main Classification

Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some of the items is meant to represent the latest XML version where the field was seen.

Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41

* US Related Documents
** Substitution
*** Relation
**** Parent Document
***** Docuemnt ID
***** Document Number
***** Country
***** Date
***** Kind
***** Parent Status
**** Child Document
***** Document Number
***** Document ID
***** Country
** Continuation/Continuation in Part
*** Relation
**** Child Doc
***** Date
***** Kind
** Division
*** Relation
**** Child Document
***** Kind

This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:
* US Deceased Inventor (XML41)
** Post Code
** Address
* US Provisional Application Status (XML41)

==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==

===Plant Patents===
XML 4.4, 4.3, 4.1, and 4.0
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic
fields: latin-name, variety

XML 4.5
parent node: us-patent-grant/us-claim-statement/
field: i

XML 4.2
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic
fields: latin-name, variety

parent node: us-patent-grant/us-claim-statement/
field: i

===Reissue Patents===

XML 4.5
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/
fields: parent-status

XML 4.4
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/
fields: parent-doc/parent-grant-document/document-id/date
parent-doc/document-id/country
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
child-doc/document-id/doc-number
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/doc-number
parent-doc/document-id/date
(everything in XML 4.3 except parent-status)

XML 4.3
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields: child-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-status
parent-doc/document-id/date
parent-doc/document-id/country
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
parent-doc/parent-grant-document/document-id/doc-number

XML 4.1
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields:
child-doc/document-id/doc-number
child-doc/document-id/country
parent-doc/parent-status
parent-doc/document-id/kind
parent-doc/document-id/country
parent-doc/document-id/doc-number
parent-doc/document-id/date
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-grant-document/document-id/doc-number

other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/
fields:
child-doc/document-id/doc-number
child-doc/document-id/country
parent-doc/document-id/date
parent-doc/document-id/doc-number
parent-doc/document-id/country
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-grant-document/document-id/doc-number

XML 4.0 and XML 4.2
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields: parent-doc/document-id/kind
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
parent-doc/parent-grant-document/document-id/doc-number
child-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-status
parent-doc/document-id/country
parent-doc/document-id/date

===Design Patents===
XML 4.5
parent node: us-patent-grant/us-bibliographic-data-grant/
fields: hague-agreement-data/international-registration-date/date
hague-agreement-data/international-registration-publication-date/date
us-term-of-grant/length-of-grant
hague-agreement-data/international-registration-number
hague-agreement-data/international-filing-date/date

XML 4.1, 4.3, and 4.4
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/
fields: length-of-grant

XML 4.0
parent node: us-patent-grant/us-bibliographic-data-grant/
fields: us-term-of-grant/length-of-grant
classification-locarno/edition
classification-locarno/main-classification

Redesign Assignment and Patent Database

2017-12-08T20:29:14Z

ShelbyBice:

{{McNair Projects
|Has title=Redesign Assignment and Patent Database
|Has owner=Shelby Bice,
|Has start date=9/2017
|Has keywords=patent
|Is dependent on=Reproducible Patent Data,
}}
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data.

'''For Oliver:''' Unfortunately I did not get to finish making the schema in your code fit the schema that is outlined here. For whoever works on this next, whether that be you or another intern, please note that the variable in the schema in the code do not match up exactly with the schema outlined here, except for perhaps the Reissues, Plants, and Designs tables in the Patent database (called patentsj in the code, I believe, but you can create a database with any name of course).

This is an extension of the work I did last semester under "Redesigning Patent Database". Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.

Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf

Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf

'''Particularly useful, because it provides solutions for cleaning location and firm data'''

==ER Diagram for Assignment and Patent Databases==

Note: this matches the design below but may not match the exact database (nor the code that creates the schema of the database currently. In the future these should be made to sync up.

[[File:erdplus-diagram(3).png]]

Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.

==Assignment Database Structure==

After reading through the paper "The USPTO Patent Assignment Dataset:
Descriptions and Analysis"(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.

The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table.

Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated.

Full list of tables: assignee, assignor, document_info, assignment, conveyance

===ASSIGNEE===
Purpose: store all the assignees from all the assignments

Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table

Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name

All columns (variables) in the table:

*rf_id (bigint) reel frame number
* assignee_name (varchar(255)) name of the assignee
* address_line_1 (varchar(255)) first line of address of assignee
* address_line_2 (varchar(255)) second line of address of assignee
* city (varchar(255)) city of assignee
* state (varchar(255)) state of assignee
* country var(char(255)) country of assignee
* postal_code (varchar(255)) post code of the assignee

===ASSIGNOR===

Purpose: store all the assignors from all the assignments

Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table

Primary key: rf_id and assignor_name

All columns (variables):
* rf_id (bigint) reel frame number
* assignor_name (varchar(255)) name of the assignor

===CONVEYANCE===
Purpose: represent the conveyance of a particular patent overtime

Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent

Primary key: rf_id

All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.

Not-repeated (with exception of primary/foreign key):
* rf_id (bigint) reel frame number
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)
* num_conveyance (int) number in order of conveyance by date
* acknowledgment_date (date) date that the assignor acknowledged the transaction
* execution_date (date) date that the transaction took place
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:
** assignment
** merger
** change of name
** government interest
** agreement
** security agreement
** release

Repeated, also include:
* assignee_name (varchar(255)) name of the assignee
* assignor_name (varchar(255)) name of the assignor
* patent_number (varchar(255)) identifying number for the patent

===DOCUMENT_INFO===
Purpose: store extra information relevant to the patents represented in ASSIGNMENT

Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT

Primary key: rf_id

All columns (variables) when created:
* rf_id (bigint) reel frame number
* app_num (varchar(255)) application document USPTO number
* app_date (date) date that the application was filed
* app_country (varhar(255)) country in which it was filed
* pgpub_num (varchar(255)) pre-grant publication document USPTO number
* pgpub_date (date) date that pre-grant publication was released
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number
* patent_number (varchar(255)) granted patent document USPTO numbe
* grant_date (date) date that the patent is officially published
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number
* invention_title (varchar(255)) title of the invention
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate
* reel_num (bigint) number of the reel the assignment was stored on
* frame_num (bigint) number of the frame on the reel the assignment was stored on

===ASSIGNMENT===
Purpose: represent assignment transactions – central table for the database

Each entry represents: an individual patent assignment transaction

Primary key: rf_id

All columns (variables):
* rf_id (bigint) reel fram number
* assignment_id (bigint) US patent assignment id
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer
* correspondent_address1 (varchar(255)) first line of address for correspondent
* correspondent_address2 (varchar(255)) second line of address for correspondent
* correspondent_address3 (varchar(255)) third line of address for correspondent
* correspondent_address4 (varchar(255)) fourth line of address for correspondent
* record_date (date) date recorded with USPTO)
* last_update_date (date) date that information for this assignment was last update with the USPTO
* page_cnt (bigint) page count of assignment record)

More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper

==Patent Database Structure==

Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new.

The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.

Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors

Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the "Equivalent XPath and APS Queries" project page.

===PATENT===
Purpose: represent patents – central table for the database

Each entry represents: an individual patent

Primary key: patent_number

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* patent_type (varchar(255)) the type of patent, like utility
* patent_kind (varchar(2)) the kind of patent, a letter and a number
* title (varchar(255)) the title of the patent
* grantdate (date) date that patent was officially granted including the year
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)
* prioritycountry (varchar(255)) country where patent is first filed
* patent_country (varchar(255)) country in which this patent was published
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number
* cpcsubclass (varchar(255)) cooperative patent classification subclass
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup
* cpcmaingroup (varchar(255)) cooperative patent classification main group
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)
* appdate (date) date that the application was filed, including year - probably also the filing date
* ipcrsubclass (varchar(255)) International patent classification subclass
* ipcrmaingroup (varchar(255)) International patent classification main group
* ipcrsubgroup (varchar(255)) International patent classification subgroup
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification
* natioanl_classification_country (varchar(255)) - should be United States
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)

Discrepancies:
These fields were previously a a part of the table, but we do not currently have xpaths for them:
app_year - the year the application was filed
grant_year - the year the patent was granted
uspc - United States Patent Classification
(might be the same as national_classification)
uspcsub - United States Patent subgroup

===ASSIGNEE===
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete)

Each entry represents: the assignee for a patent in PATENT

Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
*first_name (varchar(255)) first name of assignee
*last name (varchar(255)) last name of assignee
* address (varchar(255)) address of assignee
* postcode (varchar(255)) postcode of assignee
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames)
* city (varchar(255)) city of the assignee
* country (varchar(255)) country of the assignee
* state (varchar(255)) state of assignee
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is

===CITATION===
Purpose: Represent all the patents in PATENT that another patent in PATENT cites

Each entry represents: one citation that a patent in PATENT makes

Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited
* description (varchar(255)) description of the citation
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the "date of cited document"
* citation_kind (varchar(255)) type of cited document
* citation_county (varchaar(255)) country of origin for the cited patent
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to

Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent

===FEE===
Purpose: Represent information about the fees paid for an individual patent

Each entry represents: information about the fees paid on the patent

Primary key: patent_number

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* four (boolean) payment of maintenance fee 4th year
* eight (boolean) payment of maintenance fee 8th year
* twelve (boolean) payment of maintenance fee 12th year
* fee_date (date) date that fee is paid on
* fee_code (varchar(255)) code for type of fee to pay

===CLAIMS===

Purpose: Represent information about the claims for an individual patent

Each entry represents: information about a claim that a patent makes

Primary key: patent_number and claim_id

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* claim_id (varchar(255)) id number for the claim
* language (varchar(255)) language of claim
* claim_type (varchar(255)) type of claim
* status (varchar(255)) status of the claim
* claim_text (text) the text of the claim

===HISTPATENT===
Purpose: Represent historical data about the patent pertaining to dates and publication

Each entry represents: sequence of historical data about an individual patent in PATENT

Primary key: patent_number

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* pubno (varchar(255)) patent publication number
* pubdate (date) date patent was issued
* dispdate (date) disposal date for the application
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending
* exp_dt (date) expiration date of patent
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days

===INVENTORS===
Purpose: Represent the inventor for each of patents in the PATENT table

Each entry represents: information about an inventor for a specific patent

Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?
* name (varchar(255)) concatenated string of the first and last name of the inventor
* organme (varchar(255)) name of the organization the inventor works for
* state (varchar(255)) - state in which the inventor resided
* address (varchar(255)) - address at which the inventor resided
* country (varchar(255)) country in which the inventor resided
* city (varchar(255)) city in which the inventor resided

===LAWYERS===
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table

Each entry represents: information about a lawyer for a specific patent

Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?
* org_name (varchar(255)) the name of the firm that did arbitration for the patent

===Unique Attributes Tables===

There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes.

For each table, the primary key is the patent_number, which connects a row to a row in the main patent table.

TODO: Figure out best way to generate these tables, finish description of attributes for each table.

* Note that I did include some unique attributes for Utility patents as seen below in the "Finding New Paths Unique to Plant, Reissue, and/or Design Patents" section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.

====PLANT====
=====Fields=====
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* latin_name (varchar(255)) latin name for the plant
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties

====REISSUE====

In the following table, you'll notice lots of fields related to three different kinds of "documents" - a "parent document", a "child document", and a "parent grant document". It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these "documents" are, but please know that I do not have definitive proof as there is little available information about these "documents" in regards to a reissue patent.

It is possible that the "parent document" is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a "child document" - a "child document" could be a child application, which is filed while a parent application is still pending. So in this case, a "child document" of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A "child document" is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).

Based on this logic, I think that the difference between a "parent document" and a "parent grant document" depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under "parent document". However, if the parent patent application has been granted, then the information will be stored under "parent grant document". This seems like the most logical explanation, especially considered the path to any field related to "parent grant document" contains "parent document" as with the example below:

us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date

Therefore, a particular reissue patent will probably only have the fields filled for either the "parent document" or "parent grant document" if I am right about what they represent.

=====Fields=====

* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not
* parent_doc_number (int) probably application number for the parent document
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_doc_kind (varchar(255)) this may actually denote what the "parent document" is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)
* parent_doc_country (varchar(255)) the country of origin of the parent document
* parent_doc_date (date) probably the date the parent document was filed
* parent_grant_doc_number (int) probably application number for the parent grant document
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_grant_doc_kind (varchar(255)) this may actually denote what the "parent grant document" is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document
* parent_grant_doc_date (date) probably the date the parent grant document was filed
* child_doc_number (varchar(255)) probably the application number of the child application
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* child_doc_country (varchar(255)) country of origin of the child application

====DESIGN====
=====Fields=====
* patent_no (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* length_of_grant (int) length of grant, most likely in years
* hague_registration_date (date) filing date of international patent application
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application
* hague_international_registration_number (varchar(255)) international registration number
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification
* main_classification (varchar(255)) classification for what type of design the patent is for

==Connecting Patent database and Assignment database==
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into "The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called "grant_doc_num" (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)

The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The "appno_doc_num" is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that "administrative data" refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.

Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.

Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.

==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==

Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.

When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.

'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''

===Plant Patents===
* US Claim Statement (XML45) - unclear what this represents
* Latin Name (XML44)
* US Botanical Variety (XML44)

===Reissue Patents===
* Parent Status (XML45)
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields
** Date
** Document number
** Country
** Document ID
** Kind (XML42)
* Child Document (XML44)
** Document ID
** Document Number
** Country (XML43)

Note about fields related to the parent document and/or parent grant document:
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue.

However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.

===Design Patents===
* length of grant (XML45)
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application
** International Registration Date
** International Filing Date
** International Registration Publication Date
** International Registration Number
* Classification Locarno (XML40)
** Edition
** Main Classification

Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some of the items is meant to represent the latest XML version where the field was seen.

Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41

* US Related Documents
** Substitution
*** Relation
**** Parent Document
***** Docuemnt ID
***** Document Number
***** Country
***** Date
***** Kind
***** Parent Status
**** Child Document
***** Document Number
***** Document ID
***** Country
** Continuation/Continuation in Part
*** Relation
**** Child Doc
***** Date
***** Kind
** Division
*** Relation
**** Child Document
***** Kind

This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:
* US Deceased Inventor (XML41)
** Post Code
** Address
* US Provisional Application Status (XML41)

==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==

===Plant Patents===
XML 4.4, 4.3, 4.1, and 4.0
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic
fields: latin-name, variety

XML 4.5
parent node: us-patent-grant/us-claim-statement/
field: i

XML 4.2
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic
fields: latin-name, variety

parent node: us-patent-grant/us-claim-statement/
field: i

===Reissue Patents===

XML 4.5
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/
fields: parent-status

XML 4.4
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/
fields: parent-doc/parent-grant-document/document-id/date
parent-doc/document-id/country
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
child-doc/document-id/doc-number
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/doc-number
parent-doc/document-id/date
(everything in XML 4.3 except parent-status)

XML 4.3
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields: child-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-status
parent-doc/document-id/date
parent-doc/document-id/country
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
parent-doc/parent-grant-document/document-id/doc-number

XML 4.1
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields:
child-doc/document-id/doc-number
child-doc/document-id/country
parent-doc/parent-status
parent-doc/document-id/kind
parent-doc/document-id/country
parent-doc/document-id/doc-number
parent-doc/document-id/date
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-grant-document/document-id/doc-number

other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/
fields:
child-doc/document-id/doc-number
child-doc/document-id/country
parent-doc/document-id/date
parent-doc/document-id/doc-number
parent-doc/document-id/country
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-grant-document/document-id/doc-number

XML 4.0 and XML 4.2
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields: parent-doc/document-id/kind
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
parent-doc/parent-grant-document/document-id/doc-number
child-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-status
parent-doc/document-id/country
parent-doc/document-id/date

===Design Patents===
XML 4.5
parent node: us-patent-grant/us-bibliographic-data-grant/
fields: hague-agreement-data/international-registration-date/date
hague-agreement-data/international-registration-publication-date/date
us-term-of-grant/length-of-grant
hague-agreement-data/international-registration-number
hague-agreement-data/international-filing-date/date

XML 4.1, 4.3, and 4.4
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/
fields: length-of-grant

XML 4.0
parent node: us-patent-grant/us-bibliographic-data-grant/
fields: us-term-of-grant/length-of-grant
classification-locarno/edition
classification-locarno/main-classification

Redesign Assignment and Patent Database

2017-12-08T20:20:56Z

ShelbyBice: /* ER Diagram for Assignment and Patent Databases */

{{McNair Projects
|Has title=Redesign Assignment and Patent Database
|Has owner=Shelby Bice,
|Has start date=9/2017
|Has keywords=patent
|Is dependent on=Reproducible Patent Data,
}}
'''Final Notes''' This is more or less the finalized design for the patent database. Oliver Chang's code (see Reproducible Patent Data, which is his project page) more or less fits the design, though there are some differences and expect differences in variable names. In the future the code should be altered so that the name of variables match up and each table that is listed here exists in the database. The tables for the extra variables that exist for reissue, design, and plant patents have been added, and the instructions for adding tables can be found on Reproducible Patent Data.

This is an extension of the work I did last semester under "Redesigning Patent Database". Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.

Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf

Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf

'''Particularly useful, because it provides solutions for cleaning location and firm data'''

==ER Diagram for Assignment and Patent Databases==

Note: this matches the design below but may not match the exact database (nor the code that creates the schema of the database currently. In the future these should be made to sync up.

[[File:erdplus-diagram(3).png]]

Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.

==Assignment Database Structure==

After reading through the paper "The USPTO Patent Assignment Dataset:
Descriptions and Analysis"(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.

The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table.

Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated.

Full list of tables: assignee, assignor, document_info, assignment, conveyance

===ASSIGNEE===
Purpose: store all the assignees from all the assignments

Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table

Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name

All columns (variables) in the table:

*rf_id (bigint) reel frame number
* assignee_name (varchar(255)) name of the assignee
* address_line_1 (varchar(255)) first line of address of assignee
* address_line_2 (varchar(255)) second line of address of assignee
* city (varchar(255)) city of assignee
* state (varchar(255)) state of assignee
* country var(char(255)) country of assignee
* postal_code (varchar(255)) post code of the assignee

===ASSIGNOR===

Purpose: store all the assignors from all the assignments

Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table

Primary key: rf_id and assignor_name

All columns (variables):
* rf_id (bigint) reel frame number
* assignor_name (varchar(255)) name of the assignor

===CONVEYANCE===
Purpose: represent the conveyance of a particular patent overtime

Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent

Primary key: rf_id

All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.

Not-repeated (with exception of primary/foreign key):
* rf_id (bigint) reel frame number
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)
* num_conveyance (int) number in order of conveyance by date
* acknowledgment_date (date) date that the assignor acknowledged the transaction
* execution_date (date) date that the transaction took place
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:
** assignment
** merger
** change of name
** government interest
** agreement
** security agreement
** release

Repeated, also include:
* assignee_name (varchar(255)) name of the assignee
* assignor_name (varchar(255)) name of the assignor
* patent_number (varchar(255)) identifying number for the patent

===DOCUMENT_INFO===
Purpose: store extra information relevant to the patents represented in ASSIGNMENT

Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT

Primary key: rf_id

All columns (variables) when created:
* rf_id (bigint) reel frame number
* app_num (varchar(255)) application document USPTO number
* app_date (date) date that the application was filed
* app_country (varhar(255)) country in which it was filed
* pgpub_num (varchar(255)) pre-grant publication document USPTO number
* pgpub_date (date) date that pre-grant publication was released
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number
* patent_number (varchar(255)) granted patent document USPTO numbe
* grant_date (date) date that the patent is officially published
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number
* invention_title (varchar(255)) title of the invention
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate
* reel_num (bigint) number of the reel the assignment was stored on
* frame_num (bigint) number of the frame on the reel the assignment was stored on

===ASSIGNMENT===
Purpose: represent assignment transactions – central table for the database

Each entry represents: an individual patent assignment transaction

Primary key: rf_id

All columns (variables):
* rf_id (bigint) reel fram number
* assignment_id (bigint) US patent assignment id
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer
* correspondent_address1 (varchar(255)) first line of address for correspondent
* correspondent_address2 (varchar(255)) second line of address for correspondent
* correspondent_address3 (varchar(255)) third line of address for correspondent
* correspondent_address4 (varchar(255)) fourth line of address for correspondent
* record_date (date) date recorded with USPTO)
* last_update_date (date) date that information for this assignment was last update with the USPTO
* page_cnt (bigint) page count of assignment record)

More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper

==Patent Database Structure==

Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new.

The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.

Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors

Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the "Equivalent XPath and APS Queries" project page.

===PATENT===
Purpose: represent patents – central table for the database

Each entry represents: an individual patent

Primary key: patent_number

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* patent_type (varchar(255)) the type of patent, like utility
* patent_kind (varchar(2)) the kind of patent, a letter and a number
* title (varchar(255)) the title of the patent
* grantdate (date) date that patent was officially granted including the year
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)
* prioritycountry (varchar(255)) country where patent is first filed
* patent_country (varchar(255)) country in which this patent was published
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number
* cpcsubclass (varchar(255)) cooperative patent classification subclass
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup
* cpcmaingroup (varchar(255)) cooperative patent classification main group
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)
* appdate (date) date that the application was filed, including year - probably also the filing date
* ipcrsubclass (varchar(255)) International patent classification subclass
* ipcrmaingroup (varchar(255)) International patent classification main group
* ipcrsubgroup (varchar(255)) International patent classification subgroup
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification
* natioanl_classification_country (varchar(255)) - should be United States
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)

Discrepancies:
These fields were previously a a part of the table, but we do not currently have xpaths for them:
app_year - the year the application was filed
grant_year - the year the patent was granted
uspc - United States Patent Classification
(might be the same as national_classification)
uspcsub - United States Patent subgroup

===ASSIGNEE===
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete)

Each entry represents: the assignee for a patent in PATENT

Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
*first_name (varchar(255)) first name of assignee
*last name (varchar(255)) last name of assignee
* address (varchar(255)) address of assignee
* postcode (varchar(255)) postcode of assignee
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames)
* city (varchar(255)) city of the assignee
* country (varchar(255)) country of the assignee
* state (varchar(255)) state of assignee
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is

===CITATION===
Purpose: Represent all the patents in PATENT that another patent in PATENT cites

Each entry represents: one citation that a patent in PATENT makes

Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited
* description (varchar(255)) description of the citation
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the "date of cited document"
* citation_kind (varchar(255)) type of cited document
* citation_county (varchaar(255)) country of origin for the cited patent
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to

Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent

===FEE===
Purpose: Represent information about the fees paid for an individual patent

Each entry represents: information about the fees paid on the patent

Primary key: patent_number

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* four (boolean) payment of maintenance fee 4th year
* eight (boolean) payment of maintenance fee 8th year
* twelve (boolean) payment of maintenance fee 12th year
* fee_date (date) date that fee is paid on
* fee_code (varchar(255)) code for type of fee to pay

===CLAIMS===

Purpose: Represent information about the claims for an individual patent

Each entry represents: information about a claim that a patent makes

Primary key: patent_number and claim_id

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* claim_id (varchar(255)) id number for the claim
* language (varchar(255)) language of claim
* claim_type (varchar(255)) type of claim
* status (varchar(255)) status of the claim
* claim_text (text) the text of the claim

===HISTPATENT===
Purpose: Represent historical data about the patent pertaining to dates and publication

Each entry represents: sequence of historical data about an individual patent in PATENT

Primary key: patent_number

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* pubno (varchar(255)) patent publication number
* pubdate (date) date patent was issued
* dispdate (date) disposal date for the application
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending
* exp_dt (date) expiration date of patent
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days

===INVENTORS===
Purpose: Represent the inventor for each of patents in the PATENT table

Each entry represents: information about an inventor for a specific patent

Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?
* name (varchar(255)) concatenated string of the first and last name of the inventor
* organme (varchar(255)) name of the organization the inventor works for
* state (varchar(255)) - state in which the inventor resided
* address (varchar(255)) - address at which the inventor resided
* country (varchar(255)) country in which the inventor resided
* city (varchar(255)) city in which the inventor resided

===LAWYERS===
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table

Each entry represents: information about a lawyer for a specific patent

Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?
* org_name (varchar(255)) the name of the firm that did arbitration for the patent

===Unique Attributes Tables===

There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes.

For each table, the primary key is the patent_number, which connects a row to a row in the main patent table.

TODO: Figure out best way to generate these tables, finish description of attributes for each table.

* Note that I did include some unique attributes for Utility patents as seen below in the "Finding New Paths Unique to Plant, Reissue, and/or Design Patents" section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.

====PLANT====
=====Fields=====
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* latin_name (varchar(255)) latin name for the plant
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties

====REISSUE====

In the following table, you'll notice lots of fields related to three different kinds of "documents" - a "parent document", a "child document", and a "parent grant document". It is not immediately clear what these three documents represent for a reissue patent. After some research, I think I have determined that these "documents" are, but please know that I do not have definitive proof as there is little available information about these "documents" in regards to a reissue patent.

It is possible that the "parent document" is the parent patent application - that is, the first patent application filed in regards to an invention. That would explain why reissue patents also has several fields related to a "child document" - a "child document" could be a child application, which is filed while a parent application is still pending. So in this case, a "child document" of a reissue patent would be an application regarding the same patent that was filed while the reissue application was still pending. A "child document" is either a continuation, disclosure, or continuation-in-part application ([http://www.patenttrademarkblog.com/parent-and-child-patent-applications/ source]).

Based on this logic, I think that the difference between a "parent document" and a "parent grant document" depends on whether the parent patent application has been granted. If the parent patent application is still pending (meaning the patent has not yet been granted yet) then I believe the reissue patent will store information about the parent patent under "parent document". However, if the parent patent application has been granted, then the information will be stored under "parent grant document". This seems like the most logical explanation, especially considered the path to any field related to "parent grant document" contains "parent document" as with the example below:

us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/parent-grant-document/document-id/date

Therefore, a particular reissue patent will probably only have the fields filled for either the "parent document" or "parent grant document" if I am right about what they represent.

=====Fields=====

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* parent_doc_status (varchar(255)) status of the parent application. Probably related to whether the patent application is pending or not
* parent_doc_number (int) probably application number for the parent document
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_doc_kind (varchar(255)) this may actually denote what the "parent document" is - or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)
* parent_doc_country (varchar(255)) the country of origin of the parent document
* parent_doc_date (date) probably the date the parent document was filed
* parent_grant_doc_number (int) probably application number for the parent grant document
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_grant_doc_kind (varchar(255)) this may actually denote what the "parent grant document" is, or may refer to it's purpose (i.e. reissue, continuation, continuation-in-part, etc.)
* parent_grant_doc_country (varchar(255)) the country of origin of the parent grant document
* parent_grant_doc_date (date) probably the date the parent grant document was filed
* child_doc_number (varchar(255)) probably the application number of the child application
* child_doc_id (varchar(255)) it is unclear how this is different from child document number. In the xpaths, the path will be something like ./child_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* child_doc_country (varchar(255)) country of origin of the child application

====DESIGN====
=====Fields=====
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* length_of_grant (int) length of grant, most likely in years
* hague_registration_date (date) filing date of international patent application
* hague_filing_date (date) not necessarily the same as the filing date of the international patent application, this is the date that the International Bureau receives all necessary elements for the international patent application
* hague_registration_pub_date (date) datethat the International Bureau publishes the international patent application
* hague_international_registration_number (varchar(255)) international registration number
* edition (varchar(255)) possibly the edition of the Classification Locarno which determined main_classification
* main_classification (varchar(255)) classification for what type of design the patent is for

==Connecting Patent database and Assignment database==
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into "The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called "grant_doc_num" (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)

The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The "appno_doc_num" is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that "administrative data" refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.

Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.

Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.

==Finding New Fields Unique to Plant, Reissue, and/or Design Patents==

Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.

When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.

'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''

===Plant Patents===
* US Claim Statement (XML45) - unclear what this represents
* Latin Name (XML44)
* US Botanical Variety (XML44)

===Reissue Patents===
* Parent Status (XML45)
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields
** Date
** Document number
** Country
** Document ID
** Kind (XML42)
* Child Document (XML44)
** Document ID
** Document Number
** Country (XML43)

Note about fields related to the parent document and/or parent grant document:
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue.

However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.

===Design Patents===
* length of grant (XML45)
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application
** International Registration Date
** International Filing Date
** International Registration Publication Date
** International Registration Number
* Classification Locarno (XML40)
** Edition
** Main Classification

Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some of the items is meant to represent the latest XML version where the field was seen.

Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41

* US Related Documents
** Substitution
*** Relation
**** Parent Document
***** Docuemnt ID
***** Document Number
***** Country
***** Date
***** Kind
***** Parent Status
**** Child Document
***** Document Number
***** Document ID
***** Country
** Continuation/Continuation in Part
*** Relation
**** Child Doc
***** Date
***** Kind
** Division
*** Relation
**** Child Document
***** Kind

This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:
* US Deceased Inventor (XML41)
** Post Code
** Address
* US Provisional Application Status (XML41)

==Paths for the New Fields Related to Plant Reissue, and/or Design Patents==

===Plant Patents===
XML 4.4, 4.3, 4.1, and 4.0
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic
fields: latin-name, variety

XML 4.5
parent node: us-patent-grant/us-claim-statement/
field: i

XML 4.2
parent node: us-patent-grant/us-bibliographic-data-grant/us-botanic
fields: latin-name, variety

parent node: us-patent-grant/us-claim-statement/
field: i

===Reissue Patents===

XML 4.5
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/parent-doc/
fields: parent-status

XML 4.4
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/continuing-reissue/relation/
fields: parent-doc/parent-grant-document/document-id/date
parent-doc/document-id/country
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
child-doc/document-id/doc-number
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/doc-number
parent-doc/document-id/date
(everything in XML 4.3 except parent-status)

XML 4.3
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields: child-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-status
parent-doc/document-id/date
parent-doc/document-id/country
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
parent-doc/parent-grant-document/document-id/doc-number

XML 4.1
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields:
child-doc/document-id/doc-number
child-doc/document-id/country
parent-doc/parent-status
parent-doc/document-id/kind
parent-doc/document-id/country
parent-doc/document-id/doc-number
parent-doc/document-id/date
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-grant-document/document-id/doc-number

other parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/us-reexamination-reissue-merger/relation/
fields:
child-doc/document-id/doc-number
child-doc/document-id/country
parent-doc/document-id/date
parent-doc/document-id/doc-number
parent-doc/document-id/country
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-grant-document/document-id/doc-number

XML 4.0 and XML 4.2
parent node: us-patent-grant/us-bibliographic-data-grant/us-related-documents/reissue/relation/
fields: parent-doc/document-id/kind
parent-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/kind
parent-doc/parent-grant-document/document-id/country
child-doc/document-id/country
parent-doc/parent-grant-document/document-id/doc-number
child-doc/document-id/doc-number
parent-doc/parent-grant-document/document-id/date
parent-doc/parent-status
parent-doc/document-id/country
parent-doc/document-id/date

===Design Patents===
XML 4.5
parent node: us-patent-grant/us-bibliographic-data-grant/
fields: hague-agreement-data/international-registration-date/date
hague-agreement-data/international-registration-publication-date/date
us-term-of-grant/length-of-grant
hague-agreement-data/international-registration-number
hague-agreement-data/international-filing-date/date

XML 4.1, 4.3, and 4.4
parent node: us-patent-grant/us-bibliographic-data-grant/us-term-of-grant/
fields: length-of-grant

XML 4.0
parent node: us-patent-grant/us-bibliographic-data-grant/
fields: us-term-of-grant/length-of-grant
classification-locarno/edition
classification-locarno/main-classification

2017-11-03T21:57:33Z

ShelbyBice: /* REISSUE */

{{McNair Projects
|Has title=Redesign Assignment and Patent Database
|Has owner=Shelby Bice,
|Has start date=9/2017
|Has keywords=patent
|Is dependent on=Reproducible Patent Data,
}}
'''FOR ED:''' I finished adding going through the unique paths (you can see the unique fields I identified at the bottom of this page, under "Finding New Paths Unique to Plant, Reissue, and/or Design Patents". Then, under my design for the Patent Database, there's a new section called "Unique Attributes Tables" where I've begun detailing a design for three new tables that would include information that is unique to Design, Reissue, and Plant tables. Here are some questions I have:

'''1.''' Do you like design for the new tables? How do you think we should populate these tables? I can think of one way (putting all the information in the main patent tables and then moving it out to Plant, Reissue, or Design as needed) but I was wondering if there might be an easier way.

'''2.''' There were some unique attributes to utility patents that I also included in "Finding New Paths Unique to Plant, Reissue, and/or Design Patents". They only appear to be in two of the oldest XML versions (4.1 and 4.0) and I'm not sure the information is particularly useful - but please take a look at the field names and let me know if you disagree and think they are useful and should be included in a table in the patent database.

'''3.''' Do you know what some of these fields represent? For example, I've been trying to research online what a "Parent Document" and/or "Parent Grant Document" might represent for a reissue patent, and I've found a couple possible options. I will keep researching, but if you know what they represent or one of my descriptions for a field is wrong, please let me know.

This is an extension of the work I did last semester under "Redesigning Patent Database". Instead of simply reconfiguring the existing database, this project encompasses and full redesign and creation of a new Patent database and a new Assignment database that will be joined together.

Reference for definitions of fields in the two databases: https://www.oecd.org/sti/sci-tech/37569498.pdf

Paper on how to extract paper from XML files, parse it, and put it into a database: https://funginstitute.berkeley.edu/wp-content/uploads/2014/06/patentprocessor.pdf

'''Particularly useful, because it provides solutions for cleaning location and firm data'''

==ER Diagram for Assignment and Patent Databases==

Note: this is currently out of sync with the current description below for the patent database. It will be updated once I have finalized how we will organize unique attributes for plant, design, and reissue patents.

[[File:PatentAndAssignmentER2.png]]

Attributes for each table are listed below with descriptions - because of how many attributes there are, I decided the ER diagram would be better suited as an overview of the tables rather than trying to show all the attributes on the diagram.

==Assignment Database Structure==

After reading through the paper "The USPTO Patent Assignment Dataset:
Descriptions and Analysis"(https://www.uspto.gov/sites/default/files/documents/USPTO_Patents_Assignment_Dataset_WP.pdf) and taking into account are needs for the Assignment database (the ability to connect the two databases together, the ability to trace conveyance of a patent over time) I decided on the following structure for the Assignment database.

The foreign key that connects every table in the database is rf_id, which stands for reel frame id. It is unique for each entry in the main table, assignment and is included in every table.

Conveyance is the only table that cannot be fully constructed as data is inserted into the database. Some of the fields depend on other tables being constructed, so it may be either partially populated while the other tables are being populated or is the last table to be populated.

Full list of tables: assignee, assignor, document_info, assignment, conveyance

===ASSIGNEE===
Purpose: store all the assignees from all the assignments

Each entry represents: the recipient of a patent in an entry in the ASSIGNMENT table

Primary key: rf_id, unless it is possible to have multiple assignees, then rf_id and assignee name

All columns (variables) in the table:

*rf_id (bigint) reel frame number
* assignee_name (varchar(255)) name of the assignee
* address_line_1 (varchar(255)) first line of address of assignee
* address_line_2 (varchar(255)) second line of address of assignee
* city (varchar(255)) city of assignee
* state (varchar(255)) state of assignee
* country var(char(255)) country of assignee
* postal_code (varchar(255)) post code of the assignee

===ASSIGNOR===

Purpose: store all the assignors from all the assignments

Each entry represents: the one assigning/granting the license for the entry in the ASSIGNMENT table

Primary key: rf_id and assignor_name

All columns (variables):
* rf_id (bigint) reel frame number
* assignor_name (varchar(255)) name of the assignor

===CONVEYANCE===
Purpose: represent the conveyance of a particular patent overtime

Each entry represents: an individual transaction for a patent (not unlike assignment, except this table has information about the assignment in relation to other assignments for the patent

Primary key: rf_id

All columns (variables): Two possibilities, depending on whether we want data to be repeated or not.

Not-repeated (with exception of primary/foreign key):
* rf_id (bigint) reel frame number
* next_transaction (bigint) rf_id of next transaction that occurred after this transaction (traced by asignee/assignor)
* num_conveyance (int) number in order of conveyance by date
* acknowledgment_date (date) date that the assignor acknowledged the transaction
* execution_date (date) date that the transaction took place
* conveyance_type (varchar(255)) the USPTO paper determined a couple of different conveyance types and also outlined how to determine based on the conveyance text in the xml file which conveyance type is relevant to the transaction. The conveyance types are:
** assignment
** merger
** change of name
** government interest
** agreement
** security agreement
** release

Repeated, also include:
* assignee_name (varchar(255)) name of the assignee
* assignor_name (varchar(255)) name of the assignor
* patent_number (varchar(255)) identifying number for the patent

===DOCUMENT_INFO===
Purpose: store extra information relevant to the patents represented in ASSIGNMENT

Each entry represents: information about the documentation and the patent for a particular entry in ASSIGNMENT

Primary key: rf_id

All columns (variables) when created:
* rf_id (bigint) reel frame number
* app_num (varchar(255)) application document USPTO number
* app_date (date) date that the application was filed
* app_country (varhar(255)) country in which it was filed
* pgpub_num (varchar(255)) pre-grant publication document USPTO number
* pgpub_date (date) date that pre-grant publication was released
* pgpub_country (varchar(255)) country in which the patent is published before being granted a patent number
* patent_number (varchar(255)) granted patent document USPTO numbe
* grant_date (date) date that the patent is officially published
* grant_country (varchar(255)) country in which the patent is published after being granted a patent number
* invention_title (varchar(255)) title of the invention
* language (varchar(255)) languageof invention title – could be potentially useful and interesting to investigate
* reel_num (bigint) number of the reel the assignment was stored on
* frame_num (bigint) number of the frame on the reel the assignment was stored on

===ASSIGNMENT===
Purpose: represent assignment transactions – central table for the database

Each entry represents: an individual patent assignment transaction

Primary key: rf_id

All columns (variables):
* rf_id (bigint) reel fram number
* assignment_id (bigint) US patent assignment id
* correspondent_name (varchar(255)) name of correspondent for the transaction, could be a lawyer
* correspondent_address1 (varchar(255)) first line of address for correspondent
* correspondent_address2 (varchar(255)) second line of address for correspondent
* correspondent_address3 (varchar(255)) third line of address for correspondent
* correspondent_address4 (varchar(255)) fourth line of address for correspondent
* record_date (date) date recorded with USPTO)
* last_update_date (date) date that information for this assignment was last update with the USPTO
* page_cnt (bigint) page count of assignment record)

More extensive notes exist under :E/McNair/Projects/Redesigning Patent Database/New Patent Database Project/Notes on USPTO Assignment Data Paper

==Patent Database Structure==

Most of this design is based off what I investigated last semester and therefore is subject to change, since the design for the Assignment database is somewhat new.

The foreign key that connects every table is patent_number, which uniquely identifies every patent in the PATENT table and is included in every table.

Full list of tables: patent, assignee, citation, fee, claims, histpatent, lawyers, inventors

Update: These tables, as of October 26, are being updated and changed as I determine what data can actually be retrieved from the XML files based on the xpaths that we have determined (you can see these xpaths on the "Equivalent XPath and APS Queries" project page.

===PATENT===
Purpose: represent patents – central table for the database

Each entry represents: an individual patent

Primary key: patent_number

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* patent_type (varchar(255)) the type of patent, like utility
* patent_kind (varchar(2)) the kind of patent, a letter and a number
* title (varchar(255)) the title of the patent
* grantdate (date) date that patent was officially granted including the year
* prioritydate (date) date used to established novelty of an invention in regards to other inventions (source: http://www.bios.net/daisy/patentlens/2343.html)
* prioritycountry (varchar(255)) country where patent is first filed
* patent_country (varchar(255)) country in which this patent was published
* prioritypatentnumber (varchar(255)) placeholder patent number if the patent is published before it is fully approved - also referred to as priority claims number
* cpcsubclass (varchar(255)) cooperative patent classification subclass
* cpcsubgroup (varchar(255)) cooperative patent classification subgroup
* cpcmaingroup (varchar(255)) cooperative patent classification main group
* cpctotal (varchar(255)) concatenated version of cpcmaingroup, cpcsubgroup, cpcsubclass
* pctpatentnumber (varchar(255)) international patent number (according to Patent Cooperation Treaty) also known as PCT Document Number
* appnum(varchar(255)) application number, probably also the filing number - the format is a two digit series code followed by a six digit serial number (source: https://www.uspto.gov/patents-application-process/filing-online/info-application-number)
* appdate (date) date that the application was filed, including year - probably also the filing date
* ipcrsubclass (varchar(255)) International patent classification subclass
* ipcrmaingroup (varchar(255)) International patent classification main group
* ipcrsubgroup (varchar(255)) International patent classification subgroup
* ipcrtotal (varchar(255)) concatenated version ipcrsubgroup, ipcrsubclass, ipcrmaincroup
* national_classification (varchar(255)) - probably uspc, which is United States Patent Classification
* natioanl_classification_country (varchar(255)) - should be United States
* num_claims (int) number of claims (things that the inventor wishes to protect with this patent)

Discrepancies:
These fields were previously a a part of the table, but we do not currently have xpaths for them:
app_year - the year the application was filed
grant_year - the year the patent was granted
uspc - United States Patent Classification
(might be the same as national_classification)
uspcsub - United States Patent subgroup

===ASSIGNEE===
Purpose: Represent the person the patent was currently assigned to at publication according to the patent information (the information in the Assignment database is more complete)

Each entry represents: the assignee for a patent in PATENT

Primary key: patent_number, first_name, last_name (in case it is possible for there to be multiple assignees for an individual patent)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
*first_name (varchar(255)) first name of assignee
*last name (varchar(255)) last name of assignee
* address (varchar(255)) address of assignee
* postcode (varchar(255)) postcode of assignee
* orgname (varchar(255)) name of the organization owning the patent (if applicable - organization may go by multiple orgnames)
* city (varchar(255)) city of the assignee
* country (varchar(255)) country of the assignee
* state (varchar(255)) state of assignee
* residence (varchar(255)) mailing address of assignee if they receive mail at a different address than the one listed above
* size_of_firm (varchar(255)) size of the organization, large, small, or micro - relates to how much the fee they pay for the patent is

===CITATION===
Purpose: Represent all the patents in PATENT that another patent in PATENT cites

Each entry represents: one citation that a patent in PATENT makes

Primary key: patent_number AND cited_patent_number (because a patent might cite multiple patents, so both variables are needed for the primary key)

All columns (variables):
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* cited_patent_number (varchar(255)) unique identifier from the UPSTO office for the patent being cited
* description (varchar(255)) description of the citation
* citation_date (date) unclear what this date is supposed to represent from the DTD, but according to the paper that is the second link in the top section, it is the "date of cited document"
* citation_kind (varchar(255)) type of cited document
* citation_county (varchaar(255)) country of origin for the cited patent
* citation_name (varchar(255)) appears to be the names to whom the cited patent are assigned to

Note: There are other fields available for this table, such as document number, but this seems like it might be a repeat of the document number for the citing patent

===FEE===
Purpose: Represent information about the fees paid for an individual patent

Each entry represents: information about the fees paid on the patent

Primary key: patent_number

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* four (boolean) payment of maintenance fee 4th year
* eight (boolean) payment of maintenance fee 8th year
* twelve (boolean) payment of maintenance fee 12th year
* fee_date (date) date that fee is paid on
* fee_code (varchar(255)) code for type of fee to pay

===CLAIMS===

Purpose: Represent information about the claims for an individual patent

Each entry represents: information about a claim that a patent makes

Primary key: patent_number and claim_id

All columns:
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* claim_id (varchar(255)) id number for the claim
* language (varchar(255)) language of claim
* claim_type (varchar(255)) type of claim
* status (varchar(255)) status of the claim
* claim_text (text) the text of the claim

===HISTPATENT===
Purpose: Represent historical data about the patent pertaining to dates and publication

Each entry represents: sequence of historical data about an individual patent in PATENT

Primary key: patent_number

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* pubno (varchar(255)) patent publication number
* pubdate (date) date patent was issued
* dispdate (date) disposal date for the application
* distype (varchar(3)) application status of the patent - ABN for patent term adjustment, ISS for issued, or PEN for pending
* exp_dt (date) expiration date of patent
* pta (int) length of patent term adjustment (which extends the amount of time the patent is allowed to be in force) in number of days

===INVENTORS===
Purpose: Represent the inventor for each of patents in the PATENT table

Each entry represents: information about an inventor for a specific patent

Primary Key: patent_number (and name, if there can be multiple inventors for a single patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that uniquely identify the inventor?
* name (varchar(255)) concatenated string of the first and last name of the inventor
* organme (varchar(255)) name of the organization the inventor works for
* state (varchar(255)) - state in which the inventor resided
* address (varchar(255)) - address at which the inventor resided
* country (varchar(255)) country in which the inventor resided
* city (varchar(255)) city in which the inventor resided

===LAWYERS===
Purpose: Represent the lawyers who worked on each of the patents in the PATENT table

Each entry represents: information about a lawyer for a specific patent

Primary Key: patent_number (there doesn't appear to be more than one lawyer entry for each patent)

* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* sequence (varchar(255)) appears to be a sequence of numbers that unique identify the lawyer?
* org_name (varchar(255)) the name of the firm that did arbitration for the patent

===Unique Attributes Tables===

There are several attributes that have been identified and are unique, applying to only one kind of patent. Instead of adding all these fields to the main patent table, which will introduce a lot of NULL entries because three out of four patents will not have any particular attribute. Therefore, I have created three new tables to store these unique attributes - one for each type of patent with unique attributes.

For each table, the primary key is the patent_number, which connects a row to a row in the main patent table.

TODO: Figure out best way to generate these tables.

* Note that I did include some unique attributes for Utility patents as seen below in the "Finding New Paths Unique to Plant, Reissue, and/or Design Patents" section, but for now (11/3/2017) I'm not going to create a separate table for those fields. Most of them seem to be repeated in Reissue, and the other fields, while interesting, are perhaps not very useful.

====DESIGN====
* patent_number (varchar(255)) unique identifier for the patent from the UPSTO office, can contain letters hence varchar
* latin_name (varchar(255)) latin name for the plant
* us_botanical_variety(varchar(255)) denotes what variety of plant it is - for example, a rose has several different varieties

====REISSUE====

* parent_doc_status (varchar(255)) status of the parent document. It is unclear whether this refers to a parent reissue application (if there happen to be multiple applications relating to the reissue of this patent) or the patent for whom the reissue is being reissued. This goes for all fields beginning "parent_doc"
* parent_doc_number (int) number for the parent document
* parent_doc_id (varchar(255)) it is unclear how this is different from parent document number. In the xpaths, the path will be something like ./parent_doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_doc_kind (varchar(255)) this may actually denote what the "parent document" is - I will need to look in an actually XML file to see what these field looks like
* parent_doc_country (varchar(255)) the country of the parent document
* parent_doc_date (date) probably the date the parent document was published

In addition to parent document, there is a subcategory that has the same fields (minus status) called parent_grant_document. This is probably a patent that has been granted, given that the field names includes grant, but I am not sure. Again, I am not sure of what all of these fields reprent

* parent_grant_doc_number (int) number for the parent grant document
* parent_grant_doc_id (varchar(255)) it is unclear how this is different from parent grant document number. In the xpaths, the path will be something like ./parent_doc/parent-grant-doc/document-id/doc-number, so it's possible it's not actually different from the document number and rather just a broader category.
* parent_grant_doc_kind (varchar(255)) this may actually denote what the "parent grant document" is - I will need to look in an actually XML file to see what these field looks like
* parent_grant_doc_country (varchar(255)) the country of the parent grant document
* parent_grant_doc_date (date) probably the date the parent grant document was published

====PLANT====

==Connecting Patent database and Assignment database==
The answer to connecting the Patent database to the Assignment database lies somewhere in using the information in the table DOCUMENT_INFO to connect to a patent_id from the Patent database PATENT table for each assignment in the Assignment database ASSIGNMENT table. On further investigation into "The USPTO Patent Assignment Dataset: Descriptions and Analysis, the field they called "grant_doc_num" (grant_num above) in DOCUMENT_INFO is the patent number (the description of DOCUMENT_INFO above has been altered to reflect this.)

The paper mentions that there will errors in patent number stored in DOCUMENT_INFO under grant_doc_num so a separate table called DOCUMENT_INFO_ADMIN was constructed to determine how prevalent errors were. They queried the patent number from for the appno_doc_num from administrative data. The "appno_doc_num" is the Application Document USPTO number from DOCUMENT_INFO. On further investigation into the paper, it became clear that "administrative data" refers to internal USPTO data that was available to the authors of the paper, but that we might not have. However, in 99% of cases the grant_doc_num and the queried patent number based on the appno_doc_num were the same, so we can probably rely on grant_doc.

Therefore, since every table in the Patent database will have the patent number stored, to connect any table in Assignment to the Patent database, it would first be joined with DOCUMENT_INFO on rf_id and then joined with the appropriate table in Patent on patent number.

Also in the paper, they mentioned the Assignments on the Web for Patents (AOTW-P), a searchable database of individual USPTO assignment records keyed on reel-frame identification, patent number, and assignor or assignee name (https://assignment.uspto.gov/patent/index.html#/patent/search). Obviously it would not be possible to individually use this tool to query all the patent numbers, but if it would be possible to write a script to somehow query each patent number using the rf_id and parse the response, this could potentially be useful to check the patent numbers, but might not be any more accurate than what will already be in DOCUMENT_INFO.

==Finding New Paths Unique to Plant, Reissue, and/or Design Patents==

Based on Oliver's script which searched all xpaths and compared which were unique to particular types, we see that the following attributes are unique to each type of patents other than utility patents. These attributes vary by the XML version, which changed over time. Therefore, the lists below are a superset of the attributes that are unique to each of the patents types listed below across all XML versions.

When I am done finding the supersets, we will determine how to integrate these attributes into the design for the patent table.

'''There appears to be a fifth patent type in the results from Oliver's script, and there are attributes that are unique to Utility patents, though they do not appear particularly useful'''

===Plant Patents===
* US Claim Statement (XML45) - unclear what this represents
* Latin Name (XML44)
* US Botanical Variety (XML44)

===Reissue Patents===
* Parent Status (XML45)
* Parent Document (XML44) - seems to be either a normal document or a parent grant document (or both), both contain the following fields
** Date
** Document number
** Country
** Document ID
** Kind (XML42)
* Child Document (XML44)
** Document ID
** Document Number
** Country (XML43)
* Continuing Reissue (XML44)
** Document ID
** Relation
** Document Number

Note about fields related to the parent document and/or parent grant document:
The structure of the above fields related to a reissue patent is very odd and varies by XML version. For example, for most of the XML version, parent grant document falls under parent document and has it's own document number, date, country, document ID, kind, etc. The parent doc field falls under relation, which falls under reissue.

However, in XML41, parent document falls under relation and a new category - US Reexamination Reissue Merger. The information pertaining to the child documentation also fall under both relation and US Reexamination Reissue Merger. It's possible that during the time period this XML version represents, there were two types of reissue patents - some reexamination related to a merger and a standard reissue. With discrepancies like these, we'll have to determine in the end we want to stick with the fields as they are designed above (in which cases the information about a parent doc or parent grant document is stored the same whether it is a reissue or an Reexamination Reissue, or if these two classifications are different enough to warrant duplicate fields in the table.

===Design Patents===
* length of grant (XML45)
* Hague Agreement Data (XML45) - allows people to file design patents in 66 countries with one application
** International Registration Date
** International Filing Data
** International Registration Publication Date
** International Registration Number
* Classification Locarno (XML40)
** Edition
** Main Classification

Please note that the lists above are meant to represent unique types that should be added to the patent database. This does not mean that every XML version contains all these fields, or that every version contains the same path to these fields. That will have to be determined separately. The XML number next to some of the items is meant to represent the latest XML version where the field was seen.

Thought I was mainly looking at unique fields in plant, reissue, and design patents, I noticed the following field on unique paths for utility patents in XML41

* US Related Documents
** Substitution
*** Relation
**** Parent Document
***** Docuemnt ID
***** Document Number
***** Country
***** Date
***** Kind
***** Parent Status
**** Child Document
***** Document Number
***** Document ID
***** Country
** Continuation/Continuation in Part
*** Relation
**** Child Doc
***** Date
***** Kind
** Division
*** Relation
**** Child Document
***** Kind

This appears to resemble the information about reissue patents. Additionally, a couple of fields appear under the information about bibliographic data that might be useful:
* US Deceased Inventor (XML41)
** Post Code
** Address
* US Provisional Application Status (XML41)