edegan.com - User contributions [en]

Web Server Documentation

2016-11-07T22:05:46Z

RavaliKruthiventi: /* = Semantic Mediawiki Extensions */

[[Category: McNair Admin]]

=Old Notes (from Alex Jiang)=

== Installing Ubuntu aka Trying RAID 10 (2/15/2016) ==

Some general configuration options:
* hostname: McNairWebServ
* user full name: McNair Center
* username: mcnair
* don't encrypt home directory
* manual partitioning (see below for configuration of RAID)
* no automatic updates
* software: LAMP stack

Sahil and I tried to configure RAID 10 using the software RAID option in the installer, which is documented [https://help.ubuntu.com/community/Installation/SoftwareRAID#Partitioning_the_disk here]. We put two 64 GB swap space partitions on the first two hard drives, and created two ext4 partitions that took up the rest of the space on those two drives. For the other two drives, we used a single ext4 partition for each drive. For all of the ext4 partitions, we set the bootable flag to "on." Then we chose to configure the software RAID, created a new MD device, and chose RAID10 with 2 active devices and 2 spare devices. For the active devices, we chose the two ext4 partitions on the first two hard drives, and for the spare devices, we chose the two ext4 partitions on the other two hard drives. But then the installation process fails when the GRUB boot loader can't be installed, because the GUID partition tables (GPT) need a designated, small (1 MB is enough) partition for the GRUB bootloader.

So we started partitioning from scratch, but with only two hard drives for a RAID1 array. In the first drive, there are three partitions: one 1 MB partition reserved for the bootloader, one 64 GB swap partition, and the rest of the drive as an ext4 partition for the filesystem. In the second drive, there are two partitions: one 1 MB partition reserved for the bootloader and the rest of the drive as an ext4 partition for the filesystem. Then we made two software RAID devices, both with one with 2 active devices and 0 spare devices. The first RAID device had both of the bootloader partitions as the active devices, and the second RAID device had both of the ext4 filesystem partitions as the active devices. Then we set the first RAID device to "use as ext4" and the mount as "/boot" and the second RAID device as "use as ext4" and the mount as "/" and then continued with the installation. This time, it failed to install the kernel.

I guessed that, because the 1 MB RAID device was made first, that the kernel tried to install itself to that device and failed. So I went back to the partitioner and set the first RAID device to "do not use" and then tried the installation process again. It prompted me a couple of times warning me that the old filesystem would be overwritten, but I continued the installation regardless. But then the GRUB boot loader failed, even if we tried not installing it to the master boot record (MBR) and installing it to "dev/md0" or installing it to "dev/md0_raid1"

== Configuring RAID 1 on Web Server (2/17/2016) ==

The first RAID device (/dev/md0) we set to use as an ext4 filesystem and mounted /boot to it, and the second RAID device (/dev/md127) we set to use as an ext4 filesystem and mounted / to it (we tried this before, but it failed to install the kernel). This time, it failed to install the bootloader, but it never prompted me to choose where to install the bootloader (usually it asks whether you'd like to install the bootloader to the master boot record).

'''Second partitioning attempt:'''

First hard disk (/dev/sda):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 64 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Second hard disk (/dev/sdb):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Write partition changes to disk and then start configuring software RAID:

* First RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb2), 0 spare devices
* Second RAID device (/dev/md1): RAID1, 2 active devices (/dev/sda1 and /dev/sdb1), 0 spare devices
* first RAID device partition: use as ext4 filesystem, mount point /
* second RAID device partition: use as ext4 filesystem, mount point /boot, format data on the partition

Failed to install GRUB bootloader on a hard disk (again).

Next attempt:
First RAID device (/dev/md0): use as ext4 filesystem, mount point /, format data on the partition
Second RAID device (/dev/md1): erase data on partition, use as "do not use"

Next attempt:
Redo the RAID devices so that the first device (/dev/md0): RAID1, 2 active devices (/dev/sda1 and /dev/sdb1), 0 spare devices, and the second RAID device (/dev/md1): RAID1, 2 active devices (/dev/sda3 and /dev/sdb2), 0 spare devices. Then configure the RAID devices:

* first RAID device partition: use as ext4 filesystem, mount point /boot, format data on the partition
* second RAID device partition: use as ext4 filesystem, mount point /, format data on the partition

New idea: ditch the idea of RAID on the boot partitions (we'll put the bootloader on one of the boot partitions and then we can try to set up RAID once we've got the thing booting into Linux), so leave the partitions as above ("Second Partitioning Attempt"). Only make one software RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb2), 0 spare devices. Then configure the first RAID device partition: use as ext4 filesystem, mount point /, format data on the partition.

'''Third partitioning attempt:'''

First hard disk (/dev/sda):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Second hard disk (/dev/sdb):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

One RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb3), 0 spare devices. set partition: use as ext4 filesystem, mount point /

'''Fourth partitioning attempt:'''

First hard disk (/dev/sda):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Second hard disk (/dev/sdb):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

First RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb3), 0 spare devices. set partition: use as ext4 filesystem, mount point /

Second RAID device (/dev/md1): RAID1, 2 active devices (/dev/sda1 and /dev/sdb1), 0 spare devices. set partition: use as ext4 filesystem, mount point /boot

Third RAID device (/dev/md2): RAID0, 2 active devices (/dev/sda2 and /dev/sdb2). set partition: use as swap area

'''Fifth partitioning attempt (made sure all software RAID devices are removed, delete all partitions, create new partition tables):'''

First hard disk (/dev/sda):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Second hard disk (/dev/sdb):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

First RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb3), 0 spare devices. set partition: use as ext4 filesystem, mount point /

install GRUB bootloader to /dev/sda and /dev/sdb. It works!

== Network Configuration (2/22/2016) ==

As with the [[Test_Web_Server_Documentation|test web server]], network configuration can be annoying. First, I had to figure out the right LAN port on the mobo by plugging the RJ45 cable in and waiting for the LED to light up (it took about 5 seconds and a couple of tries). Then I went to the terminal to check on the network interfaces:

$ ifconfig
$ ifconfig -a
$ sudo ifconfig eth0 up
$ cat /etc/network/interfaces

After bringing up the eth0 interface (it's down if it's not listed in the output of ifconfig), I then modified /etc/network/interfaces to set up the eth0 interface:

$ sudo vi /etc/network/interfaces

And added these lines:

auto eth0
iface eth0 inet dhcp
dns-nameservers 8.8.8.8 8.8.4.4

Then I used ifdown/ifup to reconfigure the interface:

$ sudo ifdown eth0
$ sudo ifup eth0

There's a couple of configuration files that you can check to make sure that the network configured correctly (I compared them to the corresponding files in the test web server):

$ hostname -I
$ cat /etc/resolv.conf
$ cat /etc/hosts
$ cat /var/lib/dhcp/dhclient.eth0.leases

Then I checked if it was connected to the internet:

$ ping google.com
$ sudo apt-get update

I got a "GPG error: http://security.ubuntu.com trusty-security InRelease: Clearsigned file isn't valid, got 'NODATA' (does the network require authentication?)" message on the apt-get update a couple of times, so I tried sudo ifdown eth0 and sudo ifup eth0 a couple of times. Then I rebooted the machine and tried to update the package manager again, and it still didn't work.

These results seem familiar; I think I had the same error when I tried to connect the test web server to the internet before Ed filed the ticket with the IT help desk, which suggests that we may have given the wrong MAC address or IT messed up the configuration. Still, I checked all of the configuration files. I only noted a couple of differences between the test web server network interface and this web server network interface:

# The IP addresses are different. The test web server has an address that starts with 128, but this webserver has an address that starts with 10. (Ed thinks this is a sign that this webserver's IP address limits it to the Rice network).
# The subnet masks are different. The test web server has a subnet mask that ends in 240, but this webserver has a mask that ends in 0.
# The test webserver has a DNS domain name (i.e. the output of hostname -d) of attlocal.net. This webserver doesn't have one. I tried adding it (by editing /etc/hosts), but that change alone didn't help.

Interesting side note: going into the mobo BIOS menu, under "Server Mgmt" there is a submenu "BMC network configuration" that shows the MAC address for "DM_LAN1" as ending in de, whereas the MAC address for eth0 ends in dc (otherwise, the two MAC addresses are the same). So maybe the mobo is interfering with the MAC address? But changing DM_LAN1's Config Address source from "Previous State" to "DynamicBmcDhcp" doesn't fix the problem (and upon reboot, it switches back to Previous State).

Turns out IT just configured the network IP addresses incorrectly. Ed and I talked to the IT desk on Tuesday and we got new IP addresses.

== Installing Software (2/24/2016) ==

Now that we have internet connection, we can start getting packages:

$ sudo apt-get update
$ sudo apt-get upgrade

Since I didn't install the SSH server in the beginning, I'll go ahead and install the openssh-server package now:

$ sudo apt-get install openssh-server

Backup the SSH server config file:

$ sudo cp /etc/ssh/sshd_config /etc/ssh/sshd_config.original

== Installing Mediawiki (3/7/2016) ==

As with the [[Test Web Server Documentation#Installing Mediawiki (1/4/16)|test web server]], I followed the steps from [http://www.mediawiki.org/wiki/Manual:Running_MediaWiki_on_Ubuntu this page] on installing Mediawiki.

Make a directory for the stable version of Mediawiki (1.26.2), which isn't available through apt-get, so we're downloading the official tarball!

$ mkdir ~/Downloads
$ cd ~/Downloads
$ wget https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz
$ tar -xvzf /pathtofile/mediawiki-*.tar.gz

Copy the extracted files to /var/lib/mediawiki:

$ sudo mkdir /var/lib/mediawiki
$ sudo mv mediawiki-1.26.2/* /var/lib/mediawiki

Then set up the mediawiki directory:

$ cd /var/www/html
$ sudo ln -s /var/lib/mediawiki mediawiki

Now point a browser to http://[ip_address]/mediawiki/mw-config/index.php and configure the Mediawiki site as follows:

Choose both "your language" and the "wiki language" to be English and continue to the next page. Make sure that all of the environmental checks pass before continuing to the next page. Leave the "database host" as localhost and change "database name" to mcnair. Leave "database table prefix" empty and "database username" as root. Set the "database password" to whatever the password for the MySQL user was set as during installation and then continue to the next page. Check the box for "Use this account for installation" and choose InnoDB for "Storage Engine" and choose Binary for "Database character set" and continue to the next page. Set the name of the wiki as McNair Center and let the project namespace be the same as the wiki name. For the administrator account, set the username, password, and email. Choose to subscribe to the release announcements mailing list if you provide an email, and choose to answer more questions.

Choose "open wiki" for the user rights profile. Choose "no license footer". Uncheck the box for "enable outbound email" and choose which skin you'd like to use. For extensions, leave them all unchecked. Leave "enable file uploads" unchecked. Don't change the Logo URL and don't check "enable Instant Commons". For caching, choose "no caching".

Copy the downloaded LocalSettings.php configuration file onto the webserver in the root directory of the mediawiki installation: /var/lib/mediawiki. Then point a browser to http://[ip_address]/mediawiki and see your new site!

== Short URLs (3/7/2016) ==

Same as for the [[Test Web Server Documentation#Short URLs (1/27/16)|test web server]].

== Labeled Section Transclusion (3/7/2016) ==

Same as for the [[Test Web Server Documentation#Labeled Section Transclusion (1/25/16)|test web server]].

== Responsive Design (3/7/2016) ==

Same as for the [[Test Web Server Documentation#Responsive Design (1/25/16)|test web server]].

== Mediawiki CSS changes (3/9/2016) ==

Started working with Julia on the mediawiki website CSS design (color scheme and typography on [[Website Design]]). Ran into a couple of problems:

* If you upload a file to Slack and want to download it from its URL using the wget command on command-line, make sure you get a public link from the person who uploaded the file, otherwise the file won't be downloaded. (I was trying to figure out why the McNair logo that Julia sent me on slack wasn't showing up on the website, but it turns out I just needed a public link to the file, which should look something like https://files.slack.com/files-pri/T0JA2A9Q9-F0RL0G4BZ/mcnair.png?pub_secret=30505f5d02).
* the @font-face rule doesn't seem to work in Common.css... I never got past this problem. I think the .tff file for the font may have failed to download onto the server properly, but I haven't found a good way to test for that case. Also, I tried using an absolute URL (i.e. http://128.42.44.180/mediawiki/resources/assets/fonts/franklin-gothic-book.ttf) when specifying the @font-face rule, but it doesn't seem to help. Using an absolute URL to the slack file public URL (i.e. https://files.slack.com/files-pri/T0JA2A9Q9-F0RLDB3G8/download/franklin-gothic-book.ttf?pub_secret=327cdaaeb8) doesn't seem to work either.

Well, I don't really trust the file to download onto the webserver properly from terminal, so I got an SFTP client and used that to copy the .ttf file onto the webserver. Still no dice.

== Setting up users (3/11/2016) ==

First, getting the ImportUsers extension for bulk account creation (using a CSV). Downloading the extension is as follows:

$ cd ~/Downloads
$ wget https://extdist.wmflabs.org/dist/extensions/ImportUsers-REL1_26-0fe9e22.tar.gz
$ tar -xzvf ImportUsers-REL1_26-0fe9e22.tar.gz
$ cd /var/lib/mediawiki/extensions
$ cp -r ~/Downloads/ImportUsers ./ImportUsers

Then edit LocalSettings.php and add this line:

require_once("$IP/extensions/ImportUsers/ImportUsers.php");

Then we just have to make a CSV with columns for username, password, email, real name, and user groups (optional). More info on the [https://www.mediawiki.org/wiki/Extension:ImportUsers extension documentation page].

I made a small little CSV to test the ImportUsers extensions:

user1,pass1,user1@example.com,Dummy One
user2,pass2,user2@example.com,Dummy Two
user3,pass3,user3@example.com,Dummy Three

After importing the users, run a maintenance script from the command line to update new user statistics:

$ cd /var/lib/mediawiki/maintenance
$ php initSiteStats.php

But this runs into some errors ([https://www.mediawiki.org/wiki/Manual:Maintenance_scripts this page] suggests setting the MW_INSTALL_PATH environment variable, but I can't find a good way to do that). I looked into the error messages and found [http://stackoverflow.com/questions/21257589/ubuntu-typing-php-in-terminal-shows-a-lot-of-errors this SO post] which seems to cover it. I don't know whether we need SNMP, so I decided to just install it to be safe:

$ sudo apt-get install snmp

And the error messages go away. Alternatively, you can disable the snmp module for PHP with:

$ sudo php5dismod snmp

We also want to limit account creation to sysops only [https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_account_creation as done here]. To do this, edit LocalSettings.php and add these lines:

# Prevent new user registrations except by sysops
$wgGroupPermissions['*']['createaccount'] = false;

== BibTex citations with BibManager (3/11/2016) ==

The [https://www.mediawiki.org/wiki/Extension:BibManager BibManager extension] isn't actively maintained, but it doesn't seem like it needs to be constantly updated to accommodate for new features and was last updated for Mediawiki version 1.22, which isn't too bad.

Let's test on the test web server first.

== Bibtex citations with Bibtex (3/14/2016) ==

The [https://www.mediawiki.org/wiki/Extension:Bibtex Bibtex extension] doesn't look like it's being actively maintained, but it might work. I'm testing it on the test web server alongside BibManager.

== Ghost vs. WordPress (3/14/2016) ==

So it looks like we may choose Ghost over WordPress. We need something self-hostable, and ideally open-source (and both Ghost and WP satisfy those two conditions). However, I hear Ghost is more lightweight, so if we're not looking for a lot of extra functionality from third-party plugins, Ghost may be the better choice. I'm setting up Ghost on the [[Test Web Server Documentation#Installing Ghost (3/14/2016)|test web server]], so we'll see how it goes...

Turns out Ghost+apache is kinda difficult (definitely more difficult than WordPress+Apache), so let's just try WordPress.

The [[Test Web Server Documentation#Installing WordPress (3/14/2016)|test web server]] had a pretty easy time installing WordPress alongside the existing mediawiki site, so it seems that we'll use WP for the blog on this web server as well.

== Infoboxes (3/16/2016) ==

I decide to follow the instructions on [http://trog.qgl.org/20140923/setting-up-infobox-templates-in-mediawiki-v1-23/ this post]. Let's see how it goes.

Step 1:

Download and install the [https://www.mediawiki.org/wiki/Extension:Scribunto Scribunto extension].

cd ~/Downloads
$ wget https://extdist.wmflabs.org/dist/extensions/Scribunto-REL1_26-9fd4e64.tar.gz
$ tar -xzvf Scribunto-REL1_26-9fd4e64.tar.gz
$ cd /var/lib/mediawiki/extensions
$ cp -r ~/Downloads/Scribunto ./Scribunto

Add these two lines to LocalSettings.php:

require_once("$IP/extensions/Scribunto/Scribunto.php");
$wgScribuntoDefaultEngine = 'luastandalone';

And set execute permissions for Lua binaries in the extension:

$ chmod a+x /var/lib/mediawiki/extensions/Scribunto/engines/LuaStandalone/binaries/lua_5_1_5_linux_64_generic/lua

In addition, check that the PCRE version is at least 8.10 (preferable at least 8.33), PHP's mbstring extension is enabled, and PHP's proc_open function is not disabled using a phpinfo page.

Step 2:

Copy Wikipedia's [https://en.wikipedia.org/w/index.php?title=MediaWiki:Common.css&action=edit Common.css] stylesheet into the wiki's Common.css stylesheet.

Step 3:

Export the Infobox template from Wikipedia from the [https://en.wikipedia.org/wiki/Special:Export Special:Export] page. In the "add pages manually" text box, type Template:Infobox and then check all three checkboxes below: "Include only the current revision, not the full history", "Include templates", and "Save as file", then click the Export button and save the XML file.

Step 4:

Import that XML file onto the wiki using the Special:Import page. Choose the "Import to default locations" option.

Step 5:

Test your Infobox template by creating a new page on the mediawiki and using the Infobox template. I used the following code to test:

<nowiki>
{{Infobox
|title = An amazing Infobox
|header1 = It works!
|label2 = Configured by
|data2 = trog
|label3 = Web
|data3 = http://trog.qgl.org/20140923/setting-up-infobox-templates-in-mediawiki-v1-23/
}}</nowiki>

Debugging:

I seem to have the template functionality working, but it's not styled properly. So let's try exporting and importing Wikipedia's Common.css stylesheet instead of just copying and pasting. And let's also try exporting and importing Wikipedia's Common.js script into the wiki.

Wait, I fixed it by just removing the custom CSS code that I had from trying to change the font-face. If those two things conflict, we may have issues down the line...

I also uncovered something about HTMLTidy that may impact how well templates from Wikipedia run on our mediawiki [https://www.mediawiki.org/wiki/Manual:Using_content_from_Wikipedia#HTMLTidy]. It looks like we can either [https://www.mediawiki.org/wiki/Manual:$wgTidyConfig set an option] in LocalSettings.php to enable HTMLTidy or we can [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Transwiki get the templates from another source].

== Installing WordPress (3/16/2016) ==

Same as the [[Test Web Server Documentation#Installing WordPress (3/14/2016)|test web server]]

== Google Analytics for Mediawiki and WordPress (3/16/2016) ==

There's an [https://www.mediawiki.org/wiki/Extension:Google_Analytics_Integration extension] for google analytics integration on Mediawiki, and it seems to have pretty robust support (you can exclude specific pages or categories from analytics, and you exclude user groups from analytics too).

There's an open-source alternative to google analytics called [http://www.openwebanalytics.com/ Open Web Analytics], and there's [https://www.mediawiki.org/wiki/Extension:Open_Web_Analytics a Mediawiki extension] for that too. Looks like Open Web Analytics has some cool extra features too like click heatmaps...

WordPress appears to have support for both Google Analytics and Open Web Analytics.

After looking around for other open-source alternatives, it appears Piwik is another strong contender. There's a demo of Piwik [http://demo.piwik.org/ here] and a demo of OWA [http://demo.openwebanalytics.com/owa/ here]. There's [https://www.mediawiki.org/wiki/Extension:Piwik_Integration a Mediawiki extension] for Piwik integration, and it seems to be pretty well maintained. WordPress also appears to support Piwik as well.

== Open-source Analytics Alternatives (3/21/2016) ==

Might as well try to keep everything open-source. I'll try out Open Web Analytics (OWA) on the test web server to play around with the interface.

OWA isn't going to work, as noted on the [[Test Web Server Documentation#Installing Open Web Analytics (3/21/2016)|test web server page]]. So let's try the [https://www.mediawiki.org/wiki/Extension:Piwik_Integration extension] for Piwik too.

So at least Piwik works. But here's the counterargument: in five years, which is more likely to be well-supported and maintained, Piwik or Google Analytics? And with the obvious answer being Google Analytics, we should just use that.

== Back to Google Analytics (3/23/2016) ==

We made a new Google Analytics account! admin@mcnaircenter.org 9million

I'm going to go ahead and test the Google Analytics integration extension on the [[Test Web Server Documentation#Installing Google Analytics (3/23/2016)|test web server]].

== Cargo vs Semantic Mediawiki? (3/25/2016) ==

I recently learned about Cargo, which claims to be a more straightforward version of SMW. see the [https://www.semantic-mediawiki.org/w/images/9/9a/Cargo_and_the_future_of_SMW.pdf slides] of a presentation given at the spring 2015 SMWCon, and the Cargo extension page's [https://www.mediawiki.org/wiki/Extension:Cargo/Cargo_and_Semantic_MediaWiki comparison] page. The lead author of the extension, Yaron, is a member of the SMW community, and so Cargo is likely pretty legit. Now I'm not sure which is better...

After some more deliberation, I think Cargo wins. Cargo's querying syntax is more like SQL (which is actually useful and pretty easy to learn), and Cargo also doesn't deal with all of the property declarations that Semantic Mediawiki requires. Also, Cargo has native support for JSON exporting, while SMW doesn't (and any extensions that provide such support are pretty stale).

== CSS Design (4/22/2016) ==

Couple of notes on where "obvious" (hint: not so obvious) things are. (Note, all paths that follow are relative to the Mediawiki root directory, which should be in /var/lib/mediawiki).

First, the logo for the page is defined in LocalSettings.php. Look for the $wgLogo variable. I used a FTP client to upload new logos, but you could use a terminal and wget the file if you have it online somewhere.

For changing CSS rules, I just used the Chrome inspector (F12 or right-click and choose "Inspect" from the option menu) to understand which CSS selector rules were being applied and which were being overridden. You can also make small CSS changes in the inspector that are lost upon refreshing the page, but can be useful for experimenting with different colors, positions, etc.

You can use $ grep -r "[words_to_search_for]" on the command line to search for something (a CSS hex color code, a CSS selector, etc.) in all files and directories in the current directory. I usually used this while in the skins/Vector directory to make finding CSS properties easier.

The CSS is actually written in LESS, which is an extension of CSS syntax that allows you to do nested properties, variables, etc. The skins/Vector/variables.less file has all the variables, which are prefixed with an at sign (@) in LESS. WARNING: if you try to use a variable name that hasn't been defined (due to a typo, for example), ALL of the CSS/LESS will stop working. The plus side is that its obvious that you messed up. The down side is that it may not be obvious where exactly you messed up, so make small changes and refresh the browser view constantly. Other than that, most of the other LESS rules are in the skins/Vector/components folder. The file names are fairly reasonable: common.less defines rules common to the entire page, navigation.less defines the area on the left sidebar, personalMenu.less defines the set of links in the top right corner for the user account, footer.less defines the footer. There's also another file in skins/Vector that is useful for understanding how everything comes together: VectorTemplate.php, which contains the high level HTML structure.

== To-do list ==

* extra namespaces for IntraACL stuff. see [https://www.mediawiki.org/wiki/Manual:Using_custom_namespaces here]
* inconsistent styling: links aren't orange on special pages, fonts and links are the default in the "mobile" view

== In progress ==

* Mediawiki CSS styling - '''custom fonts fixed, need new designs/layouts'''
* analytics - '''getting GA installed for WordPress blogs, need port 21 opened'''

== Potential pitfalls ==

* It looks like the Common.css stylesheet has to be exactly the same as the Wikipedia Common.css stylesheet for the Wikipedia Infobox templates to be styled properly, because I solved the problem of the infoboxes being styled incorrectly by deleting all of the custom CSS that we had written for the mediawiki.

==Installing and configuring the Backup Drive==

=New Notes=

==Mounting the RDP==

apt-get install cifs-utils

mount -t cifs //128.42.44.182/mcnair /mnt/rdp -o user=researcher,domain=ad.mcnaircenter.org

==Mobile Interface==

===Folders===
* The folders with the source code can be found at

/var/lib/mediawiki/extensions/MobileFrontend/minerva.less

===Tips===
* Using a [http://www.mobilephoneemulator.com/ mobile emulator] helps understand what the mobile interface is going to look like before deploying onto Production.

==User Access 6/15/2016 ==
'''Objective'''

Accounts are to be vetted before they are created. We would like to have a queue of account creation requests, that must be approved before they can be created, given that we allow users to edit public wiki pages.
*Helpful Material:
** [https://www.mediawiki.org/wiki/Extension:ConfirmAccount Mediawiki Documentation ]
** mcnair@rice.edu -account that will approve account creation.

Steps Followed:
'''Package Installation Steps:'''
* cd extensions/
* wget https://extdist.wmflabs.org/dist/extensions/ConfirmAccount-REL1_26-d6e2f46.tar.gz
* tar -xzf ConfirmAccount-REL1_26-d6e2f46.tar.gz
* sudo pear install mail
* sudo pear install net_smtp
The above steps ensure that email notification system is set up, and that the Confirm Account package is set up.

'''Configuring Confirm Accounts php files '''
The following files need to be updated as follows:
*ConfirmAccount.php:
Set the confirmation queues to point to folders that www-data has access to:
// For changing path in accountreqs
$wgConfirmAccountPathAR = $IP . "/images/accountreqs";

// For changing path in accountcreds
$wgConfirmAccountPathAC = $IP . "/images/accountcreds";

*ConfirmAccount.config.php
Change the directories to those defined in ConfirmAccount.php
$wgFileStore['accountreqs']['directory'] : $wgConfirmAccountPathAR,
$wgFileStore['accountcreds']['directory'] : $wgConfirmAccountPathAC,

* LocalSettings.php:

$wgEnableEmail = true;
$wgEmergencyContact = "mcnair@rice.edu";
$wgPasswordSender = "mcnair@rice.edu";
# User Account Confirmation
require_once "$IP/extensions/ConfirmAccount/ConfirmAccount.php";

$wgSMTP = array(
'host' => 'ssl://smtp.mail.rice.edu',
'IDHost' => '128.42.44.22',
'port' => 465,
'username' => 'mcnair@rice.edu',
'password' => '*********',
'auth' => true
);
$wgConfirmAccountContact = 'mcnair@rice.edu';

''' Updating the Wiki'''
* cd /var/lib/mediawiki/maintenance
* php update.php

[[admin_classification::IT Build| ]]

== Mediawiki extensions ==

== Semantic Mediawiki Extensions ==
The SMW extension installation process requires a composer.phar to be installed. All further installations to SMW are done through the composer.phar.

==== Installing Mediawiki Composer.phar ====
Here is the mediawiki link: [https://getcomposer.org/doc/00-intro.md#installation-nix]

==== Installing Extension : Semantic Results Formats ====
* Here is the link to the installation process :
* Here is the command to be run in the Mediawiki root folder (var/lib/mediawiki)
php composer.phar require --update-no-dev mediawiki/semantic-result-formats "2.*"

Web Server Documentation

2016-11-07T21:46:08Z

RavaliKruthiventi: /* New Notes */

[[Category: McNair Admin]]

=Old Notes (from Alex Jiang)=

== Installing Ubuntu aka Trying RAID 10 (2/15/2016) ==

Some general configuration options:
* hostname: McNairWebServ
* user full name: McNair Center
* username: mcnair
* don't encrypt home directory
* manual partitioning (see below for configuration of RAID)
* no automatic updates
* software: LAMP stack

Sahil and I tried to configure RAID 10 using the software RAID option in the installer, which is documented [https://help.ubuntu.com/community/Installation/SoftwareRAID#Partitioning_the_disk here]. We put two 64 GB swap space partitions on the first two hard drives, and created two ext4 partitions that took up the rest of the space on those two drives. For the other two drives, we used a single ext4 partition for each drive. For all of the ext4 partitions, we set the bootable flag to "on." Then we chose to configure the software RAID, created a new MD device, and chose RAID10 with 2 active devices and 2 spare devices. For the active devices, we chose the two ext4 partitions on the first two hard drives, and for the spare devices, we chose the two ext4 partitions on the other two hard drives. But then the installation process fails when the GRUB boot loader can't be installed, because the GUID partition tables (GPT) need a designated, small (1 MB is enough) partition for the GRUB bootloader.

So we started partitioning from scratch, but with only two hard drives for a RAID1 array. In the first drive, there are three partitions: one 1 MB partition reserved for the bootloader, one 64 GB swap partition, and the rest of the drive as an ext4 partition for the filesystem. In the second drive, there are two partitions: one 1 MB partition reserved for the bootloader and the rest of the drive as an ext4 partition for the filesystem. Then we made two software RAID devices, both with one with 2 active devices and 0 spare devices. The first RAID device had both of the bootloader partitions as the active devices, and the second RAID device had both of the ext4 filesystem partitions as the active devices. Then we set the first RAID device to "use as ext4" and the mount as "/boot" and the second RAID device as "use as ext4" and the mount as "/" and then continued with the installation. This time, it failed to install the kernel.

I guessed that, because the 1 MB RAID device was made first, that the kernel tried to install itself to that device and failed. So I went back to the partitioner and set the first RAID device to "do not use" and then tried the installation process again. It prompted me a couple of times warning me that the old filesystem would be overwritten, but I continued the installation regardless. But then the GRUB boot loader failed, even if we tried not installing it to the master boot record (MBR) and installing it to "dev/md0" or installing it to "dev/md0_raid1"

== Configuring RAID 1 on Web Server (2/17/2016) ==

The first RAID device (/dev/md0) we set to use as an ext4 filesystem and mounted /boot to it, and the second RAID device (/dev/md127) we set to use as an ext4 filesystem and mounted / to it (we tried this before, but it failed to install the kernel). This time, it failed to install the bootloader, but it never prompted me to choose where to install the bootloader (usually it asks whether you'd like to install the bootloader to the master boot record).

'''Second partitioning attempt:'''

First hard disk (/dev/sda):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 64 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Second hard disk (/dev/sdb):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Write partition changes to disk and then start configuring software RAID:

* First RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb2), 0 spare devices
* Second RAID device (/dev/md1): RAID1, 2 active devices (/dev/sda1 and /dev/sdb1), 0 spare devices
* first RAID device partition: use as ext4 filesystem, mount point /
* second RAID device partition: use as ext4 filesystem, mount point /boot, format data on the partition

Failed to install GRUB bootloader on a hard disk (again).

Next attempt:
First RAID device (/dev/md0): use as ext4 filesystem, mount point /, format data on the partition
Second RAID device (/dev/md1): erase data on partition, use as "do not use"

Next attempt:
Redo the RAID devices so that the first device (/dev/md0): RAID1, 2 active devices (/dev/sda1 and /dev/sdb1), 0 spare devices, and the second RAID device (/dev/md1): RAID1, 2 active devices (/dev/sda3 and /dev/sdb2), 0 spare devices. Then configure the RAID devices:

* first RAID device partition: use as ext4 filesystem, mount point /boot, format data on the partition
* second RAID device partition: use as ext4 filesystem, mount point /, format data on the partition

New idea: ditch the idea of RAID on the boot partitions (we'll put the bootloader on one of the boot partitions and then we can try to set up RAID once we've got the thing booting into Linux), so leave the partitions as above ("Second Partitioning Attempt"). Only make one software RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb2), 0 spare devices. Then configure the first RAID device partition: use as ext4 filesystem, mount point /, format data on the partition.

'''Third partitioning attempt:'''

First hard disk (/dev/sda):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Second hard disk (/dev/sdb):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

One RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb3), 0 spare devices. set partition: use as ext4 filesystem, mount point /

'''Fourth partitioning attempt:'''

First hard disk (/dev/sda):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Second hard disk (/dev/sdb):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

First RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb3), 0 spare devices. set partition: use as ext4 filesystem, mount point /

Second RAID device (/dev/md1): RAID1, 2 active devices (/dev/sda1 and /dev/sdb1), 0 spare devices. set partition: use as ext4 filesystem, mount point /boot

Third RAID device (/dev/md2): RAID0, 2 active devices (/dev/sda2 and /dev/sdb2). set partition: use as swap area

'''Fifth partitioning attempt (made sure all software RAID devices are removed, delete all partitions, create new partition tables):'''

First hard disk (/dev/sda):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

Second hard disk (/dev/sdb):
* 10 MB partition, use as reserved BIOS boot area, bootable flag off
* 32 GB partition, use as swap space
* rest of the space partition, use as ext4 filesystem, mount point /, bootable flag off

First RAID device (/dev/md0): RAID1, 2 active devices (/dev/sda3 and /dev/sdb3), 0 spare devices. set partition: use as ext4 filesystem, mount point /

install GRUB bootloader to /dev/sda and /dev/sdb. It works!

== Network Configuration (2/22/2016) ==

As with the [[Test_Web_Server_Documentation|test web server]], network configuration can be annoying. First, I had to figure out the right LAN port on the mobo by plugging the RJ45 cable in and waiting for the LED to light up (it took about 5 seconds and a couple of tries). Then I went to the terminal to check on the network interfaces:

$ ifconfig
$ ifconfig -a
$ sudo ifconfig eth0 up
$ cat /etc/network/interfaces

After bringing up the eth0 interface (it's down if it's not listed in the output of ifconfig), I then modified /etc/network/interfaces to set up the eth0 interface:

$ sudo vi /etc/network/interfaces

And added these lines:

auto eth0
iface eth0 inet dhcp
dns-nameservers 8.8.8.8 8.8.4.4

Then I used ifdown/ifup to reconfigure the interface:

$ sudo ifdown eth0
$ sudo ifup eth0

There's a couple of configuration files that you can check to make sure that the network configured correctly (I compared them to the corresponding files in the test web server):

$ hostname -I
$ cat /etc/resolv.conf
$ cat /etc/hosts
$ cat /var/lib/dhcp/dhclient.eth0.leases

Then I checked if it was connected to the internet:

$ ping google.com
$ sudo apt-get update

I got a "GPG error: http://security.ubuntu.com trusty-security InRelease: Clearsigned file isn't valid, got 'NODATA' (does the network require authentication?)" message on the apt-get update a couple of times, so I tried sudo ifdown eth0 and sudo ifup eth0 a couple of times. Then I rebooted the machine and tried to update the package manager again, and it still didn't work.

These results seem familiar; I think I had the same error when I tried to connect the test web server to the internet before Ed filed the ticket with the IT help desk, which suggests that we may have given the wrong MAC address or IT messed up the configuration. Still, I checked all of the configuration files. I only noted a couple of differences between the test web server network interface and this web server network interface:

# The IP addresses are different. The test web server has an address that starts with 128, but this webserver has an address that starts with 10. (Ed thinks this is a sign that this webserver's IP address limits it to the Rice network).
# The subnet masks are different. The test web server has a subnet mask that ends in 240, but this webserver has a mask that ends in 0.
# The test webserver has a DNS domain name (i.e. the output of hostname -d) of attlocal.net. This webserver doesn't have one. I tried adding it (by editing /etc/hosts), but that change alone didn't help.

Interesting side note: going into the mobo BIOS menu, under "Server Mgmt" there is a submenu "BMC network configuration" that shows the MAC address for "DM_LAN1" as ending in de, whereas the MAC address for eth0 ends in dc (otherwise, the two MAC addresses are the same). So maybe the mobo is interfering with the MAC address? But changing DM_LAN1's Config Address source from "Previous State" to "DynamicBmcDhcp" doesn't fix the problem (and upon reboot, it switches back to Previous State).

Turns out IT just configured the network IP addresses incorrectly. Ed and I talked to the IT desk on Tuesday and we got new IP addresses.

== Installing Software (2/24/2016) ==

Now that we have internet connection, we can start getting packages:

$ sudo apt-get update
$ sudo apt-get upgrade

Since I didn't install the SSH server in the beginning, I'll go ahead and install the openssh-server package now:

$ sudo apt-get install openssh-server

Backup the SSH server config file:

$ sudo cp /etc/ssh/sshd_config /etc/ssh/sshd_config.original

== Installing Mediawiki (3/7/2016) ==

As with the [[Test Web Server Documentation#Installing Mediawiki (1/4/16)|test web server]], I followed the steps from [http://www.mediawiki.org/wiki/Manual:Running_MediaWiki_on_Ubuntu this page] on installing Mediawiki.

Make a directory for the stable version of Mediawiki (1.26.2), which isn't available through apt-get, so we're downloading the official tarball!

$ mkdir ~/Downloads
$ cd ~/Downloads
$ wget https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz
$ tar -xvzf /pathtofile/mediawiki-*.tar.gz

Copy the extracted files to /var/lib/mediawiki:

$ sudo mkdir /var/lib/mediawiki
$ sudo mv mediawiki-1.26.2/* /var/lib/mediawiki

Then set up the mediawiki directory:

$ cd /var/www/html
$ sudo ln -s /var/lib/mediawiki mediawiki

Now point a browser to http://[ip_address]/mediawiki/mw-config/index.php and configure the Mediawiki site as follows:

Choose both "your language" and the "wiki language" to be English and continue to the next page. Make sure that all of the environmental checks pass before continuing to the next page. Leave the "database host" as localhost and change "database name" to mcnair. Leave "database table prefix" empty and "database username" as root. Set the "database password" to whatever the password for the MySQL user was set as during installation and then continue to the next page. Check the box for "Use this account for installation" and choose InnoDB for "Storage Engine" and choose Binary for "Database character set" and continue to the next page. Set the name of the wiki as McNair Center and let the project namespace be the same as the wiki name. For the administrator account, set the username, password, and email. Choose to subscribe to the release announcements mailing list if you provide an email, and choose to answer more questions.

Choose "open wiki" for the user rights profile. Choose "no license footer". Uncheck the box for "enable outbound email" and choose which skin you'd like to use. For extensions, leave them all unchecked. Leave "enable file uploads" unchecked. Don't change the Logo URL and don't check "enable Instant Commons". For caching, choose "no caching".

Copy the downloaded LocalSettings.php configuration file onto the webserver in the root directory of the mediawiki installation: /var/lib/mediawiki. Then point a browser to http://[ip_address]/mediawiki and see your new site!

== Short URLs (3/7/2016) ==

Same as for the [[Test Web Server Documentation#Short URLs (1/27/16)|test web server]].

== Labeled Section Transclusion (3/7/2016) ==

Same as for the [[Test Web Server Documentation#Labeled Section Transclusion (1/25/16)|test web server]].

== Responsive Design (3/7/2016) ==

Same as for the [[Test Web Server Documentation#Responsive Design (1/25/16)|test web server]].

== Mediawiki CSS changes (3/9/2016) ==

Started working with Julia on the mediawiki website CSS design (color scheme and typography on [[Website Design]]). Ran into a couple of problems:

* If you upload a file to Slack and want to download it from its URL using the wget command on command-line, make sure you get a public link from the person who uploaded the file, otherwise the file won't be downloaded. (I was trying to figure out why the McNair logo that Julia sent me on slack wasn't showing up on the website, but it turns out I just needed a public link to the file, which should look something like https://files.slack.com/files-pri/T0JA2A9Q9-F0RL0G4BZ/mcnair.png?pub_secret=30505f5d02).
* the @font-face rule doesn't seem to work in Common.css... I never got past this problem. I think the .tff file for the font may have failed to download onto the server properly, but I haven't found a good way to test for that case. Also, I tried using an absolute URL (i.e. http://128.42.44.180/mediawiki/resources/assets/fonts/franklin-gothic-book.ttf) when specifying the @font-face rule, but it doesn't seem to help. Using an absolute URL to the slack file public URL (i.e. https://files.slack.com/files-pri/T0JA2A9Q9-F0RLDB3G8/download/franklin-gothic-book.ttf?pub_secret=327cdaaeb8) doesn't seem to work either.

Well, I don't really trust the file to download onto the webserver properly from terminal, so I got an SFTP client and used that to copy the .ttf file onto the webserver. Still no dice.

== Setting up users (3/11/2016) ==

First, getting the ImportUsers extension for bulk account creation (using a CSV). Downloading the extension is as follows:

$ cd ~/Downloads
$ wget https://extdist.wmflabs.org/dist/extensions/ImportUsers-REL1_26-0fe9e22.tar.gz
$ tar -xzvf ImportUsers-REL1_26-0fe9e22.tar.gz
$ cd /var/lib/mediawiki/extensions
$ cp -r ~/Downloads/ImportUsers ./ImportUsers

Then edit LocalSettings.php and add this line:

require_once("$IP/extensions/ImportUsers/ImportUsers.php");

Then we just have to make a CSV with columns for username, password, email, real name, and user groups (optional). More info on the [https://www.mediawiki.org/wiki/Extension:ImportUsers extension documentation page].

I made a small little CSV to test the ImportUsers extensions:

user1,pass1,user1@example.com,Dummy One
user2,pass2,user2@example.com,Dummy Two
user3,pass3,user3@example.com,Dummy Three

After importing the users, run a maintenance script from the command line to update new user statistics:

$ cd /var/lib/mediawiki/maintenance
$ php initSiteStats.php

But this runs into some errors ([https://www.mediawiki.org/wiki/Manual:Maintenance_scripts this page] suggests setting the MW_INSTALL_PATH environment variable, but I can't find a good way to do that). I looked into the error messages and found [http://stackoverflow.com/questions/21257589/ubuntu-typing-php-in-terminal-shows-a-lot-of-errors this SO post] which seems to cover it. I don't know whether we need SNMP, so I decided to just install it to be safe:

$ sudo apt-get install snmp

And the error messages go away. Alternatively, you can disable the snmp module for PHP with:

$ sudo php5dismod snmp

We also want to limit account creation to sysops only [https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_account_creation as done here]. To do this, edit LocalSettings.php and add these lines:

# Prevent new user registrations except by sysops
$wgGroupPermissions['*']['createaccount'] = false;

== BibTex citations with BibManager (3/11/2016) ==

The [https://www.mediawiki.org/wiki/Extension:BibManager BibManager extension] isn't actively maintained, but it doesn't seem like it needs to be constantly updated to accommodate for new features and was last updated for Mediawiki version 1.22, which isn't too bad.

Let's test on the test web server first.

== Bibtex citations with Bibtex (3/14/2016) ==

The [https://www.mediawiki.org/wiki/Extension:Bibtex Bibtex extension] doesn't look like it's being actively maintained, but it might work. I'm testing it on the test web server alongside BibManager.

== Ghost vs. WordPress (3/14/2016) ==

So it looks like we may choose Ghost over WordPress. We need something self-hostable, and ideally open-source (and both Ghost and WP satisfy those two conditions). However, I hear Ghost is more lightweight, so if we're not looking for a lot of extra functionality from third-party plugins, Ghost may be the better choice. I'm setting up Ghost on the [[Test Web Server Documentation#Installing Ghost (3/14/2016)|test web server]], so we'll see how it goes...

Turns out Ghost+apache is kinda difficult (definitely more difficult than WordPress+Apache), so let's just try WordPress.

The [[Test Web Server Documentation#Installing WordPress (3/14/2016)|test web server]] had a pretty easy time installing WordPress alongside the existing mediawiki site, so it seems that we'll use WP for the blog on this web server as well.

== Infoboxes (3/16/2016) ==

I decide to follow the instructions on [http://trog.qgl.org/20140923/setting-up-infobox-templates-in-mediawiki-v1-23/ this post]. Let's see how it goes.

Step 1:

Download and install the [https://www.mediawiki.org/wiki/Extension:Scribunto Scribunto extension].

cd ~/Downloads
$ wget https://extdist.wmflabs.org/dist/extensions/Scribunto-REL1_26-9fd4e64.tar.gz
$ tar -xzvf Scribunto-REL1_26-9fd4e64.tar.gz
$ cd /var/lib/mediawiki/extensions
$ cp -r ~/Downloads/Scribunto ./Scribunto

Add these two lines to LocalSettings.php:

require_once("$IP/extensions/Scribunto/Scribunto.php");
$wgScribuntoDefaultEngine = 'luastandalone';

And set execute permissions for Lua binaries in the extension:

$ chmod a+x /var/lib/mediawiki/extensions/Scribunto/engines/LuaStandalone/binaries/lua_5_1_5_linux_64_generic/lua

In addition, check that the PCRE version is at least 8.10 (preferable at least 8.33), PHP's mbstring extension is enabled, and PHP's proc_open function is not disabled using a phpinfo page.

Step 2:

Copy Wikipedia's [https://en.wikipedia.org/w/index.php?title=MediaWiki:Common.css&action=edit Common.css] stylesheet into the wiki's Common.css stylesheet.

Step 3:

Export the Infobox template from Wikipedia from the [https://en.wikipedia.org/wiki/Special:Export Special:Export] page. In the "add pages manually" text box, type Template:Infobox and then check all three checkboxes below: "Include only the current revision, not the full history", "Include templates", and "Save as file", then click the Export button and save the XML file.

Step 4:

Import that XML file onto the wiki using the Special:Import page. Choose the "Import to default locations" option.

Step 5:

Test your Infobox template by creating a new page on the mediawiki and using the Infobox template. I used the following code to test:

<nowiki>
{{Infobox
|title = An amazing Infobox
|header1 = It works!
|label2 = Configured by
|data2 = trog
|label3 = Web
|data3 = http://trog.qgl.org/20140923/setting-up-infobox-templates-in-mediawiki-v1-23/
}}</nowiki>

Debugging:

I seem to have the template functionality working, but it's not styled properly. So let's try exporting and importing Wikipedia's Common.css stylesheet instead of just copying and pasting. And let's also try exporting and importing Wikipedia's Common.js script into the wiki.

Wait, I fixed it by just removing the custom CSS code that I had from trying to change the font-face. If those two things conflict, we may have issues down the line...

I also uncovered something about HTMLTidy that may impact how well templates from Wikipedia run on our mediawiki [https://www.mediawiki.org/wiki/Manual:Using_content_from_Wikipedia#HTMLTidy]. It looks like we can either [https://www.mediawiki.org/wiki/Manual:$wgTidyConfig set an option] in LocalSettings.php to enable HTMLTidy or we can [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Transwiki get the templates from another source].

== Installing WordPress (3/16/2016) ==

Same as the [[Test Web Server Documentation#Installing WordPress (3/14/2016)|test web server]]

== Google Analytics for Mediawiki and WordPress (3/16/2016) ==

There's an [https://www.mediawiki.org/wiki/Extension:Google_Analytics_Integration extension] for google analytics integration on Mediawiki, and it seems to have pretty robust support (you can exclude specific pages or categories from analytics, and you exclude user groups from analytics too).

There's an open-source alternative to google analytics called [http://www.openwebanalytics.com/ Open Web Analytics], and there's [https://www.mediawiki.org/wiki/Extension:Open_Web_Analytics a Mediawiki extension] for that too. Looks like Open Web Analytics has some cool extra features too like click heatmaps...

WordPress appears to have support for both Google Analytics and Open Web Analytics.

After looking around for other open-source alternatives, it appears Piwik is another strong contender. There's a demo of Piwik [http://demo.piwik.org/ here] and a demo of OWA [http://demo.openwebanalytics.com/owa/ here]. There's [https://www.mediawiki.org/wiki/Extension:Piwik_Integration a Mediawiki extension] for Piwik integration, and it seems to be pretty well maintained. WordPress also appears to support Piwik as well.

== Open-source Analytics Alternatives (3/21/2016) ==

Might as well try to keep everything open-source. I'll try out Open Web Analytics (OWA) on the test web server to play around with the interface.

OWA isn't going to work, as noted on the [[Test Web Server Documentation#Installing Open Web Analytics (3/21/2016)|test web server page]]. So let's try the [https://www.mediawiki.org/wiki/Extension:Piwik_Integration extension] for Piwik too.

So at least Piwik works. But here's the counterargument: in five years, which is more likely to be well-supported and maintained, Piwik or Google Analytics? And with the obvious answer being Google Analytics, we should just use that.

== Back to Google Analytics (3/23/2016) ==

We made a new Google Analytics account! admin@mcnaircenter.org 9million

I'm going to go ahead and test the Google Analytics integration extension on the [[Test Web Server Documentation#Installing Google Analytics (3/23/2016)|test web server]].

== Cargo vs Semantic Mediawiki? (3/25/2016) ==

I recently learned about Cargo, which claims to be a more straightforward version of SMW. see the [https://www.semantic-mediawiki.org/w/images/9/9a/Cargo_and_the_future_of_SMW.pdf slides] of a presentation given at the spring 2015 SMWCon, and the Cargo extension page's [https://www.mediawiki.org/wiki/Extension:Cargo/Cargo_and_Semantic_MediaWiki comparison] page. The lead author of the extension, Yaron, is a member of the SMW community, and so Cargo is likely pretty legit. Now I'm not sure which is better...

After some more deliberation, I think Cargo wins. Cargo's querying syntax is more like SQL (which is actually useful and pretty easy to learn), and Cargo also doesn't deal with all of the property declarations that Semantic Mediawiki requires. Also, Cargo has native support for JSON exporting, while SMW doesn't (and any extensions that provide such support are pretty stale).

== CSS Design (4/22/2016) ==

Couple of notes on where "obvious" (hint: not so obvious) things are. (Note, all paths that follow are relative to the Mediawiki root directory, which should be in /var/lib/mediawiki).

First, the logo for the page is defined in LocalSettings.php. Look for the $wgLogo variable. I used a FTP client to upload new logos, but you could use a terminal and wget the file if you have it online somewhere.

For changing CSS rules, I just used the Chrome inspector (F12 or right-click and choose "Inspect" from the option menu) to understand which CSS selector rules were being applied and which were being overridden. You can also make small CSS changes in the inspector that are lost upon refreshing the page, but can be useful for experimenting with different colors, positions, etc.

You can use $ grep -r "[words_to_search_for]" on the command line to search for something (a CSS hex color code, a CSS selector, etc.) in all files and directories in the current directory. I usually used this while in the skins/Vector directory to make finding CSS properties easier.

The CSS is actually written in LESS, which is an extension of CSS syntax that allows you to do nested properties, variables, etc. The skins/Vector/variables.less file has all the variables, which are prefixed with an at sign (@) in LESS. WARNING: if you try to use a variable name that hasn't been defined (due to a typo, for example), ALL of the CSS/LESS will stop working. The plus side is that its obvious that you messed up. The down side is that it may not be obvious where exactly you messed up, so make small changes and refresh the browser view constantly. Other than that, most of the other LESS rules are in the skins/Vector/components folder. The file names are fairly reasonable: common.less defines rules common to the entire page, navigation.less defines the area on the left sidebar, personalMenu.less defines the set of links in the top right corner for the user account, footer.less defines the footer. There's also another file in skins/Vector that is useful for understanding how everything comes together: VectorTemplate.php, which contains the high level HTML structure.

== To-do list ==

* extra namespaces for IntraACL stuff. see [https://www.mediawiki.org/wiki/Manual:Using_custom_namespaces here]
* inconsistent styling: links aren't orange on special pages, fonts and links are the default in the "mobile" view

== In progress ==

* Mediawiki CSS styling - '''custom fonts fixed, need new designs/layouts'''
* analytics - '''getting GA installed for WordPress blogs, need port 21 opened'''

== Potential pitfalls ==

* It looks like the Common.css stylesheet has to be exactly the same as the Wikipedia Common.css stylesheet for the Wikipedia Infobox templates to be styled properly, because I solved the problem of the infoboxes being styled incorrectly by deleting all of the custom CSS that we had written for the mediawiki.

==Installing and configuring the Backup Drive==

=New Notes=

==Mounting the RDP==

apt-get install cifs-utils

mount -t cifs //128.42.44.182/mcnair /mnt/rdp -o user=researcher,domain=ad.mcnaircenter.org

==Mobile Interface==

===Folders===
* The folders with the source code can be found at

/var/lib/mediawiki/extensions/MobileFrontend/minerva.less

===Tips===
* Using a [http://www.mobilephoneemulator.com/ mobile emulator] helps understand what the mobile interface is going to look like before deploying onto Production.

==User Access 6/15/2016 ==
'''Objective'''

Accounts are to be vetted before they are created. We would like to have a queue of account creation requests, that must be approved before they can be created, given that we allow users to edit public wiki pages.
*Helpful Material:
** [https://www.mediawiki.org/wiki/Extension:ConfirmAccount Mediawiki Documentation ]
** mcnair@rice.edu -account that will approve account creation.

Steps Followed:
'''Package Installation Steps:'''
* cd extensions/
* wget https://extdist.wmflabs.org/dist/extensions/ConfirmAccount-REL1_26-d6e2f46.tar.gz
* tar -xzf ConfirmAccount-REL1_26-d6e2f46.tar.gz
* sudo pear install mail
* sudo pear install net_smtp
The above steps ensure that email notification system is set up, and that the Confirm Account package is set up.

'''Configuring Confirm Accounts php files '''
The following files need to be updated as follows:
*ConfirmAccount.php:
Set the confirmation queues to point to folders that www-data has access to:
// For changing path in accountreqs
$wgConfirmAccountPathAR = $IP . "/images/accountreqs";

// For changing path in accountcreds
$wgConfirmAccountPathAC = $IP . "/images/accountcreds";

*ConfirmAccount.config.php
Change the directories to those defined in ConfirmAccount.php
$wgFileStore['accountreqs']['directory'] : $wgConfirmAccountPathAR,
$wgFileStore['accountcreds']['directory'] : $wgConfirmAccountPathAC,

* LocalSettings.php:

$wgEnableEmail = true;
$wgEmergencyContact = "mcnair@rice.edu";
$wgPasswordSender = "mcnair@rice.edu";
# User Account Confirmation
require_once "$IP/extensions/ConfirmAccount/ConfirmAccount.php";

$wgSMTP = array(
'host' => 'ssl://smtp.mail.rice.edu',
'IDHost' => '128.42.44.22',
'port' => 465,
'username' => 'mcnair@rice.edu',
'password' => '*********',
'auth' => true
);
$wgConfirmAccountContact = 'mcnair@rice.edu';

''' Updating the Wiki'''
* cd /var/lib/mediawiki/maintenance
* php update.php

[[admin_classification::IT Build| ]]

== Mediawiki extensions ==

=== Semantic Mediawiki Extensions ==
The SMW extension installation process requires a composer.phar to be installed. All further installations to SMW are done through the composer.phar.

==== Installing Mediawiki Composer.phar ====
Here is the mediawiki link: [https://getcomposer.org/doc/00-intro.md#installation-nix]

==== Installing Extension : Semantic Results Formats ====
* Here is the link to the installation process :
* Here is the command to be run in the Mediawiki root folder (var/lib/mediawiki)
php composer.phar require --update-no-dev mediawiki/semantic-result-formats "2.*"

Wordpress Blog Site (Tool)

2016-10-31T21:03:37Z

RavaliKruthiventi: /* Image Uploads */

Log in to:
http://www.mcnaircenter.org/blog/wp-admin/

==Install FTP server==

Log in and sudo su yourself, then:

apt-get install vsftpd

Man page for the vsftpd.conf file

http://vsftpd.beasts.org/vsftpd_conf.html

Securing the FTP:

https://help.ubuntu.com/lts/serverguide/ftp-server.html

==Configuration==

Edit /etc/vsftpd.conf (note next restart will reflect changes in /etc/init)

#add at tend of file:
listen_port=26

'''Generate keys for our website''' with the following command:

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /etc/vsftpd.pem -out /etc/vsftpd.pem

Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:Texas
Locality Name (eg, city) []:Houston
Organization Name (eg, company) [Internet Widgits Pty Ltd]:McNair Center at Rice University's Baker Institute
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:McNair Center
Email Address []:admin@mcnaircenter.org

Edit /etc/vsftpd.conf again

#change the lines as follows:
rsa_cert_file=/etc/vsftpd.pem
rsa_private_key_file=/etc/vsftpd.pem
write_enable=YES
chroot_local_user=YES
chroot_list_enable=YES
chroot_list_file=/etc/vsftpd.chroot_list
ssl_enable=YES

Edit /etc/vsftpd.chroot_list to contain a list of usernames (e.g., ravali)

Restart the server

service vsftpd restart

The FTP server should be accessible. Beware local packet shaping. Connect through mcnaircenter.org:26. Otherise have a check that the process is running and listening:
ps -aux
netstat -lnt

Assuming all is good with the FTP server, we now need to update Wordpress.

==Update Wordpress==

First make a copy of the wordpress folder and dbase

cp -R /var/lib/wordpress/ /var/lib/wordpress_bak
mysqldump -u mcnair_wp -p wordpress > backup_12Aug2016.sql
(enter password for dbase found in wp-config.php)

Change the permissions on every in the wordpress folder and make www-data its owner:
chown -R www-data /var/lib/wordpress
chmod -R 755 /var/lib/wordpress

Browse to 128.42.44.180/blog/wp-admin
Click update now. Enter:

Hostname 128.42.44.180:26
FTP Username ravali (or some other account)
FTP Password
Connection Type FTPS (SSL)

Leave the Akismet plugin
Go to appearance, themes -> add new
Choose Accesspress Lite 2.46.7
Activate
Install all of the recommended pluggins that come with the theme

Check the media library works by uploading a file (e.g., GreenRoundLogo.png)

Create a child theme
cd /var/lib/wordpress/wp-content/themes
mkdir accesspress-lite-child
vi accesspress-lite-child/style.css
Add in the template from the parent folder's style.css (just the top of the file)
Update the theme name and text domain to accesspress-lite-child.
vi accesspress-lite-child/functions.php
Add in the section that never changes

<?php
function my_theme_enqueue_styles() {

$parent_style = 'parent-style'; // This is 'twentyfifteen-style' for the Twenty Fifteen theme.

wp_enqueue_style( $parent_style, get_template_directory_uri() . '/style.css' );
wp_enqueue_style( 'child-style',
get_stylesheet_directory_uri() . '/style.css',
array( $parent_style ),
wp_get_theme()->get('Version')
);
}
add_action( 'wp_enqueue_scripts', 'my_theme_enqueue_styles' );
?>

Check the permissions on the new files:
chown -R www-data /var/lib/wordpress
chmod -R 755 /var/lib/wordpress

Active the child theme!
Check out what it looks like: www.mcnaircenter.org/blog

==Customize our theme==
=== Middle Section ===
The middle area of the blog's home page as three sections -

==== The Twitter Feed ====
This widget will display the top 5 tweets of the McNair Center's twitter account.

*In the Appearance -> Widgets section, the theme has the middle section sidebar.
*Add the AccessPress-lite Twitter feed widget to the middle section sidebar
*Log into dev.twitter.com with the McNair Center's creds.
*Paste the security keys, consumer keys, etc identifying the McNair Center API into the form of the widget.
*Set/reset the number of blog posts that are required

==== Categories ====
This is a built in widget from wordpress that is being used in this section.

==== Custom Widgets ====
Add a custom (text/html) widget from the widgets to put in the 'Contact Us' and social media icons.

== Requirements ==

== Design ==

== Styling ==
=== Header===
=== Sidebar ===
=== Image Uploads ===
*Images uploaded, both attached to posts and unattached, are added to the media library.
*They are categorized in the backend per the month and the year in which they are uploaded.

*Plugins involved:

** Enhanced Media Library
*** This plugin allows us to
**** create new categories
**** assign images to categories
**** filter in the media library section by category

** Pixabay
*** This plugin allows us to
**** find images from Creative Commons
**** add these images for each post - the Pixabay button can be seen next to the Add Media button on the create post screen.

=== Content ===
=== Footer ===
=== Blog Posts ===
====Titles====
==== Author Info ====
== Usability Features ==
===RSS===
===Subscription Rules===

== User Accounts ==

==Useful resources if there are errors==

Wordpress:
*https://codex.wordpress.org/Upgrading_WordPress_Extended#Step_9:_Run_the_WordPress_upgrade_program
*https://wordpress.org/support/topic/wordpress-45-error-after-update
*https://help.webcontrolcenter.com/kb/a992/vsftpd-ftp-server.aspx

FTP Issues:
*https://help.ubuntu.com/lts/serverguide/ftp-server.html
*http://askubuntu.com/questions/666858/vsftpd-service-will-not-start-for-14-04

[[Category: Internal]]
[[Internal Classification: Internal Resources| ]]

NLP (Internal Tool)

2016-10-07T22:10:06Z

RavaliKruthiventi: Protected "NLP (Internal Tool)" ([Edit=Allow only administrators] (indefinite) [Move=Allow only administrators] (indefinite))

==Introduction==
We (comp sci group) will be focused on building a few tools that can help us mine\learn the behaviors of patents. To understand which of the many technology options available to us yield the most useful results, we will be implementing various approaches, for easy to compute measures, that can be verified by SQL queries, or by the judgment of the Econ group.

Depending on which option pans out for us, we will be extending the approaches to larger sets, or to more complex measures.

==Methodologies==
====Method: ===
====Data Set Used ====
====Result====
====Link to Code====
====Pros and Cons====

NLP (Internal Tool)

2016-10-07T21:18:51Z

RavaliKruthiventi: /* Methodologies */

NLP (Internal Tool)

2016-10-07T21:18:16Z

RavaliKruthiventi: Created page with "==Introduction== We (comp sci group) will be focused on building a few tools that can help us mine\learn the behaviors of patents. To understand which of the many technology o..."

==Introduction==
We (comp sci group) will be focused on building a few tools that can help us mine\learn the behaviors of patents. To understand which of the many technology options available to us yield the most useful results, we will be implementing various approaches, for easy to compute measures, that can be verified by SQL queries, or by the judgment of the Econ group.

Depending on which option pans out for us, we will be extending the approaches to larger sets, or to more complex measures.

==Methodologies==
===Method: ===
===Data Set Used ===
===Result===
===Link to Code===
===Pros and Cons===

Work Hours

2016-08-31T15:28:25Z

RavaliKruthiventi:

Please complete your preferred times for the Fall term of 2015 below.

{| class="wikitable sortable" style="border: 1px solid darkgray; bgcolor: #f9f9f9"
| align="center" style="background:#f0f0f0;"|'''Name'''
| align="center" style="background:#f0f0f0;"|'''Mon'''
| align="center" style="background:#f0f0f0;"|'''Tues'''
| align="center" style="background:#f0f0f0;"|'''Wed'''
| align="center" style="background:#f0f0f0;"|'''Thurs'''
| align="center" style="background:#f0f0f0;"|'''Fri'''
|-
| Albert Nabiullin||||||3-4:30||12:30-3||3-4:30
|-
| Amir Kazempour||||||||||
|-
| Ariel Sun||||||11-12, 1:15-2:30||||11-12, 1:15-2:30
|-
| Ben Baldazo||||||||||
|-
| Carlin Cherry||||||3-5:00||2:30-4||
|-
| Dylan Dickens||1-5:00||||||||
|-
| Harsh Upadhyay||3-5:30||3-5:30||3-5:30||3-5:30||3-5:30
|-
| Jake Silberman||||||||||
|-
| James Chen||||||||||
|-
| Julia Wang||||||1-4:30||||1-4:00
|-
| Marcela Interiano||||||||||
|-
| Meghana Gaur||||||||||
|-
| Ramee Saleh||||||3-5:00||3-5:00||3-5:00
|-
| Ravali Kruthiventi||3-6||||3-6||||3-6
|-
| Todd Rachowin||||||||||
|-
| Veeral Shah||12:30-2:30||||||||12:30-2:30
|-
| Will Cleland||||12:30-4||||12:30-4||2-5:00
|}

[[Category: McNair Admin]]
[[admin_classification::Admin| ]]

Wordpress Blog Site (Tool)

2016-08-18T19:38:39Z

RavaliKruthiventi: /* Customize our theme */

Log in to:
http://www.mcnaircenter.org/blog/wp-admin/

==Install FTP server==

Log in and sudo su yourself, then:

apt-get install vsftpd

Man page for the vsftpd.conf file

http://vsftpd.beasts.org/vsftpd_conf.html

Securing the FTP:

https://help.ubuntu.com/lts/serverguide/ftp-server.html

==Configuration==

Edit /etc/vsftpd.conf (note next restart will reflect changes in /etc/init)

#add at tend of file:
listen_port=26

'''Generate keys for our website''' with the following command:

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /etc/vsftpd.pem -out /etc/vsftpd.pem

Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:Texas
Locality Name (eg, city) []:Houston
Organization Name (eg, company) [Internet Widgits Pty Ltd]:McNair Center at Rice University's Baker Institute
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:McNair Center
Email Address []:admin@mcnaircenter.org

Edit /etc/vsftpd.conf again

#change the lines as follows:
rsa_cert_file=/etc/vsftpd.pem
rsa_private_key_file=/etc/vsftpd.pem
write_enable=YES
chroot_local_user=YES
chroot_list_enable=YES
chroot_list_file=/etc/vsftpd.chroot_list
ssl_enable=YES

Edit /etc/vsftpd.chroot_list to contain a list of usernames (e.g., ravali)

Restart the server

service vsftpd restart

The FTP server should be accessible. Beware local packet shaping. Connect through mcnaircenter.org:26. Otherise have a check that the process is running and listening:
ps -aux
netstat -lnt

Assuming all is good with the FTP server, we now need to update Wordpress.

==Update Wordpress==

First make a copy of the wordpress folder and dbase

cp -R /var/lib/wordpress/ /var/lib/wordpress_bak
mysqldump -u mcnair_wp -p wordpress > backup_12Aug2016.sql
(enter password for dbase found in wp-config.php)

Change the permissions on every in the wordpress folder and make www-data its owner:
chown -R www-data /var/lib/wordpress
chmod -R 755 /var/lib/wordpress

Browse to 128.42.44.180/blog/wp-admin
Click update now. Enter:

Hostname 128.42.44.180:26
FTP Username ravali (or some other account)
FTP Password
Connection Type FTPS (SSL)

Leave the Akismet plugin
Go to appearance, themes -> add new
Choose Accesspress Lite 2.46.7
Activate
Install all of the recommended pluggins that come with the theme

Check the media library works by uploading a file (e.g., GreenRoundLogo.png)

Create a child theme
cd /var/lib/wordpress/wp-content/themes
mkdir accesspress-lite-child
vi accesspress-lite-child/style.css
Add in the template from the parent folder's style.css (just the top of the file)
Update the theme name and text domain to accesspress-lite-child.
vi accesspress-lite-child/functions.php
Add in the section that never changes

<?php
function my_theme_enqueue_styles() {

$parent_style = 'parent-style'; // This is 'twentyfifteen-style' for the Twenty Fifteen theme.

wp_enqueue_style( $parent_style, get_template_directory_uri() . '/style.css' );
wp_enqueue_style( 'child-style',
get_stylesheet_directory_uri() . '/style.css',
array( $parent_style ),
wp_get_theme()->get('Version')
);
}
add_action( 'wp_enqueue_scripts', 'my_theme_enqueue_styles' );
?>

Check the permissions on the new files:
chown -R www-data /var/lib/wordpress
chmod -R 755 /var/lib/wordpress

Active the child theme!
Check out what it looks like: www.mcnaircenter.org/blog

==Customize our theme==
=== Middle Section ===
The middle area of the blog's home page as three sections -

==== The Twitter Feed ====
This widget will display the top 5 tweets of the McNair Center's twitter account.

*In the Appearance -> Widgets section, the theme has the middle section sidebar.
*Add the AccessPress-lite Twitter feed widget to the middle section sidebar
*Log into dev.twitter.com with the McNair Center's creds.
*Paste the security keys, consumer keys, etc identifying the McNair Center API into the form of the widget.
*Set/reset the number of blog posts that are required

==== Categories ====
This is a built in widget from wordpress that is being used in this section.

==== Custom Widgets ====
Add a custom (text/html) widget from the widgets to put in the 'Contact Us' and social media icons.

== Requirements ==

== Design ==

== Styling ==
=== Header===
=== Sidebar ===
=== Image Uploads ===
=== Content ===
=== Footer ===
=== Blog Posts ===
====Titles====
==== Author Info ====
== Usability Features ==
===RSS===
===Subscription Rules===

== User Accounts ==

==Useful resources if there are errors==

Wordpress:
*https://codex.wordpress.org/Upgrading_WordPress_Extended#Step_9:_Run_the_WordPress_upgrade_program
*https://wordpress.org/support/topic/wordpress-45-error-after-update
*https://help.webcontrolcenter.com/kb/a992/vsftpd-ftp-server.aspx

FTP Issues:
*https://help.ubuntu.com/lts/serverguide/ftp-server.html
*http://askubuntu.com/questions/666858/vsftpd-service-will-not-start-for-14-04

McNair Center Admin

2016-08-11T21:17:16Z

RavaliKruthiventi: /* Twitter */

[[Category: McNair Admin]]
==Director==

Ed Egan: ed.egan@rice.edu, 617 415 8097, Office 230 Baker Hall [[ Image: MinionBob.jpg | 200x200px ]]

==Research Assistants==

===Summer Term 2016 Schedule===

[[:Category:McNair Staff|Full Staff List]]

{| class="wikitable sortable" style="border: 1px solid darkgray; bgcolor: #f9f9f9"
|-
|Name || Year || Rice Email || Phone ||
|-
|[[Veeral Shah]] || Sophomore || vss2@rice.edu || 914-261-1057 ||
|-
|[[Marcela Interiano]] || Senior || emi2@rice.edu || 832-830-6613 ||
|-
|[[Richard Goldman]] || Sophomore|| rag10@rice.edu || 713-689-8371 ||
|-
|[[Ariel Sun]] || Senior || hs28@rice.edu || 832-931-3358 ||
|-
|[[Jake Silberman]] || Junior|| wjs4@rice.edu || 512-590-2062 ||
|-
|[[Dylan Dickens]] || Junior|| dtd4@rice.edu || 832-691-6590 ||
|-
|[[User:GunnyLiu|Gunny Liu]] || Sophomore|| jl134@rice.edu || 346-228-6657 ||
|-
|[[Ravali Kruthiventi|Ravali Kruthiventi]] || Second Year Graduate Student|| sk99@rice.edu || 512-506-1552 ||
|-
|[[McNair Staff:Shoeb Mohammed |Shoeb Mohammed]] || Second Year Graduate Student|| sm55@rice.edu || 979-402-9133 ||

|}

===Email addresses===

Director:
*ed.egan@rice.edu

Center:
*admin@mcnaircenter.org (goes to director's gmail)
*mcnair@rice.edu (goes to a dedicated Rice email box. Log in at https://webmail.mail.rice.edu/)

Researchers (copy and paste for distribution):

vss2@rice.edu, emi2@rice.edu, rag10@rice.edu, hs28@rice.edu, wjs4@rice.edu, dtd4@rice.edu, jl134@rice.edu

==Hiring Instructions==

For those hired for pay:
# Log in to Esther and complete a '''SPAF (Student Personnel Action Form)''' - just search for it and it comes up pre-populated. Print out the completed form, sign it, and return it to Ed or to the finance office downstairs. Don't worry about completing the job title, rate, or other fields. This will be done for you.
# If you don't have an '''I-9 employment verification''' filed with Rice University, you'll need to get one. You'll have one if you've worked for Rice before. Otherwise, '''bring your passport''' (original, not a copy of the page) to the Baker Institute's finance team during regular business hours, and they will help you. If you can't find the finance office, ask Ed to be taken downstairs to see Giovanna or Christine, or ask the receptionist on the 1st floor for directions.
# If you haven't already set yourself up for '''direct deposit''', you can do that through Esther.
# Complete your timesheets (available through Esther) every two weeks. These are now submitted electronically. They are checked by Ed and the Baker administration, so make sure that you complete them correctly.

==Social Science Internship==

Social science students with a declared major, who have not previously taken SOSC421, may take SOSC421 and gain 3 graded general degree credits for an internship with the McNair Center. '''The intern must work at least 10hrs a week for at least 8 weeks, and cannot receive other compensation for their efforts.''' Students selected for 3 credit internships must read the [http://socialsciencesgateway.rice.edu/uploadedFiles/Social_Sciences_Gateway/Internships/4.SOSC%20421-Internship_NEW.doc syllabus document], and complete the [http://goo.gl/forms/owxxVZDlCL online intern agreement]. The program is administrated by the Social Sciences Gateway office. As a part of the program the student must complete two one-page written descriptions of the internship and one self-evaluation. The center's director will provide two additional evaluations, and a grade.

More information is available from:
*https://socialsciencesgateway.rice.edu/Content.aspx?id=2147484424&libID=2147484424

==Login information==

===Twitter===

The Center's Twitter Account:
admin@mcnaircenter.org
Password: 9million

Handle
@bakermcnair

===Local Machines===

Usernames and passwords for local machines are:
.\McNairCenterLocal
9million

===Dropbox===

The center's dropbox account is:
admin@mcnaircenter.org
9million

===WRDS===

mcnair
9Mil2015

===SDC Platinum===

The initials are '''mc'''

===PACER===

ed.egan.mcnair
9.Million
city: Houston
favorite college: Rice

==Hardware and software==

Install TextPad from http://www.textpad.com (it's nag-ware with a free download and install).

==Current projects==
Veeral Shah:

Marcela Interiano:

Richard Goldman:

Ariel Sun:

Jake Silberman:

Dylan Dickens:

Gunny Liu:

==Management==

*[[Meeting Logs]]

[[admin_classification::General Information| ]]

Wordpress Blog Site (Tool)

2016-08-05T16:06:33Z

RavaliKruthiventi: /* Error Logs */

== Requirements ==

== Design ==

== Styling ==
=== Header===
=== Sidebar ===
=== Image Uploads ===
=== Content ===
=== Footer ===
=== Blog Posts ===
====Titles====
==== Author Info ====
== Usability Features ==
===RSS===
===Subscription Rules===

== User Accounts ==

== Error Logs ==

McNair Center Wordpress blog

Setup:
Images:
1. Wordpress currently looks for images in a completely different location than the one it is uploading to.
2. It also has trouble generating the three standard sizes - thumbnails, etc

Permissions:

Styling:
Child themes:
*Usage of child themes when creating custom design with wordpress is recommended.
*Steps to create:
**Create a folder in the var/lib/wordpress/wp-content/themes with the title of your choice
**Into the newly created folder, add the header.php, style.css and functions.php file from the parent theme's folder to the child theme
**If the child files are blank, then all the parent theme's corresponding code is preserved.
**Else, if any chunk of code is added to the child theme's file, the code overrides the code in the parent theme's code.
**The webkit modules that adjust the display for mobile interfaces are best not changed.
*Once the files are created, to the style.css, add the template section enclosed in '/* ' and '*/' from the parent's style.css file.
*Go to the Wordpress dashboard, login as admin, and add the theme to wordpress (button should appear on the UI, along with the child theme) in the themes section

Header
1. Header functions changed:
* The default header that comes with the twentysixteen has the header set within the same margins that govern the body of the blog.
* We want for our header to stretch across the UI like a banner.
* To do so,
** I removed the header from the div classes from the header.php file.
** I added some div classes around the header so that we could style

Sidebar
1. Addition of text widgets
* We need some text + image based widgets added to the sidebar.
* These can be added with basic html and css (inline) as a text widget to the sidebar.
* Fonts changed to :
* border width reduced.

Custom menus
Custom menus can be created and registered. Steps:
1.
Footer

Helpful Links:

-------------------------------------------
-- Installing FTPS Server on Web Servers
-------------------------------------------

Objective: Install FTPS server on the web servers on port 26 - test server followed by the production server.

Steps Followed:
*

Helpful links:
*

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /etc/vsftpd.pem -out /etc/vsftpd.pem

rsa_cert_file=/etc/vsftpd.pem
rsa_private_key_file=/etc/vsftpd.pem

username: webadmin
password: 9Million!

------- Aug 2nd -------------

''' Man page for the vsftpd.conf file '''
http://vsftpd.beasts.org/vsftpd_conf.html

Securing the FTP:
https://help.ubuntu.com/lts/serverguide/ftp-server.html

'''Customization:'''
''''Change the port:''''
Add line to /etc/vsftpd.conf:
listen_port=26

Restart the server with the command:
sudo vsftpd restart

Check the installation by checking via a browser, the following address:
http://128.42.44.22:26

''''Add users''''

''''Generate keys for ou website''''
Generate the key with the following command:
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /etc/vsftpd.pem -out /etc/vsftpd.pem

Add\Update the following lines in the /etc/vsftpd.conf:
rsa_cert_file=/etc/vsftpd.pem
rsa_private_key_file=/etc/vsftpd.pem

'''' Adding Users''''

FTP : Files not accessible:
Add the following to wp-config.php
if(is_admin()) {
add_filter('filesystem_method', create_function('$a', 'return "direct";' ));
define( 'FS_CHMOD_DIR', 0751 );
}

--- Back Up:

Folders:
Copy created
Database:
mysqldump -u mcnair_wp -p wordpress > backup_3Aug2016.sql

-- Update:

https://codex.wordpress.org/Upgrading_WordPress_Extended#Step_9:_Run_the_WordPress_upgrade_program

Error Resolution:
https://wordpress.org/support/topic/wordpress-45-error-after-update

''''' in case of errors, try: '''''
https://help.webcontrolcenter.com/kb/a992/vsftpd-ftp-server.aspx

FTP Issues:
https://help.ubuntu.com/lts/serverguide/ftp-server.html

http://askubuntu.com/questions/666858/vsftpd-service-will-not-start-for-14-04

Wordpress Blog Site (Tool)

2016-08-05T16:05:17Z

RavaliKruthiventi: /* Error Logs */

Wordpress Blog Site (Tool)

2016-08-05T16:03:04Z

RavaliKruthiventi: /* Error Logs */

Wordpress Blog Site (Tool)

2016-08-05T16:01:04Z

RavaliKruthiventi:

EdEganDotCom:Users For Recent Activity Email Notification

2016-07-28T21:41:15Z

RavaliKruthiventi:

# <pre>
# --- Email Notify List---
# This is a list of users who want to receive an email when wiki activity
# occurs from new or anonymous users. An email is sent only for an edit by
# a registered or anonymous user if the registered user is less than 4 hours
# old, or if an email has not been sent since the last 4 hours. If you'd like
# to add or remove yourself from this list leave a message on the Talk page.
# Make sure you have a valid email address set in your Preferences.

Dayton

## end of Email Notify List

#</pre>

EdEganDotCom:Users For Recent Activity Email Notification

2016-07-28T21:40:35Z

RavaliKruthiventi:

# <pre>
# --- Email Notify List---
# This is a list of users who want to receive an email when wiki activity
# occurs from new or anonymous users. An email is sent only for an edit by
# a registered or anonymous user if the registered user is less than 4 hours
# old, or if an email has not been sent since the last 4 hours. If you'd like
# to add or remove yourself from this list leave a message on the Talk page.
# Make sure you have a valid email address set in your Preferences.

User1
User2
User3

## end of Email Notify List

#</pre>

EdEganDotCom:Users For Recent Activity Email Notification

2016-07-28T21:39:38Z

RavaliKruthiventi: Created page with "# --- Email Notify List--- # This is a list of users who want to receive an email when wiki activity # occurs from new or anonymous users. An email is sent only for an edit b..."

# --- Email Notify List---
# This is a list of users who want to receive an email when wiki activity
# occurs from new or anonymous users. An email is sent only for an edit by
# a registered or anonymous user if the registered user is less than 4 hours
# old, or if an email has not been sent since the last 4 hours. If you'd like
# to add or remove yourself from this list leave a message on the Talk page.
# Make sure you have a valid email address set in your Preferences.

Dayton

## end of Email Notify List

#

Wordpress Blog Site (Tool)

2016-07-22T15:15:18Z

RavaliKruthiventi: Created page with "== Requirements == == Design == == Styling == === Header=== === Sidebar === === Image Uploads === === Content === === Footer === === Blog Posts === ====Titles==== ==== Auth..."

Ravali Kruthiventi (Work Log)

2016-07-21T16:25:47Z

RavaliKruthiventi:

=== June 2016 ===
06/2016 - Worked on:
*Setting up wiki, including
**coloring the wiki links, headers, etc
**Adding access limitations, confirmation queues - access rules
**adding extensions that everyone needed

*Setting up a database for merging our patent data from the Harvard dataverse and the patent data that the McNair Center collected from 2010 - 2015
*Setting up tables as needed for other projects
*Worked on the test blog site to see how Wordpress can be messed with.

===July 2016 ===

07/2016 - Worked on:
*Setting up the assignees data from the USPTO into a database into tables that Amir and Marcela can use.
*worked on cleaning up the patent data and databases (renaming, typecasting, etc)
*Currently working on the blog.

==== Week 2 ====
07/14/2016 - Worked on the blog header
07/15/2016 - Worked on and created my research plan, added html to header. The blocks are now there, they require styling to deal with mobile interfaces.

==== Week 3 ====
07/18/2016 - Blog
*Tried to add two posts per category - this seems to require the creation of a new widget\layout. Abandoning as it is likely to take too much time.
** Possible future revisit
*Determined how to add custom widgets to the sidebar - was able to add a basic box to the sidebar

07/19/2016 - Blog
*Added custom widgets to the sidebar of the blog
*Changed the fonts, separators, footers.
*Added social media links to the header, rather than the footer (default wordpress location is the footer)

07/20/2016 - Blog
*Had a quick review of the blog, and uncovered some issues that are apparent on other screen resolutions. Worked on spacing issues on the blog. EOD Update - some spacing issues exist. Will be inserting a table into the header to make the spacing better.
*Brainstormed changes to uspto assignees data. Will begin work on removing the duplicates tomorrow.

07/21/2016 - Blog, Assignees Data, LinkedIn Account Creation
*Worked on removing duplicates from the assignees data.
*Working on wrapping issues
*Agenda:
**Fix spacing, insert table into header for lower screen resolutions
**Set up account for Dylan
**Get sample content and uploads from Dylan
**Fix header colors and patterns
**Get logo up
**Set up links to Baker and the McNair Center
**Tie it all up together

Ravali Kruthiventi (Work Log)

2016-07-21T16:19:13Z

RavaliKruthiventi:

06/2016 - Worked on:
*Setting up wiki, including
**coloring the wiki links, headers, etc
**Adding access limitations, confirmation queues - access rules
**adding extensions that everyone needed

*Setting up a database for merging our patent data from the Harvard dataverse and the patent data that the McNair Center collected from 2010 - 2015
*Setting up tables as needed for other projects
*Worked on the test blog site to see how Wordpress can be messed with.

07/2016 - Worked on:
*Setting up the assignees data from the USPTO into a database into tables that Amir and Marcela can use.
*worked on cleaning up the patent data and databases (renaming, typecasting, etc)
*Currently working on the blog.

07/14/2016 - Worked on the blog header
07/15/2016 - Worked on and created my research plan, added html to header. The blocks are now there, they require styling to deal with mobile interfaces.

07/18/2016 - Blog
*Tried to add two posts per category - this seems to require the creation of a new widget\layout. Abandoning as it is likely to take too much time.
** Possible future revisit
*Determined how to add custom widgets to the sidebar - was able to add a basic box to the sidebar

07/19/2016 - Blog
*Added custom widgets to the sidebar of the blog
*Changed the fonts, separators, footers.
*Added social media links to the header, rather than the footer (default wordpress location is the footer)

07/20/2016 - Blog
*Had a quick review of the blog, and uncovered some issues that are apparent on other screen resolutions. Worked on spacing issues on the blog. EOD Update - some spacing issues exist. Will be inserting a table into the header to make the spacing better.
*Brainstormed changes to uspto assignees data. Will begin work on removing the duplicates tomorrow.
07/21/2016 -
*Worked on removing duplicates from the assignees data.
*Working on wrapping issues

Ravali Kruthiventi (Work Log)

2016-07-15T17:47:28Z

RavaliKruthiventi:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:39:27Z

RavaliKruthiventi: /* Project - USPTO Assignees, Patent and Citation Data */

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: [[Patent Data Processing - SQL Steps | patent database]] (merged data from patent_2015 and patentdata databases)
**Issues: citations data contains non numeric patent numbers (likely application numbers, etc)
**Solution:
***Segregate into smaller tables so that Amir and Marcela can identify patterns
***link back to appropriate patent numbers from the patent table
**Time to implement: 1 day
**Priority:
**Teams waiting for it:
*** Marcela and Amir
****Project : [[ Patent Data Wiki Page | Patent data analysis ]]
***Jake and James, potentially could need this down the line
****Project :[[Leveraged Buyout Innovation (Academic Paper)| LBO data]]
** Deadline:

*Data Source: [[USPTO Assignees Data | USPTO Bulk Data repository]]
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela
** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
**** If they find that any of this data is better than the data we currently have, I will have to figure out a way to integrate this data into our data model for patent data.
*** Amir and Marcela and/or I will need to delete the copies
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source: The patent database.
** Plan:
***Technique
****Determine research question to be asked
****Scrub data
****Determine 3-4 mining\machine learning techniques to best extract patterns
****Train the algorithms
****Run the algos on sample dataset
****Determine the algo with best results
****Implement the
** Known Issues:
***Dataset to be cleaned, quality analyzed as specified above.
**Deliverables
***Set of patterns to base further research on
***Research paper (?)
****Documentation - Wiki page
** Time to implement:
** Priority:
** Teams waiting for it: None
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:35:27Z

RavaliKruthiventi: /* Project - Pattern Recognition on Patent Data through Machine Learning */

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: [[Patent Data Processing - SQL Steps | patent database]] (merged data from patent_2015 and patentdata databases)
**Issues: citations data contains non numeric patent numbers (likely application numbers, etc)
**Solution:
***Segregate into smaller tables so that Amir and Marcela can identify patterns
***link back to appropriate patent numbers from the patent table
**Time to implement: 1 day
**Priority:
**Teams waiting for it:
*** Marcela and Amir
****Project : [[ Patent Data Wiki Page | Patent data analysis ]]
***Jake and James, potentially could need this down the line
****Project :[[Leveraged Buyout Innovation (Academic Paper)| LBO data]]
** Deadline:

*Data Source: [[USPTO Assignees Data | USPTO Bulk Data repository]]
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela
** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
**** If they find that any of this data is better than the data we currently have, I will have to figure out a way to integrate this data into our data model for patent data.
*** Amir and Marcela and/or I will need to delete the copies

** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source: The patent database.
** Plan:
***Technique
****Determine research question to be asked
****Scrub data
****Determine 3-4 mining\machine learning techniques to best extract patterns
****Train the algorithms
****Run the algos on sample dataset
****Determine the algo with best results
****Implement the
** Known Issues:
***Dataset to be cleaned, quality analyzed as specified above.
**Deliverables
***Set of patterns to base further research on
***Research paper (?)
****Documentation - Wiki page
** Time to implement:
** Priority:
** Teams waiting for it: None
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:30:52Z

RavaliKruthiventi: /* Assignees Data */

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: [[Patent Data Processing - SQL Steps | patent database]] (merged data from patent_2015 and patentdata databases)
**Issues: citations data contains non numeric patent numbers (likely application numbers, etc)
**Solution:
***Segregate into smaller tables so that Amir and Marcela can identify patterns
***link back to appropriate patent numbers from the patent table
**Time to implement: 1 day
**Priority:
**Teams waiting for it:
*** Marcela and Amir
****Project : [[ Patent Data Wiki Page | Patent data analysis ]]
***Jake and James, potentially could need this down the line
****Project :[[Leveraged Buyout Innovation (Academic Paper)| LBO data]]
** Deadline:

*Data Source: [[USPTO Assignees Data | USPTO Bulk Data repository]]
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela
** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
**** If they find that any of this data is better than the data we currently have, I will have to figure out a way to integrate this data into our data model for patent data.
*** Amir and Marcela and/or I will need to delete the copies

** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source:

** Plan:
***Technique

** Known Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:29:08Z

RavaliKruthiventi: /* Assignees Data */

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: [[Patent Data Processing - SQL Steps | patent database]] (merged data from patent_2015 and patentdata databases)
**Issues: citations data contains non numeric patent numbers (likely application numbers, etc)
**Solution:
***Segregate into smaller tables so that Amir and Marcela can identify patterns
***link back to appropriate patent numbers from the patent table
**Time to implement: 1 day
**Priority:
**Teams waiting for it:
*** Marcela and Amir
****Project : [[ Patent Data Wiki Page | Patent data analysis ]]
***Jake and James, potentially could need this down the line
****Project :[[Leveraged Buyout Innovation (Academic Paper)| LBO data]]
** Deadline:

*Data Source: [[USPTO Assignees Data | USPTO Bulk Data repository]]
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela
** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
*** Amir and Marcela and/or I will need to delete the copies
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source:

** Plan:
***Technique

** Known Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:26:37Z

RavaliKruthiventi: /* Assignees Data */

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: [[Patent Data Processing - SQL Steps | patent database]] (merged data from patent_2015 and patentdata databases)
**Issues: citations data contains non numeric patent numbers (likely application numbers, etc)
**Solution:
***Segregate into smaller tables so that Amir and Marcela can identify patterns
***link back to appropriate patent numbers from the patent table
**Time to implement: 1 day
**Priority:
**Teams waiting for it:
*** Marcela and Amir
****Project : Patent data analysis (?)
***Jake and James, potentially could need this down the line
****Project : LBO data
** Deadline:

*Data Source: [[USPTO Assignees Data | USPTO Bulk Data repository]]
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela
** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
*** Amir and Marcela and/or I will need to delete the copies
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source:

** Plan:
***Technique

** Known Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:24:07Z

RavaliKruthiventi: /* Assignees Data */

Ravali Kruthiventi (Research Plan)

2016-07-15T17:18:16Z

RavaliKruthiventi: /* Assignees Data */

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: patent database (merged data from patent_2015 and patentdata databases)
**Issues: citations data contains non numeric patent numbers (likely application numbers, etc)
**Solution:
***Segregate into smaller tables so that Amir and Marcela can identify patterns
***link back to appropriate patent numbers from the patent table
**Time to implement: 1 day
**Priority:
**Teams waiting for it:
*** Marcela and Amir
****Project : Patent data analysis (?)
***Jake and James, potentially could need this down the line
****Project : LBO data
** Deadline:

*Data Source: USPTO Bulk Data repository
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela
** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
*** Amir and Marcela and/or I will need to delete the copies
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source:

** Plan:
***Technique

** Known Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:17:44Z

RavaliKruthiventi: /* Assignees Data */

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: patent database (merged data from patent_2015 and patentdata databases)
**Issues: citations data contains non numeric patent numbers (likely application numbers, etc)

**Solution:
***Segregate into smaller tables so that Amir and Marcela can identify patterns
***link back to appropriate patent numbers from the patent table
**Time to implement: 1 day
**Priority:
**Teams waiting for it:
*** Marcela and Amir
****Project : Patent data analysis (?)
***Jake and James, potentially could need this down the line
****Project : LBO data
** Deadline:
*Data Source: USPTO Bulk Data repository
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela
** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
*** Amir and Marcela and/or I will need to delete the copies
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source:

** Plan:
***Technique

** Known Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:15:33Z

RavaliKruthiventi:

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: patent database (merged data from patent_2015 and patentdata databases)
**Issues: citations data contains non numeric patent numbers (likely application numbers, etc)

**Solution:
***Segregate into smaller tables so that Amir and Marcela can identify patterns
***link back to appropriate patent numbers from the patent table

**Time to implement: 1 day

**Priority:

**Teams waiting for it:
*** Marcela and Amir
****Project : Patent data analysis (?)
***Jake and James, potentially could need this down the line
****Project : LBO data

** Deadline:

*Data Source: USPTO Bulk Data repository
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela

** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
*** Amir and Marcela and/or I will need to delete the copies

** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source:

** Plan:
***Technique

** Known Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:14:26Z

RavaliKruthiventi: /* Project - USPTO Assignees, Patent and Citation Data */

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source: patent database (merged data from patent_2015 and patentdata databases)
** Issues: citations data contains non numeric patent numbers (likely application numbers, etc)

** Solution:
*** Segregate into smaller tables so that Amir and Marcela can identify patterns
*** link back to appropriate patent numbers from the patent table

** Time to implement: 1 day

** Priority:

** Teams waiting for it:
*** Marcela and Amair
****Project : Patent data analysis (?)
*** Jake and James, potentially could need this down the line
**** Project : LBO data

** Deadline:

*Data Source: USPTO Bulk Data repository
** Issues:
*** The script inserts copies of data into the tables.
*** Analysis required on the data to make sure the data was inserted correctly from the XML files.
*** Analysis is also required to determine whether this data is better than the data we have in the patent database right now.
**** Action owners : Amir and Marcela

** Solution:
*** Amir and Marcela and/or I need to look at the data to determine quality
*** Amir and Marcela and/or I will need to delete the copies

** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source:

** Plan:
***Technique

** Known Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

Ravali Kruthiventi (Research Plan)

2016-07-15T17:03:08Z

RavaliKruthiventi: Created page with " ===Project - USPTO Assignees, Patent and Citation Data=== ==== Assignees Data ==== *Data source: ** Issues: ** Solution: ** Time to implement: ** Priority: ** Teams waiting f..."

===Project - USPTO Assignees, Patent and Citation Data===
==== Assignees Data ====
*Data source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

===Project - Lex Machina Data===
*Data Source:
** Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

=== Project - Pattern Recognition on Patent Data through Machine Learning ===

*Data Source:

** Plan:
***Technique

** Known Issues:
** Solution:
** Time to implement:
** Priority:
** Teams waiting for it:
** Deadline:

Ravali Kruthiventi (Work Log)

2016-07-15T16:55:42Z

RavaliKruthiventi: Created page with "07/14/2016 - Worked on the blog header 07/15/2016 - Worked on and created my research plan"

07/14/2016 - Worked on the blog header
07/15/2016 - Worked on and created my research plan

Patent Data Cleanup (June 2016)

2016-07-07T16:33:22Z

RavaliKruthiventi: /* About this Page */

== About this Page ==

This page contains the script that was used to clean up the patents and assignees tables in allpatent.

Cleaning up includes:
* Cleaning 'NULL' string and -1 inserts : at the time of merging the patentdata and patent_2015 databases, I inserted 'NULL' strings and -1 in integer columns to differentiate between NULLs that came from the vendor, and 'NULL's that I inserted because of no column overlap.
** The 'NULL's got replaced with NULL
** The -1s got replaced with NULL as well.

* Merging some more columns, and dropping unnecessary columns:
** At the time of merging the tables, some columns, particularly in the patent table, were not merged as they should have been.
** The script that follows merges those columns as well.
NOTE: The patent data page detailing the SQL steps followed to merge the data now has the updated table structures. The script on this page can be used as a reference when trying to debug any (unlikely) merging errors

* Renaming tables and columns
** Table names and column names have been standardized.
** General rule of thumb is : short column names, singular table names (for example : patent and not patents)

== Script ==
ALTER TABLE patents
RENAME COLUMN patentnumber TO patent;

ALTER TABLE patents
DROP COLUMN kind,
DROP COLUMN title,
DROP COLUMN ussubclass, **
DROP COLUMN maingroup, --
DROP COLUMN subgroup,--
DROP COLUMN cpcsubclass, ++
DROP COLUMN cpcmaingroup, ++
DROP COLUMN classificationnationalcountry,
DROP COLUMN classificationnationalclass,** (?)
DROP COLUMN primaryexaminerfirstname,
DROP COLUMN primaryexaminerlastname,
DROP COLUMN primaryexaminerdepartment,
DROP COLUMN filename;

UPDATE patents
SET type = '2015'
WHERE type != 'NULL';

-- RESULT : UPDATE 1646225

UPDATE patents
SET type = '2010'
WHERE type = 'NULL';
-- RESULT : UPDATE 3764926

/* Join the historical patent data from the US PTO with the patents table */

ALTER TABLE PATENTS
ADD COLUMN nber INT,
ADD COLUMN uspc varchar,
ADD COLUMN uspc_sub varchar;

UPDATE patents p
SET nber = hp.nber,
uspc = hp.uspc,
uspc_sub = hp.uspc
FROM historicalpatentdata hp
WHERE hp.patentnumber = CAST(p.patent AS varchar);

-- RESULT : UPDATE 5113655

/* Mergeing some columns - claims and number of claims - column name : claims*/
UPDATE patents
SET claims = numberofclaims
WHERE claims = -1;
-- RESULTS : UPDATE 1646225

/* Merging columns -
UPDATE patents
SET appnum = CAST (applicationnumber AS INT)
where appnum = -1;
-- RESULT : UPDATE 1646225

UPDATE patents
SET appdate = filingdate
where appdate = '0001-01-01 BC'
OR filingdate is not NULL;

-- RESULT UPDATE 1646225

ALTER TABLE patents
DROP COLUMN apptype;

/* Generating GYear and AppYear from the dates */
UPDATE patents
SET gyear = EXTRACT(year from grantdate)
WHERE gyear = -1
AND grantdate IS NOT NULL;
UPDATE 1646225

UPDATE patents
SET appyear = EXTRACT(year from appdate)
WHERE appyear = -1
AND appdate is not null;
-- RESULT UPDATE 1646225

/* Test Script */
SELECT patentnumber, ussubclass, maingroup, subgroup, cpcsubclass, cpcmaingroup, cpcsubgroup, classificationnationalcountry, classificationnationalclass FROM Patents LIMIT 100;

patent | integer | not null
grantdate | date |
type | character varying |
applicationnumber | character varying |
filingdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
numberofclaims | integer |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying

/* Drop the merged columns */

ALTER TABLE patents
DROP COLUMN numberofclaims,
DROP COLUMN filingdate,
DROP COLUMN applicationnumber,
DROP COLUMN type;

UPDATE patents
SET prioritycountry = NULL
WHERE prioritycountry = 'NULL';

UPDATE patents
SET pctpatentnumber = NULL
WHERE pctpatentnumber = 'NULL';

UPDATE patents
SET prioritypatentnumber = NULL
WHERE prioritypatentnumber = 'NULL';

UPDATE patents
SET cpcsubgroup = NULL
WHERE cpcsubgroup = 'NULL';

UPDATE patents
SET appnum = NULL
WHERE appnum = -1;

UPDATE patents
SET gyear = NULL
WHERE gyear = -1;

UPDATE patents

SET appyear = NULL
WHERE appyear = -1;

Results:

allpatent=# ALTER TABLE patents
allpatent-# DROP COLUMN numberofclaims,
allpatent-# DROP COLUMN filingdate,
allpatent-# DROP COLUMN applicationnumber,
allpatent-# DROP COLUMN type;
ALTER TABLE
allpatent=#
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET prioritycountry = NULL
allpatent-# WHERE prioritycountry = 'NULL';
^[[BUPDATE 3764926
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET pctpatentnumber = NULL
allpatent-# WHERE pctpatentnumber = 'NULL';
UPDATE 3764926
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET prioritypatentnumber = NULL
allpatent-# WHERE prioritypatentnumber = 'NULL';
UPDATE 3764926
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET cpcsubgroup = NULL
allpatent-# WHERE cpcsubgroup = 'NULL';
UPDATE 3764926
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET appnum = NULL
allpatent-# WHERE appnum = -1;
UPDATE 0
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET gyear = NULL
allpatent-# WHERE gyear = -1;
UPDATE 0
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET appyear = NULL
allpatent-# WHERE appyear = -1;
UPDATE 0
allpatent=#

UPDATE assignees
SET lastname = NULL
WHERE lastname = 'null';

UPDATE assignees
SET firstname = NULL
WHERE firstname = 'null';

UPDATE assignees
SET address = NULL
WHERE address = 'null';

UPDATE assignees
SET postcode = NULL
WHERE postcode = 'null';

UPDATE assignees
SET patentcountry = NULL
WHERE patentcountry = 'null';

UPDATE assignees
SET nationality2 = NULL
WHERE nationality2 = 'null';

UPDATE assignees
SET residence = NULL
WHERE residence = 'null';

UPDATE assignees
SET asgseq = NULL
WHERE asgseq= -1;

UPDATE assignees
SET asgtype = NULL
WHERE asgtype = -1;

RESULTS:

UPDATE assignees
allpatent-# SET lastname = NULL
allpatent-# WHERE lastname = 'null';

UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET firstname = NULL
allpatent-# WHERE firstname = 'null';
UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET address = NULL
allpatent-# WHERE address = 'null';
UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET postcode = NULL
allpatent-# WHERE postcode = 'null';
UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET patentcountry = NULL
allpatent-# WHERE patentcountry = 'null';
UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET nationality2 = NULL
allpatent-# WHERE nationality2 = 'null';
UPDATE 1607714
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET residence = NULL
allpatent-# WHERE residence = 'null';
UPDATE 1607714
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET asgseq = NULL
allpatent-# WHERE asgseq= -1;
UPDATE 1607714
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET asgtype = NULL
allpatent-# WHERE asgtype = -1;
UPDATE 1607714
allpatent=#

CREATE DATABASE allpatent_clone WITH TEMPLATE allpatent OWNER dbuser;

== Renaming Tables and Columns ==

To standardize table and column names, and to make them as user-friendly as possible, a few tables and columns have been renamed.
* '''allpatent''' database -> '''patent'''
* assignees -> assignee
* judges -> judge
* citations -> citation
* matchassignees -> MatchOrgNames
* patents -> patent
* assignees -> ptoassignee
* assignments -> ptoassignment
* assignors -> ptoassignor
* patentassignment -> ptopatentfile
* properties -> ptoproperty
* mslfee -> feestatus
* patentmaintenancefee -> fee

Patent Data Cleanup (June 2016)

2016-07-07T15:45:52Z

RavaliKruthiventi:

== About this Page ==

This page contains the script that was used to clean up the patents and assignees tables in allpatent.

== Script ==
ALTER TABLE patents
RENAME COLUMN patentnumber TO patent;

ALTER TABLE patents
DROP COLUMN kind,
DROP COLUMN title,
DROP COLUMN ussubclass, **
DROP COLUMN maingroup, --
DROP COLUMN subgroup,--
DROP COLUMN cpcsubclass, ++
DROP COLUMN cpcmaingroup, ++
DROP COLUMN classificationnationalcountry,
DROP COLUMN classificationnationalclass,** (?)
DROP COLUMN primaryexaminerfirstname,
DROP COLUMN primaryexaminerlastname,
DROP COLUMN primaryexaminerdepartment,
DROP COLUMN filename;

UPDATE patents
SET type = '2015'
WHERE type != 'NULL';

-- RESULT : UPDATE 1646225

UPDATE patents
SET type = '2010'
WHERE type = 'NULL';
-- RESULT : UPDATE 3764926

/* Join the historical patent data from the US PTO with the patents table */

ALTER TABLE PATENTS
ADD COLUMN nber INT,
ADD COLUMN uspc varchar,
ADD COLUMN uspc_sub varchar;

UPDATE patents p
SET nber = hp.nber,
uspc = hp.uspc,
uspc_sub = hp.uspc
FROM historicalpatentdata hp
WHERE hp.patentnumber = CAST(p.patent AS varchar);

-- RESULT : UPDATE 5113655

/* Mergeing some columns - claims and number of claims - column name : claims*/
UPDATE patents
SET claims = numberofclaims
WHERE claims = -1;
-- RESULTS : UPDATE 1646225

/* Merging columns -
UPDATE patents
SET appnum = CAST (applicationnumber AS INT)
where appnum = -1;
-- RESULT : UPDATE 1646225

UPDATE patents
SET appdate = filingdate
where appdate = '0001-01-01 BC'
OR filingdate is not NULL;

-- RESULT UPDATE 1646225

ALTER TABLE patents
DROP COLUMN apptype;

/* Generating GYear and AppYear from the dates */
UPDATE patents
SET gyear = EXTRACT(year from grantdate)
WHERE gyear = -1
AND grantdate IS NOT NULL;
UPDATE 1646225

UPDATE patents
SET appyear = EXTRACT(year from appdate)
WHERE appyear = -1
AND appdate is not null;
-- RESULT UPDATE 1646225

/* Test Script */
SELECT patentnumber, ussubclass, maingroup, subgroup, cpcsubclass, cpcmaingroup, cpcsubgroup, classificationnationalcountry, classificationnationalclass FROM Patents LIMIT 100;

patent | integer | not null
grantdate | date |
type | character varying |
applicationnumber | character varying |
filingdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
numberofclaims | integer |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying

/* Drop the merged columns */

ALTER TABLE patents
DROP COLUMN numberofclaims,
DROP COLUMN filingdate,
DROP COLUMN applicationnumber,
DROP COLUMN type;

UPDATE patents
SET prioritycountry = NULL
WHERE prioritycountry = 'NULL';

UPDATE patents
SET pctpatentnumber = NULL
WHERE pctpatentnumber = 'NULL';

UPDATE patents
SET prioritypatentnumber = NULL
WHERE prioritypatentnumber = 'NULL';

UPDATE patents
SET cpcsubgroup = NULL
WHERE cpcsubgroup = 'NULL';

UPDATE patents
SET appnum = NULL
WHERE appnum = -1;

UPDATE patents
SET gyear = NULL
WHERE gyear = -1;

UPDATE patents

SET appyear = NULL
WHERE appyear = -1;

Results:

allpatent=# ALTER TABLE patents
allpatent-# DROP COLUMN numberofclaims,
allpatent-# DROP COLUMN filingdate,
allpatent-# DROP COLUMN applicationnumber,
allpatent-# DROP COLUMN type;
ALTER TABLE
allpatent=#
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET prioritycountry = NULL
allpatent-# WHERE prioritycountry = 'NULL';
^[[BUPDATE 3764926
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET pctpatentnumber = NULL
allpatent-# WHERE pctpatentnumber = 'NULL';
UPDATE 3764926
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET prioritypatentnumber = NULL
allpatent-# WHERE prioritypatentnumber = 'NULL';
UPDATE 3764926
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET cpcsubgroup = NULL
allpatent-# WHERE cpcsubgroup = 'NULL';
UPDATE 3764926
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET appnum = NULL
allpatent-# WHERE appnum = -1;
UPDATE 0
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET gyear = NULL
allpatent-# WHERE gyear = -1;
UPDATE 0
allpatent=#
allpatent=# UPDATE patents
allpatent-# SET appyear = NULL
allpatent-# WHERE appyear = -1;
UPDATE 0
allpatent=#

UPDATE assignees
SET lastname = NULL
WHERE lastname = 'null';

UPDATE assignees
SET firstname = NULL
WHERE firstname = 'null';

UPDATE assignees
SET address = NULL
WHERE address = 'null';

UPDATE assignees
SET postcode = NULL
WHERE postcode = 'null';

UPDATE assignees
SET patentcountry = NULL
WHERE patentcountry = 'null';

UPDATE assignees
SET nationality2 = NULL
WHERE nationality2 = 'null';

UPDATE assignees
SET residence = NULL
WHERE residence = 'null';

UPDATE assignees
SET asgseq = NULL
WHERE asgseq= -1;

UPDATE assignees
SET asgtype = NULL
WHERE asgtype = -1;

RESULTS:

UPDATE assignees
allpatent-# SET lastname = NULL
allpatent-# WHERE lastname = 'null';

UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET firstname = NULL
allpatent-# WHERE firstname = 'null';
UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET address = NULL
allpatent-# WHERE address = 'null';
UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET postcode = NULL
allpatent-# WHERE postcode = 'null';
UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET patentcountry = NULL
allpatent-# WHERE patentcountry = 'null';
UPDATE 3818842
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET nationality2 = NULL
allpatent-# WHERE nationality2 = 'null';
UPDATE 1607714
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET residence = NULL
allpatent-# WHERE residence = 'null';
UPDATE 1607714
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET asgseq = NULL
allpatent-# WHERE asgseq= -1;
UPDATE 1607714
allpatent=#
allpatent=# UPDATE assignees
allpatent-# SET asgtype = NULL
allpatent-# WHERE asgtype = -1;
UPDATE 1607714
allpatent=#

CREATE DATABASE allpatent_clone WITH TEMPLATE allpatent OWNER dbuser;

== Renaming Tables and Columns ==

To standardize table and column names, and to make them as user-friendly as possible, a few tables and columns have been renamed.
* '''allpatent''' database -> '''patent'''
* assignees -> assignee
* judges -> judge
* citations -> citation
* matchassignees -> MatchOrgNames
* patents -> patent
* assignees -> ptoassignee
* assignments -> ptoassignment
* assignors -> ptoassignor
* patentassignment -> ptopatentfile
* properties -> ptoproperty
* mslfee -> feestatus
* patentmaintenancefee -> fee

Patent Data Cleanup (June 2016)

2016-07-07T15:41:46Z

RavaliKruthiventi: /* Script */

Patent Data Processing - SQL Steps

2016-07-06T17:51:25Z

RavaliKruthiventi: /* USPTO Consolidated Patent Data */

[[Category:Internal]]
[[Internal Classification::Legacy| ]]

== Objective ==
The McNair Center owns two sets of patent data - one set that is inherited from Harvard, the Harvard dataverse, which is stored in the database patentdata and another that is generated by crawlers pulling data from the USPTO website, which is stored in the database '''patent_2015'''.

We are now merging and cleaning the two data sets, and storing them in a schema that is amalgamation of the two underlying schema for the citations tables, assignees tables, and patents tables. The destination schema is '''allpatent'''.

== Assignees Data==

The schema for the assignees table in '''patentdata''' database is:

Column | Type | Modifiers
-------------+-------------------+-----------
patent | integer |
asgtype | integer |
assignee | character varying |
city | character varying |
state | character varying |
country | character varying |
nationality | character varying |
residence | character varying |
asgseq | integer |

The schema for the assignees table in patent_2015 is :

Column | Type | Modifiers
--------------+---------+-----------
lastname | text |
firstname | text |
orgname | text |
city | text |
country | text |
patentcountry | text |
patentnumber | integer |
state | text |
address | text |
postcode | text |

To merge both schemas, we have some columns that overlap, and some columns that don't.

'''Overlapping Columns'''

patent_2015 | patentdata
--------------+--------------
orgname | assignee
city | city
country | country
patentnumber | patent
state | state

These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.

'''Final Schema'''

Table "public.assignees"
Column | Type | Modifiers
---------------+-------------------+-----------
lastname | character varying |
firstname | character varying |
address | character varying |
postcode | character varying |
orgname | character varying |
city | character varying |
country | character varying |
patentnumber | integer |
state | character varying |
patentcountry | character varying |
nationality2 | character varying |
residence | character varying |
asgseq | integer |
asgtype | integer |

'''Non-overlapping Columns'''
These are the columns that belong to either one of the assignees tables, and not to both. For these cases, to help users understand where the row is coming from, the following insert rules have been followed:

*For columns of type int, insert -1
*For columns of type string (character varying), the string 'null' has been inserted.

Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table.

==== Index ====
Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table.

allpatent=# CREATE INDEX ON assignees (orgname);
CREATE INDEX

====Sample insert and copy commands ====
INSERT INTO assignees_merge
(
SELECT
'null',
'null',
'null',
'null',
a.assignee,
a.city,
a.country,
a.patent,
a.state,
'null',
a.nationality,
a.residence,
a.asgseq,
a.asgtype
FROM assignees a
);

INSERT INTO assignees_merge
(
SELECT
assignees.lastname,
assignees.firstname,
assignees.address,
assignees.postcode,
assignees.orgname,
assignees.city,
assignees.country,
assignees.patentnumber,
assignees.state,
assignees.patentcountry,
'null',
'null',
-1,
-1
FROM assignees
);

\COPY assignees_merge TO '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--1607724

\COPY assignees_merge TO '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--3818842

Note : The assignees table was updated on 6/23 to remove the 'null' string and the '-1' values.

==Patents ==

'''Patentdata Schema:'''

Column | Type | Modifiers
--------+-------------------+-----------
patent | integer |
kind | character varying |
claims | integer |
apptype | integer |
appnum | integer |
gdate | date |
gyear | integer |
appdate | date |
appyear | integer |

'''Patent_2015 Schema:'''

Column | Type | Modifiers
-------------------------------+---------+-----------
patentnumber | int | not null
kind | varchar |
grantdate | date |
type | varchar |
applicationnumber | varchar |
filingdate | date |
prioritydate | date |
prioritycountry | varchar |
prioritypatentnumber | varchar |
ussubclass | varchar |
maingroup | varchar |
subgroup | varchar |
cpcsubclass | varchar |
cpcmaingroup | varchar |
cpcsubgroup | varchar |
classificationnationalcountry | varchar |
classificationnationalclass | varchar |
title | varchar |
numberofclaims | int |
primaryexaminerfirstname | varchar |
primaryexaminerlastname | varchar |
primaryexaminerdepartment | varchar |
pctpatentnumber | varchar |
filename | varchar |

''' Overlapping Columns '''
patent_data patent_2015
--------------+-------------
patent | patentnumber
kind | kind
claims | numberofclaims
apptype | type
appnum | applicationnumber
gdate | grantdate
appdate | filingdate

'''Combined Schema:'''

The final schema of the patents table is :

Column | Type | Modifiers
----------------------+-------------------+-----------
patent | integer | not null
grantdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying |

From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean.

Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.

Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant.

==== Index and Key Creation ====
Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well.

Code:
ALTER TABLE patents ADD PRIMARY KEY (patentnumber);
-- RESULT : ALTER TABLE
allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber);

====Sample Insert and Copy Statements====
patentdata:
INSERT INTO patents_merged
(
SELECT
patent,
kind,
gdate,
'NULL',
'NULL',
NULL,
NULL,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
-1,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
claims,
apptype,
appnum,
gyear,
appdate,
appyear
FROM patents
);
-- RESULT : INSERT 0 3984771

patent_2015:
INSERT INTO patents_merged
(
SELECT
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
-1,
-1,
-1,
-1,
NULL,
-1
FROM patents
);
-- RESULT : INSERT 0 1646225

COPY SCRIPTS:
patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 3984771

patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 1646225

PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 3984771
\COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 1646225

====TESTING ====
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t;
--RESULT: 5411151
EXPECTED: 5426566

We found some copies of a few rows, where both the patent_2015 and patentdata

SELECT COUNT(*), *
FROM patents
GROUP BY
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
claims,
apptype,
appnum,
gyear,
appdate,
appyear
HAVING COUNT(*) > 1;

SELECT patentnumber, count(*)
FROM patents
GROUP BY patentnumber
HAVING count(*)>1;
--7640598

SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
ORDER BY op.patentnumber;

(
SELECT *
INTO patentsCleaned
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)=1
)
ORDER BY op.patentnumber
)
--SELECT 5191306

INSERT INTO patentsCleaned(
SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
AND op.applicationnumber NOT LIKE 'NULL'
ORDER BY op.patentnumber
);

--219845

====TESTING:====
allpatent=# select count(*) from patentsCleaned;
count
---------
5411151
(1 row)

allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1;
count | patentnumber
-------+--------------
(0 rows)

== Citations==

In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)

To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) {
my $var=$_[0];
if ($var=~/^\d*$/) {return $var;}
return undef;
}
return undef;
$$ LANGUAGE plperl;

'''patentdata schema:'''

Column | Type | Modifiers
------------+-------------------+-----------
patent | integer |
cit_date | date |
cit_name | character varying |
cit_kind | character varying |
cit_country | character varying |
citation | integer |
category | character varying |
citseq | integer |

SELECT patent as citingpatentnumber, citation AS citedpatentnumber
INTO citations_merged
FROM citations;
--SELECT 38452957

'''patent_2015 schema:'''
Column | Type | Modifiers
---------------------+---------+-----------
citingpatentnumber | integer |
citingpatentcountry | text |
citedpatentnumber | text |
citedpatentcountry | text |

SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber
INTO citations_merged
FROM citations;
-- RESULT : SELECT 59227881

'''Overlapping Columns'''

patent_2015 | patentdata |
---------------------+---------------+
citingpatentnumber | patent |
citedpatentnumber | citation |

''' Combined Schema:'''

Column | Type | Modifiers
--------------------+--------+-----------
citingpatentnumber | bigint |
citedpatentnumber | bigint |

Copy Statements:

patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 38452957

patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 59227881

allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT : COPY 59227881

\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT: COPY 38452957

CLONING:
CREATE DATABASE allpatentsProcessed WITH TEMPLATE allpatent OWNER researcher;

== USPTO Consolidated Patent Data ==

The USPTO has a repository of patent data on their Bulk Data Storage system. We have this data downloaded and loaded into a table on the patent database. Here are the steps followed:
* Download file from the BDS system - we have access to CSV files.
* Create table with required specs
* use the \COPY command to copy the data from the file into the table.

Script follows.

'''Script:'''

/* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/

CREATE TABLE PatentMaintenanceFee(
patentnumber varchar,
applicationnumber int,
smallentity varchar,
filingdate date,
grantissuedate date,
maintenancefeedate date,
maintenancefeecode varchar
);

\COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 14042059

/* creating tables for historical patent data - USPTO */

CREATE TABLE HistoricalPatentData(
applicationid int,
pubno varchar,
patentnumber varchar,
NBER int,
USPC varchar,
USPC_sub varchar,
applicationdate date,
prioritydate date,
pubdate date,
displaydate date,
disptype varchar,
exp_dt date,
exp_dt_max date,
pta int
);

\COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;

--COPY 11191813

Patent Data Processing - SQL Steps

2016-07-06T17:46:24Z

RavaliKruthiventi: /* USPTO Consolidated Patent Data */

[[Category:Internal]]
[[Internal Classification::Legacy| ]]

== Objective ==
The McNair Center owns two sets of patent data - one set that is inherited from Harvard, the Harvard dataverse, which is stored in the database patentdata and another that is generated by crawlers pulling data from the USPTO website, which is stored in the database '''patent_2015'''.

We are now merging and cleaning the two data sets, and storing them in a schema that is amalgamation of the two underlying schema for the citations tables, assignees tables, and patents tables. The destination schema is '''allpatent'''.

== Assignees Data==

The schema for the assignees table in '''patentdata''' database is:

Column | Type | Modifiers
-------------+-------------------+-----------
patent | integer |
asgtype | integer |
assignee | character varying |
city | character varying |
state | character varying |
country | character varying |
nationality | character varying |
residence | character varying |
asgseq | integer |

The schema for the assignees table in patent_2015 is :

Column | Type | Modifiers
--------------+---------+-----------
lastname | text |
firstname | text |
orgname | text |
city | text |
country | text |
patentcountry | text |
patentnumber | integer |
state | text |
address | text |
postcode | text |

To merge both schemas, we have some columns that overlap, and some columns that don't.

'''Overlapping Columns'''

patent_2015 | patentdata
--------------+--------------
orgname | assignee
city | city
country | country
patentnumber | patent
state | state

These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.

'''Final Schema'''

Table "public.assignees"
Column | Type | Modifiers
---------------+-------------------+-----------
lastname | character varying |
firstname | character varying |
address | character varying |
postcode | character varying |
orgname | character varying |
city | character varying |
country | character varying |
patentnumber | integer |
state | character varying |
patentcountry | character varying |
nationality2 | character varying |
residence | character varying |
asgseq | integer |
asgtype | integer |

'''Non-overlapping Columns'''
These are the columns that belong to either one of the assignees tables, and not to both. For these cases, to help users understand where the row is coming from, the following insert rules have been followed:

*For columns of type int, insert -1
*For columns of type string (character varying), the string 'null' has been inserted.

Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table.

==== Index ====
Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table.

allpatent=# CREATE INDEX ON assignees (orgname);
CREATE INDEX

====Sample insert and copy commands ====
INSERT INTO assignees_merge
(
SELECT
'null',
'null',
'null',
'null',
a.assignee,
a.city,
a.country,
a.patent,
a.state,
'null',
a.nationality,
a.residence,
a.asgseq,
a.asgtype
FROM assignees a
);

INSERT INTO assignees_merge
(
SELECT
assignees.lastname,
assignees.firstname,
assignees.address,
assignees.postcode,
assignees.orgname,
assignees.city,
assignees.country,
assignees.patentnumber,
assignees.state,
assignees.patentcountry,
'null',
'null',
-1,
-1
FROM assignees
);

\COPY assignees_merge TO '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--1607724

\COPY assignees_merge TO '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--3818842

Note : The assignees table was updated on 6/23 to remove the 'null' string and the '-1' values.

==Patents ==

'''Patentdata Schema:'''

Column | Type | Modifiers
--------+-------------------+-----------
patent | integer |
kind | character varying |
claims | integer |
apptype | integer |
appnum | integer |
gdate | date |
gyear | integer |
appdate | date |
appyear | integer |

'''Patent_2015 Schema:'''

Column | Type | Modifiers
-------------------------------+---------+-----------
patentnumber | int | not null
kind | varchar |
grantdate | date |
type | varchar |
applicationnumber | varchar |
filingdate | date |
prioritydate | date |
prioritycountry | varchar |
prioritypatentnumber | varchar |
ussubclass | varchar |
maingroup | varchar |
subgroup | varchar |
cpcsubclass | varchar |
cpcmaingroup | varchar |
cpcsubgroup | varchar |
classificationnationalcountry | varchar |
classificationnationalclass | varchar |
title | varchar |
numberofclaims | int |
primaryexaminerfirstname | varchar |
primaryexaminerlastname | varchar |
primaryexaminerdepartment | varchar |
pctpatentnumber | varchar |
filename | varchar |

''' Overlapping Columns '''
patent_data patent_2015
--------------+-------------
patent | patentnumber
kind | kind
claims | numberofclaims
apptype | type
appnum | applicationnumber
gdate | grantdate
appdate | filingdate

'''Combined Schema:'''

The final schema of the patents table is :

Column | Type | Modifiers
----------------------+-------------------+-----------
patent | integer | not null
grantdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying |

From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean.

Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.

Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant.

==== Index and Key Creation ====
Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well.

Code:
ALTER TABLE patents ADD PRIMARY KEY (patentnumber);
-- RESULT : ALTER TABLE
allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber);

====Sample Insert and Copy Statements====
patentdata:
INSERT INTO patents_merged
(
SELECT
patent,
kind,
gdate,
'NULL',
'NULL',
NULL,
NULL,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
-1,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
claims,
apptype,
appnum,
gyear,
appdate,
appyear
FROM patents
);
-- RESULT : INSERT 0 3984771

patent_2015:
INSERT INTO patents_merged
(
SELECT
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
-1,
-1,
-1,
-1,
NULL,
-1
FROM patents
);
-- RESULT : INSERT 0 1646225

COPY SCRIPTS:
patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 3984771

patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 1646225

PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 3984771
\COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 1646225

====TESTING ====
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t;
--RESULT: 5411151
EXPECTED: 5426566

We found some copies of a few rows, where both the patent_2015 and patentdata

SELECT COUNT(*), *
FROM patents
GROUP BY
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
claims,
apptype,
appnum,
gyear,
appdate,
appyear
HAVING COUNT(*) > 1;

SELECT patentnumber, count(*)
FROM patents
GROUP BY patentnumber
HAVING count(*)>1;
--7640598

SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
ORDER BY op.patentnumber;

(
SELECT *
INTO patentsCleaned
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)=1
)
ORDER BY op.patentnumber
)
--SELECT 5191306

INSERT INTO patentsCleaned(
SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
AND op.applicationnumber NOT LIKE 'NULL'
ORDER BY op.patentnumber
);

--219845

====TESTING:====
allpatent=# select count(*) from patentsCleaned;
count
---------
5411151
(1 row)

allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1;
count | patentnumber
-------+--------------
(0 rows)

== Citations==

In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)

To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) {
my $var=$_[0];
if ($var=~/^\d*$/) {return $var;}
return undef;
}
return undef;
$$ LANGUAGE plperl;

'''patentdata schema:'''

Column | Type | Modifiers
------------+-------------------+-----------
patent | integer |
cit_date | date |
cit_name | character varying |
cit_kind | character varying |
cit_country | character varying |
citation | integer |
category | character varying |
citseq | integer |

SELECT patent as citingpatentnumber, citation AS citedpatentnumber
INTO citations_merged
FROM citations;
--SELECT 38452957

'''patent_2015 schema:'''
Column | Type | Modifiers
---------------------+---------+-----------
citingpatentnumber | integer |
citingpatentcountry | text |
citedpatentnumber | text |
citedpatentcountry | text |

SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber
INTO citations_merged
FROM citations;
-- RESULT : SELECT 59227881

'''Overlapping Columns'''

patent_2015 | patentdata |
---------------------+---------------+
citingpatentnumber | patent |
citedpatentnumber | citation |

''' Combined Schema:'''

Column | Type | Modifiers
--------------------+--------+-----------
citingpatentnumber | bigint |
citedpatentnumber | bigint |

Copy Statements:

patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 38452957

patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 59227881

allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT : COPY 59227881

\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT: COPY 38452957

CLONING:
CREATE DATABASE allpatentsProcessed WITH TEMPLATE allpatent OWNER researcher;

== USPTO Consolidated Patent Data ==

Scripts:

/* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/

CREATE TABLE PatentMaintenanceFee(
patentnumber varchar,
applicationnumber int,
smallentity varchar,
filingdate date,
grantissuedate date,
maintenancefeedate date,
maintenancefeecode varchar
);

\COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 14042059

/* creating tables for historical patent data - USPTO */

CREATE TABLE HistoricalPatentData(
applicationid int,
pubno varchar,
patentnumber varchar,
NBER int,
USPC varchar,
USPC_sub varchar,
applicationdate date,
prioritydate date,
pubdate date,
displaydate date,
disptype varchar,
exp_dt date,
exp_dt_max date,
pta int
);

\COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;

--COPY 11191813

Patent Data Processing - SQL Steps

2016-07-06T17:36:31Z

RavaliKruthiventi: /* Sample Insert and Copy Statements */

[[Category:Internal]]
[[Internal Classification::Legacy| ]]

== Objective ==
The McNair Center owns two sets of patent data - one set that is inherited from Harvard, the Harvard dataverse, which is stored in the database patentdata and another that is generated by crawlers pulling data from the USPTO website, which is stored in the database '''patent_2015'''.

We are now merging and cleaning the two data sets, and storing them in a schema that is amalgamation of the two underlying schema for the citations tables, assignees tables, and patents tables. The destination schema is '''allpatent'''.

== Assignees Data==

The schema for the assignees table in '''patentdata''' database is:

Column | Type | Modifiers
-------------+-------------------+-----------
patent | integer |
asgtype | integer |
assignee | character varying |
city | character varying |
state | character varying |
country | character varying |
nationality | character varying |
residence | character varying |
asgseq | integer |

The schema for the assignees table in patent_2015 is :

Column | Type | Modifiers
--------------+---------+-----------
lastname | text |
firstname | text |
orgname | text |
city | text |
country | text |
patentcountry | text |
patentnumber | integer |
state | text |
address | text |
postcode | text |

To merge both schemas, we have some columns that overlap, and some columns that don't.

'''Overlapping Columns'''

patent_2015 | patentdata
--------------+--------------
orgname | assignee
city | city
country | country
patentnumber | patent
state | state

These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.

'''Final Schema'''

Table "public.assignees"
Column | Type | Modifiers
---------------+-------------------+-----------
lastname | character varying |
firstname | character varying |
address | character varying |
postcode | character varying |
orgname | character varying |
city | character varying |
country | character varying |
patentnumber | integer |
state | character varying |
patentcountry | character varying |
nationality2 | character varying |
residence | character varying |
asgseq | integer |
asgtype | integer |

'''Non-overlapping Columns'''
These are the columns that belong to either one of the assignees tables, and not to both. For these cases, to help users understand where the row is coming from, the following insert rules have been followed:

*For columns of type int, insert -1
*For columns of type string (character varying), the string 'null' has been inserted.

Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table.

==== Index ====
Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table.

allpatent=# CREATE INDEX ON assignees (orgname);
CREATE INDEX

====Sample insert and copy commands ====
INSERT INTO assignees_merge
(
SELECT
'null',
'null',
'null',
'null',
a.assignee,
a.city,
a.country,
a.patent,
a.state,
'null',
a.nationality,
a.residence,
a.asgseq,
a.asgtype
FROM assignees a
);

INSERT INTO assignees_merge
(
SELECT
assignees.lastname,
assignees.firstname,
assignees.address,
assignees.postcode,
assignees.orgname,
assignees.city,
assignees.country,
assignees.patentnumber,
assignees.state,
assignees.patentcountry,
'null',
'null',
-1,
-1
FROM assignees
);

\COPY assignees_merge TO '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--1607724

\COPY assignees_merge TO '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--3818842

Note : The assignees table was updated on 6/23 to remove the 'null' string and the '-1' values.

==Patents ==

'''Patentdata Schema:'''

Column | Type | Modifiers
--------+-------------------+-----------
patent | integer |
kind | character varying |
claims | integer |
apptype | integer |
appnum | integer |
gdate | date |
gyear | integer |
appdate | date |
appyear | integer |

'''Patent_2015 Schema:'''

Column | Type | Modifiers
-------------------------------+---------+-----------
patentnumber | int | not null
kind | varchar |
grantdate | date |
type | varchar |
applicationnumber | varchar |
filingdate | date |
prioritydate | date |
prioritycountry | varchar |
prioritypatentnumber | varchar |
ussubclass | varchar |
maingroup | varchar |
subgroup | varchar |
cpcsubclass | varchar |
cpcmaingroup | varchar |
cpcsubgroup | varchar |
classificationnationalcountry | varchar |
classificationnationalclass | varchar |
title | varchar |
numberofclaims | int |
primaryexaminerfirstname | varchar |
primaryexaminerlastname | varchar |
primaryexaminerdepartment | varchar |
pctpatentnumber | varchar |
filename | varchar |

''' Overlapping Columns '''
patent_data patent_2015
--------------+-------------
patent | patentnumber
kind | kind
claims | numberofclaims
apptype | type
appnum | applicationnumber
gdate | grantdate
appdate | filingdate

'''Combined Schema:'''

The final schema of the patents table is :

Column | Type | Modifiers
----------------------+-------------------+-----------
patent | integer | not null
grantdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying |

From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean.

Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.

Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant.

==== Index and Key Creation ====
Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well.

Code:
ALTER TABLE patents ADD PRIMARY KEY (patentnumber);
-- RESULT : ALTER TABLE
allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber);

====Sample Insert and Copy Statements====
patentdata:
INSERT INTO patents_merged
(
SELECT
patent,
kind,
gdate,
'NULL',
'NULL',
NULL,
NULL,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
-1,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
claims,
apptype,
appnum,
gyear,
appdate,
appyear
FROM patents
);
-- RESULT : INSERT 0 3984771

patent_2015:
INSERT INTO patents_merged
(
SELECT
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
-1,
-1,
-1,
-1,
NULL,
-1
FROM patents
);
-- RESULT : INSERT 0 1646225

COPY SCRIPTS:
patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 3984771

patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 1646225

PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 3984771
\COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 1646225

====TESTING ====
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t;
--RESULT: 5411151
EXPECTED: 5426566

We found some copies of a few rows, where both the patent_2015 and patentdata

SELECT COUNT(*), *
FROM patents
GROUP BY
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
claims,
apptype,
appnum,
gyear,
appdate,
appyear
HAVING COUNT(*) > 1;

SELECT patentnumber, count(*)
FROM patents
GROUP BY patentnumber
HAVING count(*)>1;
--7640598

SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
ORDER BY op.patentnumber;

(
SELECT *
INTO patentsCleaned
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)=1
)
ORDER BY op.patentnumber
)
--SELECT 5191306

INSERT INTO patentsCleaned(
SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
AND op.applicationnumber NOT LIKE 'NULL'
ORDER BY op.patentnumber
);

--219845

====TESTING:====
allpatent=# select count(*) from patentsCleaned;
count
---------
5411151
(1 row)

allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1;
count | patentnumber
-------+--------------
(0 rows)

== Citations==

In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)

To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) {
my $var=$_[0];
if ($var=~/^\d*$/) {return $var;}
return undef;
}
return undef;
$$ LANGUAGE plperl;

'''patentdata schema:'''

Column | Type | Modifiers
------------+-------------------+-----------
patent | integer |
cit_date | date |
cit_name | character varying |
cit_kind | character varying |
cit_country | character varying |
citation | integer |
category | character varying |
citseq | integer |

SELECT patent as citingpatentnumber, citation AS citedpatentnumber
INTO citations_merged
FROM citations;
--SELECT 38452957

'''patent_2015 schema:'''
Column | Type | Modifiers
---------------------+---------+-----------
citingpatentnumber | integer |
citingpatentcountry | text |
citedpatentnumber | text |
citedpatentcountry | text |

SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber
INTO citations_merged
FROM citations;
-- RESULT : SELECT 59227881

'''Overlapping Columns'''

patent_2015 | patentdata |
---------------------+---------------+
citingpatentnumber | patent |
citedpatentnumber | citation |

''' Combined Schema:'''

Column | Type | Modifiers
--------------------+--------+-----------
citingpatentnumber | bigint |
citedpatentnumber | bigint |

Copy Statements:

patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 38452957

patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 59227881

allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT : COPY 59227881

\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT: COPY 38452957

CLONING:
CREATE DATABASE allpatentsProcessed WITH TEMPLATE allpatent OWNER researcher;

== USPTO Consolidated Patent Data ==

Scripts:

/* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/

CREATE TABLE PatentMaintenanceFee(
patentnumber varchar,
applicationnumber int,
smallentity varchar,
filingdate date,
grantissuedate date,
maintenancefeedate date,
maintenancefeecode varchar
);

\COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 14042059

/* creating tables for historical patent data - USPTO */

CREATE TABLE HistoricalPatentData(
applicationid int,
pubno varchar,
patentnumber varchar,
NBER int,
USPC varchar,
USPC_sub varchar,
applicationdate date,
prioritydate date,
pubdate date,
displaydate date,
disptype varchar,
exp_dt date,
exp_dt_max date,
pta int
);

\COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;

--COPY 11191813

Patent Data Processing - SQL Steps

2016-07-06T17:35:34Z

RavaliKruthiventi: /* Sample Insert and Copy Statements */

[[Category:Internal]]
[[Internal Classification::Legacy| ]]

== Objective ==
The McNair Center owns two sets of patent data - one set that is inherited from Harvard, the Harvard dataverse, which is stored in the database patentdata and another that is generated by crawlers pulling data from the USPTO website, which is stored in the database '''patent_2015'''.

We are now merging and cleaning the two data sets, and storing them in a schema that is amalgamation of the two underlying schema for the citations tables, assignees tables, and patents tables. The destination schema is '''allpatent'''.

== Assignees Data==

The schema for the assignees table in '''patentdata''' database is:

Column | Type | Modifiers
-------------+-------------------+-----------
patent | integer |
asgtype | integer |
assignee | character varying |
city | character varying |
state | character varying |
country | character varying |
nationality | character varying |
residence | character varying |
asgseq | integer |

The schema for the assignees table in patent_2015 is :

Column | Type | Modifiers
--------------+---------+-----------
lastname | text |
firstname | text |
orgname | text |
city | text |
country | text |
patentcountry | text |
patentnumber | integer |
state | text |
address | text |
postcode | text |

To merge both schemas, we have some columns that overlap, and some columns that don't.

'''Overlapping Columns'''

patent_2015 | patentdata
--------------+--------------
orgname | assignee
city | city
country | country
patentnumber | patent
state | state

These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.

'''Final Schema'''

Table "public.assignees"
Column | Type | Modifiers
---------------+-------------------+-----------
lastname | character varying |
firstname | character varying |
address | character varying |
postcode | character varying |
orgname | character varying |
city | character varying |
country | character varying |
patentnumber | integer |
state | character varying |
patentcountry | character varying |
nationality2 | character varying |
residence | character varying |
asgseq | integer |
asgtype | integer |

'''Non-overlapping Columns'''
These are the columns that belong to either one of the assignees tables, and not to both. For these cases, to help users understand where the row is coming from, the following insert rules have been followed:

*For columns of type int, insert -1
*For columns of type string (character varying), the string 'null' has been inserted.

Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table.

==== Index ====
Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table.

allpatent=# CREATE INDEX ON assignees (orgname);
CREATE INDEX

====Sample insert and copy commands ====
INSERT INTO assignees_merge
(
SELECT
'null',
'null',
'null',
'null',
a.assignee,
a.city,
a.country,
a.patent,
a.state,
'null',
a.nationality,
a.residence,
a.asgseq,
a.asgtype
FROM assignees a
);

INSERT INTO assignees_merge
(
SELECT
assignees.lastname,
assignees.firstname,
assignees.address,
assignees.postcode,
assignees.orgname,
assignees.city,
assignees.country,
assignees.patentnumber,
assignees.state,
assignees.patentcountry,
'null',
'null',
-1,
-1
FROM assignees
);

\COPY assignees_merge TO '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--1607724

\COPY assignees_merge TO '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--3818842

Note : The assignees table was updated on 6/23 to remove the 'null' string and the '-1' values.

==Patents ==

'''Patentdata Schema:'''

Column | Type | Modifiers
--------+-------------------+-----------
patent | integer |
kind | character varying |
claims | integer |
apptype | integer |
appnum | integer |
gdate | date |
gyear | integer |
appdate | date |
appyear | integer |

'''Patent_2015 Schema:'''

Column | Type | Modifiers
-------------------------------+---------+-----------
patentnumber | int | not null
kind | varchar |
grantdate | date |
type | varchar |
applicationnumber | varchar |
filingdate | date |
prioritydate | date |
prioritycountry | varchar |
prioritypatentnumber | varchar |
ussubclass | varchar |
maingroup | varchar |
subgroup | varchar |
cpcsubclass | varchar |
cpcmaingroup | varchar |
cpcsubgroup | varchar |
classificationnationalcountry | varchar |
classificationnationalclass | varchar |
title | varchar |
numberofclaims | int |
primaryexaminerfirstname | varchar |
primaryexaminerlastname | varchar |
primaryexaminerdepartment | varchar |
pctpatentnumber | varchar |
filename | varchar |

''' Overlapping Columns '''
patent_data patent_2015
--------------+-------------
patent | patentnumber
kind | kind
claims | numberofclaims
apptype | type
appnum | applicationnumber
gdate | grantdate
appdate | filingdate

'''Combined Schema:'''

The final schema of the patents table is :

Column | Type | Modifiers
----------------------+-------------------+-----------
patent | integer | not null
grantdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying |

From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean.

Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.

Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant.

==== Index and Key Creation ====
Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well.

Code:
ALTER TABLE patents ADD PRIMARY KEY (patentnumber);
-- RESULT : ALTER TABLE
allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber);

====Sample Insert and Copy Statements====
patentdata:
INSERT INTO patents_merged
(
SELECT
patent,
kind,
gdate,
'NULL',
'NULL',
NULL,
NULL,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
-1,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
claims,
apptype,
appnum,
gyear,
appdate,
appyear
FROM patents
);
-- RESULT : INSERT 0 3984771

patent_2015:
INSERT INTO patents_merged
(
SELECT
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
-1,
-1,
-1,
-1,
NULL,
-1
FROM patents
);
-- RESULT : INSERT 0 1646225

COPY SCRIPTS:
patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 3984771

patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 1646225

PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 3984771
\COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 1646225

====TESTING ====
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t;
--RESULT: 5411151
EXPECTED: 5426566

We found some copies of a few rows, where both the patent_2015 and patentdata

SELECT COUNT(*), *
FROM patents
GROUP BY
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
claims,
apptype,
appnum,
gyear,
appdate,
appyear
HAVING COUNT(*) > 1;

SELECT patentnumber, count(*)
FROM patents
GROUP BY patentnumber
HAVING count(*)>1;
--7640598

SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
ORDER BY op.patentnumber;

(
SELECT *
INTO patentsCleaned
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)=1
)
ORDER BY op.patentnumber
)
--SELECT 5191306

INSERT INTO patentsCleaned(
SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
AND op.applicationnumber NOT LIKE 'NULL'
ORDER BY op.patentnumber
);

--219845

====TESTING:====
allpatent=# select count(*) from patentsCleaned;
count
---------
5411151
(1 row)

allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1;
count | patentnumber
-------+--------------
(0 rows)

== Citations==

In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)

To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) {
my $var=$_[0];
if ($var=~/^\d*$/) {return $var;}
return undef;
}
return undef;
$$ LANGUAGE plperl;

'''patentdata schema:'''

Column | Type | Modifiers
------------+-------------------+-----------
patent | integer |
cit_date | date |
cit_name | character varying |
cit_kind | character varying |
cit_country | character varying |
citation | integer |
category | character varying |
citseq | integer |

SELECT patent as citingpatentnumber, citation AS citedpatentnumber
INTO citations_merged
FROM citations;
--SELECT 38452957

'''patent_2015 schema:'''
Column | Type | Modifiers
---------------------+---------+-----------
citingpatentnumber | integer |
citingpatentcountry | text |
citedpatentnumber | text |
citedpatentcountry | text |

SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber
INTO citations_merged
FROM citations;
-- RESULT : SELECT 59227881

'''Overlapping Columns'''

patent_2015 | patentdata |
---------------------+---------------+
citingpatentnumber | patent |
citedpatentnumber | citation |

''' Combined Schema:'''

Column | Type | Modifiers
--------------------+--------+-----------
citingpatentnumber | bigint |
citedpatentnumber | bigint |

Copy Statements:

patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 38452957

patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 59227881

allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT : COPY 59227881

\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT: COPY 38452957

CLONING:
CREATE DATABASE allpatentsProcessed WITH TEMPLATE allpatent OWNER researcher;

== USPTO Consolidated Patent Data ==

Scripts:

/* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/

CREATE TABLE PatentMaintenanceFee(
patentnumber varchar,
applicationnumber int,
smallentity varchar,
filingdate date,
grantissuedate date,
maintenancefeedate date,
maintenancefeecode varchar
);

\COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 14042059

/* creating tables for historical patent data - USPTO */

CREATE TABLE HistoricalPatentData(
applicationid int,
pubno varchar,
patentnumber varchar,
NBER int,
USPC varchar,
USPC_sub varchar,
applicationdate date,
prioritydate date,
pubdate date,
displaydate date,
disptype varchar,
exp_dt date,
exp_dt_max date,
pta int
);

\COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;

--COPY 11191813

Patent Data Processing - SQL Steps

2016-07-06T17:33:36Z

RavaliKruthiventi: /* Assignees Data */

[[Category:Internal]]
[[Internal Classification::Legacy| ]]

== Objective ==
The McNair Center owns two sets of patent data - one set that is inherited from Harvard, the Harvard dataverse, which is stored in the database patentdata and another that is generated by crawlers pulling data from the USPTO website, which is stored in the database '''patent_2015'''.

We are now merging and cleaning the two data sets, and storing them in a schema that is amalgamation of the two underlying schema for the citations tables, assignees tables, and patents tables. The destination schema is '''allpatent'''.

== Assignees Data==

The schema for the assignees table in '''patentdata''' database is:

Column | Type | Modifiers
-------------+-------------------+-----------
patent | integer |
asgtype | integer |
assignee | character varying |
city | character varying |
state | character varying |
country | character varying |
nationality | character varying |
residence | character varying |
asgseq | integer |

The schema for the assignees table in patent_2015 is :

Column | Type | Modifiers
--------------+---------+-----------
lastname | text |
firstname | text |
orgname | text |
city | text |
country | text |
patentcountry | text |
patentnumber | integer |
state | text |
address | text |
postcode | text |

To merge both schemas, we have some columns that overlap, and some columns that don't.

'''Overlapping Columns'''

patent_2015 | patentdata
--------------+--------------
orgname | assignee
city | city
country | country
patentnumber | patent
state | state

These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.

'''Final Schema'''

Table "public.assignees"
Column | Type | Modifiers
---------------+-------------------+-----------
lastname | character varying |
firstname | character varying |
address | character varying |
postcode | character varying |
orgname | character varying |
city | character varying |
country | character varying |
patentnumber | integer |
state | character varying |
patentcountry | character varying |
nationality2 | character varying |
residence | character varying |
asgseq | integer |
asgtype | integer |

'''Non-overlapping Columns'''
These are the columns that belong to either one of the assignees tables, and not to both. For these cases, to help users understand where the row is coming from, the following insert rules have been followed:

*For columns of type int, insert -1
*For columns of type string (character varying), the string 'null' has been inserted.

Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table.

==== Index ====
Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table.

allpatent=# CREATE INDEX ON assignees (orgname);
CREATE INDEX

====Sample insert and copy commands ====
INSERT INTO assignees_merge
(
SELECT
'null',
'null',
'null',
'null',
a.assignee,
a.city,
a.country,
a.patent,
a.state,
'null',
a.nationality,
a.residence,
a.asgseq,
a.asgtype
FROM assignees a
);

INSERT INTO assignees_merge
(
SELECT
assignees.lastname,
assignees.firstname,
assignees.address,
assignees.postcode,
assignees.orgname,
assignees.city,
assignees.country,
assignees.patentnumber,
assignees.state,
assignees.patentcountry,
'null',
'null',
-1,
-1
FROM assignees
);

\COPY assignees_merge TO '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--1607724

\COPY assignees_merge TO '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--3818842

Note : The assignees table was updated on 6/23 to remove the 'null' string and the '-1' values.

==Patents ==

'''Patentdata Schema:'''

Column | Type | Modifiers
--------+-------------------+-----------
patent | integer |
kind | character varying |
claims | integer |
apptype | integer |
appnum | integer |
gdate | date |
gyear | integer |
appdate | date |
appyear | integer |

'''Patent_2015 Schema:'''

Column | Type | Modifiers
-------------------------------+---------+-----------
patentnumber | int | not null
kind | varchar |
grantdate | date |
type | varchar |
applicationnumber | varchar |
filingdate | date |
prioritydate | date |
prioritycountry | varchar |
prioritypatentnumber | varchar |
ussubclass | varchar |
maingroup | varchar |
subgroup | varchar |
cpcsubclass | varchar |
cpcmaingroup | varchar |
cpcsubgroup | varchar |
classificationnationalcountry | varchar |
classificationnationalclass | varchar |
title | varchar |
numberofclaims | int |
primaryexaminerfirstname | varchar |
primaryexaminerlastname | varchar |
primaryexaminerdepartment | varchar |
pctpatentnumber | varchar |
filename | varchar |

''' Overlapping Columns '''
patent_data patent_2015
--------------+-------------
patent | patentnumber
kind | kind
claims | numberofclaims
apptype | type
appnum | applicationnumber
gdate | grantdate
appdate | filingdate

'''Combined Schema:'''

The final schema of the patents table is :

Column | Type | Modifiers
----------------------+-------------------+-----------
patent | integer | not null
grantdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying |

From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean.

Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.

Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant.

==== Index and Key Creation ====
Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well.

Code:
ALTER TABLE patents ADD PRIMARY KEY (patentnumber);
-- RESULT : ALTER TABLE
allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber);

====Sample Insert and Copy Statements====
patentdata:
INSERT INTO patents_merged
(
SELECT
patent,
kind,
gdate,
'NULL',
'NULL',
NULL,
NULL,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
-1,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
claims,
apptype,
appnum,
gyear,
appdate,
appyear
FROM patents
);
-- RESULT : INSERT 0 3984771

patent_2015:
INSERT INTO patents_merged
(
SELECT
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
-1,
-1,
-1,
-1,
NULL,
-1
FROM patents
);
-- RESULT : INSERT 0 1646225

COPY SCRIPTS:
patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 3984771

patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 1646225

PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 3984771
\COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 1646225

====TESTING ====
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t;
--RESULT: 5411151
EXPECTED: 5426566

We found some copies of a few rows, where both the patent_2015 and patentdata

SELECT COUNT(*), *
FROM patents
GROUP BY
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
claims,
apptype,
appnum,
gyear,
appdate,
appyear
HAVING COUNT(*) > 1;

SELECT patentnumber, count(*)
FROM patents
GROUP BY patentnumber
HAVING count(*)>1;
--7640598

SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
ORDER BY op.patentnumber;

(
SELECT *
INTO patentsCleaned
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)=1
)
ORDER BY op.patentnumber
)
--SELECT 5191306

INSERT INTO patentsCleaned(
SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
AND op.applicationnumber NOT LIKE 'NULL'
ORDER BY op.patentnumber
);

--219845

====TESTING:====
allpatent=# select count(*) from patentsCleaned;
count
---------
5411151
(1 row)

allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1;
count | patentnumber
-------+--------------
(0 rows)

== Citations==

In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)

To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) {
my $var=$_[0];
if ($var=~/^\d*$/) {return $var;}
return undef;
}
return undef;
$$ LANGUAGE plperl;

'''patentdata schema:'''

Column | Type | Modifiers
------------+-------------------+-----------
patent | integer |
cit_date | date |
cit_name | character varying |
cit_kind | character varying |
cit_country | character varying |
citation | integer |
category | character varying |
citseq | integer |

SELECT patent as citingpatentnumber, citation AS citedpatentnumber
INTO citations_merged
FROM citations;
--SELECT 38452957

'''patent_2015 schema:'''
Column | Type | Modifiers
---------------------+---------+-----------
citingpatentnumber | integer |
citingpatentcountry | text |
citedpatentnumber | text |
citedpatentcountry | text |

SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber
INTO citations_merged
FROM citations;
-- RESULT : SELECT 59227881

'''Overlapping Columns'''

patent_2015 | patentdata |
---------------------+---------------+
citingpatentnumber | patent |
citedpatentnumber | citation |

''' Combined Schema:'''

Column | Type | Modifiers
--------------------+--------+-----------
citingpatentnumber | bigint |
citedpatentnumber | bigint |

Copy Statements:

patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 38452957

patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 59227881

allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT : COPY 59227881

\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT: COPY 38452957

CLONING:
CREATE DATABASE allpatentsProcessed WITH TEMPLATE allpatent OWNER researcher;

== USPTO Consolidated Patent Data ==

Scripts:

/* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/

CREATE TABLE PatentMaintenanceFee(
patentnumber varchar,
applicationnumber int,
smallentity varchar,
filingdate date,
grantissuedate date,
maintenancefeedate date,
maintenancefeecode varchar
);

\COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 14042059

/* creating tables for historical patent data - USPTO */

CREATE TABLE HistoricalPatentData(
applicationid int,
pubno varchar,
patentnumber varchar,
NBER int,
USPC varchar,
USPC_sub varchar,
applicationdate date,
prioritydate date,
pubdate date,
displaydate date,
disptype varchar,
exp_dt date,
exp_dt_max date,
pta int
);

\COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;

--COPY 11191813

Patent Data Processing - SQL Steps

2016-07-06T17:30:46Z

RavaliKruthiventi: /* Assignees Data */

[[Category:Internal]]
[[Internal Classification::Legacy| ]]

== Objective ==
The McNair Center owns two sets of patent data - one set that is inherited from Harvard, the Harvard dataverse, which is stored in the database patentdata and another that is generated by crawlers pulling data from the USPTO website, which is stored in the database '''patent_2015'''.

We are now merging and cleaning the two data sets, and storing them in a schema that is amalgamation of the two underlying schema for the citations tables, assignees tables, and patents tables. The destination schema is '''allpatent'''.

== Assignees Data==

The schema for the assignees table in '''patentdata''' database is:

Column | Type | Modifiers
-------------+-------------------+-----------
patent | integer |
asgtype | integer |
assignee | character varying |
city | character varying |
state | character varying |
country | character varying |
nationality | character varying |
residence | character varying |
asgseq | integer |

The schema for the assignees table in patent_2015 is :

Column | Type | Modifiers
--------------+---------+-----------
lastname | text |
firstname | text |
orgname | text |
city | text |
country | text |
patentcountry | text |
patentnumber | integer |
state | text |
address | text |
postcode | text |

To merge both schemas, we have some columns that overlap, and some columns that don't.

'''Overlapping Columns'''

patent_2015 | patentdata
--------------+--------------
orgname | assignee
city | city
country | country
patentnumber | patent
state | state

These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.

'''Final Schema'''
Table "public.assignees"
Column | Type | Modifiers
---------------+-------------------+-----------
lastname | character varying |
firstname | character varying |
address | character varying |
postcode | character varying |
orgname | character varying |
city | character varying |
country | character varying |
patentnumber | integer |
state | character varying |
patentcountry | character varying |
nationality2 | character varying |
residence | character varying |
asgseq | integer |
asgtype | integer |

'''Non-overlapping Columns'''
These are the columns that belong to either one of the assignees tables, and not to both. For these cases, to help users understand where the row is coming from, the following insert rules have been followed:

*For columns of type int, insert -1
*For columns of type string (character varying), the string 'null' has been inserted.

Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table.

==== Index ====
Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table.

allpatent=# CREATE INDEX ON assignees (orgname);
CREATE INDEX

====Sample insert and copy commands ====
INSERT INTO assignees_merge
(
SELECT
'null',
'null',
'null',
'null',
a.assignee,
a.city,
a.country,
a.patent,
a.state,
'null',
a.nationality,
a.residence,
a.asgseq,
a.asgtype
FROM assignees a
);

INSERT INTO assignees_merge
(
SELECT
assignees.lastname,
assignees.firstname,
assignees.address,
assignees.postcode,
assignees.orgname,
assignees.city,
assignees.country,
assignees.patentnumber,
assignees.state,
assignees.patentcountry,
'null',
'null',
-1,
-1
FROM assignees
);

\COPY assignees_merge TO '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--1607724

\COPY assignees_merge TO '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--3818842

Note : The assignees table was updated on 6/23 to remove the 'null' string and the '-1' values.

==Patents ==

'''Patentdata Schema:'''

Column | Type | Modifiers
--------+-------------------+-----------
patent | integer |
kind | character varying |
claims | integer |
apptype | integer |
appnum | integer |
gdate | date |
gyear | integer |
appdate | date |
appyear | integer |

'''Patent_2015 Schema:'''

Column | Type | Modifiers
-------------------------------+---------+-----------
patentnumber | int | not null
kind | varchar |
grantdate | date |
type | varchar |
applicationnumber | varchar |
filingdate | date |
prioritydate | date |
prioritycountry | varchar |
prioritypatentnumber | varchar |
ussubclass | varchar |
maingroup | varchar |
subgroup | varchar |
cpcsubclass | varchar |
cpcmaingroup | varchar |
cpcsubgroup | varchar |
classificationnationalcountry | varchar |
classificationnationalclass | varchar |
title | varchar |
numberofclaims | int |
primaryexaminerfirstname | varchar |
primaryexaminerlastname | varchar |
primaryexaminerdepartment | varchar |
pctpatentnumber | varchar |
filename | varchar |

''' Overlapping Columns '''
patent_data patent_2015
--------------+-------------
patent | patentnumber
kind | kind
claims | numberofclaims
apptype | type
appnum | applicationnumber
gdate | grantdate
appdate | filingdate

'''Combined Schema:'''

The final schema of the patents table is :

Column | Type | Modifiers
----------------------+-------------------+-----------
patent | integer | not null
grantdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying |

From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean.

Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.

Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant.

==== Index and Key Creation ====
Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well.

Code:
ALTER TABLE patents ADD PRIMARY KEY (patentnumber);
-- RESULT : ALTER TABLE
allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber);

====Sample Insert and Copy Statements====
patentdata:
INSERT INTO patents_merged
(
SELECT
patent,
kind,
gdate,
'NULL',
'NULL',
NULL,
NULL,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
-1,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
claims,
apptype,
appnum,
gyear,
appdate,
appyear
FROM patents
);
-- RESULT : INSERT 0 3984771

patent_2015:
INSERT INTO patents_merged
(
SELECT
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
-1,
-1,
-1,
-1,
NULL,
-1
FROM patents
);
-- RESULT : INSERT 0 1646225

COPY SCRIPTS:
patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 3984771

patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 1646225

PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 3984771
\COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 1646225

====TESTING ====
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t;
--RESULT: 5411151
EXPECTED: 5426566

We found some copies of a few rows, where both the patent_2015 and patentdata

SELECT COUNT(*), *
FROM patents
GROUP BY
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
claims,
apptype,
appnum,
gyear,
appdate,
appyear
HAVING COUNT(*) > 1;

SELECT patentnumber, count(*)
FROM patents
GROUP BY patentnumber
HAVING count(*)>1;
--7640598

SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
ORDER BY op.patentnumber;

(
SELECT *
INTO patentsCleaned
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)=1
)
ORDER BY op.patentnumber
)
--SELECT 5191306

INSERT INTO patentsCleaned(
SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
AND op.applicationnumber NOT LIKE 'NULL'
ORDER BY op.patentnumber
);

--219845

====TESTING:====
allpatent=# select count(*) from patentsCleaned;
count
---------
5411151
(1 row)

allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1;
count | patentnumber
-------+--------------
(0 rows)

== Citations==

In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)

To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) {
my $var=$_[0];
if ($var=~/^\d*$/) {return $var;}
return undef;
}
return undef;
$$ LANGUAGE plperl;

'''patentdata schema:'''

Column | Type | Modifiers
------------+-------------------+-----------
patent | integer |
cit_date | date |
cit_name | character varying |
cit_kind | character varying |
cit_country | character varying |
citation | integer |
category | character varying |
citseq | integer |

SELECT patent as citingpatentnumber, citation AS citedpatentnumber
INTO citations_merged
FROM citations;
--SELECT 38452957

'''patent_2015 schema:'''
Column | Type | Modifiers
---------------------+---------+-----------
citingpatentnumber | integer |
citingpatentcountry | text |
citedpatentnumber | text |
citedpatentcountry | text |

SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber
INTO citations_merged
FROM citations;
-- RESULT : SELECT 59227881

'''Overlapping Columns'''

patent_2015 | patentdata |
---------------------+---------------+
citingpatentnumber | patent |
citedpatentnumber | citation |

''' Combined Schema:'''

Column | Type | Modifiers
--------------------+--------+-----------
citingpatentnumber | bigint |
citedpatentnumber | bigint |

Copy Statements:

patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 38452957

patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 59227881

allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT : COPY 59227881

\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT: COPY 38452957

CLONING:
CREATE DATABASE allpatentsProcessed WITH TEMPLATE allpatent OWNER researcher;

== USPTO Consolidated Patent Data ==

Scripts:

/* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/

CREATE TABLE PatentMaintenanceFee(
patentnumber varchar,
applicationnumber int,
smallentity varchar,
filingdate date,
grantissuedate date,
maintenancefeedate date,
maintenancefeecode varchar
);

\COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 14042059

/* creating tables for historical patent data - USPTO */

CREATE TABLE HistoricalPatentData(
applicationid int,
pubno varchar,
patentnumber varchar,
NBER int,
USPC varchar,
USPC_sub varchar,
applicationdate date,
prioritydate date,
pubdate date,
displaydate date,
disptype varchar,
exp_dt date,
exp_dt_max date,
pta int
);

\COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;

--COPY 11191813

Patent Data Processing - SQL Steps

2016-07-06T17:30:08Z

RavaliKruthiventi: /* Assignees Data */

[[Category:Internal]]
[[Internal Classification::Legacy| ]]

== Objective ==
The McNair Center owns two sets of patent data - one set that is inherited from Harvard, the Harvard dataverse, which is stored in the database patentdata and another that is generated by crawlers pulling data from the USPTO website, which is stored in the database '''patent_2015'''.

We are now merging and cleaning the two data sets, and storing them in a schema that is amalgamation of the two underlying schema for the citations tables, assignees tables, and patents tables. The destination schema is '''allpatent'''.

== Assignees Data==

The schema for the assignees table in '''patentdata''' database is:

Column | Type | Modifiers
-------------+-------------------+-----------
patent | integer |
asgtype | integer |
assignee | character varying |
city | character varying |
state | character varying |
country | character varying |
nationality | character varying |
residence | character varying |
asgseq | integer |

The schema for the assignees table in patent_2015 is :

Column | Type | Modifiers
---------------+---------+-----------
lastname | text |
firstname | text |
orgname | text |
city | text |
country | text |
patentcountry | text |
patentnumber | integer |
state | text |
address | text |
postcode | text |

To merge both schemas, we have some columns that overlap, and some columns that don't.

'''Overlapping Columns'''

patent_2015 | patentdata
--------------+--------------
orgname | assignee
city | city
country | country
patentnumber | patent
state | state

These columns will have entries for most rows in the table, because they exist in both tables. The rest of the columns will be populated based on which table the row is coming from.

'''Final Schema'''
Table "public.assignees"
Column | Type | Modifiers
---------------+-------------------+-----------
lastname | character varying |
firstname | character varying |
address | character varying |
postcode | character varying |
orgname | character varying |
city | character varying |
country | character varying |
patentnumber | integer |
state | character varying |
patentcountry | character varying |
nationality2 | character varying |
residence | character varying |
asgseq | integer |
asgtype | integer |

'''Non-overlapping Columns'''
These are the columns that belong to either one of the assignees tables, and not to both. For these cases, to help users understand where the row is coming from, the following insert rules have been followed:

*For columns of type int, insert -1
*For columns of type string (character varying), the string 'null' has been inserted.

Therefore, if a row has appropriate values for orgname, state, city ,etc, but 'null' values for lastname, firstname, address and postcode, the row has come from the patentdata table.

==== Index ====
Since the table is relatively large, and is likely to be searched often, an index has been imposed on the table.

allpatent=# CREATE INDEX ON assignees (orgname);
CREATE INDEX

====Sample insert and copy commands ====
INSERT INTO assignees_merge
(
SELECT
'null',
'null',
'null',
'null',
a.assignee,
a.city,
a.country,
a.patent,
a.state,
'null',
a.nationality,
a.residence,
a.asgseq,
a.asgtype
FROM assignees a
);

INSERT INTO assignees_merge
(
SELECT
assignees.lastname,
assignees.firstname,
assignees.address,
assignees.postcode,
assignees.orgname,
assignees.city,
assignees.country,
assignees.patentnumber,
assignees.state,
assignees.patentcountry,
'null',
'null',
-1,
-1
FROM assignees
);

\COPY assignees_merge TO '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--1607724

\COPY assignees_merge TO '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
\COPY assignees FROM '/tmp/assignees_merge_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--3818842

Note : The assignees table was updated on 6/23 to remove the 'null' string and the '-1' values.

==Patents ==

'''Patentdata Schema:'''

Column | Type | Modifiers
--------+-------------------+-----------
patent | integer |
kind | character varying |
claims | integer |
apptype | integer |
appnum | integer |
gdate | date |
gyear | integer |
appdate | date |
appyear | integer |

'''Patent_2015 Schema:'''

Column | Type | Modifiers
-------------------------------+---------+-----------
patentnumber | int | not null
kind | varchar |
grantdate | date |
type | varchar |
applicationnumber | varchar |
filingdate | date |
prioritydate | date |
prioritycountry | varchar |
prioritypatentnumber | varchar |
ussubclass | varchar |
maingroup | varchar |
subgroup | varchar |
cpcsubclass | varchar |
cpcmaingroup | varchar |
cpcsubgroup | varchar |
classificationnationalcountry | varchar |
classificationnationalclass | varchar |
title | varchar |
numberofclaims | int |
primaryexaminerfirstname | varchar |
primaryexaminerlastname | varchar |
primaryexaminerdepartment | varchar |
pctpatentnumber | varchar |
filename | varchar |

''' Overlapping Columns '''
patent_data patent_2015
--------------+-------------
patent | patentnumber
kind | kind
claims | numberofclaims
apptype | type
appnum | applicationnumber
gdate | grantdate
appdate | filingdate

'''Combined Schema:'''

The final schema of the patents table is :

Column | Type | Modifiers
----------------------+-------------------+-----------
patent | integer | not null
grantdate | date |
prioritydate | date |
prioritycountry | character varying |
prioritypatentnumber | character varying |
cpcsubgroup | character varying |
pctpatentnumber | character varying |
claims | integer |
appnum | integer |
gyear | integer |
appdate | date |
appyear | integer |
nber | integer |
uspc | character varying |
uspc_sub | character varying |

From the total list of columns belonging to both the tables (patentdata and patent_2015), a few columns, most of them related to classification of patents, have been dropped since the data in the tables was not clean.

Additionally, three columns - nber, uspc, uspc_sub have been added from the historicalpatentdata, a table built from data downloaded from the USPTO Bulk Data Storage. The join was executed on the patent number.

Note : The addition, deletion of columns as through separate [[Patent Data Cleanup - June 2016 |scripts]], therefore the scripts below will be slightly discrepant.

==== Index and Key Creation ====
Patent numbers are distinct in this table, and are central to the rest of the fields in the table. A primary key can therefore be imposed on the column. Also, since a number of searches are likely to be conducted on this table, an index has been imposed as well.

Code:
ALTER TABLE patents ADD PRIMARY KEY (patentnumber);
-- RESULT : ALTER TABLE
allpatent=# CREATE UNIQUE INDEX patent_idx ON patents (patentnumber);

====Sample Insert and Copy Statements====
patentdata:
INSERT INTO patents_merged
(
SELECT
patent,
kind,
gdate,
'NULL',
'NULL',
NULL,
NULL,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
-1,
'NULL',
'NULL',
'NULL',
'NULL',
'NULL',
claims,
apptype,
appnum,
gyear,
appdate,
appyear
FROM patents
);
-- RESULT : INSERT 0 3984771

patent_2015:
INSERT INTO patents_merged
(
SELECT
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
-1,
-1,
-1,
-1,
NULL,
-1
FROM patents
);
-- RESULT : INSERT 0 1646225

COPY SCRIPTS:
patentdata:
\COPY patents_merged TO '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 3984771

patent_2015:
\COPY patents_merged TO '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 1646225

PATENTS TABLE
\COPY patents FROM '/tmp/merged_patents_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 3984771
\COPY patents FROM '/tmp/merged_patents_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 1646225

====TESTING ====
select count(*) FROM (SELECT DISTINCT patentnumber FROM patents) AS t;
--RESULT: 5411151
EXPECTED: 5426566

We found some copies of a few rows, where both the patent_2015 and patentdata

SELECT COUNT(*), *
FROM patents
GROUP BY
patentnumber,
kind,
grantdate,
type,
applicationnumber,
filingdate,
prioritydate,
prioritycountry,
prioritypatentnumber,
ussubclass,
maingroup,
subgroup,
cpcsubclass,
cpcmaingroup,
cpcsubgroup,
classificationnationalcountry,
classificationnationalclass,
title,
numberofclaims,
primaryexaminerfirstname,
primaryexaminerlastname,
primaryexaminerdepartment,
pctpatentnumber,
filename,
claims,
apptype,
appnum,
gyear,
appdate,
appyear
HAVING COUNT(*) > 1;

SELECT patentnumber, count(*)
FROM patents
GROUP BY patentnumber
HAVING count(*)>1;
--7640598

SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
ORDER BY op.patentnumber;

(
SELECT *
INTO patentsCleaned
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)=1
)
ORDER BY op.patentnumber
)
--SELECT 5191306

INSERT INTO patentsCleaned(
SELECT *
FROM patents op
WHERE op.patentnumber IN
(
SELECT ip.patentnumber
FROM patents ip
GROUP BY ip.patentnumber
HAVING COUNT(*)>1
)
AND op.applicationnumber NOT LIKE 'NULL'
ORDER BY op.patentnumber
);

--219845

====TESTING:====
allpatent=# select count(*) from patentsCleaned;
count
---------
5411151
(1 row)

allpatent=# select count(*), patentnumber FROM patentsCleaned group by patentnumber having count(*) > 1;
count | patentnumber
-------+--------------
(0 rows)

== Citations==

In the citations table, we needed to define another function that would convert a textual patent number into a number (big int, since the patents number were exceeding the range of regular integers.)

To Extract Patents with Numbers Only and to Ignore Other RegExes
CREATE OR REPLACE FUNCTION cleanpatno (text) RETURNS bigint AS $$
if ($_[0]) {
my $var=$_[0];
if ($var=~/^\d*$/) {return $var;}
return undef;
}
return undef;
$$ LANGUAGE plperl;

'''patentdata schema:'''

Column | Type | Modifiers
------------+-------------------+-----------
patent | integer |
cit_date | date |
cit_name | character varying |
cit_kind | character varying |
cit_country | character varying |
citation | integer |
category | character varying |
citseq | integer |

SELECT patent as citingpatentnumber, citation AS citedpatentnumber
INTO citations_merged
FROM citations;
--SELECT 38452957

'''patent_2015 schema:'''
Column | Type | Modifiers
---------------------+---------+-----------
citingpatentnumber | integer |
citingpatentcountry | text |
citedpatentnumber | text |
citedpatentcountry | text |

SELECT CAST(citingpatentnumber AS bigint), CAST(cleanpatno( citedpatentnumber) AS bigint) as citedpatentnumber
INTO citations_merged
FROM citations;
-- RESULT : SELECT 59227881

'''Overlapping Columns'''

patent_2015 | patentdata |
---------------------+---------------+
citingpatentnumber | patent |
citedpatentnumber | citation |

''' Combined Schema:'''

Column | Type | Modifiers
--------------------+--------+-----------
citingpatentnumber | bigint |
citedpatentnumber | bigint |

Copy Statements:

patentdata:
\COPY citations_merged TO '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--COPY 38452957

patent_2015:
\COPY citations_merged TO '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 59227881

allpatent:
\COPY citations FROM '/tmp/merged_citations_export.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT : COPY 59227881

\COPY citations FROM '/tmp/merged_citations_export1.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
--RESULT: COPY 38452957

CLONING:
CREATE DATABASE allpatentsProcessed WITH TEMPLATE allpatent OWNER researcher;

== USPTO Consolidated Patent Data ==

Scripts:

/* creating patent data tables from : https://bulkdata.uspto.gov/data2/patent/maintenancefee/*/

CREATE TABLE PatentMaintenanceFee(
patentnumber varchar,
applicationnumber int,
smallentity varchar,
filingdate date,
grantissuedate date,
maintenancefeedate date,
maintenancefeecode varchar
);

\COPY PatentMaintenanceFee FROM '/bulk/USPTO_Consolidated/MaintFeeEvents_20160613.txt' DELIMITER AS E'\t' HEADER NULL AS '' CSV;
-- RESULT : COPY 14042059

/* creating tables for historical patent data - USPTO */

CREATE TABLE HistoricalPatentData(
applicationid int,
pubno varchar,
patentnumber varchar,
NBER int,
USPC varchar,
USPC_sub varchar,
applicationdate date,
prioritydate date,
pubdate date,
displaydate date,
disptype varchar,
exp_dt date,
exp_dt_max date,
pta int
);

\COPY historicalpatentdata FROM '/bulk/USPTO_Consolidated/HistoricalFiles/historical_masterfile.csv' DELIMITER AS ',' HEADER NULL AS '' CSV;

--COPY 11191813

Bulk Patent Assignee Processing

2016-07-01T16:07:27Z

RavaliKruthiventi: /* DTD */

== USPTO Assignees Data ==

We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.

== Steps Followed to Extract the Data ==

===Extracting Data from XML Files ===

All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:

<patent-assignment>
+<assignment-record>
+<patent-assignors>
+<patent-assignees>
+<patent-properties>
</patent-assignment>

Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.

Corresponding tables are:
*assignment-records : assignment
*patent-assignors : assignors
*patent-assignees : assignees
*patent-properties : properties

Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.

==== Assignment Records ====

The fields in the assignment record are:
* last_update_date
* purge_indicator
* recorded_date
* correspondent_name
* correspondent_address_1
* correspondent_address_2
* correspondent_address_3
* correspondent_address_4
* conveyance_text

Here is the corresponding XML that we are mapping:

-<assignment-record>
<reel-no>27132</reel-no>
<frame-no>841</frame-no>
-<last-update-date>
<date>20160122</date>
</last-update-date>
<purge-indicator>N</purge-indicator>
-<recorded-date>
<date>20111027</date>
</recorded-date>
<page-count>2</page-count>
-<correspondent>
<name>DOUGLAS B. MCKNIGHT</name>
<address-1>595 MINER ROAD</address-1>
<address-2>INTELLECTUAL PROPERTY & STANDARDS</address-2>
<address-3>CLEVELAND, OH 44143</address-3>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>

==== Assignors ====

Here are the columns in the assignors table:
* reel_no
* frame_no
* assignor_name
* execution_date

The corresponding XML node is :

-<patent-assignors>
-<patent-assignor>
<name>WALKER, MATTHEW J.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
-<patent-assignor>
<name>OLSZEWSKI, MARK E.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
</patent-assignors>

==== Assignees ====

Here are the columns in the assignees table:

* reel_no
* frame_no
* assignee_name
* assignee_address_1
* assignee_address_2
* assignee_city
* assignee_state
* assignee_country
* assignee_postcode

The corresponding XML nodes are:

-<patent-assignees>
-<patent-assignee>
<name>KONINKLIJKE PHILIPS ELECTRONICS N V</name>
<address-1>GROENEWOUDSEWEG 1</address-1>
<city>EINDHOVEN</city>
<country-name>NETHERLANDS</country-name>
<postcode>5621 BA</postcode>
</patent-assignee>
</patent-assignees>

==== Patent Properties ====

Here are the columns in the properties table:

* reel_no
* frame_no
* documentid
* country
* kind
* filingdate
* invention_title

The corresponding XML segment would be:

-<patent-properties>
-<patent-property>
-<document-id>
<country>US</country>
<doc-number>14143589</doc-number>
<kind>X0</kind>
<date>20131230</date>
</document-id>
-<document-id>
<country>US</country>
<doc-number>20140260305</doc-number>
<kind>A1</kind>
<date>20140918</date>
</document-id>
<invention-title lang="en">LEAN AZIMUTHAL FLAME COMBUSTOR</invention-title>
</patent-property>
</patent-properties>

Patent properties have a many-to-one relationship : one patent can have more than one properties.
Note: We are not sure what documents with kind 'X0' say

==== Patent Assignment ====

Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.

Here are the columns in the table:

* reel_no
* frame_no
* action_key_code
* USPTO_Transaction_Date
* USPTO_Date_Produced
* version

Here is what the XML in a downloaded file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-assignments>
-<us-patent-assignments date-produced="20131101" dtd-version="1.0">
<action-key-code>DA</action-key-code>
-<transaction-date>
<date>20160122</date>
</transaction-date>
-<patent-assignments>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
.
.
.
</patent-assignments>
</us-patent-assignments>

====DTD====
Here is the DTD specified by the USPTO, which specifies optional fields and :

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)>
<!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED
date-produced CDATA #IMPLIED>
<!ELEMENT action-key-code (#PCDATA)>
<!ELEMENT transaction-date (date)>
<!ELEMENT patent-assignments (data-available-code | patent-assignment+)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT data-available-code (#PCDATA)>
<!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)>
<!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)>
<!ELEMENT patent-assignors (patent-assignor+)>
<!ELEMENT patent-assignees (patent-assignee+)>
<!ELEMENT patent-properties (patent-property+)>
<!ELEMENT reel-no (#PCDATA)>
<!ELEMENT frame-no (#PCDATA)>
<!ELEMENT last-update-date (date)>
<!ELEMENT purge-indicator (#PCDATA)>
<!ELEMENT recorded-date (date)>
<!ELEMENT page-count (#PCDATA)>
<!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)>
<!ELEMENT conveyance-text (#PCDATA)>
<!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)>
<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)>
<!ELEMENT patent-property (document-id*, invention-title?)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST name name-type (natural | legal) #IMPLIED>
<!ELEMENT address-1 (#PCDATA)>
<!ELEMENT address-2 (#PCDATA)>
<!ELEMENT address-3 (#PCDATA)>
<!ELEMENT address-4 (#PCDATA)>
<!ELEMENT execution-date (date)>
<!ELEMENT date-acknowledged (date)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT country-name (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
<!ELEMENT document-id (country, doc-number, kind?, name?, date?)>
<!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*>
<!ATTLIST invention-title id ID #IMPLIED
lang CDATA #REQUIRED>
<!ELEMENT country (#PCDATA)>
<!ELEMENT doc-number (#PCDATA)>
<!ELEMENT kind (#PCDATA)>

<!ELEMENT b (#PCDATA | i | u | smallcaps)*>

<!ELEMENT i (#PCDATA | b | u | smallcaps)*>

<!ELEMENT u (#PCDATA | b | i | smallcaps)*>
<!ATTLIST u style (single | double | dash | dots ) 'single' >

<!ELEMENT sup (#PCDATA | b | u | i)*>

<!ELEMENT sub (#PCDATA | b | u | i)*>

<!ELEMENT smallcaps (#PCDATA | b | u | i)*>
]>

===Inserting Extracted Data into Tables ===

===Clean Up ===

Bulk Patent Assignee Processing

2016-07-01T16:04:42Z

RavaliKruthiventi: /* DTD */

== USPTO Assignees Data ==

We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.

== Steps Followed to Extract the Data ==

===Extracting Data from XML Files ===

All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:

<patent-assignment>
+<assignment-record>
+<patent-assignors>
+<patent-assignees>
+<patent-properties>
</patent-assignment>

Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.

Corresponding tables are:
*assignment-records : assignment
*patent-assignors : assignors
*patent-assignees : assignees
*patent-properties : properties

Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.

==== Assignment Records ====

The fields in the assignment record are:
* last_update_date
* purge_indicator
* recorded_date
* correspondent_name
* correspondent_address_1
* correspondent_address_2
* correspondent_address_3
* correspondent_address_4
* conveyance_text

Here is the corresponding XML that we are mapping:

-<assignment-record>
<reel-no>27132</reel-no>
<frame-no>841</frame-no>
-<last-update-date>
<date>20160122</date>
</last-update-date>
<purge-indicator>N</purge-indicator>
-<recorded-date>
<date>20111027</date>
</recorded-date>
<page-count>2</page-count>
-<correspondent>
<name>DOUGLAS B. MCKNIGHT</name>
<address-1>595 MINER ROAD</address-1>
<address-2>INTELLECTUAL PROPERTY & STANDARDS</address-2>
<address-3>CLEVELAND, OH 44143</address-3>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>

==== Assignors ====

Here are the columns in the assignors table:
* reel_no
* frame_no
* assignor_name
* execution_date

The corresponding XML node is :

-<patent-assignors>
-<patent-assignor>
<name>WALKER, MATTHEW J.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
-<patent-assignor>
<name>OLSZEWSKI, MARK E.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
</patent-assignors>

==== Assignees ====

Here are the columns in the assignees table:

* reel_no
* frame_no
* assignee_name
* assignee_address_1
* assignee_address_2
* assignee_city
* assignee_state
* assignee_country
* assignee_postcode

The corresponding XML nodes are:

-<patent-assignees>
-<patent-assignee>
<name>KONINKLIJKE PHILIPS ELECTRONICS N V</name>
<address-1>GROENEWOUDSEWEG 1</address-1>
<city>EINDHOVEN</city>
<country-name>NETHERLANDS</country-name>
<postcode>5621 BA</postcode>
</patent-assignee>
</patent-assignees>

==== Patent Properties ====

Here are the columns in the properties table:

* reel_no
* frame_no
* documentid
* country
* kind
* filingdate
* invention_title

The corresponding XML segment would be:

-<patent-properties>
-<patent-property>
-<document-id>
<country>US</country>
<doc-number>14143589</doc-number>
<kind>X0</kind>
<date>20131230</date>
</document-id>
-<document-id>
<country>US</country>
<doc-number>20140260305</doc-number>
<kind>A1</kind>
<date>20140918</date>
</document-id>
<invention-title lang="en">LEAN AZIMUTHAL FLAME COMBUSTOR</invention-title>
</patent-property>
</patent-properties>

Patent properties have a many-to-one relationship : one patent can have more than one properties.
Note: We are not sure what documents with kind 'X0' say

==== Patent Assignment ====

Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.

Here are the columns in the table:

* reel_no
* frame_no
* action_key_code
* USPTO_Transaction_Date
* USPTO_Date_Produced
* version

Here is what the XML in a downloaded file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-assignments>
-<us-patent-assignments date-produced="20131101" dtd-version="1.0">
<action-key-code>DA</action-key-code>
-<transaction-date>
<date>20160122</date>
</transaction-date>
-<patent-assignments>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
.
.
.
</patent-assignments>
</us-patent-assignments>

====DTD====
Here is the DTD specified by the USPTO, which specifies optional fields and :

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)>
<!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED
date-produced CDATA #IMPLIED>
<!ELEMENT action-key-code (#PCDATA)> 
<!ELEMENT transaction-date (date)> 
<!ELEMENT patent-assignments (data-available-code | patent-assignment+)> 
<!ELEMENT date (#PCDATA)> 
<!ELEMENT data-available-code (#PCDATA)> 
<!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)> 
<!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)> 
<!ELEMENT patent-assignors (patent-assignor+)> 
<!ELEMENT patent-assignees (patent-assignee+)> 
<!ELEMENT patent-properties (patent-property+)> 
<!ELEMENT reel-no (#PCDATA)> 
<!ELEMENT frame-no (#PCDATA)> 
<!ELEMENT last-update-date (date)> 
<!ELEMENT purge-indicator (#PCDATA)> 
<!ELEMENT recorded-date (date)> 
<!ELEMENT page-count (#PCDATA)> 
<!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)> 
<!ELEMENT conveyance-text (#PCDATA)> 
<!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)> 
<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)> 
<!ELEMENT patent-property (document-id*, invention-title?)> 
<!ELEMENT name (#PCDATA)> 
<!ATTLIST name name-type (natural | legal) #IMPLIED> 
<!ELEMENT address-1 (#PCDATA)> 
<!ELEMENT address-2 (#PCDATA)> 
<!ELEMENT address-3 (#PCDATA)> 
<!ELEMENT address-4 (#PCDATA)> 
<!ELEMENT execution-date (date)> 
<!ELEMENT date-acknowledged (date)> 
<!ELEMENT city (#PCDATA)> 
<!ELEMENT state (#PCDATA)> 
<!ELEMENT country-name (#PCDATA)> 
<!ELEMENT postcode (#PCDATA)> 
<!ELEMENT document-id (country, doc-number, kind?, name?, date?)> 
<!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*> 
<!ATTLIST invention-title id ID #IMPLIED 
lang CDATA #REQUIRED> 
<!ELEMENT country (#PCDATA)> 
<!ELEMENT doc-number (#PCDATA)> 
<!ELEMENT kind (#PCDATA)> 
 
<!ELEMENT b (#PCDATA | i | u | smallcaps)*> 
 
<!ELEMENT i (#PCDATA | b | u | smallcaps)*> 
 
<!ELEMENT u (#PCDATA | b | i | smallcaps)*> 
<!ATTLIST u style (single | double | dash | dots ) 'single' > 
 
<!ELEMENT sup (#PCDATA | b | u | i)*> 
 
<!ELEMENT sub (#PCDATA | b | u | i)*> 
 
<!ELEMENT smallcaps (#PCDATA | b | u | i)*> 
]> 

===Inserting Extracted Data into Tables ===

===Clean Up ===

Bulk Patent Assignee Processing

2016-07-01T16:04:10Z

RavaliKruthiventi: /* DTD */

== USPTO Assignees Data ==

We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.

== Steps Followed to Extract the Data ==

===Extracting Data from XML Files ===

All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:

<patent-assignment>
+<assignment-record>
+<patent-assignors>
+<patent-assignees>
+<patent-properties>
</patent-assignment>

Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.

Corresponding tables are:
*assignment-records : assignment
*patent-assignors : assignors
*patent-assignees : assignees
*patent-properties : properties

Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.

==== Assignment Records ====

The fields in the assignment record are:
* last_update_date
* purge_indicator
* recorded_date
* correspondent_name
* correspondent_address_1
* correspondent_address_2
* correspondent_address_3
* correspondent_address_4
* conveyance_text

Here is the corresponding XML that we are mapping:

-<assignment-record>
<reel-no>27132</reel-no>
<frame-no>841</frame-no>
-<last-update-date>
<date>20160122</date>
</last-update-date>
<purge-indicator>N</purge-indicator>
-<recorded-date>
<date>20111027</date>
</recorded-date>
<page-count>2</page-count>
-<correspondent>
<name>DOUGLAS B. MCKNIGHT</name>
<address-1>595 MINER ROAD</address-1>
<address-2>INTELLECTUAL PROPERTY & STANDARDS</address-2>
<address-3>CLEVELAND, OH 44143</address-3>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>

==== Assignors ====

Here are the columns in the assignors table:
* reel_no
* frame_no
* assignor_name
* execution_date

The corresponding XML node is :

-<patent-assignors>
-<patent-assignor>
<name>WALKER, MATTHEW J.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
-<patent-assignor>
<name>OLSZEWSKI, MARK E.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
</patent-assignors>

==== Assignees ====

Here are the columns in the assignees table:

* reel_no
* frame_no
* assignee_name
* assignee_address_1
* assignee_address_2
* assignee_city
* assignee_state
* assignee_country
* assignee_postcode

The corresponding XML nodes are:

-<patent-assignees>
-<patent-assignee>
<name>KONINKLIJKE PHILIPS ELECTRONICS N V</name>
<address-1>GROENEWOUDSEWEG 1</address-1>
<city>EINDHOVEN</city>
<country-name>NETHERLANDS</country-name>
<postcode>5621 BA</postcode>
</patent-assignee>
</patent-assignees>

==== Patent Properties ====

Here are the columns in the properties table:

* reel_no
* frame_no
* documentid
* country
* kind
* filingdate
* invention_title

The corresponding XML segment would be:

-<patent-properties>
-<patent-property>
-<document-id>
<country>US</country>
<doc-number>14143589</doc-number>
<kind>X0</kind>
<date>20131230</date>
</document-id>
-<document-id>
<country>US</country>
<doc-number>20140260305</doc-number>
<kind>A1</kind>
<date>20140918</date>
</document-id>
<invention-title lang="en">LEAN AZIMUTHAL FLAME COMBUSTOR</invention-title>
</patent-property>
</patent-properties>

Patent properties have a many-to-one relationship : one patent can have more than one properties.
Note: We are not sure what documents with kind 'X0' say

==== Patent Assignment ====

Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.

Here are the columns in the table:

* reel_no
* frame_no
* action_key_code
* USPTO_Transaction_Date
* USPTO_Date_Produced
* version

Here is what the XML in a downloaded file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-assignments>
-<us-patent-assignments date-produced="20131101" dtd-version="1.0">
<action-key-code>DA</action-key-code>
-<transaction-date>
<date>20160122</date>
</transaction-date>
-<patent-assignments>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
.
.
.
</patent-assignments>
</us-patent-assignments>

====DTD====
Here is the DTD specified by the USPTO, which specifies optional fields and :

<?xml version="1.0" encoding="utf-8"?> 
<!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)> 
<!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED 
date-produced CDATA #IMPLIED> 
<!ELEMENT action-key-code (#PCDATA)> 
<!ELEMENT transaction-date (date)> 
<!ELEMENT patent-assignments (data-available-code | patent-assignment+)> 
<!ELEMENT date (#PCDATA)> 
<!ELEMENT data-available-code (#PCDATA)> 
<!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)> 
<!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)> 
<!ELEMENT patent-assignors (patent-assignor+)> 
<!ELEMENT patent-assignees (patent-assignee+)> 
<!ELEMENT patent-properties (patent-property+)> 
<!ELEMENT reel-no (#PCDATA)> 
<!ELEMENT frame-no (#PCDATA)> 
<!ELEMENT last-update-date (date)> 
<!ELEMENT purge-indicator (#PCDATA)> 
<!ELEMENT recorded-date (date)> 
<!ELEMENT page-count (#PCDATA)> 
<!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)> 
<!ELEMENT conveyance-text (#PCDATA)> 
<!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)> 
<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)> 
<!ELEMENT patent-property (document-id*, invention-title?)> 
<!ELEMENT name (#PCDATA)> 
<!ATTLIST name name-type (natural | legal) #IMPLIED> 
<!ELEMENT address-1 (#PCDATA)> 
<!ELEMENT address-2 (#PCDATA)> 
<!ELEMENT address-3 (#PCDATA)> 
<!ELEMENT address-4 (#PCDATA)> 
<!ELEMENT execution-date (date)> 
<!ELEMENT date-acknowledged (date)> 
<!ELEMENT city (#PCDATA)> 
<!ELEMENT state (#PCDATA)> 
<!ELEMENT country-name (#PCDATA)> 
<!ELEMENT postcode (#PCDATA)> 
<!ELEMENT document-id (country, doc-number, kind?, name?, date?)> 
<!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*> 
<!ATTLIST invention-title id ID #IMPLIED 
lang CDATA #REQUIRED> 
<!ELEMENT country (#PCDATA)> 
<!ELEMENT doc-number (#PCDATA)> 
<!ELEMENT kind (#PCDATA)> 
 
<!ELEMENT b (#PCDATA | i | u | smallcaps)*> 
 
<!ELEMENT i (#PCDATA | b | u | smallcaps)*> 
 
<!ELEMENT u (#PCDATA | b | i | smallcaps)*> 
<!ATTLIST u style (single | double | dash | dots ) 'single' > 
 
<!ELEMENT sup (#PCDATA | b | u | i)*> 
 
<!ELEMENT sub (#PCDATA | b | u | i)*> 
 
<!ELEMENT smallcaps (#PCDATA | b | u | i)*> 
]> 

===Inserting Extracted Data into Tables ===

===Clean Up ===

Bulk Patent Assignee Processing

2016-07-01T16:02:06Z

RavaliKruthiventi: /* DTD */

== USPTO Assignees Data ==

We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.

== Steps Followed to Extract the Data ==

===Extracting Data from XML Files ===

All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:

<patent-assignment>
+<assignment-record>
+<patent-assignors>
+<patent-assignees>
+<patent-properties>
</patent-assignment>

Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.

Corresponding tables are:
*assignment-records : assignment
*patent-assignors : assignors
*patent-assignees : assignees
*patent-properties : properties

Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.

==== Assignment Records ====

The fields in the assignment record are:
* last_update_date
* purge_indicator
* recorded_date
* correspondent_name
* correspondent_address_1
* correspondent_address_2
* correspondent_address_3
* correspondent_address_4
* conveyance_text

Here is the corresponding XML that we are mapping:

-<assignment-record>
<reel-no>27132</reel-no>
<frame-no>841</frame-no>
-<last-update-date>
<date>20160122</date>
</last-update-date>
<purge-indicator>N</purge-indicator>
-<recorded-date>
<date>20111027</date>
</recorded-date>
<page-count>2</page-count>
-<correspondent>
<name>DOUGLAS B. MCKNIGHT</name>
<address-1>595 MINER ROAD</address-1>
<address-2>INTELLECTUAL PROPERTY & STANDARDS</address-2>
<address-3>CLEVELAND, OH 44143</address-3>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>

==== Assignors ====

Here are the columns in the assignors table:
* reel_no
* frame_no
* assignor_name
* execution_date

The corresponding XML node is :

-<patent-assignors>
-<patent-assignor>
<name>WALKER, MATTHEW J.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
-<patent-assignor>
<name>OLSZEWSKI, MARK E.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
</patent-assignors>

==== Assignees ====

Here are the columns in the assignees table:

* reel_no
* frame_no
* assignee_name
* assignee_address_1
* assignee_address_2
* assignee_city
* assignee_state
* assignee_country
* assignee_postcode

The corresponding XML nodes are:

-<patent-assignees>
-<patent-assignee>
<name>KONINKLIJKE PHILIPS ELECTRONICS N V</name>
<address-1>GROENEWOUDSEWEG 1</address-1>
<city>EINDHOVEN</city>
<country-name>NETHERLANDS</country-name>
<postcode>5621 BA</postcode>
</patent-assignee>
</patent-assignees>

==== Patent Properties ====

Here are the columns in the properties table:

* reel_no
* frame_no
* documentid
* country
* kind
* filingdate
* invention_title

The corresponding XML segment would be:

-<patent-properties>
-<patent-property>
-<document-id>
<country>US</country>
<doc-number>14143589</doc-number>
<kind>X0</kind>
<date>20131230</date>
</document-id>
-<document-id>
<country>US</country>
<doc-number>20140260305</doc-number>
<kind>A1</kind>
<date>20140918</date>
</document-id>
<invention-title lang="en">LEAN AZIMUTHAL FLAME COMBUSTOR</invention-title>
</patent-property>
</patent-properties>

Patent properties have a many-to-one relationship : one patent can have more than one properties.
Note: We are not sure what documents with kind 'X0' say

==== Patent Assignment ====

Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.

Here are the columns in the table:

* reel_no
* frame_no
* action_key_code
* USPTO_Transaction_Date
* USPTO_Date_Produced
* version

Here is what the XML in a downloaded file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-assignments>
-<us-patent-assignments date-produced="20131101" dtd-version="1.0">
<action-key-code>DA</action-key-code>
-<transaction-date>
<date>20160122</date>
</transaction-date>
-<patent-assignments>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
.
.
.
</patent-assignments>
</us-patent-assignments>

====DTD====
Here is the DTD specified by the USPTO, which specifies optional fields and :

<?xml version="1.0" encoding="utf-8"?> 
<!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)> 
<!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED 
date-produced CDATA #IMPLIED> 
<!ELEMENT action-key-code (#PCDATA)> 
<!ELEMENT transaction-date (date)> 
<!ELEMENT patent-assignments (data-available-code | patent-assignment+)> 
<!ELEMENT date (#PCDATA)> 
<!ELEMENT data-available-code (#PCDATA)> 
<!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)> 
<!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)> 
<!ELEMENT patent-assignors (patent-assignor+)> 
<!ELEMENT patent-assignees (patent-assignee+)> 
<!ELEMENT patent-properties (patent-property+)> 
<!ELEMENT reel-no (#PCDATA)> 
<!ELEMENT frame-no (#PCDATA)> 
<!ELEMENT last-update-date (date)> 
<!ELEMENT purge-indicator (#PCDATA)> 
<!ELEMENT recorded-date (date)> 
<!ELEMENT page-count (#PCDATA)> 
<!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)> 
<!ELEMENT conveyance-text (#PCDATA)> 
<!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)> 
<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)> 
<!ELEMENT patent-property (document-id*, invention-title?)> 
<!ELEMENT name (#PCDATA)> 
<!ATTLIST name name-type (natural | legal) #IMPLIED> 
<!ELEMENT address-1 (#PCDATA)> 
<!ELEMENT address-2 (#PCDATA)> 
<!ELEMENT address-3 (#PCDATA)> 
<!ELEMENT address-4 (#PCDATA)> 
<!ELEMENT execution-date (date)> 
<!ELEMENT date-acknowledged (date)> 
<!ELEMENT city (#PCDATA)> 
<!ELEMENT state (#PCDATA)> 
<!ELEMENT country-name (#PCDATA)> 
<!ELEMENT postcode (#PCDATA)> 
<!ELEMENT document-id (country, doc-number, kind?, name?, date?)> 
<!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*> 
<!ATTLIST invention-title id ID #IMPLIED 
lang CDATA #REQUIRED> 
<!ELEMENT country (#PCDATA)> 
<!ELEMENT doc-number (#PCDATA)> 
<!ELEMENT kind (#PCDATA)> 
 
<!ELEMENT b (#PCDATA | i | u | smallcaps)*> 
 
<!ELEMENT i (#PCDATA | b | u | smallcaps)*> 
 
<!ELEMENT u (#PCDATA | b | i | smallcaps)*> 
<!ATTLIST u style (single | double | dash | dots ) 'single' > 
 
<!ELEMENT sup (#PCDATA | b | u | i)*> 
 
<!ELEMENT sub (#PCDATA | b | u | i)*> 
 
<!ELEMENT smallcaps (#PCDATA | b | u | i)*> 
]> 

===Inserting Extracted Data into Tables ===

===Clean Up ===

Bulk Patent Assignee Processing

2016-07-01T16:00:57Z

RavaliKruthiventi: /* DTD */

== USPTO Assignees Data ==

We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.

== Steps Followed to Extract the Data ==

===Extracting Data from XML Files ===

All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:

<patent-assignment>
+<assignment-record>
+<patent-assignors>
+<patent-assignees>
+<patent-properties>
</patent-assignment>

Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.

Corresponding tables are:
*assignment-records : assignment
*patent-assignors : assignors
*patent-assignees : assignees
*patent-properties : properties

Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.

==== Assignment Records ====

The fields in the assignment record are:
* last_update_date
* purge_indicator
* recorded_date
* correspondent_name
* correspondent_address_1
* correspondent_address_2
* correspondent_address_3
* correspondent_address_4
* conveyance_text

Here is the corresponding XML that we are mapping:

-<assignment-record>
<reel-no>27132</reel-no>
<frame-no>841</frame-no>
-<last-update-date>
<date>20160122</date>
</last-update-date>
<purge-indicator>N</purge-indicator>
-<recorded-date>
<date>20111027</date>
</recorded-date>
<page-count>2</page-count>
-<correspondent>
<name>DOUGLAS B. MCKNIGHT</name>
<address-1>595 MINER ROAD</address-1>
<address-2>INTELLECTUAL PROPERTY & STANDARDS</address-2>
<address-3>CLEVELAND, OH 44143</address-3>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>

==== Assignors ====

Here are the columns in the assignors table:
* reel_no
* frame_no
* assignor_name
* execution_date

The corresponding XML node is :

-<patent-assignors>
-<patent-assignor>
<name>WALKER, MATTHEW J.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
-<patent-assignor>
<name>OLSZEWSKI, MARK E.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
</patent-assignors>

==== Assignees ====

Here are the columns in the assignees table:

* reel_no
* frame_no
* assignee_name
* assignee_address_1
* assignee_address_2
* assignee_city
* assignee_state
* assignee_country
* assignee_postcode

The corresponding XML nodes are:

-<patent-assignees>
-<patent-assignee>
<name>KONINKLIJKE PHILIPS ELECTRONICS N V</name>
<address-1>GROENEWOUDSEWEG 1</address-1>
<city>EINDHOVEN</city>
<country-name>NETHERLANDS</country-name>
<postcode>5621 BA</postcode>
</patent-assignee>
</patent-assignees>

==== Patent Properties ====

Here are the columns in the properties table:

* reel_no
* frame_no
* documentid
* country
* kind
* filingdate
* invention_title

The corresponding XML segment would be:

-<patent-properties>
-<patent-property>
-<document-id>
<country>US</country>
<doc-number>14143589</doc-number>
<kind>X0</kind>
<date>20131230</date>
</document-id>
-<document-id>
<country>US</country>
<doc-number>20140260305</doc-number>
<kind>A1</kind>
<date>20140918</date>
</document-id>
<invention-title lang="en">LEAN AZIMUTHAL FLAME COMBUSTOR</invention-title>
</patent-property>
</patent-properties>

Patent properties have a many-to-one relationship : one patent can have more than one properties.
Note: We are not sure what documents with kind 'X0' say

==== Patent Assignment ====

Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.

Here are the columns in the table:

* reel_no
* frame_no
* action_key_code
* USPTO_Transaction_Date
* USPTO_Date_Produced
* version

Here is what the XML in a downloaded file looks like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-assignments>
-<us-patent-assignments date-produced="20131101" dtd-version="1.0">
<action-key-code>DA</action-key-code>
-<transaction-date>
<date>20160122</date>
</transaction-date>
-<patent-assignments>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
+<patent-assignment>
.
.
.
</patent-assignments>
</us-patent-assignments>

====DTD====
Here is the DTD specified by the USPTO, which specifies optional fields and :

<?xml version="1.0" encoding="utf-8"?> 
<!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)> 
<!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED 
date-produced CDATA #IMPLIED> 
<!ELEMENT action-key-code (#PCDATA)> 
<!ELEMENT transaction-date (date)> 
<!ELEMENT patent-assignments (data-available-code | patent-assignment+)> 
<!ELEMENT date (#PCDATA)> 
<!ELEMENT data-available-code (#PCDATA)> 
<!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)> 
<!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)> 
<!ELEMENT patent-assignors (patent-assignor+)> 
<!ELEMENT patent-assignees (patent-assignee+)> 
<!ELEMENT patent-properties (patent-property+)> 
<!ELEMENT reel-no (#PCDATA)> 
<!ELEMENT frame-no (#PCDATA)> 
<!ELEMENT last-update-date (date)> 
<!ELEMENT purge-indicator (#PCDATA)> 
<!ELEMENT recorded-date (date)> 
<!ELEMENT page-count (#PCDATA)> 
<!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)> 
<!ELEMENT conveyance-text (#PCDATA)> 
<!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)> 
<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)> 
<!ELEMENT patent-property (document-id*, invention-title?)> 
<!ELEMENT name (#PCDATA)> 
<!ATTLIST name name-type (natural | legal) #IMPLIED> 
<!ELEMENT address-1 (#PCDATA)> 
<!ELEMENT address-2 (#PCDATA)> 
<!ELEMENT address-3 (#PCDATA)> 
<!ELEMENT address-4 (#PCDATA)> 
<!ELEMENT execution-date (date)> 
<!ELEMENT date-acknowledged (date)> 
<!ELEMENT city (#PCDATA)> 
<!ELEMENT state (#PCDATA)> 
<!ELEMENT country-name (#PCDATA)> 
<!ELEMENT postcode (#PCDATA)> 
<!ELEMENT document-id (country, doc-number, kind?, name?, date?)> 
<!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*> 
<!ATTLIST invention-title id ID #IMPLIED 
lang CDATA #REQUIRED> 
<!ELEMENT country (#PCDATA)> 
<!ELEMENT doc-number (#PCDATA)> 
<!ELEMENT kind (#PCDATA)> 
 
<!ELEMENT b (#PCDATA | i | u | smallcaps)*> 
 
<!ELEMENT i (#PCDATA | b | u | smallcaps)*> 
 
<!ELEMENT u (#PCDATA | b | i | smallcaps)*> 
<!ATTLIST u style (single | double | dash | dots ) 'single' > 
 
<!ELEMENT sup (#PCDATA | b | u | i)*> 
 
<!ELEMENT sub (#PCDATA | b | u | i)*> 
 
<!ELEMENT smallcaps (#PCDATA | b | u | i)*> 
]> 

===Inserting Extracted Data into Tables ===

===Clean Up ===

Bulk Patent Assignee Processing

2016-07-01T15:58:59Z

RavaliKruthiventi: /* Extracting Data from XML Files */

Bulk Patent Assignee Processing

2016-07-01T15:47:48Z

RavaliKruthiventi: /* Patent Properties */

== USPTO Assignees Data ==

We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.

== Steps Followed to Extract the Data ==

===Extracting Data from XML Files ===

All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:

<patent-assignment>
+<assignment-record>
+<patent-assignors>
+<patent-assignees>
+<patent-properties>
</patent-assignment>

Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.

Corresponding tables are:
*assignment-records : assignment
*patent-assignors : assignors
*patent-assignees : assignees
*patent-properties : properties

Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.

==== Assignment Records ====

The fields in the assignment record are:
* last_update_date
* purge_indicator
* recorded_date
* correspondent_name
* correspondent_address_1
* correspondent_address_2
* correspondent_address_3
* correspondent_address_4
* conveyance_text

Here is the corresponding XML that we are mapping:

-<assignment-record>
<reel-no>27132</reel-no>
<frame-no>841</frame-no>
-<last-update-date>
<date>20160122</date>
</last-update-date>
<purge-indicator>N</purge-indicator>
-<recorded-date>
<date>20111027</date>
</recorded-date>
<page-count>2</page-count>
-<correspondent>
<name>DOUGLAS B. MCKNIGHT</name>
<address-1>595 MINER ROAD</address-1>
<address-2>INTELLECTUAL PROPERTY & STANDARDS</address-2>
<address-3>CLEVELAND, OH 44143</address-3>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>

==== Assignors ====

Here are the columns in the assignors table:
* reel_no
* frame_no
* assignor_name
* execution_date

The corresponding XML node is :

-<patent-assignors>
-<patent-assignor>
<name>WALKER, MATTHEW J.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
-<patent-assignor>
<name>OLSZEWSKI, MARK E.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
</patent-assignors>

==== Assignees ====

Here are the columns in the assignees table:

* reel_no
* frame_no
* assignee_name
* assignee_address_1
* assignee_address_2
* assignee_city
* assignee_state
* assignee_country
* assignee_postcode

The corresponding XML nodes are:

-<patent-assignees>
-<patent-assignee>
<name>KONINKLIJKE PHILIPS ELECTRONICS N V</name>
<address-1>GROENEWOUDSEWEG 1</address-1>
<city>EINDHOVEN</city>
<country-name>NETHERLANDS</country-name>
<postcode>5621 BA</postcode>
</patent-assignee>
</patent-assignees>

==== Patent Properties ====

Here are the columns in the properties table:

* reel_no
* frame_no
* documentid
* country
* kind
* filingdate
* invention_title

The corresponding XML segment would be:

-<patent-properties>
-<patent-property>
-<document-id>
<country>US</country>
<doc-number>14143589</doc-number>
<kind>X0</kind>
<date>20131230</date>
</document-id>
-<document-id>
<country>US</country>
<doc-number>20140260305</doc-number>
<kind>A1</kind>
<date>20140918</date>
</document-id>
<invention-title lang="en">LEAN AZIMUTHAL FLAME COMBUSTOR</invention-title>
</patent-property>
</patent-properties>

Patent properties have a many-to-one relationship : one patent can have more than one properties.
Note: We are not sure what documents with kind 'X0' say

====DTD====
Here is the DTD specified by the USPTO, which specifies optional fields and :

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)>
<!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED
date-produced CDATA #IMPLIED>
<!ELEMENT action-key-code (#PCDATA)>
<!ELEMENT transaction-date (date)>
<!ELEMENT patent-assignments (data-available-code | patent-assignment+)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT data-available-code (#PCDATA)>
<!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)>
<!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)>
<!ELEMENT patent-assignors (patent-assignor+)>
<!ELEMENT patent-assignees (patent-assignee+)>
<!ELEMENT patent-properties (patent-property+)>
<!ELEMENT reel-no (#PCDATA)>
<!ELEMENT frame-no (#PCDATA)>
<!ELEMENT last-update-date (date)>
<!ELEMENT purge-indicator (#PCDATA)>
<!ELEMENT recorded-date (date)>
<!ELEMENT page-count (#PCDATA)>
<!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)>
<!ELEMENT conveyance-text (#PCDATA)>
<!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)>
<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)>
<!ELEMENT patent-property (document-id*, invention-title?)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST name name-type (natural | legal) #IMPLIED>
<!ELEMENT address-1 (#PCDATA)>
<!ELEMENT address-2 (#PCDATA)>
<!ELEMENT address-3 (#PCDATA)>
<!ELEMENT address-4 (#PCDATA)>
<!ELEMENT execution-date (date)>
<!ELEMENT date-acknowledged (date)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT country-name (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
<!ELEMENT document-id (country, doc-number, kind?, name?, date?)>
<!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*>
<!ATTLIST invention-title id ID #IMPLIED
lang CDATA #REQUIRED>
<!ELEMENT country (#PCDATA)>
<!ELEMENT doc-number (#PCDATA)>
<!ELEMENT kind (#PCDATA)>

<!ELEMENT b (#PCDATA | i | u | smallcaps)*>

<!ELEMENT i (#PCDATA | b | u | smallcaps)*>

<!ELEMENT u (#PCDATA | b | i | smallcaps)*>
<!ATTLIST u style (single | double | dash | dots ) 'single' >

<!ELEMENT sup (#PCDATA | b | u | i)*>

<!ELEMENT sub (#PCDATA | b | u | i)*>

<!ELEMENT smallcaps (#PCDATA | b | u | i)*>
]>

===Inserting Extracted Data into Tables ===

===Clean Up ===

Bulk Patent Assignee Processing

2016-07-01T15:38:50Z

RavaliKruthiventi: /* Extracting Data from XML Files */

== USPTO Assignees Data ==

We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.

== Steps Followed to Extract the Data ==

===Extracting Data from XML Files ===

All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:

<patent-assignment>
+<assignment-record>
+<patent-assignors>
+<patent-assignees>
+<patent-properties>
</patent-assignment>

Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.

Corresponding tables are:
*assignment-records : assignment
*patent-assignors : assignors
*patent-assignees : assignees
*patent-properties : properties

Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.

==== Assignment Records ====

The fields in the assignment record are:
* last_update_date
* purge_indicator
* recorded_date
* correspondent_name
* correspondent_address_1
* correspondent_address_2
* correspondent_address_3
* correspondent_address_4
* conveyance_text

Here is the corresponding XML that we are mapping:

-<assignment-record>
<reel-no>27132</reel-no>
<frame-no>841</frame-no>
-<last-update-date>
<date>20160122</date>
</last-update-date>
<purge-indicator>N</purge-indicator>
-<recorded-date>
<date>20111027</date>
</recorded-date>
<page-count>2</page-count>
-<correspondent>
<name>DOUGLAS B. MCKNIGHT</name>
<address-1>595 MINER ROAD</address-1>
<address-2>INTELLECTUAL PROPERTY & STANDARDS</address-2>
<address-3>CLEVELAND, OH 44143</address-3>
</correspondent>
<conveyance-text>ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).</conveyance-text>
</assignment-record>

==== Assignors ====

Here are the columns in the assignors table:
* reel_no
* frame_no
* assignor_name
* execution_date

The corresponding XML node is :

-<patent-assignors>
-<patent-assignor>
<name>WALKER, MATTHEW J.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
-<patent-assignor>
<name>OLSZEWSKI, MARK E.</name>
-<execution-date>
<date>20090512</date>
</execution-date>
</patent-assignor>
</patent-assignors>

==== Assignees ====

Here are the columns in the assignees table:

* reel_no
* frame_no
* assignee_name
* assignee_address_1
* assignee_address_2
* assignee_city
* assignee_state
* assignee_country
* assignee_postcode

The corresponding XML nodes are:

-<patent-assignees>
-<patent-assignee>
<name>KONINKLIJKE PHILIPS ELECTRONICS N V</name>
<address-1>GROENEWOUDSEWEG 1</address-1>
<city>EINDHOVEN</city>
<country-name>NETHERLANDS</country-name>
<postcode>5621 BA</postcode>
</patent-assignee>
</patent-assignees>

==== Patent Properties ====

====DTD====
Here is the DTD specified by the USPTO, which specifies optional fields and :

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE us-patent-assignments [<!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)>
<!ATTLIST us-patent-assignments dtd-version CDATA #IMPLIED
date-produced CDATA #IMPLIED>
<!ELEMENT action-key-code (#PCDATA)>
<!ELEMENT transaction-date (date)>
<!ELEMENT patent-assignments (data-available-code | patent-assignment+)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT data-available-code (#PCDATA)>
<!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)>
<!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)>
<!ELEMENT patent-assignors (patent-assignor+)>
<!ELEMENT patent-assignees (patent-assignee+)>
<!ELEMENT patent-properties (patent-property+)>
<!ELEMENT reel-no (#PCDATA)>
<!ELEMENT frame-no (#PCDATA)>
<!ELEMENT last-update-date (date)>
<!ELEMENT purge-indicator (#PCDATA)>
<!ELEMENT recorded-date (date)>
<!ELEMENT page-count (#PCDATA)>
<!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)>
<!ELEMENT conveyance-text (#PCDATA)>
<!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)>
<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)>
<!ELEMENT patent-property (document-id*, invention-title?)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST name name-type (natural | legal) #IMPLIED>
<!ELEMENT address-1 (#PCDATA)>
<!ELEMENT address-2 (#PCDATA)>
<!ELEMENT address-3 (#PCDATA)>
<!ELEMENT address-4 (#PCDATA)>
<!ELEMENT execution-date (date)>
<!ELEMENT date-acknowledged (date)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT country-name (#PCDATA)>
<!ELEMENT postcode (#PCDATA)>
<!ELEMENT document-id (country, doc-number, kind?, name?, date?)>
<!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*>
<!ATTLIST invention-title id ID #IMPLIED
lang CDATA #REQUIRED>
<!ELEMENT country (#PCDATA)>
<!ELEMENT doc-number (#PCDATA)>
<!ELEMENT kind (#PCDATA)>

<!ELEMENT b (#PCDATA | i | u | smallcaps)*>

<!ELEMENT i (#PCDATA | b | u | smallcaps)*>

<!ELEMENT u (#PCDATA | b | i | smallcaps)*>
<!ATTLIST u style (single | double | dash | dots ) 'single' >

<!ELEMENT sup (#PCDATA | b | u | i)*>

<!ELEMENT sub (#PCDATA | b | u | i)*>

<!ELEMENT smallcaps (#PCDATA | b | u | i)*>
]>

===Inserting Extracted Data into Tables ===

===Clean Up ===