<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=ShoebMohammed</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=ShoebMohammed"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/ShoebMohammed"/>
	<updated>2026-06-08T09:53:22Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=8001</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=8001"/>
		<updated>2016-08-08T20:03:31Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Category: McNair Admin]]&lt;br /&gt;
[[admin_classification::General Information| ]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;br /&gt;
**Decided to use python in place of perl because it is officially supported.&lt;br /&gt;
&lt;br /&gt;
*07/20/2016&lt;br /&gt;
**Created accounts&lt;br /&gt;
**Ran selenium to load a sample website&lt;br /&gt;
&lt;br /&gt;
*07/21/2016&lt;br /&gt;
**read more selenium docs and tutorials.&lt;br /&gt;
**waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/22/2016&lt;br /&gt;
**initial design for the program&lt;br /&gt;
**still waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/25/2016&lt;br /&gt;
**started implementation. need clarification on requirements.&lt;br /&gt;
&lt;br /&gt;
*07/27/2016&lt;br /&gt;
**continue with the implementation. need clarification on search handler requirements.&lt;br /&gt;
&lt;br /&gt;
*07/28/2016, 07/29/2016, 07/30/2016, 08/01/2016&lt;br /&gt;
**continue development. &lt;br /&gt;
&lt;br /&gt;
*08/05/2016&lt;br /&gt;
**completed development. Future todo: implement advance facilities.&lt;br /&gt;
&lt;br /&gt;
*08/06/2016&lt;br /&gt;
**added Eds suggestions.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=8000</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=8000"/>
		<updated>2016-08-06T20:39:21Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Category: McNair Admin]]&lt;br /&gt;
[[admin_classification::General Information| ]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;br /&gt;
**Decided to use python in place of perl because it is officially supported.&lt;br /&gt;
&lt;br /&gt;
*07/20/2016&lt;br /&gt;
**Created accounts&lt;br /&gt;
**Ran selenium to load a sample website&lt;br /&gt;
&lt;br /&gt;
*07/21/2016&lt;br /&gt;
**read more selenium docs and tutorials.&lt;br /&gt;
**waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/22/2016&lt;br /&gt;
**initial design for the program&lt;br /&gt;
**still waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/25/2016&lt;br /&gt;
**started implementation. need clarification on requirements.&lt;br /&gt;
&lt;br /&gt;
*07/27/2016&lt;br /&gt;
**continue with the implementation. need clarification on search handler requirements.&lt;br /&gt;
&lt;br /&gt;
*07/28/2016, 07/29/2016, 07/30/2016, 08/01/2016&lt;br /&gt;
**continue development. &lt;br /&gt;
&lt;br /&gt;
*08/05/2016&lt;br /&gt;
**completed development. Future todo: implement advance facilities.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7999</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7999"/>
		<updated>2016-08-06T20:38:24Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Category: McNair Admin]]&lt;br /&gt;
[[admin_classification::General Information| ]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans) | Research Plan page]]&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center. - '''''Completed'''''&lt;br /&gt;
* Build a Linux box to run the crawler. - '''''Completed'''''&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has. - '''''Completed'''''&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7998</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7998"/>
		<updated>2016-08-06T20:36:12Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Internal]]&lt;br /&gt;
[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans) | Research Plan page]]&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center. - '''''Completed'''''&lt;br /&gt;
* Build a Linux box to run the crawler. - '''''Completed'''''&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has. - '''''Completed'''''&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7997</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7997"/>
		<updated>2016-08-06T20:35:49Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Internal]]&lt;br /&gt;
[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans) | Research Plan page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center. - '''''Completed'''''&lt;br /&gt;
* Build a Linux box to run the crawler. - '''''Completed'''''&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has. - '''''Completed'''''&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7986</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7986"/>
		<updated>2016-08-05T19:50:32Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans) | Research Plan page]]&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center. - '''''Completed'''''&lt;br /&gt;
* Build a Linux box to run the crawler. - '''''Completed'''''&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has. - '''''Completed'''''&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7985</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7985"/>
		<updated>2016-08-05T19:48:59Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans) | Research Plan page]]&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center. - completed&lt;br /&gt;
* Build a Linux box to run the crawler. - completed&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has. - completed&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7984</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=7984"/>
		<updated>2016-08-05T19:48:07Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans) | Research Plan page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center.&lt;br /&gt;
* Build a Linux box to run the crawler.&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has.&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7983</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7983"/>
		<updated>2016-08-05T19:45:08Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;br /&gt;
**Decided to use python in place of perl because it is officially supported.&lt;br /&gt;
&lt;br /&gt;
*07/20/2016&lt;br /&gt;
**Created accounts&lt;br /&gt;
**Ran selenium to load a sample website&lt;br /&gt;
&lt;br /&gt;
*07/21/2016&lt;br /&gt;
**read more selenium docs and tutorials.&lt;br /&gt;
**waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/22/2016&lt;br /&gt;
**initial design for the program&lt;br /&gt;
**still waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/25/2016&lt;br /&gt;
**started implementation. need clarification on requirements.&lt;br /&gt;
&lt;br /&gt;
*07/27/2016&lt;br /&gt;
**continue with the implementation. need clarification on search handler requirements.&lt;br /&gt;
&lt;br /&gt;
*07/28/2016, 07/29/2016, 07/30/2016, 08/01/2016&lt;br /&gt;
**continue development. &lt;br /&gt;
&lt;br /&gt;
*08/05/2016&lt;br /&gt;
**completed development. Future todo: implement advance facilities.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7586</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7586"/>
		<updated>2016-07-27T20:19:03Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;br /&gt;
**Decided to use python in place of perl because it is officially supported.&lt;br /&gt;
&lt;br /&gt;
*07/20/2016&lt;br /&gt;
**Created accounts&lt;br /&gt;
**Ran selenium to load a sample website&lt;br /&gt;
&lt;br /&gt;
*07/21/2016&lt;br /&gt;
**read more selenium docs and tutorials.&lt;br /&gt;
**waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/22/2016&lt;br /&gt;
**initial design for the program&lt;br /&gt;
**still waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/25/2016&lt;br /&gt;
**started implementation. need clarification on requirements.&lt;br /&gt;
&lt;br /&gt;
*07/27/2016&lt;br /&gt;
**continue with the implementation. need clarification on search handler requirements.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7446</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7446"/>
		<updated>2016-07-25T21:59:35Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;br /&gt;
**Decided to use python in place of perl because it is officially supported.&lt;br /&gt;
&lt;br /&gt;
*07/20/2016&lt;br /&gt;
**Created accounts&lt;br /&gt;
**Ran selenium to load a sample website&lt;br /&gt;
&lt;br /&gt;
*07/21/2016&lt;br /&gt;
**read more selenium docs and tutorials.&lt;br /&gt;
**waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/22/2016&lt;br /&gt;
**initial design for the program&lt;br /&gt;
**still waiting for Dan's codebase&lt;br /&gt;
&lt;br /&gt;
*07/25/2016&lt;br /&gt;
**started implementation. need clarification on requirements.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7364</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7364"/>
		<updated>2016-07-21T21:29:34Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;br /&gt;
**Decided to use python in place of perl because it is officially supported.&lt;br /&gt;
&lt;br /&gt;
*07/20/2016&lt;br /&gt;
**Created accounts&lt;br /&gt;
**Ran selenium to load a sample website&lt;br /&gt;
&lt;br /&gt;
*07/21/2016&lt;br /&gt;
**read more selenium docs and tutorials.&lt;br /&gt;
**waiting for Dan's codebase&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7363</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7363"/>
		<updated>2016-07-21T21:23:25Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;br /&gt;
&lt;br /&gt;
*07/20/2016&lt;br /&gt;
**Created accounts&lt;br /&gt;
**Ran selenium to load a sample website&lt;br /&gt;
&lt;br /&gt;
*07/21/2016&lt;br /&gt;
**read more selenium docs and tutorials.&lt;br /&gt;
**waiting for Dan's codebase&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7284</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7284"/>
		<updated>2016-07-20T20:05:21Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;br /&gt;
&lt;br /&gt;
*07/20/2016&lt;br /&gt;
**Created accounts&lt;br /&gt;
**Ran selenium to load a sample website&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7127</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7127"/>
		<updated>2016-07-19T17:32:10Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, editor(emacs), git&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7125</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7125"/>
		<updated>2016-07-19T17:04:51Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;br /&gt;
&lt;br /&gt;
*07/19/2016&lt;br /&gt;
**Installed development environment on the linux box - python bindings, IDE(emacs), git&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7055</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7055"/>
		<updated>2016-07-18T20:03:01Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | Work Log page]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7052</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7052"/>
		<updated>2016-07-18T20:00:24Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7051</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=7051"/>
		<updated>2016-07-18T20:00:06Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Finished installing linux for code development.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6959</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6959"/>
		<updated>2016-07-18T19:36:50Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Began installing linux for development.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6958</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6958"/>
		<updated>2016-07-18T19:36:14Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*07/18/2016&lt;br /&gt;
**Completed listing page on the wiki for the software repository.&lt;br /&gt;
**Begin building linux box&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Software_Repository_Listing&amp;diff=6915</id>
		<title>Software Repository Listing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Software_Repository_Listing&amp;diff=6915"/>
		<updated>2016-07-18T17:15:22Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: /* Geocoding Inventor Locations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page lists all software/tools available on our [[Software Repository]]. The documentation on using a particular tool will be on its separate wiki page.&lt;br /&gt;
&lt;br /&gt;
 For information and tutorial on how to access McNair git server, see [[Software Repository]].&lt;br /&gt;
 Read the tutorial and instructions first before pushing anything to the git-server.&lt;br /&gt;
&lt;br /&gt;
=Repositories on McNair [[Software Repository|git server]]=&lt;br /&gt;
&lt;br /&gt;
==Center IT Sysadmin==&lt;br /&gt;
This repository contains all tools and scripts meant for system administration (stuff like backup scripts..)&lt;br /&gt;
*See the [[Center IT]] page for current documentation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Harvard Dataverse==&lt;br /&gt;
This repository contains all tools and scripts related to Harvard Dataverse.&lt;br /&gt;
*The [[Harvard Dataverse]] page provides instruction on how to access data.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Geocoding Inventor Locations==&lt;br /&gt;
This repository holds software for matching Inventor addresses to known locations.&lt;br /&gt;
There are two programs/scripts that do same job. One is implemented in Perl(old) and other in Python(new). You should probably use the newer tool.&lt;br /&gt;
*See [[Geocoding Inventor Locations (Tool)]] for documentation on the older version implemented in Perl.&lt;br /&gt;
*See [[Geocode.py]] for the newer version in Python.&lt;br /&gt;
&lt;br /&gt;
==Matcher==&lt;br /&gt;
This repository contains the matcher tool which is used to match firm names given two lists.&lt;br /&gt;
*See [[The Matcher (Tool)]] for documentation.&lt;br /&gt;
&lt;br /&gt;
==Patent Data Parser==&lt;br /&gt;
This repository contains all tools developed for patent data parsing.&lt;br /&gt;
*[[Patent Data (Tool)]] and [[Patent Data Extraction Scripts (Tool)]] pages on the wiki describe our Patent Database schema and corresponding XML parsing tools.&lt;br /&gt;
*Also, see [[USPTO Assignees Data]] which explains Patent Assignee Database schema and relevant XML parsing tools.&lt;br /&gt;
&lt;br /&gt;
==Utilities==&lt;br /&gt;
This repository contains various utilities developed for text processing and other generally useful tools. See the wiki pages for each tool's documentation.&lt;br /&gt;
*[[Fuzzy match names (Tool)]]&lt;br /&gt;
*[[Godo (Tool)]]&lt;br /&gt;
*[[Normalizer Documentation | Normalizer]]. On the [[Software Repository|git-server]] we have many different versions like normalize fixed width, normalize surnames.&lt;br /&gt;
&lt;br /&gt;
==Web Crawler==&lt;br /&gt;
This repository contains all software for web crawlers.&lt;br /&gt;
*[[Whois Parser]] pulls the Whois information given a list for URLs.&lt;br /&gt;
*[[PhD Masterclass - How to Build a Web Crawler]]: Ed's class on building a web crawler.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:McNair Admin]]&lt;br /&gt;
[[admin_classification::Software Repository| ]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Software_Repository_Listing&amp;diff=6912</id>
		<title>Software Repository Listing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Software_Repository_Listing&amp;diff=6912"/>
		<updated>2016-07-18T17:06:27Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: Created page with &amp;quot;This page lists all software/tools available on our Software Repository. The documentation on using a particular tool will be on its separate wiki page.   For information...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page lists all software/tools available on our [[Software Repository]]. The documentation on using a particular tool will be on its separate wiki page.&lt;br /&gt;
&lt;br /&gt;
 For information and tutorial on how to access McNair git server, see [[Software Repository]].&lt;br /&gt;
 Read the tutorial and instructions first before pushing anything to the git-server.&lt;br /&gt;
&lt;br /&gt;
=Repositories on McNair [[Software Repository|git server]]=&lt;br /&gt;
&lt;br /&gt;
==Center IT Sysadmin==&lt;br /&gt;
This repository contains all tools and scripts meant for system administration (stuff like backup scripts..)&lt;br /&gt;
*See the [[Center IT]] page for current documentation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Harvard Dataverse==&lt;br /&gt;
This repository contains all tools and scripts related to Harvard Dataverse.&lt;br /&gt;
*The [[Harvard Dataverse]] page provides instruction on how to access data.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Geocoding Inventor Locations==&lt;br /&gt;
This repository holds software for matching Inventor addresses to known locations.&lt;br /&gt;
There are two programs/scripts that do same job. One is implemented in Perl(old) and other in Python(new). You should probably use the newer tool.&lt;br /&gt;
*See [[Geocoding Inventor Locations (Tool)]] for documentation on the older version implemented in Perl.&lt;br /&gt;
*See [[Geocoding Inventor Locations - New (Tool)]] for the newer version in Python.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Matcher==&lt;br /&gt;
This repository contains the matcher tool which is used to match firm names given two lists.&lt;br /&gt;
*See [[The Matcher (Tool)]] for documentation.&lt;br /&gt;
&lt;br /&gt;
==Patent Data Parser==&lt;br /&gt;
This repository contains all tools developed for patent data parsing.&lt;br /&gt;
*[[Patent Data (Tool)]] and [[Patent Data Extraction Scripts (Tool)]] pages on the wiki describe our Patent Database schema and corresponding XML parsing tools.&lt;br /&gt;
*Also, see [[USPTO Assignees Data]] which explains Patent Assignee Database schema and relevant XML parsing tools.&lt;br /&gt;
&lt;br /&gt;
==Utilities==&lt;br /&gt;
This repository contains various utilities developed for text processing and other generally useful tools. See the wiki pages for each tool's documentation.&lt;br /&gt;
*[[Fuzzy match names (Tool)]]&lt;br /&gt;
*[[Godo (Tool)]]&lt;br /&gt;
*[[Normalizer Documentation | Normalizer]]. On the [[Software Repository|git-server]] we have many different versions like normalize fixed width, normalize surnames.&lt;br /&gt;
&lt;br /&gt;
==Web Crawler==&lt;br /&gt;
This repository contains all software for web crawlers.&lt;br /&gt;
*[[Whois Parser]] pulls the Whois information given a list for URLs.&lt;br /&gt;
*[[PhD Masterclass - How to Build a Web Crawler]]: Ed's class on building a web crawler.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:McNair Admin]]&lt;br /&gt;
[[admin_classification::Software Repository| ]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Geocoding_Inventor_Locations_(Tool)&amp;diff=6911</id>
		<title>Geocoding Inventor Locations (Tool)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Geocoding_Inventor_Locations_(Tool)&amp;diff=6911"/>
		<updated>2016-07-18T17:06:11Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: Created page with &amp;quot;*This page is part of a series under the NBER Patent Data Project  This page details the various matching techniques used to Geocode inventor locations i...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;*This page is part of a series under the [[NBER Patent Data |NBER Patent Data Project]]&lt;br /&gt;
&lt;br /&gt;
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude. &lt;br /&gt;
&lt;br /&gt;
==Script Files==&lt;br /&gt;
&lt;br /&gt;
The scripts and modules that operationalize these matching techniques can be downloaded as a bundle with ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gz v1.0.1] ~20Mb) or without ([http://www.edegan.com/repository/MatchLocations_Full.tar.gz MatchLocations_Full.tar.gz v1.0.1] ~20Mb) all supporting data files. Note that the current version is 1.0.3, which will be posted shortly. The bundles contain the default directory structure. Defaults can be changed in the MatchLocations.pl script. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The directories are as follows:&lt;br /&gt;
*Source - Source data should be placed here. See below for formatting.&lt;br /&gt;
*Results - Results generated by the scripts, including logs will appear here.&lt;br /&gt;
*GNS - contains GNS reference data named GNS-XX.txt&lt;br /&gt;
*Match - contains the modules&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The bundle contains:&lt;br /&gt;
*MatchLocations.pl - The main script that initializes and processes the matching requests&lt;br /&gt;
*BatchMatch.pl - A script for running batches &lt;br /&gt;
*Match::GNS.pm - Interface to the GNS reference data (see below)&lt;br /&gt;
*Match::Patent.pm - Interface to the Patent Location data (see below)&lt;br /&gt;
*Match::Common.pm - Provides common (string cleaning) routines for both the reference and source interface modules&lt;br /&gt;
*Match::PostalCodes.pm - A module that extracts postcodes of various formats from (address) strings&lt;br /&gt;
*Match::Gram.pm - Custom NGram Module&lt;br /&gt;
*Match::LCS.pm - A standard LCS Module&lt;br /&gt;
*PatentLocations-Stopwords.txt - A Stop Word file (tab delimited)&lt;br /&gt;
*GNS Reference Files - The full bundle contains a full set of correctly named GNS reference files&lt;br /&gt;
&lt;br /&gt;
The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;tt&amp;gt;perl MatchLocations.pl -co GB -u -human -r -wf &amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
which will process ISO3166 &amp;lt;tt&amp;gt;country&amp;lt;/tt&amp;gt; code GB (Great Britain), include &amp;lt;tt&amp;gt;unmatched&amp;lt;/tt&amp;gt; inputs in the results file, produce a &amp;lt;tt&amp;gt;human&amp;lt;/tt&amp;gt; choices file, write the &amp;lt;tt&amp;gt;report&amp;lt;/tt&amp;gt; to a text file, and &amp;lt;tt&amp;gt;write fuzzy&amp;lt;/tt&amp;gt; matches to additional seperate files as well as the main results file. Other options include &amp;lt;tt&amp;gt;over&amp;lt;/tt&amp;gt; to override country designations and &amp;lt;tt&amp;gt;o&amp;lt;/tt&amp;gt; to specify the results filename.&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;tt&amp;gt;perl MatchLocations.pl -h&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
produces a simple help output.&lt;br /&gt;
&lt;br /&gt;
==The Source Files==&lt;br /&gt;
&lt;br /&gt;
Per country source files are extracted from the NBER patent data. The format of the source file(s) is as follows (XX is an ISO3166 code): &lt;br /&gt;
&lt;br /&gt;
 XX.txt - Tab delimited plain text with no (intentional) string quotation. &lt;br /&gt;
 Column(s): &amp;lt;tt&amp;gt;country&amp;lt;/tt&amp;gt; &amp;lt;tt&amp;gt;str&amp;lt;/tt&amp;gt; &amp;lt;tt&amp;gt;cty&amp;lt;/tt&amp;gt; &amp;lt;tt&amp;gt;adm&amp;lt;/tt&amp;gt; &amp;lt;tt&amp;gt;city&amp;lt;/tt&amp;gt; &amp;lt;tt&amp;gt;postcode&amp;lt;/tt&amp;gt; &amp;lt;tt&amp;gt;str&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The column order is not important. &amp;lt;tt&amp;gt;country&amp;lt;/tt&amp;gt;, &amp;lt;tt&amp;gt;str&amp;lt;/tt&amp;gt;, and &amp;lt;tt&amp;gt;cty&amp;lt;/tt&amp;gt; can not all be null. &amp;lt;tt&amp;gt;adm&amp;lt;/tt&amp;gt; &amp;lt;tt&amp;gt;city&amp;lt;/tt&amp;gt; &amp;lt;tt&amp;gt;postcode&amp;lt;/tt&amp;gt; are optional 'exception' fields that are processed with priority. They provide hand corrections and other specifically generated information.&lt;br /&gt;
&lt;br /&gt;
The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation. The Match::PostalCodes.pm perl module provides a method to extract [[Postal Codes]] from a the addresses for a large number of ISO3166 codes, and implements 'standard' postal code identification for all other jurisdictions.&lt;br /&gt;
&lt;br /&gt;
==Reference Data==&lt;br /&gt;
&lt;br /&gt;
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica. &lt;br /&gt;
&lt;br /&gt;
This project uses [[ISO3166]] two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the [[GEOnet Names Server | GNS page]] for details). A full bundle of correctly names GNS files is also available.&lt;br /&gt;
&lt;br /&gt;
The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes an ISO3166 code, and the index methods and most other methods take specific GNS FC codes (e.g. &amp;quot;P&amp;quot; for populated place, &amp;quot;L&amp;quot; for locality, and &amp;quot;A&amp;quot; for administrative area). Which GNS FC codes are used is specified in the @Letters global varible of MatchLocations.pl and inherited by all other modules. &lt;br /&gt;
&lt;br /&gt;
MatchLocations.pl also retrieves a list of all ISO3166 codes included in the data (from the MatchPatent.pm module) and in any specified override file, and calls Match::GNS.pm to load them. An override file can be specified with the &amp;lt;tt&amp;gt;-over&amp;lt;/tt&amp;gt; option. Override files are tab-delimited and have the format:&lt;br /&gt;
&lt;br /&gt;
 ListedISO3166 1stPreference 2ndPreference 3rdPreference ...&lt;br /&gt;
&lt;br /&gt;
The ISO3166 listed in the source data is then overridden and the alternatives are searched for matches in order of preference. The search is terminated when a match is found or the override set is exhausted.&lt;br /&gt;
&lt;br /&gt;
==The Matching Process==&lt;br /&gt;
&lt;br /&gt;
The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchLocations.pl] script, and its dependent modules (detailed above), which has a standard pod based command line interface. The &amp;lt;tt&amp;gt;-co&amp;lt;/tt&amp;gt; option specifies the ISO3166 country code to be matched. If the override option is used, then the &amp;lt;tt&amp;gt;-co&amp;lt;/tt&amp;gt; option can be used to specify the source file. When an override option is set to 1, rather than to the filename containing the overrides, then the source files countries are used to determine which GNS lookups to perform, otherwise the &amp;lt;tt&amp;gt;-co&amp;lt;/tt&amp;gt; option specifies the GNS reference set.&lt;br /&gt;
&lt;br /&gt;
Glossary of terms:&lt;br /&gt;
*Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately. &lt;br /&gt;
*Tokens - Single words or sequences of words separated by a space (note that this is a specific usage)&lt;br /&gt;
*n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth&lt;br /&gt;
*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings&lt;br /&gt;
*LCS - Longest Common Subsequence based matching (See below)&lt;br /&gt;
*Administrative area, populated place, and locality - locations identified as a FC=A, FC=P or FC=L respectively in the GNS data. Unless otherwise specified, matches are performed for all GNS FC codes requested (default is A,P,L) separately and in series.&lt;br /&gt;
&lt;br /&gt;
The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):&lt;br /&gt;
#Load the source files, clean and parse (parsing identifies units)&lt;br /&gt;
#Load the reference file, build indices&lt;br /&gt;
#Exact match the exception units of records with exceptions&lt;br /&gt;
#Exact match the units of well-formatted records&lt;br /&gt;
#Exact match tokens (1-5 words)&lt;br /&gt;
#N-gram and LCS match&lt;br /&gt;
#Reconsile multiple matches&lt;br /&gt;
&lt;br /&gt;
===Exact Matching Units===&lt;br /&gt;
&lt;br /&gt;
The exact matching of units is performed for both the exception units and units of &amp;quot;well-formatted&amp;quot; records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. Units are matched from the bottom to the top, in order of precedence. That is if the string is Unit1, Unit2, Unit3, Postcode; then Unit3 is matched with precedence over Units 2 and 1, and so forth. However, if multiple matches are made for a some FC code and one match is made for another, then preference is given to the different combination. For example if the string were &amp;quot;Chelsea, London&amp;quot; and both Chelsea and London were recorded in the GNS data as FC=P, but only London was recorded as a FC=A, then it would be most sensible to record P=Chelsea, A=London, and not P=London, A=London. This is differencing is done in the matching method and independent from the resolution of multiple matches at the end.&lt;br /&gt;
&lt;br /&gt;
===Token Matching===&lt;br /&gt;
&lt;br /&gt;
Each source string is cleaned of upper-asci code (i.e. reduced to alphanumerics plus spaces), removed of its postcode (to generate PRS_POSTCODE) and then seperated into token arrays on space characters. Thus the string &amp;quot;String1 String2 String3 String4 Postcode&amp;quot; would become:&lt;br /&gt;
[0]String1 &lt;br /&gt;
[1]String2 &lt;br /&gt;
[2]String3 &lt;br /&gt;
[3]String4 &lt;br /&gt;
&lt;br /&gt;
An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:&lt;br /&gt;
#String1 String2 String3 String4 (token set length=4, first and only set)&lt;br /&gt;
#String2 String3 String4 (token set length=3, first set)&lt;br /&gt;
#String1 String2 String3 (token set length=3, second set)&lt;br /&gt;
#String3 String4  (token set length=2, first set)&lt;br /&gt;
#String2 String3 (token set length=2, second set)&lt;br /&gt;
#String1 String2 (token set length=2, third set)&lt;br /&gt;
#String4 (token set length=1, first set)&lt;br /&gt;
#String3 (token set length=1, second set)&lt;br /&gt;
#String2 (token set length=1, third set)&lt;br /&gt;
#String1 (token set length=1, fourth set)&lt;br /&gt;
&lt;br /&gt;
===NGram and LCS Matching===&lt;br /&gt;
&lt;br /&gt;
Longest Common Subsequence (LCS) is an abundantly used fuzzy matching technique. The [http://en.wikipedia.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia] provides a very detailed background. However, LCS matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run-times, LCS matching is done on only a small sub-set of strings that have met the NGram criteria detailed below.&lt;br /&gt;
&lt;br /&gt;
NGrams are character-based token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets:&lt;br /&gt;
#ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet)&lt;br /&gt;
#0123456789 (i.e. Standard numbers)&lt;br /&gt;
#&amp;quot; &amp;quot; (i.e. the space character)&lt;br /&gt;
#Alphanumeric (i.e. 1 and 2 above)&lt;br /&gt;
#Alphanumeric plus space (1,2,3 above)&lt;br /&gt;
#Alphabet only plus space(1 and 3 above)&lt;br /&gt;
&lt;br /&gt;
Gramsets of various lengths, currently 2,3 and 4 characters, are used. Reference strings are decomposed into grams and both forward and reverse indexed for speed on large datasets. Source strings are decomposed into grams. Candidate reference strings are found by compiling a list of all reference strings that contain at least one of the source grams (this uses the reverse index). Then each of the candidate reference strings is gram-matched against the source gram to compute a number of scores. Specically gram-matching involves computing:&lt;br /&gt;
*The number of grams in the source string&lt;br /&gt;
*The number of grams in the reference string&lt;br /&gt;
*The number of grams in the source string that appear in the reference string&lt;br /&gt;
*The number of grams in the reference string that appear in the source string&lt;br /&gt;
*The above two numbers as a source percentage and a reference percentage&lt;br /&gt;
&lt;br /&gt;
Then an optional constraint that the first letter of the source and reference strings must match can be employed. If a source and reference pair meets a variety of criteria on the above variables (for example the source string is greater than 10 characters long, the source percentage of grams is greater then 80% and less than 110%, the reference percentage of grams is greater than 90% and less than 105%, and the two strings start with the same letter), then the LCS is computed. An LCS percentage, using the maximum of the lengths of the source and reference strings as a denominator, is also calculated. &lt;br /&gt;
&lt;br /&gt;
The candidate reference string with the highest gram and LCS scores, assuming that these scores meet the decision threshold is then selected as the closest match. If no candidate reference string meets the decision threshold the source string is left unmatched. The decision thresholds are configured in the MatchLocations.pl script, and sets of Ngram/LCS matchings, using different character sets, gram lengths and decision thresholds, are performed sequentially, with the currently unmatched source strings used as input for each round.&lt;br /&gt;
&lt;br /&gt;
===Reconciling Multiple Matches===&lt;br /&gt;
&lt;br /&gt;
In a small number of cases it is possible that the source string will achieve more than one match for more than one FC code. For example suppose the string &amp;quot;Glouchester Street Cambridge Cambridgeshire&amp;quot; were considered. This could concievably produce two P matches and one A match with the token matching algorithm detailed above. &lt;br /&gt;
&lt;br /&gt;
To reconsile multiple matches the following process is undertaken:&lt;br /&gt;
*If an FC code has only one match keep that one match&lt;br /&gt;
*Aim for distinction in the set, giving priority in the order that the FC codes are specified in MatchLocations.pl. The default is to include A,P,L in order, so that precedence follows importance and size. This is important if multiple FC codes contain multiple overlapping matches. For example suppose A=1,2 P=2,3 and L=3. The algorithm will look forward and backwards to assign: A=1 P=2 L=3.&lt;br /&gt;
*Determine the set of FC code matches with the shortest distance between them using a [http://en.wikipedia.org/wiki/Haversine_formula Haversine formula] distance calculation based on the GNS reported longitudes and latitudes. (Note that the Haversine formula is implemented in the Match::GNS.pm module and is the most accurate method over short distances, where other methods, like the great-circle method, suffer from compounded rounding error problems.) This is important when multiple FC codes have muliple matches but they do not overlap.&lt;br /&gt;
*If one or more match is found for an FC code then one final 'best' match must be reported, even if it overlaps with another FC code or is distant.&lt;br /&gt;
&lt;br /&gt;
==Human Choices==&lt;br /&gt;
&lt;br /&gt;
It is generally preferrable to have a very high degree of confidence in the fuzzy matches, so that they can be treated as correct without individual inspection. However the script and modules are capable of matching to any degree of accuracy. To get further matches that can be inspected/validated/chosen by a human agent, a very weak criteria is set for two runs of fuzzy matching, and then in each run the best (in terms of parameter scores) options are recorded and written into a 'human choice' file.&lt;br /&gt;
&lt;br /&gt;
As a result a human choice file may contain:&lt;br /&gt;
#No matches for a source string as none of the reference strings managed to reach even the very weak threshold criteria.&lt;br /&gt;
#One match, as both runs of fuzzy matching produced the same recommendation.&lt;br /&gt;
#Two matches, as both runs of fuzzy matching produced one best candidate and the candidates were unique.&lt;br /&gt;
#More than two matches, as one or both of the fuzzy matching runs had multiple unique candidates with the same scores.&lt;br /&gt;
&lt;br /&gt;
It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters.&lt;br /&gt;
&lt;br /&gt;
==Output Files==&lt;br /&gt;
&lt;br /&gt;
By default all files are outputted to the Results directory. Which files are outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches (unless the &amp;lt;tt&amp;gt;-e&amp;lt;/tt&amp;gt; option is used to force just exact matching). The main results file outputs:&lt;br /&gt;
*COUNTRY - From the source entry&lt;br /&gt;
*STR - From the source entry&lt;br /&gt;
*CTY - From the source entry&lt;br /&gt;
*EXP_CITY - From the source entry&lt;br /&gt;
*EXP_ADM - From the source entry&lt;br /&gt;
*EXP_POSTCODE - From the source entry&lt;br /&gt;
*CTY_STR - A compound entry, delimited by #, used as an internal key. It is the software's best estimate of an address structure.&lt;br /&gt;
*EXP_STR - A compound entry, delimited by #, made from the exception data in a similar way to CTY_STR&lt;br /&gt;
*PRS_POSTCODE - The software's best estimate of the postcode if any&lt;br /&gt;
*MATCH_TYPE - The match type that was used to make the match&lt;br /&gt;
*PLACE - The name of the most precise location&lt;br /&gt;
*UNI - The GNS unique identifier of the most precise location&lt;br /&gt;
*LAT - The latitude of the most precise location&lt;br /&gt;
*LONG - The longitude of the most precise location&lt;br /&gt;
*FC - The FC code of the most precise location&lt;br /&gt;
&lt;br /&gt;
The most precise location is taken to be the finest grained result. That is the match corresponding to the lowest level FC code. In the case of the default of FC=A,P,L preference is given to L then P then A. The following variables are then repeated for each FC code searched, and prefixed by the FC code (if no match was found for this FC code the entries will be blank):&lt;br /&gt;
*NAME&lt;br /&gt;
*UNI&lt;br /&gt;
*LAT&lt;br /&gt;
*LONG&lt;br /&gt;
&lt;br /&gt;
The fuzzy match file(s), if requested with &amp;lt;tt&amp;gt;-wf&amp;lt;/tt&amp;gt;, have the same format (they are written by the same method). The report file is a copy of the output to the terminal, and can be enabled with the &amp;lt;tt&amp;gt;-r&amp;lt;/tt&amp;gt; option. The human choice file (enabled with &amp;lt;tt&amp;gt;-human&amp;lt;/tt&amp;gt;) has its own format as follows:&lt;br /&gt;
*SOURCENAME - The word, token or string from the source entry that is being considered as relevant for a match&lt;br /&gt;
*REFNAME - The name of a place in the GNS file&lt;br /&gt;
*COUNTRY - From the source entry&lt;br /&gt;
*STR - From the source entry&lt;br /&gt;
*CTY - From the source entry&lt;br /&gt;
*EXP_CITY - From the source entry&lt;br /&gt;
*EXP_ADM - From the source entry&lt;br /&gt;
*EXP_POSTCODE - From the source entry&lt;br /&gt;
*REFTOTAL - The total number of grams in REFNAME&lt;br /&gt;
*SOURCETOTAL - The total number of grams in SOURCENAME&lt;br /&gt;
*REFPC - the percentage of the REFNAME grams that appear in the SOURCENAME gram set&lt;br /&gt;
*SOURCEPC - the percentage of the SOURCENAME grams that appear in the REFNAME gram set&lt;br /&gt;
*LEFTGRAMS - the number of the REFNAME grams that appear in the SOURCENAME gram set&lt;br /&gt;
*RIGHTGRAMS - the number of the SOURCENAME grams that appear in the REFNAME gram set&lt;br /&gt;
*LCSSCORE - The size of the longest common subsequence in characters&lt;br /&gt;
*SOURCELENGTH - The length of SOURCENAME&lt;br /&gt;
*REFLENGTH - The length of REFNAME&lt;br /&gt;
*MAXLENGTH - The maximum of the lengths of SOURCENAME and REFNAME&lt;br /&gt;
*LCSPC - The LCSSCORE divided by the MAXLENGTH&lt;br /&gt;
*FIRSTLETTERBINDS - Whether the fuzzy matching algorithm required the same first letter in SOURCENAME and REFNAME&lt;br /&gt;
*GRAMALPHABET - The gram alphabet used by the matching algorithm&lt;br /&gt;
*GRAMLENGTH - The length of the n-grams used&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:McNair Admin]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=PhD_Masterclass_-_How_to_Build_a_Web_Crawler&amp;diff=6906</id>
		<title>PhD Masterclass - How to Build a Web Crawler</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=PhD_Masterclass_-_How_to_Build_a_Web_Crawler&amp;diff=6906"/>
		<updated>2016-07-18T17:00:35Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: Created page with &amp;quot;This page provides resources for the PhD Masterclass &amp;quot;How to Build a Web Crawler&amp;quot;, which I gave on Friday 28th January 2011 to interested PhD students at Haas.  ==Tools==  *[h...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page provides resources for the PhD Masterclass &amp;quot;How to Build a Web Crawler&amp;quot;, which I gave on Friday 28th January 2011 to interested PhD students at Haas.&lt;br /&gt;
&lt;br /&gt;
==Tools==&lt;br /&gt;
&lt;br /&gt;
*[http://www.perl.org/ Perl] - Available with a large set of useful modules for Windows from ActiveState as [http://www.activestate.com/activeperl ActivePerl]&lt;br /&gt;
*[http://www.activestate.com/komodo-ide Komodo] - An integrated development environment for Perl available from ActiveState&lt;br /&gt;
*[http://www.textpad.com/ Textpad] - A powerful shareware text editor that supports [http://en.wikipedia.org/wiki/Regular_expression regular expressions]&lt;br /&gt;
&lt;br /&gt;
You should [http://www.activestate.com/komodo-ide/downloads download a trial of Komodo] to help you learn. The trial is valid for 21 days (longer if you keep changing your system clock). Komodo will let you step through your code, line by line, and see the values that your variables take on.&lt;br /&gt;
&lt;br /&gt;
Perl is a free and open language, with a rich history, so you will find a wealth of information on the web to help you learn and use it.&lt;br /&gt;
&lt;br /&gt;
==Sample Perl Code==&lt;br /&gt;
&lt;br /&gt;
We wrote a couple of simple scripts together to get to grips with Perl.&lt;br /&gt;
&lt;br /&gt;
===Running a Perl Script===&lt;br /&gt;
&lt;br /&gt;
The first was (save it in a file called Script1.pl in the root of your R drive):&lt;br /&gt;
&lt;br /&gt;
 print &amp;quot;Hello World&amp;quot;;&lt;br /&gt;
&lt;br /&gt;
To execute the script we can either open a command prompt and run the script:&lt;br /&gt;
 Start-&amp;gt;Run-&amp;gt;&amp;quot;cmd.exe&amp;quot;&lt;br /&gt;
 R:&lt;br /&gt;
 perl Script1.pl&lt;br /&gt;
&lt;br /&gt;
Or we can run it in Komodo by going:&lt;br /&gt;
 Debug-&amp;gt;Go&lt;br /&gt;
&lt;br /&gt;
(Under Preferences-&amp;gt;Debugger tick the box to avoid being prompted by the debug dialog each time)&lt;br /&gt;
&lt;br /&gt;
Or we can shell on to Bear and run it there:&lt;br /&gt;
 Use PuTTY to connect to bear.haas.berkeley.edu (see [[Research Computing At Haas|here]]).&lt;br /&gt;
 perl Script1.pl&lt;br /&gt;
&lt;br /&gt;
===Processing Text Data===&lt;br /&gt;
&lt;br /&gt;
Next we went to:&lt;br /&gt;
&lt;br /&gt;
 http://www.contractormisconduct.org/index.cfm/1,73,222,html?CaseID=2&lt;br /&gt;
&lt;br /&gt;
And we created a file called Data.txt (saved next to the script) that contained the following:&lt;br /&gt;
&lt;br /&gt;
 Accenture&lt;br /&gt;
 Potential Foreign Corrupt Practices Act Violation&lt;br /&gt;
 Date:  07/01/2003 (Date of Incident Report)&lt;br /&gt;
 &lt;br /&gt;
 Misconduct Type:  Ethics&lt;br /&gt;
 &lt;br /&gt;
 Enforcement Agency:  SEC&lt;br /&gt;
 &lt;br /&gt;
 Contracting Party:  None&lt;br /&gt;
 &lt;br /&gt;
 Court Type:  Administrative&lt;br /&gt;
 &lt;br /&gt;
 Amount:  $0&lt;br /&gt;
 &lt;br /&gt;
 Disposition:  Pending&lt;br /&gt;
 &lt;br /&gt;
 Synopsis:  &amp;quot;As previously reported in July 2003, we became aware of an incident...&amp;quot;&lt;br /&gt;
 &lt;br /&gt;
 Document(s):&lt;br /&gt;
 •1.  SEC 10-K (p. 34 of 137)&lt;br /&gt;
&lt;br /&gt;
We then wrote the following script to process the data:&lt;br /&gt;
&lt;br /&gt;
 #!/usr/bin/perl -w&lt;br /&gt;
 #Lines that start with a # are comments that aren't read by the interpreter&lt;br /&gt;
 &lt;br /&gt;
 use strict;&lt;br /&gt;
 #The strict module forces us to declare variables before we use them&lt;br /&gt;
 &lt;br /&gt;
 my @Textfile;&lt;br /&gt;
 #Declare an array called TextFile&lt;br /&gt;
 &lt;br /&gt;
 open (DATA,&amp;quot;Data.txt&amp;quot;);&lt;br /&gt;
 #Open a filehandle on our file&lt;br /&gt;
 &lt;br /&gt;
 while (&amp;lt;DATA&amp;gt;) {&lt;br /&gt;
 #Read the data from the filehandle, line by line&lt;br /&gt;
 &lt;br /&gt;
     chomp $_;&lt;br /&gt;
     #$_ is a special variable - it captures the line being read from the filehandle here&lt;br /&gt;
 &lt;br /&gt;
     if (!$_) {next;}&lt;br /&gt;
     #if the line is undefined (i.e. blank) move to the next loop iteration&lt;br /&gt;
 &lt;br /&gt;
     my $line = $_; &lt;br /&gt;
     #Set a local variable called line to $_&lt;br /&gt;
 &lt;br /&gt;
     push (@Textfile, $line);&lt;br /&gt;
     #Push the line onto the Textfile array&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 my $Doccell;&lt;br /&gt;
 #Declare the Doccell variable&lt;br /&gt;
 &lt;br /&gt;
 for (my $i=0; $i&amp;lt;=$#Textfile; $i++) {&lt;br /&gt;
 #Do a for loop, starting from i=0, going while i is less than the &lt;br /&gt;
 #last index of the Textfile array, and incrementing by one each time&lt;br /&gt;
 &lt;br /&gt;
     if ($Textfile[$i]=~/^Document\(s\):/) {$Doccell=$i;}&lt;br /&gt;
     #Test to see if the entry matches a regular expression, if it does record the index&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 my @docs = splice(@Textfile,$Doccell);&lt;br /&gt;
 #Create a next array by splicing out everything after the index we just found&lt;br /&gt;
 &lt;br /&gt;
 shift @docs;&lt;br /&gt;
 #Remove the first element of the docs array&lt;br /&gt;
 &lt;br /&gt;
 my $Firm = shift @Textfile;&lt;br /&gt;
 #Set Firm equal to the first element of Textfile (which we just removed)&lt;br /&gt;
 &lt;br /&gt;
 my $Violation =shift(@Textfile);&lt;br /&gt;
 #Set Violation equal to the (new) first element of Textfile (which we just removed)&lt;br /&gt;
 &lt;br /&gt;
 my $Offense={};&lt;br /&gt;
 #Create an anonymous hash&lt;br /&gt;
 &lt;br /&gt;
 foreach my $cell (@Textfile) {\&lt;br /&gt;
 #Iterative over Textfile, setting the current iteration to cell&lt;br /&gt;
 &lt;br /&gt;
     my ($name,@value)=split(&amp;quot;:&amp;quot;,$cell);&lt;br /&gt;
     #Spill the cell on :&lt;br /&gt;
 &lt;br /&gt;
     my $value=join(&amp;quot;:&amp;quot;,@value);&lt;br /&gt;
     #Join the Value array on :&lt;br /&gt;
 &lt;br /&gt;
     $Offense-&amp;gt;{$name}=$value;&lt;br /&gt;
     #Set an entry in the Offense hash&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 $Offense-&amp;gt;{&amp;quot;DocList&amp;quot;}=\@docs;&lt;br /&gt;
 #Set the doclist entry in the Offense hash to a reference to the docs array&lt;br /&gt;
 &lt;br /&gt;
 my $Master=[];&lt;br /&gt;
 #Define an anonymous array&lt;br /&gt;
 &lt;br /&gt;
 $Master-&amp;gt;[0]={};&lt;br /&gt;
 #Define an anonymous hash in the zeroth cell of the anonymous array&lt;br /&gt;
 &lt;br /&gt;
 $Master-&amp;gt;[0]-&amp;gt;{FirmName}=$Firm;&lt;br /&gt;
 #Set a hash entry&lt;br /&gt;
 &lt;br /&gt;
 $Master-&amp;gt;[0]-&amp;gt;{Offense}=$Offense;&lt;br /&gt;
 #Set a hash entry&lt;br /&gt;
 &lt;br /&gt;
 $Master-&amp;gt;[0]-&amp;gt;{Violation}=$Violation;&lt;br /&gt;
 #Set a hash entry&lt;br /&gt;
 &lt;br /&gt;
 open(OUTPUT,&amp;quot;&amp;gt;Result.txt&amp;quot;);&lt;br /&gt;
 #Open a filehandle for writing (overwrite the file if it exists)&lt;br /&gt;
 &lt;br /&gt;
 print OUTPUT $Master-&amp;gt;[0]-&amp;gt;{FirmName};&lt;br /&gt;
 #Print the output file an entry from the anonymous hash in the anonymous array&lt;br /&gt;
 &lt;br /&gt;
 print OUTPUT &amp;quot;\t&amp;quot;;&lt;br /&gt;
 #Print a tab&lt;br /&gt;
 &lt;br /&gt;
 print OUTPUT $Master-&amp;gt;[0]-&amp;gt;{Violation}.&amp;quot;\t&amp;quot;;&lt;br /&gt;
 #Print another entry with another tab on the end&lt;br /&gt;
 &lt;br /&gt;
 foreach my $key ( sort {$a cmp $b } (keys %{ $Master-&amp;gt;[0]-&amp;gt;{Offense} } )) {&lt;br /&gt;
 #Iterate through the hash's keys, in alphabetical order, setting the current key to $key&lt;br /&gt;
 &lt;br /&gt;
     print OUTPUT  $Master-&amp;gt;[0]-&amp;gt;{Offense}-&amp;gt;{$key}.&amp;quot;\t&amp;quot;;&lt;br /&gt;
     #Print an entry, with a tab&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 print OUTPUT &amp;quot;\n&amp;quot;;&lt;br /&gt;
 #Print a new line&lt;br /&gt;
 &lt;br /&gt;
 close OUTPUT;&lt;br /&gt;
 #Close the output filehandle - this will flush the write buffer&lt;br /&gt;
 &lt;br /&gt;
==Modules==&lt;br /&gt;
&lt;br /&gt;
One of the joys of Perl is [http://www.cpan.org/ CPAN - The Comprehensive Perl Archive Network] which acts as repository for perl modules (as well as scripts, distros and much else). There are modules written by people from all over the world for almost every conceivable purpose. There is usually no need to reinvent the wheel in Perl - just grab a module (e.g. Wheel::Base)!&lt;br /&gt;
&lt;br /&gt;
We tested some code using LWP::UserAgent and HTML::TreeBuilder. Useful documentation is here:&lt;br /&gt;
&lt;br /&gt;
*[http://search.cpan.org/~gaas/libwww-perl-5.837/lib/LWP/UserAgent.pm LWP::UserAgent]&lt;br /&gt;
*[http://search.cpan.org/~petdance/WWW-Mechanize-1.66/lib/WWW/Mechanize.pm WWW::Mechanize]&lt;br /&gt;
*[http://search.cpan.org/~gaas/libwww-perl-5.837/lib/HTTP/Response.pm HTTP::Response]&lt;br /&gt;
*[http://search.cpan.org/~jfearn/HTML-Tree-4.1/lib/HTML/TreeBuilder.pm HTML::TreeBuilder]&lt;br /&gt;
*[http://search.cpan.org/~jfearn/HTML-Tree-4.1/lib/HTML/Element.pm HTML::Element]&lt;br /&gt;
*[http://annocpan.org/~GAAS/libwww-perl-5.837/lib/LWP/RobotUA.pm LWP::RobotUA]&lt;br /&gt;
*[http://annocpan.org/~GRANTM/XML-Simple-2.18/lib/XML/Simple.pm XML::Simple]&lt;br /&gt;
&lt;br /&gt;
Below is a simple UserAgent example:&lt;br /&gt;
&lt;br /&gt;
 use LWP::UserAgent;&lt;br /&gt;
 #Use the LWP::UserAgent modules&lt;br /&gt;
 &lt;br /&gt;
 my $ua = LWP::UserAgent-&amp;gt;new;&lt;br /&gt;
 #Create a new UserAgent&lt;br /&gt;
 &lt;br /&gt;
 my $url=&amp;quot;http://www.contractormisconduct.org/index.cfm/1,73,222,html?CaseID=2&amp;quot;;&lt;br /&gt;
 #Set up a string containing a URL&lt;br /&gt;
 &lt;br /&gt;
 my $response = $ua-&amp;gt;get($url);&lt;br /&gt;
 #Use the UA 'get' method to retrieve the webpage. This returns an HTTP Response object&lt;br /&gt;
 &lt;br /&gt;
 my $content=$response-&amp;gt;decoded_content;&lt;br /&gt;
 #Get the response as one long text string, so we can work with it...&lt;br /&gt;
&lt;br /&gt;
And now for a TreeBuilder example:&lt;br /&gt;
&lt;br /&gt;
 use HTML::TreeBuilder;&lt;br /&gt;
 #Use the HTML::TreeBuilder modules&lt;br /&gt;
 &lt;br /&gt;
 my $tree = HTML::TreeBuilder-&amp;gt;new; # empty tree&lt;br /&gt;
 #Create a new tree object&lt;br /&gt;
 &lt;br /&gt;
 $tree-&amp;gt;parse($content);&lt;br /&gt;
 #Load up the tree from the content string (that we got using UA)&lt;br /&gt;
 &lt;br /&gt;
 my $dump=$tree-&amp;gt;as_text;&lt;br /&gt;
 #Dump the tree as text maybe&lt;br /&gt;
 &lt;br /&gt;
 my $incidentelement=$tree-&amp;gt;look_down(&amp;quot;id&amp;quot;,&amp;quot;primecontent&amp;quot;);&lt;br /&gt;
 #Or use HTML::Element methods to look_down the tree for a tag with some properties&lt;br /&gt;
&lt;br /&gt;
==An Example Webcrawler==&lt;br /&gt;
&lt;br /&gt;
I wrote the following simple webcrawler for a fellow PhD student:&lt;br /&gt;
&lt;br /&gt;
 #!/usr/bin/perl -w&lt;br /&gt;
 use strict;&lt;br /&gt;
 &lt;br /&gt;
 use LWP::UserAgent;&lt;br /&gt;
 #Use the LWP::UserAgent modules&lt;br /&gt;
 use HTML::TreeBuilder;&lt;br /&gt;
 #Use the HTML::TreeBuilder modules&lt;br /&gt;
 &lt;br /&gt;
 my $ua = LWP::UserAgent-&amp;gt;new;&lt;br /&gt;
 #Create a new UserAgent&lt;br /&gt;
 &lt;br /&gt;
 my @Pkids;&lt;br /&gt;
 open (PKIDS,&amp;quot;Pkidfile.txt&amp;quot;) || die &amp;quot;Can't open the PKID file to read $!&amp;quot;;&lt;br /&gt;
 #Open the Pkid file to read - this file has a Pkid on each line. You can get some from here: http://myaccount.sdar.com/RealtorSrch.asp&lt;br /&gt;
 &lt;br /&gt;
 while (&amp;lt;PKIDS&amp;gt;) {&lt;br /&gt;
     #Read the pkid file line by line&lt;br /&gt;
     &lt;br /&gt;
     chomp $_;&lt;br /&gt;
     #Remove the \n (newline symbol) from each line&lt;br /&gt;
     &lt;br /&gt;
     push(@Pkids,$_);&lt;br /&gt;
     #Add the PKID to an array&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 open (RESULTS,&amp;quot;&amp;gt;Results.txt&amp;quot;) || die &amp;quot;Can't write the Results.txt file $!&amp;quot;;&lt;br /&gt;
 #Open the Results file to write&lt;br /&gt;
 &lt;br /&gt;
 my $headerflag=0;&lt;br /&gt;
 #Set a flag to indicate whether we wrote the header line to the output file&lt;br /&gt;
 &lt;br /&gt;
 foreach my $Pkid (sort (@Pkids)) {&lt;br /&gt;
     #Go through the PKIDs in order&lt;br /&gt;
     &lt;br /&gt;
     my $url=&amp;quot;http://myaccount.sdar.com/RealtorSrchDetail.asp?PKID=&amp;quot;.$Pkid;&lt;br /&gt;
     #Set up a string containing a URL&lt;br /&gt;
 &lt;br /&gt;
     my $response = $ua-&amp;gt;get($url);&lt;br /&gt;
     #Use the UA 'get' method to retrieve the webpage. This returns an HTTP Response object&lt;br /&gt;
 &lt;br /&gt;
     my $content=$response-&amp;gt;decoded_content;&lt;br /&gt;
     #Get the response as one long text string, so we can work with it...&lt;br /&gt;
 &lt;br /&gt;
     my $tree = HTML::TreeBuilder-&amp;gt;new; # empty tree&lt;br /&gt;
     #Create a new tree object&lt;br /&gt;
 &lt;br /&gt;
     $tree-&amp;gt;parse($content);&lt;br /&gt;
     #Load up the tree from the content string (that we got using UA)&lt;br /&gt;
 &lt;br /&gt;
     my $name=$tree-&amp;gt;look_down(&amp;quot;width&amp;quot;,&amp;quot;520&amp;quot;);&lt;br /&gt;
     #Find an element in the HTML that has width=520 (this is where names are stored)&lt;br /&gt;
     &lt;br /&gt;
     my $nametext=$name-&amp;gt;as_text;&lt;br /&gt;
     #Convert it to text&lt;br /&gt;
     &lt;br /&gt;
     $nametext=~s/^\s{1,}//;&lt;br /&gt;
     #Remove leading spaces&lt;br /&gt;
     &lt;br /&gt;
     $nametext=~s/\s{1,}$//;&lt;br /&gt;
     #Remove trailing spaces&lt;br /&gt;
     &lt;br /&gt;
     $nametext=~s/^\s{2,}/ /g;&lt;br /&gt;
     #Replace double spaces with a single space, globally&lt;br /&gt;
     &lt;br /&gt;
     my @fieldstext;&lt;br /&gt;
     #Declare an array&lt;br /&gt;
     &lt;br /&gt;
     my @fields=$tree-&amp;gt;look_down(&amp;quot;class&amp;quot;,&amp;quot;field_labels&amp;quot;);&lt;br /&gt;
     #Find all of the field elements&lt;br /&gt;
     &lt;br /&gt;
     foreach my $field (@fields) {&lt;br /&gt;
     #Go through them&lt;br /&gt;
     &lt;br /&gt;
         my $fieldparent=$field-&amp;gt;parent;&lt;br /&gt;
         #Go to their parent&lt;br /&gt;
         &lt;br /&gt;
         my $fieldparenttext=$fieldparent-&amp;gt;as_text;&lt;br /&gt;
         #Turn the parent into text&lt;br /&gt;
         &lt;br /&gt;
         $fieldparenttext=~s/^\s{1,}//; $fieldparenttext=~s/\s{1,}$//; $fieldparenttext=~s/^\s{2,}/ /g;&lt;br /&gt;
         #Deal with spaces again&lt;br /&gt;
         &lt;br /&gt;
         push @fieldstext,$fieldparenttext;&lt;br /&gt;
         #Add the fields to a list&lt;br /&gt;
     }&lt;br /&gt;
     &lt;br /&gt;
     &amp;amp;writeoutput($Pkid,$nametext,@fieldstext);&lt;br /&gt;
     #Call the write output subroutine&lt;br /&gt;
     &lt;br /&gt;
     $content=undef;  $tree=undef;  $name=undef;  undef @fields;&lt;br /&gt;
     #Set a bunch of variables to undefined - this frees up memory&lt;br /&gt;
     &lt;br /&gt;
     sleep(2);&lt;br /&gt;
     #Pause for a second or two...&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 close (RESULTS);&lt;br /&gt;
 #Close the Results filehandle - this flushes the write buffer&lt;br /&gt;
 &lt;br /&gt;
 sub writeoutput {&lt;br /&gt;
 #Declare the writeoutput subroutine&lt;br /&gt;
 &lt;br /&gt;
     my $data={};&lt;br /&gt;
     #Set up an anonymous hash&lt;br /&gt;
     &lt;br /&gt;
     $data-&amp;gt;{&amp;quot;A Pkid&amp;quot;}=shift @_;&lt;br /&gt;
     #Set the A PKID field to the first parameter passed to the subroutine&lt;br /&gt;
     &lt;br /&gt;
     $data-&amp;gt;{&amp;quot;A Name&amp;quot;}=shift @_;&lt;br /&gt;
     #Set the A PKID field to the second parameter passed to the subroutine (the first has now gone)&lt;br /&gt;
     &lt;br /&gt;
     push(my @fields,@_);&lt;br /&gt;
     #Add the remaining parameters to an array&lt;br /&gt;
     &lt;br /&gt;
     foreach my $field (@fields) {&lt;br /&gt;
     #Go through the array&lt;br /&gt;
     &lt;br /&gt;
         my @fieldparts=split(&amp;quot;:&amp;quot;,$field);&lt;br /&gt;
         #Split the fields on semicolon&lt;br /&gt;
         &lt;br /&gt;
         my $key=shift(@fieldparts);&lt;br /&gt;
         #Set the key&lt;br /&gt;
         &lt;br /&gt;
         $data-&amp;gt;{$key}=join(&amp;quot;:&amp;quot;,@fieldparts);&lt;br /&gt;
         #Write the hash entry&lt;br /&gt;
     }&lt;br /&gt;
     if (!$headerflag) {&lt;br /&gt;
         #If the headflag is 0 then do this&lt;br /&gt;
         &lt;br /&gt;
         foreach my $key (sort {$a cmp $b} (keys %{$data})) {&lt;br /&gt;
         #Go through the keys&lt;br /&gt;
         &lt;br /&gt;
             print RESULTS $key.&amp;quot;\t&amp;quot;;&lt;br /&gt;
             #Write the key followed by a tab&lt;br /&gt;
         }&lt;br /&gt;
         print RESULTS &amp;quot;\n&amp;quot;;&lt;br /&gt;
         #Print a newline&lt;br /&gt;
         &lt;br /&gt;
         $headerflag=1;&lt;br /&gt;
         #Set the headflag to 1&lt;br /&gt;
     }&lt;br /&gt;
     &lt;br /&gt;
     foreach my $key (sort {$a cmp $b} (keys %{$data})) {&lt;br /&gt;
         #Go through the keys again&lt;br /&gt;
         &lt;br /&gt;
         print RESULTS $data-&amp;gt;{$key}.&amp;quot;\t&amp;quot;;&lt;br /&gt;
         #This time print the data followed by tabs&lt;br /&gt;
     }&lt;br /&gt;
     print RESULTS &amp;quot;\n&amp;quot;;&lt;br /&gt;
     #print a newline&lt;br /&gt;
 }&lt;br /&gt;
 &lt;br /&gt;
 print &amp;quot;Thanks to Ed&amp;quot;;&lt;br /&gt;
 #Thank Ed.&lt;br /&gt;
&lt;br /&gt;
[[category:McNair Admin]]&lt;br /&gt;
[[admin_classification::Software Tutorial| ]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=6848</id>
		<title>Software Repository</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=6848"/>
		<updated>2016-07-18T16:06:20Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: McNair Admin]]&lt;br /&gt;
&lt;br /&gt;
For a listing of all software tools and scripts see [[Software Repository Listing]].&lt;br /&gt;
&lt;br /&gt;
==Background==&lt;br /&gt;
Given the amount of software that has been written by past computer science interns and more being written, we felt the need to have some kind of source code management system put into place so that developers can work without ever being in fear of breaking production and facing Ed's wrath (you do not want that dude angry! Wherever you go, he will find you! No escape.).&lt;br /&gt;
&lt;br /&gt;
To enforce efficient source control we(Ed) chose to host our own git server on the RDP machine using [https://bonobogitserver.com/ Bonobo Git Server] that makes use of the windows IIS platform and is open source.&lt;br /&gt;
&lt;br /&gt;
Installing Bonobo git server is pretty simple:&lt;br /&gt;
* dowload the zip file from the Bonobo website.&lt;br /&gt;
* extract its contents. It should be a single folder containing directories like App_Data, bin etc.&lt;br /&gt;
* rename that folder to anything you want. I used the name &amp;quot;codebase&amp;quot;&lt;br /&gt;
* copy the codebase folder to C:\inetpub\wwwroot\&lt;br /&gt;
* Allow IIS User to modify C:\inetpub\wwwroot\codebase\App_Data folder. To do so:&lt;br /&gt;
**select Properties of App_Data folder,&lt;br /&gt;
**go to Security tab,&lt;br /&gt;
**click edit,&lt;br /&gt;
**select IIS user (in my case IIS_IUSRS) and add Modify and Write permission,&lt;br /&gt;
**confirm these settings with Apply button.&lt;br /&gt;
*Convert ''codebase'' to Application in IIS&lt;br /&gt;
**Run IIS Manager and navigate to Sites -&amp;gt; Default Web Site. You should see Bonobo.Git.Server.&lt;br /&gt;
**Right click on 'codebase' and convert to application.&lt;br /&gt;
**Check if the selected application pool runs on .NET 4.0 and convert the site.&lt;br /&gt;
*Enable Anonymous Authentication in IIS and disable the others. To do so, select the application in the left pane, double-click on the authentication icon in the right pane and set the value to of Anonymous Authentication to Enabled&lt;br /&gt;
*Launch your browser and go to http://localhost/codebase. Now you can see the initial page of the Bonobo Git Server and everything should work.&lt;br /&gt;
**default credentials are ''username'': '''admin''', ''password'': '''admin'''&lt;br /&gt;
**[6-22-2016]: Can also use https://localhost/codebase which is preferable, otherwise username/passwords are transmitted plain text. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
==Our Git Server==&lt;br /&gt;
We have already done the set up of the git server on the RDP machine. Here are the admin credentials:&lt;br /&gt;
*Username: '''boss'''&lt;br /&gt;
*Name: '''Ed'''&lt;br /&gt;
*Surname: '''Egan'''&lt;br /&gt;
*Email: '''Edward.Egan@rice.edu'''&lt;br /&gt;
*Password: '''you_seriously_thought_Id_write_that_in_here??'''&lt;br /&gt;
&lt;br /&gt;
To access this from your computer and not the RDP you can go to http://128.42.44.182/codebase where it will prompt you for your username and password.&lt;br /&gt;
**[6-22-2016]: Can also use https://128.42.44.182/codebase which is preferable. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Our Git workflow==&lt;br /&gt;
We chose a simple git workflow.&lt;br /&gt;
&lt;br /&gt;
Our aim is not to break things in the master branch. All commits on the master should work.&lt;br /&gt;
&lt;br /&gt;
 1.&lt;br /&gt;
 When adding a new feature or fixing a bug, ALWAYS check out a new feature branch from the master.&lt;br /&gt;
 NEVER checkout a feature branch from next (see below). The feature branch should be named user/feature_name. &lt;br /&gt;
&lt;br /&gt;
 2.&lt;br /&gt;
 After feature development is complete merge your feature-branch into next.&lt;br /&gt;
&lt;br /&gt;
 3.&lt;br /&gt;
 The next branch is intended for testing and confirming things do not break. So, after feature branches are merged into next and conflicts resolved, we merge into master.&lt;br /&gt;
 After this, you can end the feature branches if you want.&lt;br /&gt;
&lt;br /&gt;
==Quick and dirty github tutorial==&lt;br /&gt;
 For a cool interactive tutorial see http://learngitbranching.js.org/.&lt;br /&gt;
&lt;br /&gt;
 ***&lt;br /&gt;
 You can also use SourceTree which is a GUI interface for git-client. This is installed on the RDP.&lt;br /&gt;
 Like using git from CLI (see below), SourceTree constructs appropriate commands. But the good thing&lt;br /&gt;
 is it automatically generates all error check/logging options with each command that are difficult&lt;br /&gt;
 to recall from memory. SourceTree is freely available from Altassian at https://www.sourcetreeapp.com/&lt;br /&gt;
 ***&lt;br /&gt;
&lt;br /&gt;
 To use SourceTree you should have basic understanding of git (like branches,commits etc). The interactive tutorial above is very good for this purpose.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*''Installing - '' Depending on your operating system you can install git in three different ways:&lt;br /&gt;
** If you are a windows or a mac user user, you can simply download &amp;amp; install the latest release from [https://git-scm.com/ git scm website]&lt;br /&gt;
** If you use ubuntu then all you need to do is type &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;sudo apt-get install git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
* Check your installation by typing &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; in terminal or windows powershell&lt;br /&gt;
*Basic git operations:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
* to checkout code from remote repository, use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. This will create a local repository on your disk as well as download the source code of the project you wish to work on. Here's an example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone http://128.42.44.182/codebase/Matcher.git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to update your repository to include others' work in your project use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Its always a good practice to update your code before you commit to ensure that others' code doesn't break yours. Also, you cannot push to remote unless your local repository is up to date. If you commit on a stale local repository that is fine, just that this would mean you are likely to have more trouble merging your code with others later on thanks to all the conflicts that you'll face when you actually try to update your repository later. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update &amp;lt;optional folder path&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to commit your changes to your local repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Committing your changes is an essential step whether you are adding/removing items from the repository or changing existing items. See example :&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit -m &amp;quot;mandatory commit message&amp;quot;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to push your changes to remote repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git push &amp;lt;optional file/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Whatever you need to be pushed to the server must be committed to your local repository first. By default this command will push everything from current folder if no item is specified.&lt;br /&gt;
&lt;br /&gt;
* to add new files to your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you must commit to ensure that your repository actually has the new file. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add &amp;lt;filename/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to remove items from your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you delete the file that you wanted removed from the repository and commit to ensure that your repository actually has the change persisted. Finally, you push to server to make sure the server has those items removed as well and that nobody in your team works under the assumption that those items are stills there. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove &amp;lt;filename&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
''Note'': if removing a non empty folder use the -r flag to recursively remove all contents of that folder as well :&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove -r &amp;lt;foldername&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[admin_classification::IT Build| ]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=6840</id>
		<title>Software Repository</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=6840"/>
		<updated>2016-07-18T15:55:41Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: McNair Admin]]&lt;br /&gt;
&lt;br /&gt;
For a listing of all software tools and scripts see [[Resources and Tools]].&lt;br /&gt;
&lt;br /&gt;
==Background==&lt;br /&gt;
Given the amount of software that has been written by past computer science interns and more being written, we felt the need to have some kind of source code management system put into place so that developers can work without ever being in fear of breaking production and facing Ed's wrath (you do not want that dude angry! Wherever you go, he will find you! No escape.).&lt;br /&gt;
&lt;br /&gt;
To enforce efficient source control we(Ed) chose to host our own git server on the RDP machine using [https://bonobogitserver.com/ Bonobo Git Server] that makes use of the windows IIS platform and is open source.&lt;br /&gt;
&lt;br /&gt;
Installing Bonobo git server is pretty simple:&lt;br /&gt;
* dowload the zip file from the Bonobo website.&lt;br /&gt;
* extract its contents. It should be a single folder containing directories like App_Data, bin etc.&lt;br /&gt;
* rename that folder to anything you want. I used the name &amp;quot;codebase&amp;quot;&lt;br /&gt;
* copy the codebase folder to C:\inetpub\wwwroot\&lt;br /&gt;
* Allow IIS User to modify C:\inetpub\wwwroot\codebase\App_Data folder. To do so:&lt;br /&gt;
**select Properties of App_Data folder,&lt;br /&gt;
**go to Security tab,&lt;br /&gt;
**click edit,&lt;br /&gt;
**select IIS user (in my case IIS_IUSRS) and add Modify and Write permission,&lt;br /&gt;
**confirm these settings with Apply button.&lt;br /&gt;
*Convert ''codebase'' to Application in IIS&lt;br /&gt;
**Run IIS Manager and navigate to Sites -&amp;gt; Default Web Site. You should see Bonobo.Git.Server.&lt;br /&gt;
**Right click on 'codebase' and convert to application.&lt;br /&gt;
**Check if the selected application pool runs on .NET 4.0 and convert the site.&lt;br /&gt;
*Enable Anonymous Authentication in IIS and disable the others. To do so, select the application in the left pane, double-click on the authentication icon in the right pane and set the value to of Anonymous Authentication to Enabled&lt;br /&gt;
*Launch your browser and go to http://localhost/codebase. Now you can see the initial page of the Bonobo Git Server and everything should work.&lt;br /&gt;
**default credentials are ''username'': '''admin''', ''password'': '''admin'''&lt;br /&gt;
**[6-22-2016]: Can also use https://localhost/codebase which is preferable, otherwise username/passwords are transmitted plain text. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
==Our Git Server==&lt;br /&gt;
We have already done the set up of the git server on the RDP machine. Here are the admin credentials:&lt;br /&gt;
*Username: '''boss'''&lt;br /&gt;
*Name: '''Ed'''&lt;br /&gt;
*Surname: '''Egan'''&lt;br /&gt;
*Email: '''Edward.Egan@rice.edu'''&lt;br /&gt;
*Password: '''you_seriously_thought_Id_write_that_in_here??'''&lt;br /&gt;
&lt;br /&gt;
To access this from your computer and not the RDP you can go to http://128.42.44.182/codebase where it will prompt you for your username and password.&lt;br /&gt;
**[6-22-2016]: Can also use https://128.42.44.182/codebase which is preferable. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Our Git workflow==&lt;br /&gt;
We chose a simple git workflow.&lt;br /&gt;
&lt;br /&gt;
Our aim is not to break things in the master branch. All commits on the master should work.&lt;br /&gt;
&lt;br /&gt;
 1.&lt;br /&gt;
 When adding a new feature or fixing a bug, ALWAYS check out a new feature branch from the master.&lt;br /&gt;
 NEVER checkout a feature branch from next (see below). The feature branch should be named user/feature_name. &lt;br /&gt;
&lt;br /&gt;
 2.&lt;br /&gt;
 After feature development is complete merge your feature-branch into next.&lt;br /&gt;
&lt;br /&gt;
 3.&lt;br /&gt;
 The next branch is intended for testing and confirming things do not break. So, after feature branches are merged into next and conflicts resolved, we merge into master.&lt;br /&gt;
 After this, you can end the feature branches if you want.&lt;br /&gt;
&lt;br /&gt;
==Quick and dirty github tutorial==&lt;br /&gt;
 For a cool interactive tutorial see http://learngitbranching.js.org/.&lt;br /&gt;
&lt;br /&gt;
 ***&lt;br /&gt;
 You can also use SourceTree which is a GUI interface for git-client. This is installed on the RDP.&lt;br /&gt;
 Like using git from CLI (see below), SourceTree constructs appropriate commands. But the good thing&lt;br /&gt;
 is it automatically generates all error check/logging options with each command that are difficult&lt;br /&gt;
 to recall from memory. SourceTree is freely available from Altassian at https://www.sourcetreeapp.com/&lt;br /&gt;
 ***&lt;br /&gt;
&lt;br /&gt;
 To use SourceTree you should have basic understanding of git (like branches,commits etc). The interactive tutorial above is very good for this purpose.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*''Installing - '' Depending on your operating system you can install git in three different ways:&lt;br /&gt;
** If you are a windows or a mac user user, you can simply download &amp;amp; install the latest release from [https://git-scm.com/ git scm website]&lt;br /&gt;
** If you use ubuntu then all you need to do is type &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;sudo apt-get install git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
* Check your installation by typing &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; in terminal or windows powershell&lt;br /&gt;
*Basic git operations:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
* to checkout code from remote repository, use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. This will create a local repository on your disk as well as download the source code of the project you wish to work on. Here's an example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone http://128.42.44.182/codebase/Matcher.git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to update your repository to include others' work in your project use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Its always a good practice to update your code before you commit to ensure that others' code doesn't break yours. Also, you cannot push to remote unless your local repository is up to date. If you commit on a stale local repository that is fine, just that this would mean you are likely to have more trouble merging your code with others later on thanks to all the conflicts that you'll face when you actually try to update your repository later. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update &amp;lt;optional folder path&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to commit your changes to your local repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Committing your changes is an essential step whether you are adding/removing items from the repository or changing existing items. See example :&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit -m &amp;quot;mandatory commit message&amp;quot;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to push your changes to remote repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git push &amp;lt;optional file/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Whatever you need to be pushed to the server must be committed to your local repository first. By default this command will push everything from current folder if no item is specified.&lt;br /&gt;
&lt;br /&gt;
* to add new files to your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you must commit to ensure that your repository actually has the new file. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add &amp;lt;filename/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to remove items from your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you delete the file that you wanted removed from the repository and commit to ensure that your repository actually has the change persisted. Finally, you push to server to make sure the server has those items removed as well and that nobody in your team works under the assumption that those items are stills there. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove &amp;lt;filename&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
''Note'': if removing a non empty folder use the -r flag to recursively remove all contents of that folder as well :&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove -r &amp;lt;foldername&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[admin_classification::IT Build| ]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=6838</id>
		<title>Software Repository</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=6838"/>
		<updated>2016-07-18T15:53:51Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: McNair Admin]]&lt;br /&gt;
&lt;br /&gt;
For a listing of all software tools and scripts see [[Software Tools Listing]].&lt;br /&gt;
&lt;br /&gt;
==Background==&lt;br /&gt;
Given the amount of software that has been written by past computer science interns and more being written, we felt the need to have some kind of source code management system put into place so that developers can work without ever being in fear of breaking production and facing Ed's wrath (you do not want that dude angry! Wherever you go, he will find you! No escape.).&lt;br /&gt;
&lt;br /&gt;
To enforce efficient source control we(Ed) chose to host our own git server on the RDP machine using [https://bonobogitserver.com/ Bonobo Git Server] that makes use of the windows IIS platform and is open source.&lt;br /&gt;
&lt;br /&gt;
Installing Bonobo git server is pretty simple:&lt;br /&gt;
* dowload the zip file from the Bonobo website.&lt;br /&gt;
* extract its contents. It should be a single folder containing directories like App_Data, bin etc.&lt;br /&gt;
* rename that folder to anything you want. I used the name &amp;quot;codebase&amp;quot;&lt;br /&gt;
* copy the codebase folder to C:\inetpub\wwwroot\&lt;br /&gt;
* Allow IIS User to modify C:\inetpub\wwwroot\codebase\App_Data folder. To do so:&lt;br /&gt;
**select Properties of App_Data folder,&lt;br /&gt;
**go to Security tab,&lt;br /&gt;
**click edit,&lt;br /&gt;
**select IIS user (in my case IIS_IUSRS) and add Modify and Write permission,&lt;br /&gt;
**confirm these settings with Apply button.&lt;br /&gt;
*Convert ''codebase'' to Application in IIS&lt;br /&gt;
**Run IIS Manager and navigate to Sites -&amp;gt; Default Web Site. You should see Bonobo.Git.Server.&lt;br /&gt;
**Right click on 'codebase' and convert to application.&lt;br /&gt;
**Check if the selected application pool runs on .NET 4.0 and convert the site.&lt;br /&gt;
*Enable Anonymous Authentication in IIS and disable the others. To do so, select the application in the left pane, double-click on the authentication icon in the right pane and set the value to of Anonymous Authentication to Enabled&lt;br /&gt;
*Launch your browser and go to http://localhost/codebase. Now you can see the initial page of the Bonobo Git Server and everything should work.&lt;br /&gt;
**default credentials are ''username'': '''admin''', ''password'': '''admin'''&lt;br /&gt;
**[6-22-2016]: Can also use https://localhost/codebase which is preferable, otherwise username/passwords are transmitted plain text. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
==Our Git Server==&lt;br /&gt;
We have already done the set up of the git server on the RDP machine. Here are the admin credentials:&lt;br /&gt;
*Username: '''boss'''&lt;br /&gt;
*Name: '''Ed'''&lt;br /&gt;
*Surname: '''Egan'''&lt;br /&gt;
*Email: '''Edward.Egan@rice.edu'''&lt;br /&gt;
*Password: '''you_seriously_thought_Id_write_that_in_here??'''&lt;br /&gt;
&lt;br /&gt;
To access this from your computer and not the RDP you can go to http://128.42.44.182/codebase where it will prompt you for your username and password.&lt;br /&gt;
**[6-22-2016]: Can also use https://128.42.44.182/codebase which is preferable. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Our Git workflow==&lt;br /&gt;
We chose a simple git workflow.&lt;br /&gt;
&lt;br /&gt;
Our aim is not to break things in the master branch. All commits on the master should work.&lt;br /&gt;
&lt;br /&gt;
 1.&lt;br /&gt;
 When adding a new feature or fixing a bug, ALWAYS check out a new feature branch from the master.&lt;br /&gt;
 NEVER checkout a feature branch from next (see below). The feature branch should be named user/feature_name. &lt;br /&gt;
&lt;br /&gt;
 2.&lt;br /&gt;
 After feature development is complete merge your feature-branch into next.&lt;br /&gt;
&lt;br /&gt;
 3.&lt;br /&gt;
 The next branch is intended for testing and confirming things do not break. So, after feature branches are merged into next and conflicts resolved, we merge into master.&lt;br /&gt;
 After this, you can end the feature branches if you want.&lt;br /&gt;
&lt;br /&gt;
==Quick and dirty github tutorial==&lt;br /&gt;
 For a cool interactive tutorial see http://learngitbranching.js.org/.&lt;br /&gt;
&lt;br /&gt;
 ***&lt;br /&gt;
 You can also use SourceTree which is a GUI interface for git-client. This is installed on the RDP.&lt;br /&gt;
 Like using git from CLI (see below), SourceTree constructs appropriate commands. But the good thing&lt;br /&gt;
 is it automatically generates all error check/logging options with each command that are difficult&lt;br /&gt;
 to recall from memory. SourceTree is freely available from Altassian at https://www.sourcetreeapp.com/&lt;br /&gt;
 ***&lt;br /&gt;
&lt;br /&gt;
 To use SourceTree you should have basic understanding of git (like branches,commits etc). The interactive tutorial above is very good for this purpose.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*''Installing - '' Depending on your operating system you can install git in three different ways:&lt;br /&gt;
** If you are a windows or a mac user user, you can simply download &amp;amp; install the latest release from [https://git-scm.com/ git scm website]&lt;br /&gt;
** If you use ubuntu then all you need to do is type &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;sudo apt-get install git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
* Check your installation by typing &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; in terminal or windows powershell&lt;br /&gt;
*Basic git operations:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
* to checkout code from remote repository, use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. This will create a local repository on your disk as well as download the source code of the project you wish to work on. Here's an example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone http://128.42.44.182/codebase/Matcher.git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to update your repository to include others' work in your project use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Its always a good practice to update your code before you commit to ensure that others' code doesn't break yours. Also, you cannot push to remote unless your local repository is up to date. If you commit on a stale local repository that is fine, just that this would mean you are likely to have more trouble merging your code with others later on thanks to all the conflicts that you'll face when you actually try to update your repository later. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update &amp;lt;optional folder path&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to commit your changes to your local repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Committing your changes is an essential step whether you are adding/removing items from the repository or changing existing items. See example :&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit -m &amp;quot;mandatory commit message&amp;quot;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to push your changes to remote repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git push &amp;lt;optional file/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Whatever you need to be pushed to the server must be committed to your local repository first. By default this command will push everything from current folder if no item is specified.&lt;br /&gt;
&lt;br /&gt;
* to add new files to your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you must commit to ensure that your repository actually has the new file. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add &amp;lt;filename/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to remove items from your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you delete the file that you wanted removed from the repository and commit to ensure that your repository actually has the change persisted. Finally, you push to server to make sure the server has those items removed as well and that nobody in your team works under the assumption that those items are stills there. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove &amp;lt;filename&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
''Note'': if removing a non empty folder use the -r flag to recursively remove all contents of that folder as well :&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove -r &amp;lt;foldername&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[admin_classification::IT Build| ]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6837</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6837"/>
		<updated>2016-07-18T15:51:45Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]]&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center.&lt;br /&gt;
* Build a Linux box to run the crawler.&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has.&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6784</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6784"/>
		<updated>2016-07-15T20:20:25Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: /* Short Term */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans)|(Plan Page)]]&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center.&lt;br /&gt;
* Build a Linux box to run the crawler.&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has.&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6783</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6783"/>
		<updated>2016-07-15T20:19:17Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans)|(Plan Page)]]&lt;br /&gt;
&lt;br /&gt;
====Short Term====&lt;br /&gt;
* Create a listing on the wiki for all software developed at McNair center.&lt;br /&gt;
&lt;br /&gt;
====Long Term====&lt;br /&gt;
* Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
* Develop the crawler. Try to begin with code that Dan has.&lt;br /&gt;
&lt;br /&gt;
====Side Tasks====&lt;br /&gt;
* If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6781</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6781"/>
		<updated>2016-07-15T20:15:21Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans)|(Plan Page)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
** Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
** Develop the crawler. Try to begin with code that Dan has.&lt;br /&gt;
** Create a listing on the wiki for all software developed at McNair center.&lt;br /&gt;
** If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6780</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6780"/>
		<updated>2016-07-15T20:14:35Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans)|(Plan Page)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
** Optimize/re-design the 'Matcher' software. In particular, speed-up fuzzy matching and possibly re-structure the code to make usage easier.&lt;br /&gt;
** Develop the crawler starting with code from Dan.&lt;br /&gt;
** Create a listing on the wiki for all software developed at McNair center.&lt;br /&gt;
** If possible, redo the patent parser (previous coded by Kranti) to also pull in Patent Citation data.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Research_Plans&amp;diff=6620</id>
		<title>Research Plans</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Research_Plans&amp;diff=6620"/>
		<updated>2016-07-15T16:18:10Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: McNair Admin]]&lt;br /&gt;
[[admin_classification::General Information| ]]&lt;br /&gt;
==Dylan Dickens==&lt;br /&gt;
{{:Dylan Dickens (Research Plan)}}&lt;br /&gt;
&lt;br /&gt;
==Ben Baldazo==&lt;br /&gt;
{{:Ben Baldazo (Research Plans)}}&lt;br /&gt;
&lt;br /&gt;
==Jake Silberman==&lt;br /&gt;
{{:Jake Silberman (Research Plan)}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Shoeb Mohammed==&lt;br /&gt;
{{:Shoeb Mohammed (Research Plans)}}&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6618</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6618"/>
		<updated>2016-07-15T16:17:33Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plans)|(Plan Page)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6617</id>
		<title>Shoeb Mohammed (Research Plans)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Research_Plans)&amp;diff=6617"/>
		<updated>2016-07-15T16:14:24Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: Created page with &amp;quot;Category:Work Log Shoeb Mohammed Research Plans (Plan Page)  *07/15/2016  **&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Research Plans]] [[Shoeb Mohammed (Research Plan)|(Plan Page)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Work_Logs&amp;diff=6615</id>
		<title>Work Logs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Work_Logs&amp;diff=6615"/>
		<updated>2016-07-15T16:11:53Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: McNair Admin]]&lt;br /&gt;
==Dylan Dickens==&lt;br /&gt;
{{:Dylan Dickens (Work Log)}}&lt;br /&gt;
&lt;br /&gt;
==Jake Silberman==&lt;br /&gt;
{{:Jake Silberman (Work Log)}}&lt;br /&gt;
&lt;br /&gt;
==Marcela Interiano==&lt;br /&gt;
{{:Marcela Interiano (Work Log)}}&lt;br /&gt;
&lt;br /&gt;
==Veeral Shah==&lt;br /&gt;
{{:Veeral Shah (Work Log)}}&lt;br /&gt;
&lt;br /&gt;
==Ariel Sun==&lt;br /&gt;
{{:Ariel Sun (Work Log)}}&lt;br /&gt;
&lt;br /&gt;
==Gunny Liu==&lt;br /&gt;
{{:Gunny Liu (Work Log)}}&lt;br /&gt;
&lt;br /&gt;
[[admin_classification::General Information| ]]&lt;br /&gt;
&lt;br /&gt;
==Ben Baldazo==&lt;br /&gt;
&lt;br /&gt;
{{:Ben Baldazo (Work Log)}}&lt;br /&gt;
&lt;br /&gt;
==Shoeb Mohammed==&lt;br /&gt;
{{:Shoeb Mohammed (Work Log)}}&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6594</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6594"/>
		<updated>2016-07-15T16:06:14Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. Below links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6591</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6591"/>
		<updated>2016-07-15T16:05:56Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read tutorials on using Selenium with perl. These links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6587</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6587"/>
		<updated>2016-07-15T16:05:34Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;br /&gt;
**Review and profile the Matcher perl code.&lt;br /&gt;
**Read using Selenium with perl. These links are good to get started &lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/login_apr15_09_blank-edelman.pdf&lt;br /&gt;
  https://www.usenix.org/system/files/login/articles/blank-edelman_0.pdf&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6495</id>
		<title>Shoeb Mohammed (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Shoeb_Mohammed_(Work_Log)&amp;diff=6495"/>
		<updated>2016-07-15T15:56:26Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: Created page with &amp;quot;Category:Work Log Shoeb Mohammed Work Logs  (Work Log)  *07/15/2016  **Started work log and research plan.&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Work Log]]&lt;br /&gt;
[[Shoeb Mohammed]] [[Work Logs]] [[Shoeb Mohammed (Work Log) | (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
*07/15/2016 &lt;br /&gt;
**Started work log and research plan.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Whois_Parser&amp;diff=4790</id>
		<title>Whois Parser</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Whois_Parser&amp;diff=4790"/>
		<updated>2016-07-11T19:08:50Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Internal]]&lt;br /&gt;
[[Internal Classification::Legacy| ]]&lt;br /&gt;
This wiki page is under Additional Links/WhoisParser&lt;br /&gt;
&lt;br /&gt;
The whoisParser was written by Kunal Shah on March 20, 2016 and is located &lt;br /&gt;
 repository: Web_Crawler&lt;br /&gt;
 branch: shoeb_patch/whoisParser&lt;br /&gt;
 directory: /WhoIsParser&lt;br /&gt;
 file: whoisParser.pl&lt;br /&gt;
&lt;br /&gt;
To use this parser, copy above perl program into a directory, make it current working directory (that is, use 'cd' command if needed) and run the following command. The directory should also have the input file(see below).&lt;br /&gt;
&lt;br /&gt;
perl WhoIsParser.pl -file=listofurls.txt -outfile=listofurls_processed.txt&lt;br /&gt;
&lt;br /&gt;
= NAME =&lt;br /&gt;
&lt;br /&gt;
WhoIs Parser - Retrieves and parses Whois information&lt;br /&gt;
Specifically, takes a file with a column of domain names and populates the&lt;br /&gt;
corresponding columns with information from the WhoIs API.&lt;br /&gt;
&lt;br /&gt;
= SYNOPSIS =&lt;br /&gt;
&lt;br /&gt;
perl whoisParser -file=&amp;lt;file&amp;gt; [-outfile=&amp;lt;file&amp;gt;] &lt;br /&gt;
&lt;br /&gt;
= OPTIONS =&lt;br /&gt;
&lt;br /&gt;
    -file=&amp;lt;file&amp;gt;:           Name of file of domain names. &lt;br /&gt;
    -outfile=&amp;lt;file&amp;gt;:        The name of the outfile &lt;br /&gt;
    -h:                     Display help&lt;br /&gt;
&lt;br /&gt;
= USAGE &amp;amp; FEATURES =&lt;br /&gt;
&lt;br /&gt;
'''Arguments:''' &lt;br /&gt;
&lt;br /&gt;
A text file with a column of domain names&lt;br /&gt;
&lt;br /&gt;
'''Returns:''' &lt;br /&gt;
&lt;br /&gt;
A text file of the domain names with the next 12 columns populated with information pulled from the Whois API. A header specifying each column is inserted into the first row of the file.       The columns of information outputed are:&lt;br /&gt;
&lt;br /&gt;
1. Domain Name&lt;br /&gt;
&lt;br /&gt;
2. Creation Date&lt;br /&gt;
&lt;br /&gt;
3. Expiration Date&lt;br /&gt;
&lt;br /&gt;
4. Update Date&lt;br /&gt;
&lt;br /&gt;
5. Registrant Name&lt;br /&gt;
&lt;br /&gt;
6. Registrant Street&lt;br /&gt;
&lt;br /&gt;
7. Registrant City&lt;br /&gt;
&lt;br /&gt;
8. Registrant Postal Code&lt;br /&gt;
&lt;br /&gt;
9. Registrant Country&lt;br /&gt;
&lt;br /&gt;
10. Admin Street&lt;br /&gt;
&lt;br /&gt;
11. Admin City&lt;br /&gt;
&lt;br /&gt;
12. Admin Postal Code&lt;br /&gt;
&lt;br /&gt;
13. Admin Country&lt;br /&gt;
&lt;br /&gt;
= BUGS &amp;amp; FEEDBACK =&lt;br /&gt;
&lt;br /&gt;
Worked as expected on all example files. Please report any discovered bugs to Kunal.&lt;br /&gt;
&lt;br /&gt;
Tested files:&lt;br /&gt;
Input: example_file.txt&lt;br /&gt;
&lt;br /&gt;
Output: example_outfile.txt&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Input Text: &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:input.jpg|400px|thumb|right|WhoIs input file in Excel]]&lt;br /&gt;
&lt;br /&gt;
http://1986ventures.com&lt;br /&gt;
&lt;br /&gt;
http://2nd.md/&lt;br /&gt;
&lt;br /&gt;
http://www.2ndsquare.com&lt;br /&gt;
&lt;br /&gt;
http://www.32nddegree.com/&lt;br /&gt;
&lt;br /&gt;
http://www.80legs.com&lt;br /&gt;
&lt;br /&gt;
http://hotmailpasswordsupportnumber.info/&lt;br /&gt;
&lt;br /&gt;
http://www.MidtownDelivery.com&lt;br /&gt;
&lt;br /&gt;
http://accreu.com&lt;br /&gt;
&lt;br /&gt;
http://www.actionfigurelabs.com&lt;br /&gt;
&lt;br /&gt;
https://m.facebook.com/AddictivePerformance99&lt;br /&gt;
&lt;br /&gt;
http://www.additech.com/&lt;br /&gt;
&lt;br /&gt;
http://adknowledgents.wix.com/adknowledgents&lt;br /&gt;
&lt;br /&gt;
http://www.rmudata.com&lt;br /&gt;
&lt;br /&gt;
http://www.advancedcardiodr.com/&lt;br /&gt;
&lt;br /&gt;
http://alwii.org&lt;br /&gt;
&lt;br /&gt;
http://www.advancedseismic.com&lt;br /&gt;
&lt;br /&gt;
http://www.AdvoWire.com&lt;br /&gt;
&lt;br /&gt;
http://www.aggredyne.com&lt;br /&gt;
&lt;br /&gt;
http://www.akrostechlabs.com/&lt;br /&gt;
&lt;br /&gt;
http://www.aleedex.com&lt;br /&gt;
&lt;br /&gt;
http://www.alertlogic.com/&lt;br /&gt;
&lt;br /&gt;
http://www.aliceandlove.com&lt;br /&gt;
&lt;br /&gt;
https://www.alignedsigns.com/ppcregistration6.htm&lt;br /&gt;
&lt;br /&gt;
https://www.alliedwarranty.com/&lt;br /&gt;
&lt;br /&gt;
http://none yet&lt;br /&gt;
&lt;br /&gt;
http://www.alpheus.net&lt;br /&gt;
&lt;br /&gt;
Output Text:&lt;br /&gt;
&lt;br /&gt;
[[Image:output.jpg|400px|thumb|right|WhoIs output file in Excel]]&lt;br /&gt;
&lt;br /&gt;
Domain Name	Creation Date	Expiration Date	Update Date	Registrant Name	Registrant Street	Registrant City	Registrant Postal Code	Registrant Country	Admin Street	Admin City	Admin Postal Code	Admin Country&lt;br /&gt;
&lt;br /&gt;
http://1986ventures.com	2013-09-12T09:25:51Z	12-sep-2016		Domain Admin	C/O ID#10760, PO Box 16 Note - Visit PrivacyProtect.org to contact the domain owner/operator Note - Visit PrivacyProtect.org to contact the domain owner/operator	Nobby Beach	QLD 4218	AU	C/O ID#10760, PO Box 16 Note - Visit PrivacyProtect.org to contact the domain owner/operator Note - Visit PrivacyProtect.org to contact the domain owner/operator	Nobby Beach	QLD 4218	AU&lt;br /&gt;
&lt;br /&gt;
http://2nd.md/	2010-11-17	2017-11-17										&lt;br /&gt;
&lt;br /&gt;
http://www.2ndsquare.com	2013-10-16T04:01:29Z	16-oct-2016	2015-10-16T20:38:12Z	Sameer Khan	22215 Tower Terr	San Antonio	78259	US	22215 Tower Terr	San Antonio	78259	US&lt;br /&gt;
&lt;br /&gt;
http://www.32nddegree.com/	2008-02-18T18:45:15Z	18-feb-2020		Cutshall, Wes	1321 Upland Dr.	Houston	77043	US	1321 Upland Dr.	Houston	77043	US&lt;br /&gt;
&lt;br /&gt;
http://www.80legs.com	2008-07-17T21:09:48Z	17-jul-2016		Shion Deysarkar	904 West Avenue	Austin	78701	US	904 West Avenue	Austin	78701	US&lt;br /&gt;
&lt;br /&gt;
http://hotmailpasswordsupportnumber.info/&lt;br /&gt;
&lt;br /&gt;
http://www.MidtownDelivery.com	2012-01-23T05:01:21Z	23-jan-2017	2015-01-05T05:24:56Z	Jim Wiseheart	7655 S. Braeswood#21	Houston	77071	US	7655 S. Braeswood#21	Houston	77071	US&lt;br /&gt;
&lt;br /&gt;
http://accreu.com	2011-05-05T00:11:53.000Z	05-may-2016		Oneandone Private Registration	701 Lee Road Suite 300ATTN	Chesterbrook	19087	US	701 Lee Road Suite 300ATTN	Chesterbrook	19087	US&lt;br /&gt;
&lt;br /&gt;
http://www.actionfigurelabs.com	2011-02-18T17:40:24Z	18-feb-2017		Phillip Leech	2223 Willowby Dr	Houston	77008	US	2223 Willowby Dr	Houston	77008	US&lt;br /&gt;
&lt;br /&gt;
https://m.facebook.com/AddictivePerformance99&lt;br /&gt;
&lt;br /&gt;
http://www.additech.com/	1997-01-24T05:00:00Z	25-jan-2018		Additech, Inc.	10925 Kinghurst	Houston	77099	US	10925 Kinghurst	Houston	77099	US&lt;br /&gt;
&lt;br /&gt;
http://adknowledgents.wix.com/adknowledgents&lt;br /&gt;
&lt;br /&gt;
http://www.rmudata.com	2000-04-13T17:09:54Z	13-apr-2017		PERFECT PRIVACY, LLC	12808 Gran Bay Parkway West	Jacksonville	32258	US	12808 Gran Bay Parkway West	Jacksonville	32258	US&lt;br /&gt;
&lt;br /&gt;
http://www.advancedcardiodr.com/	2012-04-17T14:12:09Z	17-apr-2022	2015-01-08T22:09:14Z	Sharafat Hussain	Advanced Cardiovascular Care Center800 Peakwood Drive, Suite 8C	Houston	77090	US	Advanced Cardiovascular Care Center800 Peakwood Drive, Suite 8C	Houston	77090	US&lt;br /&gt;
&lt;br /&gt;
http://alwii.org	2011-05-31T21:48:05Z			Chi Mao	1917 Ashland St, 2nd FloorIn Select Specialty Hospital	Houston	77008	US	1917 Ashland St, 2nd FloorIn Select Specialty Hospital	Houston	77008	US&lt;br /&gt;
&lt;br /&gt;
http://www.advancedseismic.com	2009-10-30T19:00:47Z	30-oct-2016	2015-10-31T11:28:22Z	na na	na	na	88888	US	na	na	88888	US&lt;br /&gt;
&lt;br /&gt;
http://www.AdvoWire.com	2013-07-13T08:43:39Z	13-jul-2018	2013-07-13T08:43:39Z	Jason Pampell	6516 North Gessner	Houston	77040	US	6516 North Gessner	Houston	77040	US&lt;br /&gt;
&lt;br /&gt;
http://www.aggredyne.com	2011-04-01T21:03:52Z	01-apr-2018		Robert C. Hux	10530 Rockley Rd.,Suite 150	Houston	77099	US	10530 Rockley Rd.,Suite 150	Houston	77099	US&lt;br /&gt;
&lt;br /&gt;
http://www.akrostechlabs.com/	2008-03-24T17:34:07Z	24-mar-2017	2015-03-24T01:54:15Z	Registration Private	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US&lt;br /&gt;
&lt;br /&gt;
http://www.aleedex.com	2012-12-27T20:15:55Z	10-jun-2019	2013-06-14T09:54:17Z	Farid Premani	10500 Reserve at Fountain Lake	Stafford	77477	US	10500 Reserve at Fountain Lake	Stafford	77477	US&lt;br /&gt;
&lt;br /&gt;
http://www.alertlogic.com/	2003-10-10T21:24:13Z	10-oct-2019		PERFECT PRIVACY, LLC	12808 Gran Bay Pkwy West	Jacksonville	32258	US	12808 Gran Bay Pkwy West	Jacksonville	32258	US&lt;br /&gt;
&lt;br /&gt;
http://www.aliceandlove.com	2014-08-07T01:42:29Z	07-aug-2016		c/o WHOIStrustee.com Limited	Riverside View	Thornes Lane	WF1 5QW	GB	Riverside View	Thornes Lane	WF1 5QW	GB&lt;br /&gt;
&lt;br /&gt;
https://www.alignedsigns.com/ppcregistration6.htm&lt;br /&gt;
&lt;br /&gt;
https://www.alliedwarranty.com/	2004-03-31T20:07:28Z	31-mar-2018	2014-03-16T04:17:39Z	Registration Private	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US&lt;br /&gt;
&lt;br /&gt;
http://none yet&lt;br /&gt;
&lt;br /&gt;
http://www.alpheus.net	2003-03-27T23:14:33Z	27-mar-2018	2016-03-28T11:22:05Z	Alpheus Firstcall	1301 Fannin St.20th Floor	Houston	77002	US	1301 Fannin St.20th Floor	Houston	77002	US&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Whois_Parser&amp;diff=4789</id>
		<title>Whois Parser</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Whois_Parser&amp;diff=4789"/>
		<updated>2016-07-11T19:06:20Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Internal]]&lt;br /&gt;
[[Internal Classification::Legacy| ]]&lt;br /&gt;
This wiki page is under Additional Links/WhoisParser&lt;br /&gt;
&lt;br /&gt;
The whoisParser was written by Kunal Shah on March 20, 2016 and is located &lt;br /&gt;
 repository: Web_Crawler&lt;br /&gt;
 branch: shoeb_patch/whoisparser&lt;br /&gt;
 directory: /WhoIsParser&lt;br /&gt;
 file: whoisParser.pl&lt;br /&gt;
&lt;br /&gt;
To use this parser, copy above perl program into a directory, make it current working directory (that is, use 'cd' command if needed) and run the following command. The directory should also have the input file(see below).&lt;br /&gt;
&lt;br /&gt;
perl WhoIsParser.pl -file=listofurls.txt -outfile=listofurls_processed.txt&lt;br /&gt;
&lt;br /&gt;
= NAME =&lt;br /&gt;
&lt;br /&gt;
WhoIs Parser - Retrieves and parses Whois information&lt;br /&gt;
Specifically, takes a file with a column of domain names and populates the&lt;br /&gt;
corresponding columns with information from the WhoIs API.&lt;br /&gt;
&lt;br /&gt;
= SYNOPSIS =&lt;br /&gt;
&lt;br /&gt;
perl whoisParser -file=&amp;lt;file&amp;gt; [-outfile=&amp;lt;file&amp;gt;] &lt;br /&gt;
&lt;br /&gt;
= OPTIONS =&lt;br /&gt;
&lt;br /&gt;
    -file=&amp;lt;file&amp;gt;:           Name of file of domain names. &lt;br /&gt;
    -outfile=&amp;lt;file&amp;gt;:        The name of the outfile &lt;br /&gt;
    -h:                     Display help&lt;br /&gt;
&lt;br /&gt;
= USAGE &amp;amp; FEATURES =&lt;br /&gt;
&lt;br /&gt;
'''Arguments:''' &lt;br /&gt;
&lt;br /&gt;
A text file with a column of domain names&lt;br /&gt;
&lt;br /&gt;
'''Returns:''' &lt;br /&gt;
&lt;br /&gt;
A text file of the domain names with the next 12 columns populated with information pulled from the Whois API. A header specifying each column is inserted into the first row of the file.       The columns of information outputed are:&lt;br /&gt;
&lt;br /&gt;
1. Domain Name&lt;br /&gt;
&lt;br /&gt;
2. Creation Date&lt;br /&gt;
&lt;br /&gt;
3. Expiration Date&lt;br /&gt;
&lt;br /&gt;
4. Update Date&lt;br /&gt;
&lt;br /&gt;
5. Registrant Name&lt;br /&gt;
&lt;br /&gt;
6. Registrant Street&lt;br /&gt;
&lt;br /&gt;
7. Registrant City&lt;br /&gt;
&lt;br /&gt;
8. Registrant Postal Code&lt;br /&gt;
&lt;br /&gt;
9. Registrant Country&lt;br /&gt;
&lt;br /&gt;
10. Admin Street&lt;br /&gt;
&lt;br /&gt;
11. Admin City&lt;br /&gt;
&lt;br /&gt;
12. Admin Postal Code&lt;br /&gt;
&lt;br /&gt;
13. Admin Country&lt;br /&gt;
&lt;br /&gt;
= BUGS &amp;amp; FEEDBACK =&lt;br /&gt;
&lt;br /&gt;
Worked as expected on all example files. Please report any discovered bugs to Kunal.&lt;br /&gt;
&lt;br /&gt;
Tested files:&lt;br /&gt;
Input: example_file.txt&lt;br /&gt;
&lt;br /&gt;
Output: example_outfile.txt&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Input Text: &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:input.jpg|400px|thumb|right|WhoIs input file in Excel]]&lt;br /&gt;
&lt;br /&gt;
http://1986ventures.com&lt;br /&gt;
&lt;br /&gt;
http://2nd.md/&lt;br /&gt;
&lt;br /&gt;
http://www.2ndsquare.com&lt;br /&gt;
&lt;br /&gt;
http://www.32nddegree.com/&lt;br /&gt;
&lt;br /&gt;
http://www.80legs.com&lt;br /&gt;
&lt;br /&gt;
http://hotmailpasswordsupportnumber.info/&lt;br /&gt;
&lt;br /&gt;
http://www.MidtownDelivery.com&lt;br /&gt;
&lt;br /&gt;
http://accreu.com&lt;br /&gt;
&lt;br /&gt;
http://www.actionfigurelabs.com&lt;br /&gt;
&lt;br /&gt;
https://m.facebook.com/AddictivePerformance99&lt;br /&gt;
&lt;br /&gt;
http://www.additech.com/&lt;br /&gt;
&lt;br /&gt;
http://adknowledgents.wix.com/adknowledgents&lt;br /&gt;
&lt;br /&gt;
http://www.rmudata.com&lt;br /&gt;
&lt;br /&gt;
http://www.advancedcardiodr.com/&lt;br /&gt;
&lt;br /&gt;
http://alwii.org&lt;br /&gt;
&lt;br /&gt;
http://www.advancedseismic.com&lt;br /&gt;
&lt;br /&gt;
http://www.AdvoWire.com&lt;br /&gt;
&lt;br /&gt;
http://www.aggredyne.com&lt;br /&gt;
&lt;br /&gt;
http://www.akrostechlabs.com/&lt;br /&gt;
&lt;br /&gt;
http://www.aleedex.com&lt;br /&gt;
&lt;br /&gt;
http://www.alertlogic.com/&lt;br /&gt;
&lt;br /&gt;
http://www.aliceandlove.com&lt;br /&gt;
&lt;br /&gt;
https://www.alignedsigns.com/ppcregistration6.htm&lt;br /&gt;
&lt;br /&gt;
https://www.alliedwarranty.com/&lt;br /&gt;
&lt;br /&gt;
http://none yet&lt;br /&gt;
&lt;br /&gt;
http://www.alpheus.net&lt;br /&gt;
&lt;br /&gt;
Output Text:&lt;br /&gt;
&lt;br /&gt;
[[Image:output.jpg|400px|thumb|right|WhoIs output file in Excel]]&lt;br /&gt;
&lt;br /&gt;
Domain Name	Creation Date	Expiration Date	Update Date	Registrant Name	Registrant Street	Registrant City	Registrant Postal Code	Registrant Country	Admin Street	Admin City	Admin Postal Code	Admin Country&lt;br /&gt;
&lt;br /&gt;
http://1986ventures.com	2013-09-12T09:25:51Z	12-sep-2016		Domain Admin	C/O ID#10760, PO Box 16 Note - Visit PrivacyProtect.org to contact the domain owner/operator Note - Visit PrivacyProtect.org to contact the domain owner/operator	Nobby Beach	QLD 4218	AU	C/O ID#10760, PO Box 16 Note - Visit PrivacyProtect.org to contact the domain owner/operator Note - Visit PrivacyProtect.org to contact the domain owner/operator	Nobby Beach	QLD 4218	AU&lt;br /&gt;
&lt;br /&gt;
http://2nd.md/	2010-11-17	2017-11-17										&lt;br /&gt;
&lt;br /&gt;
http://www.2ndsquare.com	2013-10-16T04:01:29Z	16-oct-2016	2015-10-16T20:38:12Z	Sameer Khan	22215 Tower Terr	San Antonio	78259	US	22215 Tower Terr	San Antonio	78259	US&lt;br /&gt;
&lt;br /&gt;
http://www.32nddegree.com/	2008-02-18T18:45:15Z	18-feb-2020		Cutshall, Wes	1321 Upland Dr.	Houston	77043	US	1321 Upland Dr.	Houston	77043	US&lt;br /&gt;
&lt;br /&gt;
http://www.80legs.com	2008-07-17T21:09:48Z	17-jul-2016		Shion Deysarkar	904 West Avenue	Austin	78701	US	904 West Avenue	Austin	78701	US&lt;br /&gt;
&lt;br /&gt;
http://hotmailpasswordsupportnumber.info/&lt;br /&gt;
&lt;br /&gt;
http://www.MidtownDelivery.com	2012-01-23T05:01:21Z	23-jan-2017	2015-01-05T05:24:56Z	Jim Wiseheart	7655 S. Braeswood#21	Houston	77071	US	7655 S. Braeswood#21	Houston	77071	US&lt;br /&gt;
&lt;br /&gt;
http://accreu.com	2011-05-05T00:11:53.000Z	05-may-2016		Oneandone Private Registration	701 Lee Road Suite 300ATTN	Chesterbrook	19087	US	701 Lee Road Suite 300ATTN	Chesterbrook	19087	US&lt;br /&gt;
&lt;br /&gt;
http://www.actionfigurelabs.com	2011-02-18T17:40:24Z	18-feb-2017		Phillip Leech	2223 Willowby Dr	Houston	77008	US	2223 Willowby Dr	Houston	77008	US&lt;br /&gt;
&lt;br /&gt;
https://m.facebook.com/AddictivePerformance99&lt;br /&gt;
&lt;br /&gt;
http://www.additech.com/	1997-01-24T05:00:00Z	25-jan-2018		Additech, Inc.	10925 Kinghurst	Houston	77099	US	10925 Kinghurst	Houston	77099	US&lt;br /&gt;
&lt;br /&gt;
http://adknowledgents.wix.com/adknowledgents&lt;br /&gt;
&lt;br /&gt;
http://www.rmudata.com	2000-04-13T17:09:54Z	13-apr-2017		PERFECT PRIVACY, LLC	12808 Gran Bay Parkway West	Jacksonville	32258	US	12808 Gran Bay Parkway West	Jacksonville	32258	US&lt;br /&gt;
&lt;br /&gt;
http://www.advancedcardiodr.com/	2012-04-17T14:12:09Z	17-apr-2022	2015-01-08T22:09:14Z	Sharafat Hussain	Advanced Cardiovascular Care Center800 Peakwood Drive, Suite 8C	Houston	77090	US	Advanced Cardiovascular Care Center800 Peakwood Drive, Suite 8C	Houston	77090	US&lt;br /&gt;
&lt;br /&gt;
http://alwii.org	2011-05-31T21:48:05Z			Chi Mao	1917 Ashland St, 2nd FloorIn Select Specialty Hospital	Houston	77008	US	1917 Ashland St, 2nd FloorIn Select Specialty Hospital	Houston	77008	US&lt;br /&gt;
&lt;br /&gt;
http://www.advancedseismic.com	2009-10-30T19:00:47Z	30-oct-2016	2015-10-31T11:28:22Z	na na	na	na	88888	US	na	na	88888	US&lt;br /&gt;
&lt;br /&gt;
http://www.AdvoWire.com	2013-07-13T08:43:39Z	13-jul-2018	2013-07-13T08:43:39Z	Jason Pampell	6516 North Gessner	Houston	77040	US	6516 North Gessner	Houston	77040	US&lt;br /&gt;
&lt;br /&gt;
http://www.aggredyne.com	2011-04-01T21:03:52Z	01-apr-2018		Robert C. Hux	10530 Rockley Rd.,Suite 150	Houston	77099	US	10530 Rockley Rd.,Suite 150	Houston	77099	US&lt;br /&gt;
&lt;br /&gt;
http://www.akrostechlabs.com/	2008-03-24T17:34:07Z	24-mar-2017	2015-03-24T01:54:15Z	Registration Private	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US&lt;br /&gt;
&lt;br /&gt;
http://www.aleedex.com	2012-12-27T20:15:55Z	10-jun-2019	2013-06-14T09:54:17Z	Farid Premani	10500 Reserve at Fountain Lake	Stafford	77477	US	10500 Reserve at Fountain Lake	Stafford	77477	US&lt;br /&gt;
&lt;br /&gt;
http://www.alertlogic.com/	2003-10-10T21:24:13Z	10-oct-2019		PERFECT PRIVACY, LLC	12808 Gran Bay Pkwy West	Jacksonville	32258	US	12808 Gran Bay Pkwy West	Jacksonville	32258	US&lt;br /&gt;
&lt;br /&gt;
http://www.aliceandlove.com	2014-08-07T01:42:29Z	07-aug-2016		c/o WHOIStrustee.com Limited	Riverside View	Thornes Lane	WF1 5QW	GB	Riverside View	Thornes Lane	WF1 5QW	GB&lt;br /&gt;
&lt;br /&gt;
https://www.alignedsigns.com/ppcregistration6.htm&lt;br /&gt;
&lt;br /&gt;
https://www.alliedwarranty.com/	2004-03-31T20:07:28Z	31-mar-2018	2014-03-16T04:17:39Z	Registration Private	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US&lt;br /&gt;
&lt;br /&gt;
http://none yet&lt;br /&gt;
&lt;br /&gt;
http://www.alpheus.net	2003-03-27T23:14:33Z	27-mar-2018	2016-03-28T11:22:05Z	Alpheus Firstcall	1301 Fannin St.20th Floor	Houston	77002	US	1301 Fannin St.20th Floor	Houston	77002	US&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Whois_Parser&amp;diff=4788</id>
		<title>Whois Parser</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Whois_Parser&amp;diff=4788"/>
		<updated>2016-07-11T19:00:55Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Internal]]&lt;br /&gt;
[[Internal Classification::Legacy| ]]&lt;br /&gt;
This wiki page is under Additional Links/WhoisParser&lt;br /&gt;
&lt;br /&gt;
The whoisParser was written by Kunal Shah on March 20, 2016 and is located &lt;br /&gt;
 repository: Web_Crawler&lt;br /&gt;
 branch: kunal/whoisparser&lt;br /&gt;
 directory: /WhoIsParser&lt;br /&gt;
 file: whoisParser.pl&lt;br /&gt;
&lt;br /&gt;
To use this parser, copy above perl program into a directory, make it current working directory (that is, use 'cd' command if needed) and run the following command. The directory should also have the input file(see below).&lt;br /&gt;
&lt;br /&gt;
perl WhoIsParser.pl -file=listofurls.txt -outfile=listofurls_processed.txt&lt;br /&gt;
&lt;br /&gt;
= NAME =&lt;br /&gt;
&lt;br /&gt;
WhoIs Parser - Retrieves and parses Whois information&lt;br /&gt;
Specifically, takes a file with a column of domain names and populates the&lt;br /&gt;
corresponding columns with information from the WhoIs API.&lt;br /&gt;
&lt;br /&gt;
= SYNOPSIS =&lt;br /&gt;
&lt;br /&gt;
perl whoisParser -file=&amp;lt;file&amp;gt; [-outfile=&amp;lt;file&amp;gt;] &lt;br /&gt;
&lt;br /&gt;
= OPTIONS =&lt;br /&gt;
&lt;br /&gt;
    -file=&amp;lt;file&amp;gt;:           Name of file of domain names. &lt;br /&gt;
    -outfile=&amp;lt;file&amp;gt;:        The name of the outfile &lt;br /&gt;
    -h:                     Display help&lt;br /&gt;
&lt;br /&gt;
= USAGE &amp;amp; FEATURES =&lt;br /&gt;
&lt;br /&gt;
'''Arguments:''' &lt;br /&gt;
&lt;br /&gt;
A text file with a column of domain names&lt;br /&gt;
&lt;br /&gt;
'''Returns:''' &lt;br /&gt;
&lt;br /&gt;
A text file of the domain names with the next 12 columns populated with information pulled from the Whois API. A header specifying each column is inserted into the first row of the file.       The columns of information outputed are:&lt;br /&gt;
&lt;br /&gt;
1. Domain Name&lt;br /&gt;
&lt;br /&gt;
2. Creation Date&lt;br /&gt;
&lt;br /&gt;
3. Expiration Date&lt;br /&gt;
&lt;br /&gt;
4. Update Date&lt;br /&gt;
&lt;br /&gt;
5. Registrant Name&lt;br /&gt;
&lt;br /&gt;
6. Registrant Street&lt;br /&gt;
&lt;br /&gt;
7. Registrant City&lt;br /&gt;
&lt;br /&gt;
8. Registrant Postal Code&lt;br /&gt;
&lt;br /&gt;
9. Registrant Country&lt;br /&gt;
&lt;br /&gt;
10. Admin Street&lt;br /&gt;
&lt;br /&gt;
11. Admin City&lt;br /&gt;
&lt;br /&gt;
12. Admin Postal Code&lt;br /&gt;
&lt;br /&gt;
13. Admin Country&lt;br /&gt;
&lt;br /&gt;
= BUGS &amp;amp; FEEDBACK =&lt;br /&gt;
&lt;br /&gt;
Worked as expected on all example files. Please report any discovered bugs to Kunal.&lt;br /&gt;
&lt;br /&gt;
Tested files:&lt;br /&gt;
Input: example_file.txt&lt;br /&gt;
&lt;br /&gt;
Output: example_outfile.txt&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Input Text: &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:input.jpg|400px|thumb|right|WhoIs input file in Excel]]&lt;br /&gt;
&lt;br /&gt;
http://1986ventures.com&lt;br /&gt;
&lt;br /&gt;
http://2nd.md/&lt;br /&gt;
&lt;br /&gt;
http://www.2ndsquare.com&lt;br /&gt;
&lt;br /&gt;
http://www.32nddegree.com/&lt;br /&gt;
&lt;br /&gt;
http://www.80legs.com&lt;br /&gt;
&lt;br /&gt;
http://hotmailpasswordsupportnumber.info/&lt;br /&gt;
&lt;br /&gt;
http://www.MidtownDelivery.com&lt;br /&gt;
&lt;br /&gt;
http://accreu.com&lt;br /&gt;
&lt;br /&gt;
http://www.actionfigurelabs.com&lt;br /&gt;
&lt;br /&gt;
https://m.facebook.com/AddictivePerformance99&lt;br /&gt;
&lt;br /&gt;
http://www.additech.com/&lt;br /&gt;
&lt;br /&gt;
http://adknowledgents.wix.com/adknowledgents&lt;br /&gt;
&lt;br /&gt;
http://www.rmudata.com&lt;br /&gt;
&lt;br /&gt;
http://www.advancedcardiodr.com/&lt;br /&gt;
&lt;br /&gt;
http://alwii.org&lt;br /&gt;
&lt;br /&gt;
http://www.advancedseismic.com&lt;br /&gt;
&lt;br /&gt;
http://www.AdvoWire.com&lt;br /&gt;
&lt;br /&gt;
http://www.aggredyne.com&lt;br /&gt;
&lt;br /&gt;
http://www.akrostechlabs.com/&lt;br /&gt;
&lt;br /&gt;
http://www.aleedex.com&lt;br /&gt;
&lt;br /&gt;
http://www.alertlogic.com/&lt;br /&gt;
&lt;br /&gt;
http://www.aliceandlove.com&lt;br /&gt;
&lt;br /&gt;
https://www.alignedsigns.com/ppcregistration6.htm&lt;br /&gt;
&lt;br /&gt;
https://www.alliedwarranty.com/&lt;br /&gt;
&lt;br /&gt;
http://none yet&lt;br /&gt;
&lt;br /&gt;
http://www.alpheus.net&lt;br /&gt;
&lt;br /&gt;
Output Text:&lt;br /&gt;
&lt;br /&gt;
[[Image:output.jpg|400px|thumb|right|WhoIs output file in Excel]]&lt;br /&gt;
&lt;br /&gt;
Domain Name	Creation Date	Expiration Date	Update Date	Registrant Name	Registrant Street	Registrant City	Registrant Postal Code	Registrant Country	Admin Street	Admin City	Admin Postal Code	Admin Country&lt;br /&gt;
&lt;br /&gt;
http://1986ventures.com	2013-09-12T09:25:51Z	12-sep-2016		Domain Admin	C/O ID#10760, PO Box 16 Note - Visit PrivacyProtect.org to contact the domain owner/operator Note - Visit PrivacyProtect.org to contact the domain owner/operator	Nobby Beach	QLD 4218	AU	C/O ID#10760, PO Box 16 Note - Visit PrivacyProtect.org to contact the domain owner/operator Note - Visit PrivacyProtect.org to contact the domain owner/operator	Nobby Beach	QLD 4218	AU&lt;br /&gt;
&lt;br /&gt;
http://2nd.md/	2010-11-17	2017-11-17										&lt;br /&gt;
&lt;br /&gt;
http://www.2ndsquare.com	2013-10-16T04:01:29Z	16-oct-2016	2015-10-16T20:38:12Z	Sameer Khan	22215 Tower Terr	San Antonio	78259	US	22215 Tower Terr	San Antonio	78259	US&lt;br /&gt;
&lt;br /&gt;
http://www.32nddegree.com/	2008-02-18T18:45:15Z	18-feb-2020		Cutshall, Wes	1321 Upland Dr.	Houston	77043	US	1321 Upland Dr.	Houston	77043	US&lt;br /&gt;
&lt;br /&gt;
http://www.80legs.com	2008-07-17T21:09:48Z	17-jul-2016		Shion Deysarkar	904 West Avenue	Austin	78701	US	904 West Avenue	Austin	78701	US&lt;br /&gt;
&lt;br /&gt;
http://hotmailpasswordsupportnumber.info/&lt;br /&gt;
&lt;br /&gt;
http://www.MidtownDelivery.com	2012-01-23T05:01:21Z	23-jan-2017	2015-01-05T05:24:56Z	Jim Wiseheart	7655 S. Braeswood#21	Houston	77071	US	7655 S. Braeswood#21	Houston	77071	US&lt;br /&gt;
&lt;br /&gt;
http://accreu.com	2011-05-05T00:11:53.000Z	05-may-2016		Oneandone Private Registration	701 Lee Road Suite 300ATTN	Chesterbrook	19087	US	701 Lee Road Suite 300ATTN	Chesterbrook	19087	US&lt;br /&gt;
&lt;br /&gt;
http://www.actionfigurelabs.com	2011-02-18T17:40:24Z	18-feb-2017		Phillip Leech	2223 Willowby Dr	Houston	77008	US	2223 Willowby Dr	Houston	77008	US&lt;br /&gt;
&lt;br /&gt;
https://m.facebook.com/AddictivePerformance99&lt;br /&gt;
&lt;br /&gt;
http://www.additech.com/	1997-01-24T05:00:00Z	25-jan-2018		Additech, Inc.	10925 Kinghurst	Houston	77099	US	10925 Kinghurst	Houston	77099	US&lt;br /&gt;
&lt;br /&gt;
http://adknowledgents.wix.com/adknowledgents&lt;br /&gt;
&lt;br /&gt;
http://www.rmudata.com	2000-04-13T17:09:54Z	13-apr-2017		PERFECT PRIVACY, LLC	12808 Gran Bay Parkway West	Jacksonville	32258	US	12808 Gran Bay Parkway West	Jacksonville	32258	US&lt;br /&gt;
&lt;br /&gt;
http://www.advancedcardiodr.com/	2012-04-17T14:12:09Z	17-apr-2022	2015-01-08T22:09:14Z	Sharafat Hussain	Advanced Cardiovascular Care Center800 Peakwood Drive, Suite 8C	Houston	77090	US	Advanced Cardiovascular Care Center800 Peakwood Drive, Suite 8C	Houston	77090	US&lt;br /&gt;
&lt;br /&gt;
http://alwii.org	2011-05-31T21:48:05Z			Chi Mao	1917 Ashland St, 2nd FloorIn Select Specialty Hospital	Houston	77008	US	1917 Ashland St, 2nd FloorIn Select Specialty Hospital	Houston	77008	US&lt;br /&gt;
&lt;br /&gt;
http://www.advancedseismic.com	2009-10-30T19:00:47Z	30-oct-2016	2015-10-31T11:28:22Z	na na	na	na	88888	US	na	na	88888	US&lt;br /&gt;
&lt;br /&gt;
http://www.AdvoWire.com	2013-07-13T08:43:39Z	13-jul-2018	2013-07-13T08:43:39Z	Jason Pampell	6516 North Gessner	Houston	77040	US	6516 North Gessner	Houston	77040	US&lt;br /&gt;
&lt;br /&gt;
http://www.aggredyne.com	2011-04-01T21:03:52Z	01-apr-2018		Robert C. Hux	10530 Rockley Rd.,Suite 150	Houston	77099	US	10530 Rockley Rd.,Suite 150	Houston	77099	US&lt;br /&gt;
&lt;br /&gt;
http://www.akrostechlabs.com/	2008-03-24T17:34:07Z	24-mar-2017	2015-03-24T01:54:15Z	Registration Private	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US&lt;br /&gt;
&lt;br /&gt;
http://www.aleedex.com	2012-12-27T20:15:55Z	10-jun-2019	2013-06-14T09:54:17Z	Farid Premani	10500 Reserve at Fountain Lake	Stafford	77477	US	10500 Reserve at Fountain Lake	Stafford	77477	US&lt;br /&gt;
&lt;br /&gt;
http://www.alertlogic.com/	2003-10-10T21:24:13Z	10-oct-2019		PERFECT PRIVACY, LLC	12808 Gran Bay Pkwy West	Jacksonville	32258	US	12808 Gran Bay Pkwy West	Jacksonville	32258	US&lt;br /&gt;
&lt;br /&gt;
http://www.aliceandlove.com	2014-08-07T01:42:29Z	07-aug-2016		c/o WHOIStrustee.com Limited	Riverside View	Thornes Lane	WF1 5QW	GB	Riverside View	Thornes Lane	WF1 5QW	GB&lt;br /&gt;
&lt;br /&gt;
https://www.alignedsigns.com/ppcregistration6.htm&lt;br /&gt;
&lt;br /&gt;
https://www.alliedwarranty.com/	2004-03-31T20:07:28Z	31-mar-2018	2014-03-16T04:17:39Z	Registration Private	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US	DomainsByProxy.com14747 N Northsight Blvd Suite 111, PMB 309	Scottsdale	85260	US&lt;br /&gt;
&lt;br /&gt;
http://none yet&lt;br /&gt;
&lt;br /&gt;
http://www.alpheus.net	2003-03-27T23:14:33Z	27-mar-2018	2016-03-28T11:22:05Z	Alpheus Firstcall	1301 Fannin St.20th Floor	Houston	77002	US	1301 Fannin St.20th Floor	Houston	77002	US&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Bulk_Patent_Assignee_Processing&amp;diff=4646</id>
		<title>Bulk Patent Assignee Processing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Bulk_Patent_Assignee_Processing&amp;diff=4646"/>
		<updated>2016-07-08T15:56:50Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: /* Scripts for processing data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== USPTO Assignees Data ==&lt;br /&gt;
&lt;br /&gt;
We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.&lt;br /&gt;
&lt;br /&gt;
== Steps Followed to Extract the Data ==&lt;br /&gt;
&lt;br /&gt;
===Extracting Data from XML Files ===&lt;br /&gt;
&lt;br /&gt;
All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;assignment-record&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignors&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignees&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-properties&amp;gt;&lt;br /&gt;
 &amp;lt;/patent-assignment&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.&lt;br /&gt;
&lt;br /&gt;
Corresponding tables are:&lt;br /&gt;
*assignment-records : assignment&lt;br /&gt;
*patent-assignors : assignors&lt;br /&gt;
*patent-assignees : assignees&lt;br /&gt;
*patent-properties : properties&lt;br /&gt;
&lt;br /&gt;
Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.&lt;br /&gt;
&lt;br /&gt;
==== Assignment Records ====&lt;br /&gt;
&lt;br /&gt;
The fields in the assignment record are:&lt;br /&gt;
* last_update_date&lt;br /&gt;
* purge_indicator&lt;br /&gt;
* recorded_date&lt;br /&gt;
* correspondent_name&lt;br /&gt;
* correspondent_address_1&lt;br /&gt;
* correspondent_address_2&lt;br /&gt;
* correspondent_address_3&lt;br /&gt;
* correspondent_address_4&lt;br /&gt;
* conveyance_text&lt;br /&gt;
&lt;br /&gt;
Here is the corresponding XML that we are mapping:&lt;br /&gt;
 &lt;br /&gt;
   -&amp;lt;assignment-record&amp;gt;&lt;br /&gt;
       &amp;lt;reel-no&amp;gt;27132&amp;lt;/reel-no&amp;gt;&lt;br /&gt;
       &amp;lt;frame-no&amp;gt;841&amp;lt;/frame-no&amp;gt;&lt;br /&gt;
      -&amp;lt;last-update-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20160122&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/last-update-date&amp;gt;&lt;br /&gt;
       &amp;lt;purge-indicator&amp;gt;N&amp;lt;/purge-indicator&amp;gt;&lt;br /&gt;
          -&amp;lt;recorded-date&amp;gt;&lt;br /&gt;
              &amp;lt;date&amp;gt;20111027&amp;lt;/date&amp;gt;&lt;br /&gt;
           &amp;lt;/recorded-date&amp;gt;&lt;br /&gt;
         &amp;lt;page-count&amp;gt;2&amp;lt;/page-count&amp;gt;&lt;br /&gt;
      -&amp;lt;correspondent&amp;gt;&lt;br /&gt;
           &amp;lt;name&amp;gt;DOUGLAS B. MCKNIGHT&amp;lt;/name&amp;gt;&lt;br /&gt;
           &amp;lt;address-1&amp;gt;595 MINER ROAD&amp;lt;/address-1&amp;gt;&lt;br /&gt;
           &amp;lt;address-2&amp;gt;INTELLECTUAL PROPERTY &amp;amp; STANDARDS&amp;lt;/address-2&amp;gt;&lt;br /&gt;
           &amp;lt;address-3&amp;gt;CLEVELAND, OH 44143&amp;lt;/address-3&amp;gt;&lt;br /&gt;
        &amp;lt;/correspondent&amp;gt;&lt;br /&gt;
        &amp;lt;conveyance-text&amp;gt;ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).&amp;lt;/conveyance-text&amp;gt;&lt;br /&gt;
  &amp;lt;/assignment-record&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Assignors ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the assignors table:&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* assignor_name&lt;br /&gt;
* execution_date&lt;br /&gt;
&lt;br /&gt;
The corresponding XML node is :&lt;br /&gt;
&lt;br /&gt;
 -&amp;lt;patent-assignors&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignor&amp;gt;&lt;br /&gt;
       &amp;lt;name&amp;gt;WALKER, MATTHEW J.&amp;lt;/name&amp;gt;&lt;br /&gt;
      -&amp;lt;execution-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20090512&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/execution-date&amp;gt;&lt;br /&gt;
     &amp;lt;/patent-assignor&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignor&amp;gt;&lt;br /&gt;
       &amp;lt;name&amp;gt;OLSZEWSKI, MARK E.&amp;lt;/name&amp;gt;&lt;br /&gt;
      -&amp;lt;execution-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20090512&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/execution-date&amp;gt;&lt;br /&gt;
     &amp;lt;/patent-assignor&amp;gt;&lt;br /&gt;
   &amp;lt;/patent-assignors&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Assignees ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the assignees table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* assignee_name&lt;br /&gt;
* assignee_address_1&lt;br /&gt;
* assignee_address_2&lt;br /&gt;
* assignee_city&lt;br /&gt;
* assignee_state&lt;br /&gt;
* assignee_country&lt;br /&gt;
* assignee_postcode&lt;br /&gt;
&lt;br /&gt;
The corresponding XML nodes are:&lt;br /&gt;
&lt;br /&gt;
  -&amp;lt;patent-assignees&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignee&amp;gt;&lt;br /&gt;
        &amp;lt;name&amp;gt;KONINKLIJKE PHILIPS ELECTRONICS N V&amp;lt;/name&amp;gt;&lt;br /&gt;
        &amp;lt;address-1&amp;gt;GROENEWOUDSEWEG 1&amp;lt;/address-1&amp;gt;&lt;br /&gt;
        &amp;lt;city&amp;gt;EINDHOVEN&amp;lt;/city&amp;gt;&lt;br /&gt;
        &amp;lt;country-name&amp;gt;NETHERLANDS&amp;lt;/country-name&amp;gt;&lt;br /&gt;
        &amp;lt;postcode&amp;gt;5621 BA&amp;lt;/postcode&amp;gt;&lt;br /&gt;
      &amp;lt;/patent-assignee&amp;gt;&lt;br /&gt;
    &amp;lt;/patent-assignees&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Patent Properties ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the properties table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* documentid&lt;br /&gt;
* country&lt;br /&gt;
* kind&lt;br /&gt;
* filingdate&lt;br /&gt;
* invention_title&lt;br /&gt;
&lt;br /&gt;
The corresponding XML segment would be:&lt;br /&gt;
&lt;br /&gt;
  -&amp;lt;patent-properties&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-property&amp;gt;&lt;br /&gt;
      -&amp;lt;document-id&amp;gt;&lt;br /&gt;
          &amp;lt;country&amp;gt;US&amp;lt;/country&amp;gt;&lt;br /&gt;
          &amp;lt;doc-number&amp;gt;14143589&amp;lt;/doc-number&amp;gt;&lt;br /&gt;
          &amp;lt;kind&amp;gt;X0&amp;lt;/kind&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20131230&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/document-id&amp;gt;&lt;br /&gt;
      -&amp;lt;document-id&amp;gt;&lt;br /&gt;
          &amp;lt;country&amp;gt;US&amp;lt;/country&amp;gt;&lt;br /&gt;
          &amp;lt;doc-number&amp;gt;20140260305&amp;lt;/doc-number&amp;gt;&lt;br /&gt;
          &amp;lt;kind&amp;gt;A1&amp;lt;/kind&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20140918&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/document-id&amp;gt;&lt;br /&gt;
      &amp;lt;invention-title lang=&amp;quot;en&amp;quot;&amp;gt;LEAN AZIMUTHAL FLAME COMBUSTOR&amp;lt;/invention-title&amp;gt;&lt;br /&gt;
    &amp;lt;/patent-property&amp;gt;&lt;br /&gt;
  &amp;lt;/patent-properties&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patent properties have a many-to-one relationship : one patent can have more than one properties.&lt;br /&gt;
 Note: We are not sure what documents with kind 'X0' say&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Patent Assignment ====&lt;br /&gt;
&lt;br /&gt;
Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* action_key_code&lt;br /&gt;
* USPTO_Transaction_Date&lt;br /&gt;
* USPTO_Date_Produced&lt;br /&gt;
* version&lt;br /&gt;
&lt;br /&gt;
Here is what the XML in a downloaded file looks like:&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;&lt;br /&gt;
  &amp;lt;!DOCTYPE us-patent-assignments&amp;gt;&lt;br /&gt;
 -&amp;lt;us-patent-assignments date-produced=&amp;quot;20131101&amp;quot; dtd-version=&amp;quot;1.0&amp;quot;&amp;gt;&lt;br /&gt;
     &amp;lt;action-key-code&amp;gt;DA&amp;lt;/action-key-code&amp;gt;&lt;br /&gt;
    -&amp;lt;transaction-date&amp;gt;&lt;br /&gt;
        &amp;lt;date&amp;gt;20160122&amp;lt;/date&amp;gt;&lt;br /&gt;
     &amp;lt;/transaction-date&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignments&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
             .&lt;br /&gt;
             .&lt;br /&gt;
             .&lt;br /&gt;
      &amp;lt;/patent-assignments&amp;gt;&lt;br /&gt;
  &amp;lt;/us-patent-assignments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====DTD====&lt;br /&gt;
Here is the DTD specified by the USPTO, which specifies optional fields and :&lt;br /&gt;
    &lt;br /&gt;
 &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;utf-8&amp;quot;?&amp;gt; &lt;br /&gt;
 &amp;lt;!DOCTYPE us-patent-assignments [&amp;lt;!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST us-patent-assignments  dtd-version   CDATA  #IMPLIED &lt;br /&gt;
 				 date-produced CDATA  #IMPLIED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT action-key-code (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT transaction-date (date)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignments (data-available-code | patent-assignment+)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT date (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT data-available-code (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignors (patent-assignor+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignees (patent-assignee+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-properties (patent-property+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT reel-no (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT frame-no (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT last-update-date (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT purge-indicator (#PCDATA)&amp;gt;  &lt;br /&gt;
 &amp;lt;!ELEMENT recorded-date (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT page-count (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT conveyance-text (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-property (document-id*, invention-title?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT name (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ATTLIST name name-type (natural | legal)  #IMPLIED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-1 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-2 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-3 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-4 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT execution-date (date)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT date-acknowledged (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT city (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT state (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT country-name (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT postcode (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT document-id (country, doc-number, kind?, name?, date?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST invention-title  id   ID     #IMPLIED &lt;br /&gt;
 			   lang CDATA  #REQUIRED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT country (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT doc-number (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT kind (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!--bold formatting for text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT b (#PCDATA | i | u | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--italic formatting for text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT i (#PCDATA | b | u | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--underscore: style - single is default--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT u (#PCDATA | b | i | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST u  style  (single | double | dash | dots )  'single' &amp;gt;&lt;br /&gt;
 &amp;lt;!--superscripted text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT sup (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--subscripted text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT sub (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--small capitals--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT smallcaps (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 ]&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Inserting Extracted Data into Tables ===&lt;br /&gt;
&lt;br /&gt;
===Clean Up ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Scripts for processing data ==&lt;br /&gt;
The programs/scripts (see details below) are located on our [[Software Repository|Bonobo Git Server]]. &lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Downloading raw bulk data from USPTO ===&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
 file: USPTO_Assignee_Download.pl&lt;br /&gt;
&lt;br /&gt;
The down-loader script used to download XML files is essentially same, with minor changes, as the one used for downloading USPTO patent-data.&lt;br /&gt;
That is, the current version of down-loader script downloads all files from the base URL: https://bulkdata.uspto.gov/data2/patent/assignment/&lt;br /&gt;
&lt;br /&gt;
=== Parsing the XML files ===&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
 file: uspto_assignees_XML_parser.plx&lt;br /&gt;
&lt;br /&gt;
==== NAME ====&lt;br /&gt;
&lt;br /&gt;
uspto_assignees_XML_parser.plx - Parses XML files and populates a database.&lt;br /&gt;
&lt;br /&gt;
Specifically, parses every file in a directory according to a schema (see above).&lt;br /&gt;
Then populates a database on the RDP. &lt;br /&gt;
&lt;br /&gt;
==== SYNOPSIS ====&lt;br /&gt;
&lt;br /&gt;
perl uspto_assignees_XML_parser.plx /path/to/directory_containing_XML_files&lt;br /&gt;
&lt;br /&gt;
==== USAGE &amp;amp; FEATURES ====&lt;br /&gt;
&lt;br /&gt;
'''Arguments'''&lt;br /&gt;
The full path to directory is provided as a command line argument. It should contain the XML files to parse and no other file.&lt;br /&gt;
This path should be specified in Windows format (with '\') and NOT unix format.&lt;br /&gt;
&lt;br /&gt;
'''Features and Effects'''&lt;br /&gt;
As each XML file is parsed, a database on local host (RDP) is populated. If at any point there is an error, for example a particular&lt;br /&gt;
XML file is bad/invalid or the psql statement cannot be executed, the program aborts with a message.&lt;br /&gt;
&lt;br /&gt;
We choose to populate local database because remote connections are too slow. The database is eventually moved to DataBase server manually.&lt;br /&gt;
&lt;br /&gt;
==== TESTS ====&lt;br /&gt;
The first version does the job as expected. It was used to populate the assignees database by parsing XML files from USPTO(see above).&lt;br /&gt;
We parsed all XML files dated till 7/4/2016.&lt;br /&gt;
&lt;br /&gt;
==== TO DO ====&lt;br /&gt;
*Add more command line options to improve usability.&lt;br /&gt;
*Improve portability to allow Unix/Linux pathnames. This is straightforward to do with Perl modules File::Basename and File::Spec.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Bulk_Patent_Assignee_Processing&amp;diff=4645</id>
		<title>Bulk Patent Assignee Processing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Bulk_Patent_Assignee_Processing&amp;diff=4645"/>
		<updated>2016-07-08T15:55:55Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: /* Scripts for processing data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== USPTO Assignees Data ==&lt;br /&gt;
&lt;br /&gt;
We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.&lt;br /&gt;
&lt;br /&gt;
== Steps Followed to Extract the Data ==&lt;br /&gt;
&lt;br /&gt;
===Extracting Data from XML Files ===&lt;br /&gt;
&lt;br /&gt;
All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;assignment-record&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignors&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignees&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-properties&amp;gt;&lt;br /&gt;
 &amp;lt;/patent-assignment&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.&lt;br /&gt;
&lt;br /&gt;
Corresponding tables are:&lt;br /&gt;
*assignment-records : assignment&lt;br /&gt;
*patent-assignors : assignors&lt;br /&gt;
*patent-assignees : assignees&lt;br /&gt;
*patent-properties : properties&lt;br /&gt;
&lt;br /&gt;
Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.&lt;br /&gt;
&lt;br /&gt;
==== Assignment Records ====&lt;br /&gt;
&lt;br /&gt;
The fields in the assignment record are:&lt;br /&gt;
* last_update_date&lt;br /&gt;
* purge_indicator&lt;br /&gt;
* recorded_date&lt;br /&gt;
* correspondent_name&lt;br /&gt;
* correspondent_address_1&lt;br /&gt;
* correspondent_address_2&lt;br /&gt;
* correspondent_address_3&lt;br /&gt;
* correspondent_address_4&lt;br /&gt;
* conveyance_text&lt;br /&gt;
&lt;br /&gt;
Here is the corresponding XML that we are mapping:&lt;br /&gt;
 &lt;br /&gt;
   -&amp;lt;assignment-record&amp;gt;&lt;br /&gt;
       &amp;lt;reel-no&amp;gt;27132&amp;lt;/reel-no&amp;gt;&lt;br /&gt;
       &amp;lt;frame-no&amp;gt;841&amp;lt;/frame-no&amp;gt;&lt;br /&gt;
      -&amp;lt;last-update-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20160122&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/last-update-date&amp;gt;&lt;br /&gt;
       &amp;lt;purge-indicator&amp;gt;N&amp;lt;/purge-indicator&amp;gt;&lt;br /&gt;
          -&amp;lt;recorded-date&amp;gt;&lt;br /&gt;
              &amp;lt;date&amp;gt;20111027&amp;lt;/date&amp;gt;&lt;br /&gt;
           &amp;lt;/recorded-date&amp;gt;&lt;br /&gt;
         &amp;lt;page-count&amp;gt;2&amp;lt;/page-count&amp;gt;&lt;br /&gt;
      -&amp;lt;correspondent&amp;gt;&lt;br /&gt;
           &amp;lt;name&amp;gt;DOUGLAS B. MCKNIGHT&amp;lt;/name&amp;gt;&lt;br /&gt;
           &amp;lt;address-1&amp;gt;595 MINER ROAD&amp;lt;/address-1&amp;gt;&lt;br /&gt;
           &amp;lt;address-2&amp;gt;INTELLECTUAL PROPERTY &amp;amp; STANDARDS&amp;lt;/address-2&amp;gt;&lt;br /&gt;
           &amp;lt;address-3&amp;gt;CLEVELAND, OH 44143&amp;lt;/address-3&amp;gt;&lt;br /&gt;
        &amp;lt;/correspondent&amp;gt;&lt;br /&gt;
        &amp;lt;conveyance-text&amp;gt;ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).&amp;lt;/conveyance-text&amp;gt;&lt;br /&gt;
  &amp;lt;/assignment-record&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Assignors ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the assignors table:&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* assignor_name&lt;br /&gt;
* execution_date&lt;br /&gt;
&lt;br /&gt;
The corresponding XML node is :&lt;br /&gt;
&lt;br /&gt;
 -&amp;lt;patent-assignors&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignor&amp;gt;&lt;br /&gt;
       &amp;lt;name&amp;gt;WALKER, MATTHEW J.&amp;lt;/name&amp;gt;&lt;br /&gt;
      -&amp;lt;execution-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20090512&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/execution-date&amp;gt;&lt;br /&gt;
     &amp;lt;/patent-assignor&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignor&amp;gt;&lt;br /&gt;
       &amp;lt;name&amp;gt;OLSZEWSKI, MARK E.&amp;lt;/name&amp;gt;&lt;br /&gt;
      -&amp;lt;execution-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20090512&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/execution-date&amp;gt;&lt;br /&gt;
     &amp;lt;/patent-assignor&amp;gt;&lt;br /&gt;
   &amp;lt;/patent-assignors&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Assignees ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the assignees table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* assignee_name&lt;br /&gt;
* assignee_address_1&lt;br /&gt;
* assignee_address_2&lt;br /&gt;
* assignee_city&lt;br /&gt;
* assignee_state&lt;br /&gt;
* assignee_country&lt;br /&gt;
* assignee_postcode&lt;br /&gt;
&lt;br /&gt;
The corresponding XML nodes are:&lt;br /&gt;
&lt;br /&gt;
  -&amp;lt;patent-assignees&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignee&amp;gt;&lt;br /&gt;
        &amp;lt;name&amp;gt;KONINKLIJKE PHILIPS ELECTRONICS N V&amp;lt;/name&amp;gt;&lt;br /&gt;
        &amp;lt;address-1&amp;gt;GROENEWOUDSEWEG 1&amp;lt;/address-1&amp;gt;&lt;br /&gt;
        &amp;lt;city&amp;gt;EINDHOVEN&amp;lt;/city&amp;gt;&lt;br /&gt;
        &amp;lt;country-name&amp;gt;NETHERLANDS&amp;lt;/country-name&amp;gt;&lt;br /&gt;
        &amp;lt;postcode&amp;gt;5621 BA&amp;lt;/postcode&amp;gt;&lt;br /&gt;
      &amp;lt;/patent-assignee&amp;gt;&lt;br /&gt;
    &amp;lt;/patent-assignees&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Patent Properties ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the properties table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* documentid&lt;br /&gt;
* country&lt;br /&gt;
* kind&lt;br /&gt;
* filingdate&lt;br /&gt;
* invention_title&lt;br /&gt;
&lt;br /&gt;
The corresponding XML segment would be:&lt;br /&gt;
&lt;br /&gt;
  -&amp;lt;patent-properties&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-property&amp;gt;&lt;br /&gt;
      -&amp;lt;document-id&amp;gt;&lt;br /&gt;
          &amp;lt;country&amp;gt;US&amp;lt;/country&amp;gt;&lt;br /&gt;
          &amp;lt;doc-number&amp;gt;14143589&amp;lt;/doc-number&amp;gt;&lt;br /&gt;
          &amp;lt;kind&amp;gt;X0&amp;lt;/kind&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20131230&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/document-id&amp;gt;&lt;br /&gt;
      -&amp;lt;document-id&amp;gt;&lt;br /&gt;
          &amp;lt;country&amp;gt;US&amp;lt;/country&amp;gt;&lt;br /&gt;
          &amp;lt;doc-number&amp;gt;20140260305&amp;lt;/doc-number&amp;gt;&lt;br /&gt;
          &amp;lt;kind&amp;gt;A1&amp;lt;/kind&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20140918&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/document-id&amp;gt;&lt;br /&gt;
      &amp;lt;invention-title lang=&amp;quot;en&amp;quot;&amp;gt;LEAN AZIMUTHAL FLAME COMBUSTOR&amp;lt;/invention-title&amp;gt;&lt;br /&gt;
    &amp;lt;/patent-property&amp;gt;&lt;br /&gt;
  &amp;lt;/patent-properties&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patent properties have a many-to-one relationship : one patent can have more than one properties.&lt;br /&gt;
 Note: We are not sure what documents with kind 'X0' say&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Patent Assignment ====&lt;br /&gt;
&lt;br /&gt;
Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* action_key_code&lt;br /&gt;
* USPTO_Transaction_Date&lt;br /&gt;
* USPTO_Date_Produced&lt;br /&gt;
* version&lt;br /&gt;
&lt;br /&gt;
Here is what the XML in a downloaded file looks like:&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;&lt;br /&gt;
  &amp;lt;!DOCTYPE us-patent-assignments&amp;gt;&lt;br /&gt;
 -&amp;lt;us-patent-assignments date-produced=&amp;quot;20131101&amp;quot; dtd-version=&amp;quot;1.0&amp;quot;&amp;gt;&lt;br /&gt;
     &amp;lt;action-key-code&amp;gt;DA&amp;lt;/action-key-code&amp;gt;&lt;br /&gt;
    -&amp;lt;transaction-date&amp;gt;&lt;br /&gt;
        &amp;lt;date&amp;gt;20160122&amp;lt;/date&amp;gt;&lt;br /&gt;
     &amp;lt;/transaction-date&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignments&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
             .&lt;br /&gt;
             .&lt;br /&gt;
             .&lt;br /&gt;
      &amp;lt;/patent-assignments&amp;gt;&lt;br /&gt;
  &amp;lt;/us-patent-assignments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====DTD====&lt;br /&gt;
Here is the DTD specified by the USPTO, which specifies optional fields and :&lt;br /&gt;
    &lt;br /&gt;
 &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;utf-8&amp;quot;?&amp;gt; &lt;br /&gt;
 &amp;lt;!DOCTYPE us-patent-assignments [&amp;lt;!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST us-patent-assignments  dtd-version   CDATA  #IMPLIED &lt;br /&gt;
 				 date-produced CDATA  #IMPLIED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT action-key-code (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT transaction-date (date)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignments (data-available-code | patent-assignment+)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT date (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT data-available-code (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignors (patent-assignor+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignees (patent-assignee+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-properties (patent-property+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT reel-no (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT frame-no (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT last-update-date (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT purge-indicator (#PCDATA)&amp;gt;  &lt;br /&gt;
 &amp;lt;!ELEMENT recorded-date (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT page-count (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT conveyance-text (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-property (document-id*, invention-title?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT name (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ATTLIST name name-type (natural | legal)  #IMPLIED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-1 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-2 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-3 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-4 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT execution-date (date)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT date-acknowledged (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT city (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT state (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT country-name (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT postcode (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT document-id (country, doc-number, kind?, name?, date?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST invention-title  id   ID     #IMPLIED &lt;br /&gt;
 			   lang CDATA  #REQUIRED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT country (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT doc-number (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT kind (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!--bold formatting for text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT b (#PCDATA | i | u | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--italic formatting for text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT i (#PCDATA | b | u | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--underscore: style - single is default--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT u (#PCDATA | b | i | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST u  style  (single | double | dash | dots )  'single' &amp;gt;&lt;br /&gt;
 &amp;lt;!--superscripted text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT sup (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--subscripted text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT sub (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--small capitals--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT smallcaps (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 ]&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Inserting Extracted Data into Tables ===&lt;br /&gt;
&lt;br /&gt;
===Clean Up ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Scripts for processing data ==&lt;br /&gt;
The programs/scripts are located on our [[Software Repository|Bonobo Git Server]].&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Downloading raw bulk data from USPTO ===&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
 file: USPTO_Assignee_Download.pl&lt;br /&gt;
&lt;br /&gt;
The down-loader script used to download XML files is essentially same, with minor changes, as the one used for downloading USPTO patent-data.&lt;br /&gt;
That is, the current version of down-loader script downloads all files from the base URL: https://bulkdata.uspto.gov/data2/patent/assignment/&lt;br /&gt;
&lt;br /&gt;
=== Parsing the XML files ===&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
 file: uspto_assignees_XML_parser.plx&lt;br /&gt;
&lt;br /&gt;
==== NAME ====&lt;br /&gt;
&lt;br /&gt;
uspto_assignees_XML_parser.plx - Parses XML files and populates a database.&lt;br /&gt;
&lt;br /&gt;
Specifically, parses every file in a directory according to a schema (see above).&lt;br /&gt;
Then populates a database on the RDP. &lt;br /&gt;
&lt;br /&gt;
==== SYNOPSIS ====&lt;br /&gt;
&lt;br /&gt;
perl uspto_assignees_XML_parser.plx /path/to/directory_containing_XML_files&lt;br /&gt;
&lt;br /&gt;
==== USAGE &amp;amp; FEATURES ====&lt;br /&gt;
&lt;br /&gt;
'''Arguments'''&lt;br /&gt;
The full path to directory is provided as a command line argument. It should contain the XML files to parse and no other file.&lt;br /&gt;
This path should be specified in Windows format (with '\') and NOT unix format.&lt;br /&gt;
&lt;br /&gt;
'''Features and Effects'''&lt;br /&gt;
As each XML file is parsed, a database on local host (RDP) is populated. If at any point there is an error, for example a particular&lt;br /&gt;
XML file is bad/invalid or the psql statement cannot be executed, the program aborts with a message.&lt;br /&gt;
&lt;br /&gt;
We choose to populate local database because remote connections are too slow. The database is eventually moved to DataBase server manually.&lt;br /&gt;
&lt;br /&gt;
==== TESTS ====&lt;br /&gt;
The first version does the job as expected. It was used to populate the assignees database by parsing XML files from USPTO(see above).&lt;br /&gt;
We parsed all XML files dated till 7/4/2016.&lt;br /&gt;
&lt;br /&gt;
==== TO DO ====&lt;br /&gt;
*Add more command line options to improve usability.&lt;br /&gt;
*Improve portability to allow Unix/Linux pathnames. This is straightforward to do with Perl modules File::Basename and File::Spec.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=4641</id>
		<title>Software Repository</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=4641"/>
		<updated>2016-07-08T15:50:22Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: /* Our Git workflow */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: McNair Admin]]&lt;br /&gt;
&lt;br /&gt;
==Background==&lt;br /&gt;
Given the amount of software that has been written by past computer science interns and more being written, we felt the need to have some kind of source code management system put into place so that developers can work without ever being in fear of breaking production and facing Ed's wrath (you do not want that dude angry! Wherever you go, he will find you! No escape.).&lt;br /&gt;
&lt;br /&gt;
To enforce efficient source control we(Ed) chose to host our own git server on the RDP machine using [https://bonobogitserver.com/ Bonobo Git Server] that makes use of the windows IIS platform and is open source.&lt;br /&gt;
&lt;br /&gt;
Installing Bonobo git server is pretty simple:&lt;br /&gt;
* dowload the zip file from the Bonobo website.&lt;br /&gt;
* extract its contents. It should be a single folder containing directories like App_Data, bin etc.&lt;br /&gt;
* rename that folder to anything you want. I used the name &amp;quot;codebase&amp;quot;&lt;br /&gt;
* copy the codebase folder to C:\inetpub\wwwroot\&lt;br /&gt;
* Allow IIS User to modify C:\inetpub\wwwroot\codebase\App_Data folder. To do so:&lt;br /&gt;
**select Properties of App_Data folder,&lt;br /&gt;
**go to Security tab,&lt;br /&gt;
**click edit,&lt;br /&gt;
**select IIS user (in my case IIS_IUSRS) and add Modify and Write permission,&lt;br /&gt;
**confirm these settings with Apply button.&lt;br /&gt;
*Convert ''codebase'' to Application in IIS&lt;br /&gt;
**Run IIS Manager and navigate to Sites -&amp;gt; Default Web Site. You should see Bonobo.Git.Server.&lt;br /&gt;
**Right click on 'codebase' and convert to application.&lt;br /&gt;
**Check if the selected application pool runs on .NET 4.0 and convert the site.&lt;br /&gt;
*Enable Anonymous Authentication in IIS and disable the others. To do so, select the application in the left pane, double-click on the authentication icon in the right pane and set the value to of Anonymous Authentication to Enabled&lt;br /&gt;
*Launch your browser and go to http://localhost/codebase. Now you can see the initial page of the Bonobo Git Server and everything should work.&lt;br /&gt;
**default credentials are ''username'': '''admin''', ''password'': '''admin'''&lt;br /&gt;
**[6-22-2016]: Can also use https://localhost/codebase which is preferable, otherwise username/passwords are transmitted plain text. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
==Our Git Server==&lt;br /&gt;
We have already done the set up of the git server on the RDP machine. Here are the admin credentials:&lt;br /&gt;
*Username: '''boss'''&lt;br /&gt;
*Name: '''Ed'''&lt;br /&gt;
*Surname: '''Egan'''&lt;br /&gt;
*Email: '''Edward.Egan@rice.edu'''&lt;br /&gt;
*Password: '''you_seriously_thought_Id_write_that_in_here??'''&lt;br /&gt;
&lt;br /&gt;
To access this from your computer and not the RDP you can go to http://128.42.44.182/codebase where it will prompt you for your username and password.&lt;br /&gt;
**[6-22-2016]: Can also use https://128.42.44.182/codebase which is preferable. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Our Git workflow==&lt;br /&gt;
We chose a simple git workflow.&lt;br /&gt;
&lt;br /&gt;
Our aim is not to break things in the master branch. All commits on the master should work.&lt;br /&gt;
&lt;br /&gt;
 1.&lt;br /&gt;
 When adding a new feature or fixing a bug, ALWAYS check out a new feature branch from the master.&lt;br /&gt;
 NEVER checkout a feature branch from next (see below). The feature branch should be named user/feature_name. &lt;br /&gt;
&lt;br /&gt;
 2.&lt;br /&gt;
 After feature development is complete merge your feature-branch into next.&lt;br /&gt;
&lt;br /&gt;
 3.&lt;br /&gt;
 The next branch is intended for testing and confirming things do not break. So, after feature branches are merged into next and conflicts resolved, we merge into master.&lt;br /&gt;
 After this, you can end the feature branches if you want.&lt;br /&gt;
&lt;br /&gt;
==Quick and dirty github tutorial==&lt;br /&gt;
 For a cool interactive tutorial see http://learngitbranching.js.org/.&lt;br /&gt;
&lt;br /&gt;
 ***&lt;br /&gt;
 You can also use SourceTree which is a GUI interface for git-client. This is installed on the RDP.&lt;br /&gt;
 Like using git from CLI (see below), SourceTree constructs appropriate commands. But the good thing&lt;br /&gt;
 is it automatically generates all error check/logging options with each command that are difficult&lt;br /&gt;
 to recall from memory. SourceTree is freely available from Altassian at https://www.sourcetreeapp.com/&lt;br /&gt;
 ***&lt;br /&gt;
&lt;br /&gt;
 To use SourceTree you should have basic understanding of git (like branches,commits etc). The interactive tutorial above is very good for this purpose.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*''Installing - '' Depending on your operating system you can install git in three different ways:&lt;br /&gt;
** If you are a windows or a mac user user, you can simply download &amp;amp; install the latest release from [https://git-scm.com/ git scm website]&lt;br /&gt;
** If you use ubuntu then all you need to do is type &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;sudo apt-get install git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
* Check your installation by typing &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; in terminal or windows powershell&lt;br /&gt;
*Basic git operations:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
* to checkout code from remote repository, use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. This will create a local repository on your disk as well as download the source code of the project you wish to work on. Here's an example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone http://128.42.44.182/codebase/Matcher.git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to update your repository to include others' work in your project use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Its always a good practice to update your code before you commit to ensure that others' code doesn't break yours. Also, you cannot push to remote unless your local repository is up to date. If you commit on a stale local repository that is fine, just that this would mean you are likely to have more trouble merging your code with others later on thanks to all the conflicts that you'll face when you actually try to update your repository later. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update &amp;lt;optional folder path&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to commit your changes to your local repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Committing your changes is an essential step whether you are adding/removing items from the repository or changing existing items. See example :&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit -m &amp;quot;mandatory commit message&amp;quot;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to push your changes to remote repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git push &amp;lt;optional file/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Whatever you need to be pushed to the server must be committed to your local repository first. By default this command will push everything from current folder if no item is specified.&lt;br /&gt;
&lt;br /&gt;
* to add new files to your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you must commit to ensure that your repository actually has the new file. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add &amp;lt;filename/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to remove items from your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you delete the file that you wanted removed from the repository and commit to ensure that your repository actually has the change persisted. Finally, you push to server to make sure the server has those items removed as well and that nobody in your team works under the assumption that those items are stills there. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove &amp;lt;filename&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
''Note'': if removing a non empty folder use the -r flag to recursively remove all contents of that folder as well :&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove -r &amp;lt;foldername&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[admin_classification::IT Build| ]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=4639</id>
		<title>Software Repository</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Software_Repository&amp;diff=4639"/>
		<updated>2016-07-08T15:47:56Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: /* Background */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category: McNair Admin]]&lt;br /&gt;
&lt;br /&gt;
==Background==&lt;br /&gt;
Given the amount of software that has been written by past computer science interns and more being written, we felt the need to have some kind of source code management system put into place so that developers can work without ever being in fear of breaking production and facing Ed's wrath (you do not want that dude angry! Wherever you go, he will find you! No escape.).&lt;br /&gt;
&lt;br /&gt;
To enforce efficient source control we(Ed) chose to host our own git server on the RDP machine using [https://bonobogitserver.com/ Bonobo Git Server] that makes use of the windows IIS platform and is open source.&lt;br /&gt;
&lt;br /&gt;
Installing Bonobo git server is pretty simple:&lt;br /&gt;
* dowload the zip file from the Bonobo website.&lt;br /&gt;
* extract its contents. It should be a single folder containing directories like App_Data, bin etc.&lt;br /&gt;
* rename that folder to anything you want. I used the name &amp;quot;codebase&amp;quot;&lt;br /&gt;
* copy the codebase folder to C:\inetpub\wwwroot\&lt;br /&gt;
* Allow IIS User to modify C:\inetpub\wwwroot\codebase\App_Data folder. To do so:&lt;br /&gt;
**select Properties of App_Data folder,&lt;br /&gt;
**go to Security tab,&lt;br /&gt;
**click edit,&lt;br /&gt;
**select IIS user (in my case IIS_IUSRS) and add Modify and Write permission,&lt;br /&gt;
**confirm these settings with Apply button.&lt;br /&gt;
*Convert ''codebase'' to Application in IIS&lt;br /&gt;
**Run IIS Manager and navigate to Sites -&amp;gt; Default Web Site. You should see Bonobo.Git.Server.&lt;br /&gt;
**Right click on 'codebase' and convert to application.&lt;br /&gt;
**Check if the selected application pool runs on .NET 4.0 and convert the site.&lt;br /&gt;
*Enable Anonymous Authentication in IIS and disable the others. To do so, select the application in the left pane, double-click on the authentication icon in the right pane and set the value to of Anonymous Authentication to Enabled&lt;br /&gt;
*Launch your browser and go to http://localhost/codebase. Now you can see the initial page of the Bonobo Git Server and everything should work.&lt;br /&gt;
**default credentials are ''username'': '''admin''', ''password'': '''admin'''&lt;br /&gt;
**[6-22-2016]: Can also use https://localhost/codebase which is preferable, otherwise username/passwords are transmitted plain text. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
==Our Git Server==&lt;br /&gt;
We have already done the set up of the git server on the RDP machine. Here are the admin credentials:&lt;br /&gt;
*Username: '''boss'''&lt;br /&gt;
*Name: '''Ed'''&lt;br /&gt;
*Surname: '''Egan'''&lt;br /&gt;
*Email: '''Edward.Egan@rice.edu'''&lt;br /&gt;
*Password: '''you_seriously_thought_Id_write_that_in_here??'''&lt;br /&gt;
&lt;br /&gt;
To access this from your computer and not the RDP you can go to http://128.42.44.182/codebase where it will prompt you for your username and password.&lt;br /&gt;
**[6-22-2016]: Can also use https://128.42.44.182/codebase which is preferable. The browser will show a security error because we have a self signed certificate. This is ok if we are restricted to intranet. If we want to allow public access, we probably need to get a certificate from a Certificate Authority like Verisign etc.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Our Git workflow==&lt;br /&gt;
We chose a simple git workflow.&lt;br /&gt;
&lt;br /&gt;
Our aim is not to break things in the master branch. All commits on the master should work.&lt;br /&gt;
&lt;br /&gt;
 1.&lt;br /&gt;
 When adding a new feature or fixing a bug (well, why fix it, was that not a feature?), ALWAYS check out a new feature branch from the master.&lt;br /&gt;
 NEVER checkout a feature branch from next (see below). The feature branch should be named user/feature_name. &lt;br /&gt;
&lt;br /&gt;
 2.&lt;br /&gt;
 After feature development is complete merge your feature-branch into next.&lt;br /&gt;
&lt;br /&gt;
 3.&lt;br /&gt;
 The next branch is intended for testing and confirm things do not break. So, after feature branches are merged into next and conflicts resolved, if things work we push it to master.&lt;br /&gt;
 You can end the feature branches if you want.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quick and dirty github tutorial==&lt;br /&gt;
 For a cool interactive tutorial see http://learngitbranching.js.org/.&lt;br /&gt;
&lt;br /&gt;
 ***&lt;br /&gt;
 You can also use SourceTree which is a GUI interface for git-client. This is installed on the RDP.&lt;br /&gt;
 Like using git from CLI (see below), SourceTree constructs appropriate commands. But the good thing&lt;br /&gt;
 is it automatically generates all error check/logging options with each command that are difficult&lt;br /&gt;
 to recall from memory. SourceTree is freely available from Altassian at https://www.sourcetreeapp.com/&lt;br /&gt;
 ***&lt;br /&gt;
&lt;br /&gt;
 To use SourceTree you should have basic understanding of git (like branches,commits etc). The interactive tutorial above is very good for this purpose.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*''Installing - '' Depending on your operating system you can install git in three different ways:&lt;br /&gt;
** If you are a windows or a mac user user, you can simply download &amp;amp; install the latest release from [https://git-scm.com/ git scm website]&lt;br /&gt;
** If you use ubuntu then all you need to do is type &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;sudo apt-get install git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&lt;br /&gt;
* Check your installation by typing &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; in terminal or windows powershell&lt;br /&gt;
*Basic git operations:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
* to checkout code from remote repository, use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. This will create a local repository on your disk as well as download the source code of the project you wish to work on. Here's an example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git clone http://128.42.44.182/codebase/Matcher.git&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to update your repository to include others' work in your project use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Its always a good practice to update your code before you commit to ensure that others' code doesn't break yours. Also, you cannot push to remote unless your local repository is up to date. If you commit on a stale local repository that is fine, just that this would mean you are likely to have more trouble merging your code with others later on thanks to all the conflicts that you'll face when you actually try to update your repository later. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git update &amp;lt;optional folder path&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to commit your changes to your local repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Committing your changes is an essential step whether you are adding/removing items from the repository or changing existing items. See example :&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git commit -m &amp;quot;mandatory commit message&amp;quot;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to push your changes to remote repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git push &amp;lt;optional file/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. Whatever you need to be pushed to the server must be committed to your local repository first. By default this command will push everything from current folder if no item is specified.&lt;br /&gt;
&lt;br /&gt;
* to add new files to your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you must commit to ensure that your repository actually has the new file. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git add &amp;lt;filename/folder name&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* to remove items from your repository use the &amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt; command. After that you delete the file that you wanted removed from the repository and commit to ensure that your repository actually has the change persisted. Finally, you push to server to make sure the server has those items removed as well and that nobody in your team works under the assumption that those items are stills there. See example:&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove &amp;lt;filename&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 2em;&amp;quot;&amp;gt;&lt;br /&gt;
''Note'': if removing a non empty folder use the -r flag to recursively remove all contents of that folder as well :&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&amp;lt;div style=&amp;quot;text-align: left; direction: ltr; margin-left: 5em;&amp;quot;&amp;gt;&amp;lt;code&amp;gt;&amp;lt;big&amp;gt;git remove -r &amp;lt;foldername&amp;gt;&amp;lt;/big&amp;gt;&amp;lt;/code&amp;gt;&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[admin_classification::IT Build| ]]&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Bulk_Patent_Assignee_Processing&amp;diff=4636</id>
		<title>Bulk Patent Assignee Processing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Bulk_Patent_Assignee_Processing&amp;diff=4636"/>
		<updated>2016-07-08T15:38:04Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: /* Downloading raw bulk data from USPTO */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== USPTO Assignees Data ==&lt;br /&gt;
&lt;br /&gt;
We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.&lt;br /&gt;
&lt;br /&gt;
== Steps Followed to Extract the Data ==&lt;br /&gt;
&lt;br /&gt;
===Extracting Data from XML Files ===&lt;br /&gt;
&lt;br /&gt;
All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;assignment-record&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignors&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignees&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-properties&amp;gt;&lt;br /&gt;
 &amp;lt;/patent-assignment&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.&lt;br /&gt;
&lt;br /&gt;
Corresponding tables are:&lt;br /&gt;
*assignment-records : assignment&lt;br /&gt;
*patent-assignors : assignors&lt;br /&gt;
*patent-assignees : assignees&lt;br /&gt;
*patent-properties : properties&lt;br /&gt;
&lt;br /&gt;
Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.&lt;br /&gt;
&lt;br /&gt;
==== Assignment Records ====&lt;br /&gt;
&lt;br /&gt;
The fields in the assignment record are:&lt;br /&gt;
* last_update_date&lt;br /&gt;
* purge_indicator&lt;br /&gt;
* recorded_date&lt;br /&gt;
* correspondent_name&lt;br /&gt;
* correspondent_address_1&lt;br /&gt;
* correspondent_address_2&lt;br /&gt;
* correspondent_address_3&lt;br /&gt;
* correspondent_address_4&lt;br /&gt;
* conveyance_text&lt;br /&gt;
&lt;br /&gt;
Here is the corresponding XML that we are mapping:&lt;br /&gt;
 &lt;br /&gt;
   -&amp;lt;assignment-record&amp;gt;&lt;br /&gt;
       &amp;lt;reel-no&amp;gt;27132&amp;lt;/reel-no&amp;gt;&lt;br /&gt;
       &amp;lt;frame-no&amp;gt;841&amp;lt;/frame-no&amp;gt;&lt;br /&gt;
      -&amp;lt;last-update-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20160122&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/last-update-date&amp;gt;&lt;br /&gt;
       &amp;lt;purge-indicator&amp;gt;N&amp;lt;/purge-indicator&amp;gt;&lt;br /&gt;
          -&amp;lt;recorded-date&amp;gt;&lt;br /&gt;
              &amp;lt;date&amp;gt;20111027&amp;lt;/date&amp;gt;&lt;br /&gt;
           &amp;lt;/recorded-date&amp;gt;&lt;br /&gt;
         &amp;lt;page-count&amp;gt;2&amp;lt;/page-count&amp;gt;&lt;br /&gt;
      -&amp;lt;correspondent&amp;gt;&lt;br /&gt;
           &amp;lt;name&amp;gt;DOUGLAS B. MCKNIGHT&amp;lt;/name&amp;gt;&lt;br /&gt;
           &amp;lt;address-1&amp;gt;595 MINER ROAD&amp;lt;/address-1&amp;gt;&lt;br /&gt;
           &amp;lt;address-2&amp;gt;INTELLECTUAL PROPERTY &amp;amp; STANDARDS&amp;lt;/address-2&amp;gt;&lt;br /&gt;
           &amp;lt;address-3&amp;gt;CLEVELAND, OH 44143&amp;lt;/address-3&amp;gt;&lt;br /&gt;
        &amp;lt;/correspondent&amp;gt;&lt;br /&gt;
        &amp;lt;conveyance-text&amp;gt;ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).&amp;lt;/conveyance-text&amp;gt;&lt;br /&gt;
  &amp;lt;/assignment-record&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Assignors ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the assignors table:&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* assignor_name&lt;br /&gt;
* execution_date&lt;br /&gt;
&lt;br /&gt;
The corresponding XML node is :&lt;br /&gt;
&lt;br /&gt;
 -&amp;lt;patent-assignors&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignor&amp;gt;&lt;br /&gt;
       &amp;lt;name&amp;gt;WALKER, MATTHEW J.&amp;lt;/name&amp;gt;&lt;br /&gt;
      -&amp;lt;execution-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20090512&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/execution-date&amp;gt;&lt;br /&gt;
     &amp;lt;/patent-assignor&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignor&amp;gt;&lt;br /&gt;
       &amp;lt;name&amp;gt;OLSZEWSKI, MARK E.&amp;lt;/name&amp;gt;&lt;br /&gt;
      -&amp;lt;execution-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20090512&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/execution-date&amp;gt;&lt;br /&gt;
     &amp;lt;/patent-assignor&amp;gt;&lt;br /&gt;
   &amp;lt;/patent-assignors&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Assignees ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the assignees table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* assignee_name&lt;br /&gt;
* assignee_address_1&lt;br /&gt;
* assignee_address_2&lt;br /&gt;
* assignee_city&lt;br /&gt;
* assignee_state&lt;br /&gt;
* assignee_country&lt;br /&gt;
* assignee_postcode&lt;br /&gt;
&lt;br /&gt;
The corresponding XML nodes are:&lt;br /&gt;
&lt;br /&gt;
  -&amp;lt;patent-assignees&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignee&amp;gt;&lt;br /&gt;
        &amp;lt;name&amp;gt;KONINKLIJKE PHILIPS ELECTRONICS N V&amp;lt;/name&amp;gt;&lt;br /&gt;
        &amp;lt;address-1&amp;gt;GROENEWOUDSEWEG 1&amp;lt;/address-1&amp;gt;&lt;br /&gt;
        &amp;lt;city&amp;gt;EINDHOVEN&amp;lt;/city&amp;gt;&lt;br /&gt;
        &amp;lt;country-name&amp;gt;NETHERLANDS&amp;lt;/country-name&amp;gt;&lt;br /&gt;
        &amp;lt;postcode&amp;gt;5621 BA&amp;lt;/postcode&amp;gt;&lt;br /&gt;
      &amp;lt;/patent-assignee&amp;gt;&lt;br /&gt;
    &amp;lt;/patent-assignees&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Patent Properties ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the properties table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* documentid&lt;br /&gt;
* country&lt;br /&gt;
* kind&lt;br /&gt;
* filingdate&lt;br /&gt;
* invention_title&lt;br /&gt;
&lt;br /&gt;
The corresponding XML segment would be:&lt;br /&gt;
&lt;br /&gt;
  -&amp;lt;patent-properties&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-property&amp;gt;&lt;br /&gt;
      -&amp;lt;document-id&amp;gt;&lt;br /&gt;
          &amp;lt;country&amp;gt;US&amp;lt;/country&amp;gt;&lt;br /&gt;
          &amp;lt;doc-number&amp;gt;14143589&amp;lt;/doc-number&amp;gt;&lt;br /&gt;
          &amp;lt;kind&amp;gt;X0&amp;lt;/kind&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20131230&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/document-id&amp;gt;&lt;br /&gt;
      -&amp;lt;document-id&amp;gt;&lt;br /&gt;
          &amp;lt;country&amp;gt;US&amp;lt;/country&amp;gt;&lt;br /&gt;
          &amp;lt;doc-number&amp;gt;20140260305&amp;lt;/doc-number&amp;gt;&lt;br /&gt;
          &amp;lt;kind&amp;gt;A1&amp;lt;/kind&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20140918&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/document-id&amp;gt;&lt;br /&gt;
      &amp;lt;invention-title lang=&amp;quot;en&amp;quot;&amp;gt;LEAN AZIMUTHAL FLAME COMBUSTOR&amp;lt;/invention-title&amp;gt;&lt;br /&gt;
    &amp;lt;/patent-property&amp;gt;&lt;br /&gt;
  &amp;lt;/patent-properties&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patent properties have a many-to-one relationship : one patent can have more than one properties.&lt;br /&gt;
 Note: We are not sure what documents with kind 'X0' say&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Patent Assignment ====&lt;br /&gt;
&lt;br /&gt;
Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* action_key_code&lt;br /&gt;
* USPTO_Transaction_Date&lt;br /&gt;
* USPTO_Date_Produced&lt;br /&gt;
* version&lt;br /&gt;
&lt;br /&gt;
Here is what the XML in a downloaded file looks like:&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;&lt;br /&gt;
  &amp;lt;!DOCTYPE us-patent-assignments&amp;gt;&lt;br /&gt;
 -&amp;lt;us-patent-assignments date-produced=&amp;quot;20131101&amp;quot; dtd-version=&amp;quot;1.0&amp;quot;&amp;gt;&lt;br /&gt;
     &amp;lt;action-key-code&amp;gt;DA&amp;lt;/action-key-code&amp;gt;&lt;br /&gt;
    -&amp;lt;transaction-date&amp;gt;&lt;br /&gt;
        &amp;lt;date&amp;gt;20160122&amp;lt;/date&amp;gt;&lt;br /&gt;
     &amp;lt;/transaction-date&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignments&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
             .&lt;br /&gt;
             .&lt;br /&gt;
             .&lt;br /&gt;
      &amp;lt;/patent-assignments&amp;gt;&lt;br /&gt;
  &amp;lt;/us-patent-assignments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====DTD====&lt;br /&gt;
Here is the DTD specified by the USPTO, which specifies optional fields and :&lt;br /&gt;
    &lt;br /&gt;
 &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;utf-8&amp;quot;?&amp;gt; &lt;br /&gt;
 &amp;lt;!DOCTYPE us-patent-assignments [&amp;lt;!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST us-patent-assignments  dtd-version   CDATA  #IMPLIED &lt;br /&gt;
 				 date-produced CDATA  #IMPLIED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT action-key-code (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT transaction-date (date)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignments (data-available-code | patent-assignment+)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT date (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT data-available-code (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignors (patent-assignor+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignees (patent-assignee+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-properties (patent-property+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT reel-no (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT frame-no (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT last-update-date (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT purge-indicator (#PCDATA)&amp;gt;  &lt;br /&gt;
 &amp;lt;!ELEMENT recorded-date (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT page-count (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT conveyance-text (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-property (document-id*, invention-title?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT name (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ATTLIST name name-type (natural | legal)  #IMPLIED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-1 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-2 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-3 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-4 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT execution-date (date)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT date-acknowledged (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT city (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT state (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT country-name (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT postcode (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT document-id (country, doc-number, kind?, name?, date?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST invention-title  id   ID     #IMPLIED &lt;br /&gt;
 			   lang CDATA  #REQUIRED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT country (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT doc-number (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT kind (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!--bold formatting for text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT b (#PCDATA | i | u | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--italic formatting for text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT i (#PCDATA | b | u | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--underscore: style - single is default--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT u (#PCDATA | b | i | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST u  style  (single | double | dash | dots )  'single' &amp;gt;&lt;br /&gt;
 &amp;lt;!--superscripted text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT sup (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--subscripted text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT sub (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--small capitals--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT smallcaps (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 ]&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Inserting Extracted Data into Tables ===&lt;br /&gt;
&lt;br /&gt;
===Clean Up ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Scripts for processing data ==&lt;br /&gt;
The programs/scripts are located at our [[Software Repository]].&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Downloading raw bulk data from USPTO ===&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
 file: USPTO_Assignee_Download.pl&lt;br /&gt;
&lt;br /&gt;
The down-loader script used to download XML files is essentially same, with minor changes, as the one used for downloading USPTO patent-data.&lt;br /&gt;
That is, the current version of down-loader script downloads all files from the base URL: https://bulkdata.uspto.gov/data2/patent/assignment/&lt;br /&gt;
&lt;br /&gt;
=== Parsing the XML files ===&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
 file: uspto_assignees_XML_parser.plx&lt;br /&gt;
&lt;br /&gt;
==== NAME ====&lt;br /&gt;
&lt;br /&gt;
uspto_assignees_XML_parser.plx - Parses XML files and populates a database.&lt;br /&gt;
&lt;br /&gt;
Specifically, parses every file in a directory according to a schema (see above).&lt;br /&gt;
Then populates a database on the RDP. &lt;br /&gt;
&lt;br /&gt;
==== SYNOPSIS ====&lt;br /&gt;
&lt;br /&gt;
perl uspto_assignees_XML_parser.plx /path/to/directory_containing_XML_files&lt;br /&gt;
&lt;br /&gt;
==== USAGE &amp;amp; FEATURES ====&lt;br /&gt;
&lt;br /&gt;
'''Arguments'''&lt;br /&gt;
The full path to directory is provided as a command line argument. It should contain the XML files to parse and no other file.&lt;br /&gt;
This path should be specified in Windows format (with '\') and NOT unix format.&lt;br /&gt;
&lt;br /&gt;
'''Features and Effects'''&lt;br /&gt;
As each XML file is parsed, a database on local host (RDP) is populated. If at any point there is an error, for example a particular&lt;br /&gt;
XML file is bad/invalid or the psql statement cannot be executed, the program aborts with a message.&lt;br /&gt;
&lt;br /&gt;
We choose to populate local database because remote connections are too slow. The database is eventually moved to DataBase server manually.&lt;br /&gt;
&lt;br /&gt;
==== TESTS ====&lt;br /&gt;
The first version does the job as expected. It was used to populate the assignees database by parsing XML files from USPTO(see above).&lt;br /&gt;
We parsed all XML files dated till 7/4/2016.&lt;br /&gt;
&lt;br /&gt;
==== TO DO ====&lt;br /&gt;
*Add more command line options to improve usability.&lt;br /&gt;
*Improve portability to allow Unix/Linux pathnames. This is straightforward to do with Perl modules File::Basename and File::Spec.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Bulk_Patent_Assignee_Processing&amp;diff=4420</id>
		<title>Bulk Patent Assignee Processing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Bulk_Patent_Assignee_Processing&amp;diff=4420"/>
		<updated>2016-07-05T20:37:21Z</updated>

		<summary type="html">&lt;p&gt;ShoebMohammed: /* Scripts for processing data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== USPTO Assignees Data ==&lt;br /&gt;
&lt;br /&gt;
We would like to download and absorb data from this location on the USPTO website into our tables. The objective is to determine whether this dataset is better than the current version of our patent data (a combination of the data in the patent_2015 and patentdata databases.&lt;br /&gt;
&lt;br /&gt;
== Steps Followed to Extract the Data ==&lt;br /&gt;
&lt;br /&gt;
===Extracting Data from XML Files ===&lt;br /&gt;
&lt;br /&gt;
All the historical USPTO data is available as XML files. Here is the tree structure for the XML files:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;assignment-record&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignors&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignees&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-properties&amp;gt;&lt;br /&gt;
 &amp;lt;/patent-assignment&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Each of the above internal nodes is mandatory, and is a logical grouping of information fields. Each node has a corresponding table created with more or less the same fields as the XML elements.&lt;br /&gt;
&lt;br /&gt;
Corresponding tables are:&lt;br /&gt;
*assignment-records : assignment&lt;br /&gt;
*patent-assignors : assignors&lt;br /&gt;
*patent-assignees : assignees&lt;br /&gt;
*patent-properties : properties&lt;br /&gt;
&lt;br /&gt;
Additionally, for each file that is downloaded, there are some associated specs. All of these are stored in the PatentAssignment table. Here is the data model diagram.&lt;br /&gt;
&lt;br /&gt;
==== Assignment Records ====&lt;br /&gt;
&lt;br /&gt;
The fields in the assignment record are:&lt;br /&gt;
* last_update_date&lt;br /&gt;
* purge_indicator&lt;br /&gt;
* recorded_date&lt;br /&gt;
* correspondent_name&lt;br /&gt;
* correspondent_address_1&lt;br /&gt;
* correspondent_address_2&lt;br /&gt;
* correspondent_address_3&lt;br /&gt;
* correspondent_address_4&lt;br /&gt;
* conveyance_text&lt;br /&gt;
&lt;br /&gt;
Here is the corresponding XML that we are mapping:&lt;br /&gt;
 &lt;br /&gt;
   -&amp;lt;assignment-record&amp;gt;&lt;br /&gt;
       &amp;lt;reel-no&amp;gt;27132&amp;lt;/reel-no&amp;gt;&lt;br /&gt;
       &amp;lt;frame-no&amp;gt;841&amp;lt;/frame-no&amp;gt;&lt;br /&gt;
      -&amp;lt;last-update-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20160122&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/last-update-date&amp;gt;&lt;br /&gt;
       &amp;lt;purge-indicator&amp;gt;N&amp;lt;/purge-indicator&amp;gt;&lt;br /&gt;
          -&amp;lt;recorded-date&amp;gt;&lt;br /&gt;
              &amp;lt;date&amp;gt;20111027&amp;lt;/date&amp;gt;&lt;br /&gt;
           &amp;lt;/recorded-date&amp;gt;&lt;br /&gt;
         &amp;lt;page-count&amp;gt;2&amp;lt;/page-count&amp;gt;&lt;br /&gt;
      -&amp;lt;correspondent&amp;gt;&lt;br /&gt;
           &amp;lt;name&amp;gt;DOUGLAS B. MCKNIGHT&amp;lt;/name&amp;gt;&lt;br /&gt;
           &amp;lt;address-1&amp;gt;595 MINER ROAD&amp;lt;/address-1&amp;gt;&lt;br /&gt;
           &amp;lt;address-2&amp;gt;INTELLECTUAL PROPERTY &amp;amp; STANDARDS&amp;lt;/address-2&amp;gt;&lt;br /&gt;
           &amp;lt;address-3&amp;gt;CLEVELAND, OH 44143&amp;lt;/address-3&amp;gt;&lt;br /&gt;
        &amp;lt;/correspondent&amp;gt;&lt;br /&gt;
        &amp;lt;conveyance-text&amp;gt;ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS).&amp;lt;/conveyance-text&amp;gt;&lt;br /&gt;
  &amp;lt;/assignment-record&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Assignors ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the assignors table:&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* assignor_name&lt;br /&gt;
* execution_date&lt;br /&gt;
&lt;br /&gt;
The corresponding XML node is :&lt;br /&gt;
&lt;br /&gt;
 -&amp;lt;patent-assignors&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignor&amp;gt;&lt;br /&gt;
       &amp;lt;name&amp;gt;WALKER, MATTHEW J.&amp;lt;/name&amp;gt;&lt;br /&gt;
      -&amp;lt;execution-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20090512&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/execution-date&amp;gt;&lt;br /&gt;
     &amp;lt;/patent-assignor&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignor&amp;gt;&lt;br /&gt;
       &amp;lt;name&amp;gt;OLSZEWSKI, MARK E.&amp;lt;/name&amp;gt;&lt;br /&gt;
      -&amp;lt;execution-date&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20090512&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/execution-date&amp;gt;&lt;br /&gt;
     &amp;lt;/patent-assignor&amp;gt;&lt;br /&gt;
   &amp;lt;/patent-assignors&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Assignees ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the assignees table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* assignee_name&lt;br /&gt;
* assignee_address_1&lt;br /&gt;
* assignee_address_2&lt;br /&gt;
* assignee_city&lt;br /&gt;
* assignee_state&lt;br /&gt;
* assignee_country&lt;br /&gt;
* assignee_postcode&lt;br /&gt;
&lt;br /&gt;
The corresponding XML nodes are:&lt;br /&gt;
&lt;br /&gt;
  -&amp;lt;patent-assignees&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignee&amp;gt;&lt;br /&gt;
        &amp;lt;name&amp;gt;KONINKLIJKE PHILIPS ELECTRONICS N V&amp;lt;/name&amp;gt;&lt;br /&gt;
        &amp;lt;address-1&amp;gt;GROENEWOUDSEWEG 1&amp;lt;/address-1&amp;gt;&lt;br /&gt;
        &amp;lt;city&amp;gt;EINDHOVEN&amp;lt;/city&amp;gt;&lt;br /&gt;
        &amp;lt;country-name&amp;gt;NETHERLANDS&amp;lt;/country-name&amp;gt;&lt;br /&gt;
        &amp;lt;postcode&amp;gt;5621 BA&amp;lt;/postcode&amp;gt;&lt;br /&gt;
      &amp;lt;/patent-assignee&amp;gt;&lt;br /&gt;
    &amp;lt;/patent-assignees&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==== Patent Properties ====&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the properties table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* documentid&lt;br /&gt;
* country&lt;br /&gt;
* kind&lt;br /&gt;
* filingdate&lt;br /&gt;
* invention_title&lt;br /&gt;
&lt;br /&gt;
The corresponding XML segment would be:&lt;br /&gt;
&lt;br /&gt;
  -&amp;lt;patent-properties&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-property&amp;gt;&lt;br /&gt;
      -&amp;lt;document-id&amp;gt;&lt;br /&gt;
          &amp;lt;country&amp;gt;US&amp;lt;/country&amp;gt;&lt;br /&gt;
          &amp;lt;doc-number&amp;gt;14143589&amp;lt;/doc-number&amp;gt;&lt;br /&gt;
          &amp;lt;kind&amp;gt;X0&amp;lt;/kind&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20131230&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/document-id&amp;gt;&lt;br /&gt;
      -&amp;lt;document-id&amp;gt;&lt;br /&gt;
          &amp;lt;country&amp;gt;US&amp;lt;/country&amp;gt;&lt;br /&gt;
          &amp;lt;doc-number&amp;gt;20140260305&amp;lt;/doc-number&amp;gt;&lt;br /&gt;
          &amp;lt;kind&amp;gt;A1&amp;lt;/kind&amp;gt;&lt;br /&gt;
          &amp;lt;date&amp;gt;20140918&amp;lt;/date&amp;gt;&lt;br /&gt;
       &amp;lt;/document-id&amp;gt;&lt;br /&gt;
      &amp;lt;invention-title lang=&amp;quot;en&amp;quot;&amp;gt;LEAN AZIMUTHAL FLAME COMBUSTOR&amp;lt;/invention-title&amp;gt;&lt;br /&gt;
    &amp;lt;/patent-property&amp;gt;&lt;br /&gt;
  &amp;lt;/patent-properties&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Patent properties have a many-to-one relationship : one patent can have more than one properties.&lt;br /&gt;
 Note: We are not sure what documents with kind 'X0' say&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Patent Assignment ====&lt;br /&gt;
&lt;br /&gt;
Every XML file download has some fields associated with it, in addition to a number of patent assignment nodes.&lt;br /&gt;
&lt;br /&gt;
Here are the columns in the table:&lt;br /&gt;
&lt;br /&gt;
* reel_no&lt;br /&gt;
* frame_no&lt;br /&gt;
* action_key_code&lt;br /&gt;
* USPTO_Transaction_Date&lt;br /&gt;
* USPTO_Date_Produced&lt;br /&gt;
* version&lt;br /&gt;
&lt;br /&gt;
Here is what the XML in a downloaded file looks like:&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;UTF-8&amp;quot;?&amp;gt;&lt;br /&gt;
  &amp;lt;!DOCTYPE us-patent-assignments&amp;gt;&lt;br /&gt;
 -&amp;lt;us-patent-assignments date-produced=&amp;quot;20131101&amp;quot; dtd-version=&amp;quot;1.0&amp;quot;&amp;gt;&lt;br /&gt;
     &amp;lt;action-key-code&amp;gt;DA&amp;lt;/action-key-code&amp;gt;&lt;br /&gt;
    -&amp;lt;transaction-date&amp;gt;&lt;br /&gt;
        &amp;lt;date&amp;gt;20160122&amp;lt;/date&amp;gt;&lt;br /&gt;
     &amp;lt;/transaction-date&amp;gt;&lt;br /&gt;
    -&amp;lt;patent-assignments&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
        +&amp;lt;patent-assignment&amp;gt;&lt;br /&gt;
             .&lt;br /&gt;
             .&lt;br /&gt;
             .&lt;br /&gt;
      &amp;lt;/patent-assignments&amp;gt;&lt;br /&gt;
  &amp;lt;/us-patent-assignments&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====DTD====&lt;br /&gt;
Here is the DTD specified by the USPTO, which specifies optional fields and :&lt;br /&gt;
    &lt;br /&gt;
 &amp;lt;?xml version=&amp;quot;1.0&amp;quot; encoding=&amp;quot;utf-8&amp;quot;?&amp;gt; &lt;br /&gt;
 &amp;lt;!DOCTYPE us-patent-assignments [&amp;lt;!ELEMENT us-patent-assignments (action-key-code, transaction-date, patent-assignments)&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST us-patent-assignments  dtd-version   CDATA  #IMPLIED &lt;br /&gt;
 				 date-produced CDATA  #IMPLIED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT action-key-code (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT transaction-date (date)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignments (data-available-code | patent-assignment+)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT date (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT data-available-code (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignment (assignment-record, patent-assignors, patent-assignees, patent-properties)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT assignment-record (reel-no, frame-no, last-update-date, purge-indicator, recorded-date, page-count?, correspondent, conveyance-text)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignors (patent-assignor+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignees (patent-assignee+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-properties (patent-property+)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT reel-no (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT frame-no (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT last-update-date (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT purge-indicator (#PCDATA)&amp;gt;  &lt;br /&gt;
 &amp;lt;!ELEMENT recorded-date (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT page-count (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT correspondent (name, address-1?, address-2?, address-3?, address-4?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT conveyance-text (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignor (name, execution-date?, date-acknowledged?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT patent-property (document-id*, invention-title?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT name (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ATTLIST name name-type (natural | legal)  #IMPLIED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-1 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-2 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-3 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT address-4 (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT execution-date (date)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT date-acknowledged (date)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT city (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT state (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT country-name (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT postcode (#PCDATA)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT document-id (country, doc-number, kind?, name?, date?)&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT invention-title (#PCDATA | b | i | u | sup | sub)*&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST invention-title  id   ID     #IMPLIED &lt;br /&gt;
 			   lang CDATA  #REQUIRED&amp;gt; &lt;br /&gt;
 &amp;lt;!ELEMENT country (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT doc-number (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT kind (#PCDATA)&amp;gt;&lt;br /&gt;
 &amp;lt;!--bold formatting for text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT b (#PCDATA | i | u | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--italic formatting for text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT i (#PCDATA | b | u | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--underscore: style - single is default--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT u (#PCDATA | b | i | smallcaps)*&amp;gt;&lt;br /&gt;
 &amp;lt;!ATTLIST u  style  (single | double | dash | dots )  'single' &amp;gt;&lt;br /&gt;
 &amp;lt;!--superscripted text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT sup (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--subscripted text--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT sub (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 &amp;lt;!--small capitals--&amp;gt;&lt;br /&gt;
 &amp;lt;!ELEMENT smallcaps (#PCDATA | b | u | i)*&amp;gt;&lt;br /&gt;
 ]&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Inserting Extracted Data into Tables ===&lt;br /&gt;
&lt;br /&gt;
===Clean Up ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Scripts for processing data ==&lt;br /&gt;
The programs/scripts are located at our [[Software Repository]].&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Downloading raw bulk data from USPTO ===&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
 file: USPTO_Assignee_Download.pl&lt;br /&gt;
&lt;br /&gt;
The XML files are available at https://bulkdata.uspto.gov/data2/patent/assignment/&lt;br /&gt;
&lt;br /&gt;
The down-loader script used to download XML files is essentially same, with minor changes, as the one used for downloading USPTO patent-data.&lt;br /&gt;
That is, the current version of down-loader script downloads all files from the base URL (see above).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Parsing the XML files ===&lt;br /&gt;
 repository: Patent_Data_Parser &lt;br /&gt;
 branch: next&lt;br /&gt;
 directory: /uspto_assignees_xml_parser&lt;br /&gt;
 file: uspto_assignees_XML_parser.plx&lt;br /&gt;
&lt;br /&gt;
==== NAME ====&lt;br /&gt;
&lt;br /&gt;
uspto_assignees_XML_parser.plx - Parses XML files and populates a database.&lt;br /&gt;
&lt;br /&gt;
Specifically, parses every file in a directory according to a schema (see above).&lt;br /&gt;
Then populates a database on the RDP. &lt;br /&gt;
&lt;br /&gt;
==== SYNOPSIS ====&lt;br /&gt;
&lt;br /&gt;
perl uspto_assignees_XML_parser.plx /path/to/directory_containing_XML_files&lt;br /&gt;
&lt;br /&gt;
==== USAGE &amp;amp; FEATURES ====&lt;br /&gt;
&lt;br /&gt;
'''Arguments'''&lt;br /&gt;
The full path to directory is provided as a command line argument. It should contain the XML files to parse and no other file.&lt;br /&gt;
This path should be specified in Windows format (with '\') and NOT unix format.&lt;br /&gt;
&lt;br /&gt;
'''Features and Effects'''&lt;br /&gt;
As each XML file is parsed, a database on local host (RDP) is populated. If at any point there is an error, for example a particular&lt;br /&gt;
XML file is bad/invalid or the psql statement cannot be executed, the program aborts with a message.&lt;br /&gt;
&lt;br /&gt;
We choose to populate local database because remote connections are too slow. The database is eventually moved to DataBase server manually.&lt;br /&gt;
&lt;br /&gt;
==== TESTS ====&lt;br /&gt;
The first version does the job as expected. It was used to populate the assignees database by parsing XML files from USPTO(see above).&lt;br /&gt;
We parsed all XML files dated till 7/4/2016.&lt;br /&gt;
&lt;br /&gt;
==== TO DO ====&lt;br /&gt;
*Add more command line options to improve usability.&lt;br /&gt;
*Improve portability to allow Unix/Linux pathnames. This is straightforward to do with Perl modules File::Basename and File::Spec.&lt;/div&gt;</summary>
		<author><name>ShoebMohammed</name></author>
		
	</entry>
</feed>