<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Hbrown512</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Hbrown512"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/Hbrown512"/>
	<updated>2026-05-18T01:50:27Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22246</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22246"/>
		<updated>2017-11-30T20:18:50Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* USITC 337 Cases Tab Delimited Text */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=New Work=&lt;br /&gt;
==USITC 337 Cases Tab Delimited Text==&lt;br /&gt;
USITC patent information was gathered from the investigations.json file downloaded from the USITC website (https://pubapps2.usitc.gov/337external/, Click on Cases Instituted After 2008).&lt;br /&gt;
This contains information on 337 cases and their respondents/complainants and the patents that were part of the case. &lt;br /&gt;
The code and results for this program are here:&lt;br /&gt;
 Projects/USITC/JSON_scraping_python&lt;br /&gt;
The program grabs the information, places it into lists of lists in Python, and then writes to the file names listed below. The files do not have headers and null values are set to be empty strings.&lt;br /&gt;
To create the tab delimited text files, run code.py in the JSON_scraping_python directory. This has all of the file names hard coded. It will create the following files&lt;br /&gt;
 investigation_info.txt &lt;br /&gt;
 Schema for this file is id, title, investigation number,  investigation type, docket number, date of publication notice&lt;br /&gt;
&lt;br /&gt;
 complainant_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Complaintant name, complainant outside party ID, comp_city, comp country&lt;br /&gt;
&lt;br /&gt;
 respondent_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Respondent Outside Party ID , Respondent Name, Respondent City, Respondent Country&lt;br /&gt;
&lt;br /&gt;
 patent_info.txt&lt;br /&gt;
 Schema for this file is Investigation Number, Patent ID, Patent Number, Active Date, Inactive Date,&lt;br /&gt;
&lt;br /&gt;
==XML Information==&lt;br /&gt;
UPDATE: used JSON file of data to convert to tab-delimited text.&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will&lt;br /&gt;
download a csv file.&lt;br /&gt;
* The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV&lt;br /&gt;
* Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22245</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22245"/>
		<updated>2017-11-30T20:15:01Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* JSON Information */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=New Work=&lt;br /&gt;
==USITC 337 Cases Tab Delimited Text==&lt;br /&gt;
USITC patent information was gathered from the investigations.json file downloaded from the USITC website (https://pubapps2.usitc.gov/337external/, Click on Cases Instituted After 2008).&lt;br /&gt;
This contains information on 337 cases and their respondents/complainants and the patents that were part of the case. &lt;br /&gt;
The code and results for this program are here:&lt;br /&gt;
 Projects/USITC/JSON_scraping_python&lt;br /&gt;
The program grabs the information, places it into lists of lists in Python, and then writes to the file names listed below. The files do not have headers and null values are set to be empty strings.&lt;br /&gt;
To create the tab delimited text files, run code.py in the JSON_scraping_python directory. This has all of the file names hard coded. It will create the following files&lt;br /&gt;
 investigation_info.txt &lt;br /&gt;
 Schema for this file is id, title, investigation number,  investigation tpye, docket number, date of publication notice&lt;br /&gt;
&lt;br /&gt;
 complainant_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Complaintant name, complainant outside party ID, comp_city, comp country&lt;br /&gt;
&lt;br /&gt;
 respondent_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Respondent Outside Party ID , Respondent Name, Respondent City, Respondent Country&lt;br /&gt;
&lt;br /&gt;
 patent_info.txt&lt;br /&gt;
 Schema for this file is Investigation Number, Patent ID, Patent Number, Active Date, Inactive Date,&lt;br /&gt;
&lt;br /&gt;
==XML Information==&lt;br /&gt;
UPDATE: used JSON file of data to convert to tab-delimited text.&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will&lt;br /&gt;
download a csv file.&lt;br /&gt;
* The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV&lt;br /&gt;
* Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22244</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22244"/>
		<updated>2017-11-30T20:11:46Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* JSON Information */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=New Work=&lt;br /&gt;
==JSON Information==&lt;br /&gt;
The tab-delimited test files are here:&lt;br /&gt;
 Projects/USITC/JSON_scraping_python&lt;br /&gt;
To create the tab delimited text files, run code.py in the JSON_scraping_python directory. This has all of the file names hard coded. It will create the following files&lt;br /&gt;
 investigation_info.txt &lt;br /&gt;
 Schema for this file is id, title, investigation number,  investigation tpye, docket number, date of publication notice&lt;br /&gt;
&lt;br /&gt;
 complainant_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Complaintant name, complainant outside party ID, comp_city, comp country&lt;br /&gt;
&lt;br /&gt;
 respondent_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Respondent Outside Party ID , Respondent Name, Respondent City, Respondent Country&lt;br /&gt;
&lt;br /&gt;
 patent_info.txt&lt;br /&gt;
 Schema for this file is Investigation Number, Patent ID, Patent Number, Active Date, Inactive Date,&lt;br /&gt;
&lt;br /&gt;
This information was gathered from the investigations.json file downloaded from the USITC website (https://pubapps2.usitc.gov/337external/, Click on Cases Instituted After 2008)&lt;br /&gt;
The program grabs the information, places it into lists of lists, and then writes to the file names listed above.&lt;br /&gt;
&lt;br /&gt;
==XML Information==&lt;br /&gt;
UPDATE: used JSON file of data to convert to tab-delimited text.&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will&lt;br /&gt;
download a csv file.&lt;br /&gt;
* The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV&lt;br /&gt;
* Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22243</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22243"/>
		<updated>2017-11-30T20:08:06Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* JSON Information */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=New Work=&lt;br /&gt;
==JSON Information==&lt;br /&gt;
The tab-delimited test files are here:&lt;br /&gt;
 Projects/USITC/JSON_scraping_python&lt;br /&gt;
To create the tab delimited text files, run code.py in the JSON_scraping_python directory. This has all of the file names hard coded. It will create&lt;br /&gt;
 investigation_info.txt &lt;br /&gt;
 Schema for this file is id, title, investigation number,  investigation tpye, docket number, date of publication notice&lt;br /&gt;
&lt;br /&gt;
 complainant_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Complaintant name, complainant outside party ID, comp_city, comp country&lt;br /&gt;
&lt;br /&gt;
 respondent_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Respondent Outside Party ID , Respondent Name, Respondent City, Respondent Country&lt;br /&gt;
&lt;br /&gt;
 patent_info.txt&lt;br /&gt;
 Schema for this file is Investigation Number, Patent ID, Patent Number, Active Date, Inactive Date,&lt;br /&gt;
&lt;br /&gt;
==XML Information==&lt;br /&gt;
UPDATE: used JSON file of data to convert to tab-delimited text.&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will&lt;br /&gt;
download a csv file.&lt;br /&gt;
* The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV&lt;br /&gt;
* Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22242</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22242"/>
		<updated>2017-11-30T20:06:37Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=New Work=&lt;br /&gt;
==JSON Information==&lt;br /&gt;
The tab-delimited test files are here:&lt;br /&gt;
 Projects/USITC/JSON_scraping_python&lt;br /&gt;
To create the tab delimited text files, run code.py in the JSON_scraping_python directory. This has all of the file names hard coded. It will create&lt;br /&gt;
 investigation_info.txt &lt;br /&gt;
 Schema for this file is id, title, investigation number,  investigation tpye, docket number, date of publication notice&lt;br /&gt;
&lt;br /&gt;
 complainant_info.txt&lt;br /&gt;
 Schema for this file is investigation id, investigation number, Complaintant name, complainant outside party ID, comp_city, comp country&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==XML Information==&lt;br /&gt;
UPDATE: used JSON file of data to convert to tab-delimited text.&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will&lt;br /&gt;
download a csv file.&lt;br /&gt;
* The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV&lt;br /&gt;
* Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22240</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=22240"/>
		<updated>2017-11-30T20:01:48Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* New Work */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==New Work==&lt;br /&gt;
=JSON Information=&lt;br /&gt;
The tab-delimited test files are here&lt;br /&gt;
=XML Information=&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will&lt;br /&gt;
download a csv file.&lt;br /&gt;
* The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV&lt;br /&gt;
* Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=22239</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=22239"/>
		<updated>2017-11-30T20:00:42Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-29:&lt;br /&gt;
*Got the tab-delimited text files written for USITC data. Added detail to project page. &lt;br /&gt;
&lt;br /&gt;
2017-11-29:&lt;br /&gt;
*Finishing up converting JSON to tab-delimited text, see USITC/JSON_scraping_python. Worked on creating images with ArcGIS&lt;br /&gt;
&lt;br /&gt;
2017-11-13:&lt;br /&gt;
*Worked on getting JSON to tab-delimited text&lt;br /&gt;
&lt;br /&gt;
2017-11-01:&lt;br /&gt;
*Looked at Oliver's code. Got git repository set up for the project on Bonobo. Started messing around with reading the XML documents in Java.&lt;br /&gt;
&lt;br /&gt;
2017-10-30:&lt;br /&gt;
*Worked on seeing what data can be gathered from the CSV and XML files. Started project page for project.&lt;br /&gt;
&lt;br /&gt;
2017-10-26:&lt;br /&gt;
*Met with Ed to talk about the direction of the project. Starting to work on extracting information from the XML files. Working on adding documentation to wiki and work log. Looking into work from other projects that may use XML.&lt;br /&gt;
&lt;br /&gt;
2017-10-25:&lt;br /&gt;
*Found information about a USITC database that we could use. Added this information to the wiki, and updated information on USITC wiki page.&lt;br /&gt;
&lt;br /&gt;
2017-10-19:&lt;br /&gt;
*Continued to look into NLTK. Talked with Ed about looking into alternative approaches to gathering this data.&lt;br /&gt;
&lt;br /&gt;
2017-10-18:&lt;br /&gt;
*Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches. &lt;br /&gt;
&lt;br /&gt;
2017-10-16:&lt;br /&gt;
*NLTK&lt;br /&gt;
** NLTK Information&lt;br /&gt;
*** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
*** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
*** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
**** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
*** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
2017-10-11:&lt;br /&gt;
*Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
2017-10-05:&lt;br /&gt;
*Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
&lt;br /&gt;
2017-10-04:&lt;br /&gt;
*Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
2017-10-02:&lt;br /&gt;
*Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
2017-09-28:&lt;br /&gt;
*Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
2017-09-28:&lt;br /&gt;
*Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
2017-09-25:&lt;br /&gt;
*Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
2017-09-20:&lt;br /&gt;
*Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder. &lt;br /&gt;
*Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
2017-09-17: &lt;br /&gt;
*Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
2017-09-14: &lt;br /&gt;
*Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
2017-09-13: &lt;br /&gt;
*Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
*cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
2017-09-11: &lt;br /&gt;
	Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
2017-09-07: &lt;br /&gt;
	Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=22212</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=22212"/>
		<updated>2017-11-29T21:03:54Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-29:&lt;br /&gt;
*Finishing up converting JSON to tab-delimited text, see USITC/JSON_scraping_python&lt;br /&gt;
&lt;br /&gt;
2017-11-13:&lt;br /&gt;
*Worked on getting JSON to tab-delimited text&lt;br /&gt;
&lt;br /&gt;
2017-11-01:&lt;br /&gt;
*Looked at Oliver's code. Got git repository set up for the project on Bonobo. Started messing around with reading the XML documents in Java.&lt;br /&gt;
&lt;br /&gt;
2017-10-30:&lt;br /&gt;
*Worked on seeing what data can be gathered from the CSV and XML files. Started project page for project.&lt;br /&gt;
&lt;br /&gt;
2017-10-26:&lt;br /&gt;
*Met with Ed to talk about the direction of the project. Starting to work on extracting information from the XML files. Working on adding documentation to wiki and work log. Looking into work from other projects that may use XML.&lt;br /&gt;
&lt;br /&gt;
2017-10-25:&lt;br /&gt;
*Found information about a USITC database that we could use. Added this information to the wiki, and updated information on USITC wiki page.&lt;br /&gt;
&lt;br /&gt;
2017-10-19:&lt;br /&gt;
*Continued to look into NLTK. Talked with Ed about looking into alternative approaches to gathering this data.&lt;br /&gt;
&lt;br /&gt;
2017-10-18:&lt;br /&gt;
*Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches. &lt;br /&gt;
&lt;br /&gt;
2017-10-16:&lt;br /&gt;
*NLTK&lt;br /&gt;
** NLTK Information&lt;br /&gt;
*** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
*** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
*** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
**** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
*** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
2017-10-11:&lt;br /&gt;
*Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
2017-10-05:&lt;br /&gt;
*Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
&lt;br /&gt;
2017-10-04:&lt;br /&gt;
*Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
2017-10-02:&lt;br /&gt;
*Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
2017-09-28:&lt;br /&gt;
*Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
2017-09-28:&lt;br /&gt;
*Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
2017-09-25:&lt;br /&gt;
*Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
2017-09-20:&lt;br /&gt;
*Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder. &lt;br /&gt;
*Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
2017-09-17: &lt;br /&gt;
*Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
2017-09-14: &lt;br /&gt;
*Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
2017-09-13: &lt;br /&gt;
*Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
*cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
2017-09-11: &lt;br /&gt;
	Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
2017-09-07: &lt;br /&gt;
	Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=22211</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=22211"/>
		<updated>2017-11-29T21:03:11Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* Fall 2017 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-29:&lt;br /&gt;
*Finishing up converting JSON to tab-delimited text&lt;br /&gt;
&lt;br /&gt;
2017-11-13:&lt;br /&gt;
*Worked on getting JSON to tab-delimited text&lt;br /&gt;
&lt;br /&gt;
2017-11-01:&lt;br /&gt;
*Looked at Oliver's code. Got git repository set up for the project on Bonobo. Started messing around with reading the XML documents in Java.&lt;br /&gt;
&lt;br /&gt;
2017-10-30:&lt;br /&gt;
*Worked on seeing what data can be gathered from the CSV and XML files. Started project page for project.&lt;br /&gt;
&lt;br /&gt;
2017-10-26:&lt;br /&gt;
*Met with Ed to talk about the direction of the project. Starting to work on extracting information from the XML files. Working on adding documentation to wiki and work log. Looking into work from other projects that may use XML.&lt;br /&gt;
&lt;br /&gt;
2017-10-25:&lt;br /&gt;
*Found information about a USITC database that we could use. Added this information to the wiki, and updated information on USITC wiki page.&lt;br /&gt;
&lt;br /&gt;
2017-10-19:&lt;br /&gt;
*Continued to look into NLTK. Talked with Ed about looking into alternative approaches to gathering this data.&lt;br /&gt;
&lt;br /&gt;
2017-10-18:&lt;br /&gt;
*Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches. &lt;br /&gt;
&lt;br /&gt;
2017-10-16:&lt;br /&gt;
*NLTK&lt;br /&gt;
** NLTK Information&lt;br /&gt;
*** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
*** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
*** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
**** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
*** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
2017-10-11:&lt;br /&gt;
*Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
2017-10-05:&lt;br /&gt;
*Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
&lt;br /&gt;
2017-10-04:&lt;br /&gt;
*Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
2017-10-02:&lt;br /&gt;
*Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
2017-09-28:&lt;br /&gt;
*Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
2017-09-28:&lt;br /&gt;
*Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
2017-09-25:&lt;br /&gt;
*Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
2017-09-20:&lt;br /&gt;
*Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder. &lt;br /&gt;
*Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
2017-09-17: &lt;br /&gt;
*Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
2017-09-14: &lt;br /&gt;
*Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
2017-09-13: &lt;br /&gt;
*Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
*cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
2017-09-11: &lt;br /&gt;
	Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
2017-09-07: &lt;br /&gt;
	Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=ArcMap_/_ArcGIS_Documentation&amp;diff=22210</id>
		<title>ArcMap / ArcGIS Documentation</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=ArcMap_/_ArcGIS_Documentation&amp;diff=22210"/>
		<updated>2017-11-29T20:19:15Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* ArcMap */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=ArcMap / ArcGIS Documentation&lt;br /&gt;
|Has owner=Jeemin Sim,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
= ArcMap =&lt;br /&gt;
&lt;br /&gt;
== Add Text file (.txt) as data on ArcMap ==&lt;br /&gt;
 1) Open ArcMap&lt;br /&gt;
 2) Locate a a button with a black cross with yellow diamond underneath (below'Selection')&lt;br /&gt;
 3) Click the tiny triangle drop-down button to its immediate right&lt;br /&gt;
 4) Select 'Add Data'&lt;br /&gt;
 5) Find location where local text file resides. For example: (Home - Documents\ArcGIS)&lt;br /&gt;
 6) Select a text file of interest and click Add&lt;br /&gt;
    - The text file should have at least two columns: X and Y coordinates&lt;br /&gt;
 7) Locate the text file in the Table Of Contents tab in the left&lt;br /&gt;
 8) Right click the text file&lt;br /&gt;
 9) Select 'Display XY Data...'&lt;br /&gt;
 10) In the pop-up window 'Display XY Data', choose XField and YField - the default should work&lt;br /&gt;
 11) Check that in the same pop-up window the box labelled 'Coordinate System of Input Coordinates'&lt;br /&gt;
     contains GCS_WGS_1984&lt;br /&gt;
     - If not: click the 'Edit...' Button. --&amp;gt; Click the Plus sign for 'Geographic Coordinate Systems'&lt;br /&gt;
                --&amp;gt; Click the Plus sign for 'World' --&amp;gt; Select 'WGS 1984' --&amp;gt; Press OK&lt;br /&gt;
 12) Click 'OK'&lt;br /&gt;
 13) Most likely a window titled 'Table Does Not Have Object-ID Field' will pop up,&lt;br /&gt;
     but do not fret&lt;br /&gt;
 14) The points will show on the big screen in the middle&lt;br /&gt;
&lt;br /&gt;
==Manipulate Data With Queries==&lt;br /&gt;
 1) Right click on data layer&lt;br /&gt;
 2) Click on Properties&lt;br /&gt;
 3) Click on Definition Query&lt;br /&gt;
 4) There are two options, definition query or query builder&lt;br /&gt;
 5) Type In Query and Hit Apply and Then OK&lt;br /&gt;
&lt;br /&gt;
== Add Base layer or map on ArcMap ==&lt;br /&gt;
&lt;br /&gt;
 1) Locate a a button with a black cross with yellow diamond underneath (below'Selection')&lt;br /&gt;
 2) Click the tiny triangle drop-down button to its immediate right&lt;br /&gt;
 3) Select 'Add Basemap...'&lt;br /&gt;
 4) Choose a base map of your liking. 'Light Gray Canvas' is frequently used.&lt;br /&gt;
 5) Click 'Add'&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=ArcMap_/_ArcGIS_Documentation&amp;diff=22209</id>
		<title>ArcMap / ArcGIS Documentation</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=ArcMap_/_ArcGIS_Documentation&amp;diff=22209"/>
		<updated>2017-11-29T20:18:44Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=ArcMap / ArcGIS Documentation&lt;br /&gt;
|Has owner=Jeemin Sim,&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
= ArcMap =&lt;br /&gt;
&lt;br /&gt;
== Add Text file (.txt) as data on ArcMap ==&lt;br /&gt;
 1) Open ArcMap&lt;br /&gt;
 2) Locate a a button with a black cross with yellow diamond underneath (below'Selection')&lt;br /&gt;
 3) Click the tiny triangle drop-down button to its immediate right&lt;br /&gt;
 4) Select 'Add Data'&lt;br /&gt;
 5) Find location where local text file resides. For example: (Home - Documents\ArcGIS)&lt;br /&gt;
 6) Select a text file of interest and click Add&lt;br /&gt;
    - The text file should have at least two columns: X and Y coordinates&lt;br /&gt;
 7) Locate the text file in the Table Of Contents tab in the left&lt;br /&gt;
 8) Right click the text file&lt;br /&gt;
 9) Select 'Display XY Data...'&lt;br /&gt;
 10) In the pop-up window 'Display XY Data', choose XField and YField - the default should work&lt;br /&gt;
 11) Check that in the same pop-up window the box labelled 'Coordinate System of Input Coordinates'&lt;br /&gt;
     contains GCS_WGS_1984&lt;br /&gt;
     - If not: click the 'Edit...' Button. --&amp;gt; Click the Plus sign for 'Geographic Coordinate Systems'&lt;br /&gt;
                --&amp;gt; Click the Plus sign for 'World' --&amp;gt; Select 'WGS 1984' --&amp;gt; Press OK&lt;br /&gt;
 12) Click 'OK'&lt;br /&gt;
 13) Most likely a window titled 'Table Does Not Have Object-ID Field' will pop up,&lt;br /&gt;
     but do not fret&lt;br /&gt;
 14) The points will show on the big screen in the middle&lt;br /&gt;
&lt;br /&gt;
==Manipulate Data With Queries==&lt;br /&gt;
1) Right click on data layer&lt;br /&gt;
2) Click on Properties&lt;br /&gt;
3) Click on Definition Query&lt;br /&gt;
4) There are two options, definition query or query builder&lt;br /&gt;
5) Type In Query and Hit Apply and Then OK&lt;br /&gt;
&lt;br /&gt;
== Add Base layer or map on ArcMap ==&lt;br /&gt;
&lt;br /&gt;
 1) Locate a a button with a black cross with yellow diamond underneath (below'Selection')&lt;br /&gt;
 2) Click the tiny triangle drop-down button to its immediate right&lt;br /&gt;
 3) Select 'Add Basemap...'&lt;br /&gt;
 4) Choose a base map of your liking. 'Light Gray Canvas' is frequently used.&lt;br /&gt;
 5) Click 'Add'&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=21719</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=21719"/>
		<updated>2017-11-09T19:28:07Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Fall 2017===&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2017-11-01:&lt;br /&gt;
*Looked at Oliver's code. Got git repository set up for the project on Bonobo. Started messing around with reading the XML documents in Java.&lt;br /&gt;
&lt;br /&gt;
2017-10-30:&lt;br /&gt;
*Worked on seeing what data can be gathered from the CSV and XML files. Started project page for project.&lt;br /&gt;
&lt;br /&gt;
2017-10-26:&lt;br /&gt;
*Met with Ed to talk about the direction of the project. Starting to work on extracting information from the XML files. Working on adding documentation to wiki and work log. Looking into work from other projects that may use XML.&lt;br /&gt;
&lt;br /&gt;
2017-10-25:&lt;br /&gt;
*Found information about a USITC database that we could use. Added this information to the wiki, and updated information on USITC wiki page.&lt;br /&gt;
&lt;br /&gt;
2017-10-19:&lt;br /&gt;
*Continued to look into NLTK. Talked with Ed about looking into alternative approaches to gathering this data.&lt;br /&gt;
&lt;br /&gt;
2017-10-18:&lt;br /&gt;
*Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches. &lt;br /&gt;
&lt;br /&gt;
2017-10-16:&lt;br /&gt;
*NLTK&lt;br /&gt;
** NLTK Information&lt;br /&gt;
*** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
*** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
*** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
**** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
*** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
2017-10-11:&lt;br /&gt;
*Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
2017-10-05:&lt;br /&gt;
*Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
&lt;br /&gt;
2017-10-04:&lt;br /&gt;
*Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
2017-10-02:&lt;br /&gt;
*Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
2017-09-28:&lt;br /&gt;
*Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
2017-09-28:&lt;br /&gt;
*Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
2017-09-25:&lt;br /&gt;
*Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
2017-09-20:&lt;br /&gt;
*Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder. &lt;br /&gt;
*Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
2017-09-17: &lt;br /&gt;
*Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
2017-09-14: &lt;br /&gt;
*Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
2017-09-13: &lt;br /&gt;
*Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
*cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
2017-09-11: &lt;br /&gt;
	Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
2017-09-07: &lt;br /&gt;
	Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=21456</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=21456"/>
		<updated>2017-11-01T20:42:37Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
10/16/2017 1pm - 5:00pm - NLTK&lt;br /&gt;
* NLTK Information&lt;br /&gt;
** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
*** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
10/18/2017 1pm - 3:50pm &lt;br /&gt;
Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches. &lt;br /&gt;
&lt;br /&gt;
10/19/2017 1pm - 3:50pm &lt;br /&gt;
Continued to look into NLTK. Talked with Ed about looking into alternative approaches to gathering this data.&lt;br /&gt;
&lt;br /&gt;
10/25/2017 1pm - 3:50pm &lt;br /&gt;
Found information about a USITC database that we could use. Added this information to the wiki, and updated information on USITC wiki page.&lt;br /&gt;
&lt;br /&gt;
10/26/2017 1pm - 3:50pm &lt;br /&gt;
Met with Ed to talk about the direction of the project. Starting to work on extracting information from the XML files. Working on adding documentation to wiki and work log. Looking into work from other projects that may use XML.&lt;br /&gt;
&lt;br /&gt;
10/30/2017 1pm - 5:00pm &lt;br /&gt;
Worked on seeing what data can be gathered from the CSV and XML files. Started project page for project.&lt;br /&gt;
&lt;br /&gt;
11/31/2017 1pm - 3:50pm &lt;br /&gt;
Looked at Oliver's code. Got git repository set up for the project on Bonobo. Started messing around with reading the XML documents in Java.&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21453</id>
		<title>XML Parsing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21453"/>
		<updated>2017-11-01T19:43:00Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=XML&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to XML parsing projects that have been done&lt;br /&gt;
&lt;br /&gt;
==Projects Using XML==&lt;br /&gt;
This will hold the projects that use XML&lt;br /&gt;
Had Issues Checking out with HTTPS&lt;br /&gt;
&lt;br /&gt;
* [[Reproducible_Patent_Data|Reproducible Patent Data]], uses Java&lt;br /&gt;
&lt;br /&gt;
==Potentially Relevant Projects==&lt;br /&gt;
HTML Parsing/Potentially relevant projects&lt;br /&gt;
&lt;br /&gt;
* [[Accelerator_Seed_List_(Data)|Accelerator Seed List (Data)]], look at F6S&lt;br /&gt;
&lt;br /&gt;
==Information==&lt;br /&gt;
This will hold information on how to parse XML&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21451</id>
		<title>XML Parsing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21451"/>
		<updated>2017-11-01T19:02:13Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=XML&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to XML parsing projects that have been done&lt;br /&gt;
&lt;br /&gt;
==Projects Using XML==&lt;br /&gt;
This will hold the projects that use XML&lt;br /&gt;
&lt;br /&gt;
* [[Reproducible_Patent_Data|Reproducible Patent Data]], uses Java&lt;br /&gt;
&lt;br /&gt;
==Potentially Relevant Projects==&lt;br /&gt;
HTML Parsing/Potentially relevant projects&lt;br /&gt;
&lt;br /&gt;
* [[Accelerator_Seed_List_(Data)|Accelerator Seed List (Data)]], look at F6S&lt;br /&gt;
&lt;br /&gt;
==Information==&lt;br /&gt;
This will hold information on how to parse XML&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=21332</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=21332"/>
		<updated>2017-10-30T21:51:31Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
10/16/2017 1pm - 5:00pm - NLTK&lt;br /&gt;
* NLTK Information&lt;br /&gt;
** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
*** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
10/18/2017 1pm - 3:50pm &lt;br /&gt;
Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches. &lt;br /&gt;
&lt;br /&gt;
10/19/2017 1pm - 3:50pm &lt;br /&gt;
Continued to look into NLTK. Talked with Ed about looking into alternative approaches to gathering this data.&lt;br /&gt;
&lt;br /&gt;
10/25/2017 1pm - 3:50pm &lt;br /&gt;
Found information about a USITC database that we could use. Added this information to the wiki, and updated information on USITC wiki page.&lt;br /&gt;
&lt;br /&gt;
10/26/2017 1pm - 3:50pm &lt;br /&gt;
Met with Ed to talk about the direction of the project. Starting to work on extracting information from the XML files. Working on adding documentation to wiki and work log. Looking into work from other projects that may use XML.&lt;br /&gt;
&lt;br /&gt;
10/30/2017 1pm - 5:00pm &lt;br /&gt;
Worked on seeing what data can be gathered from the CSV and XML files. Started project page for project.&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21327</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21327"/>
		<updated>2017-10-30T20:35:02Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==New Work==&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will&lt;br /&gt;
download a csv file.&lt;br /&gt;
* The investigation number, Title, Unfair Act Alleged, Patent Numbers,Complainants, Respondents, can be grabbed easily from the CSV&lt;br /&gt;
* Target Date, Beginning and Ending Dates contain notes (some cases are extended and dates are changed)and so we may need to do some text processing to grab this information&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21325</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21325"/>
		<updated>2017-10-30T20:27:43Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==New Work==&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
To find information on cases prior to 2008, go to the link above and click on 'Looking for cases instituted prior to October 2008?', and it will&lt;br /&gt;
download a csv file.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21324</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21324"/>
		<updated>2017-10-30T20:26:18Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==New Work==&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complainant can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21323</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21323"/>
		<updated>2017-10-30T20:25:31Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==New Work==&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Investigation Number ex - (&amp;lt;entry key=&amp;quot;investigationNo&amp;quot;&amp;gt;966&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Date of publication Notice - (&amp;lt;entry key=&amp;quot;dateOfPublicationFrNotice&amp;quot;&amp;gt;2015-09-24T04:00:00.000Z&amp;lt;/entry&amp;gt;)&lt;br /&gt;
* Title ex - &amp;lt;entry key=&amp;quot;title&amp;quot;&amp;gt;Silicon-on-Insulator Wafers&amp;lt;/entry&amp;gt;&lt;br /&gt;
* There is an entry for patent numbers, ex - &amp;lt;entry key=&amp;quot;patentNumbers&amp;quot;&amp;gt;&lt;br /&gt;
* Investigation Type ex - &amp;lt;entry key=&amp;quot;investigationType&amp;quot;&amp;gt;Violation&amp;lt;/entry&amp;gt;&lt;br /&gt;
* Respondents can be found under &amp;lt;entry key=&amp;quot;respondent&amp;quot;&amp;gt;&lt;br /&gt;
* Complaintants can be found under &amp;lt;entry key=&amp;quot;complainant&amp;quot;&amp;gt;&lt;br /&gt;
Additional information can also be gathered from the XML document&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21321</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21321"/>
		<updated>2017-10-30T20:19:11Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==New Work==&lt;br /&gt;
There is an XML file that contains information on investigations. To get it go to the link below and 'xml' link that is under the tab that under 'Cases instituted after October 2008'.&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
The information that is found this file can be grabbed with an XML parser. For each investigation, we can find out&lt;br /&gt;
* Test&lt;br /&gt;
* Test 1&lt;br /&gt;
* Test 2&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21320</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21320"/>
		<updated>2017-10-30T20:13:19Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==New Work==&lt;br /&gt;
Here is the information that we can gather from the XML file containing information about investigations&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
To get this file click on the 'xm' link that is under the tab that under 'Cases instituted after October 2008'&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21318</id>
		<title>XML Parsing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21318"/>
		<updated>2017-10-30T19:53:22Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=XML&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to XML parsing project that have been done&lt;br /&gt;
&lt;br /&gt;
==Projects Using XML==&lt;br /&gt;
This will hold the projects that use XML&lt;br /&gt;
&lt;br /&gt;
* [[Reproducible_Patent_Data|Reproducible Patent Data]], uses Java&lt;br /&gt;
&lt;br /&gt;
==Potentially Relevant Projects==&lt;br /&gt;
HTML Parsing/Potentially relevant projects&lt;br /&gt;
&lt;br /&gt;
* [[Accelerator_Seed_List_(Data)|Accelerator Seed List (Data)]], look at F6S&lt;br /&gt;
&lt;br /&gt;
==Information==&lt;br /&gt;
This will hold information on how to parse XML&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21305</id>
		<title>XML Parsing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21305"/>
		<updated>2017-10-30T18:58:35Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* Projects Using XML */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=XML&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to XML parsing project that have been done&lt;br /&gt;
&lt;br /&gt;
==Projects Using XML==&lt;br /&gt;
This will hold the projects that use XML&lt;br /&gt;
&lt;br /&gt;
* [[Reproducible_Patent_Data|Reproducible Patent Data]]&lt;br /&gt;
&lt;br /&gt;
==Potentially Relevant Projects==&lt;br /&gt;
HTML Parsing/Potentially relevant projects&lt;br /&gt;
&lt;br /&gt;
* [[Accelerator_Seed_List_(Data)|Accelerator Seed List (Data)]], look at F6S&lt;br /&gt;
&lt;br /&gt;
==Information==&lt;br /&gt;
This will hold information on how to parse XML&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21304</id>
		<title>XML Parsing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21304"/>
		<updated>2017-10-30T18:58:13Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=XML&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to XML parsing project that have been done&lt;br /&gt;
&lt;br /&gt;
==Projects Using XML==&lt;br /&gt;
This will hold the projects that use XML&lt;br /&gt;
&lt;br /&gt;
* [[Reproducible_Patent_Data|Reproducible Patent Data]]&lt;br /&gt;
&lt;br /&gt;
==Projects Using XML==&lt;br /&gt;
HTML Parsing/Potentially relevant projects&lt;br /&gt;
&lt;br /&gt;
* [[Accelerator_Seed_List_(Data)|Accelerator Seed List (Data)]], look at F6S&lt;br /&gt;
&lt;br /&gt;
==Information==&lt;br /&gt;
This will hold information on how to parse XML&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21297</id>
		<title>XML Parsing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21297"/>
		<updated>2017-10-30T18:48:43Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* Projects Using XML */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=XML&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to XML parsing project that have been done&lt;br /&gt;
&lt;br /&gt;
==Projects Using XML==&lt;br /&gt;
This will hold the projects that use XML&lt;br /&gt;
&lt;br /&gt;
* [[Reproducible_Patent_Data|Reproducible Patent Data]]&lt;br /&gt;
&lt;br /&gt;
==Information==&lt;br /&gt;
This will hold information on how to parse XML&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21176</id>
		<title>XML Parsing</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=XML_Parsing&amp;diff=21176"/>
		<updated>2017-10-26T20:29:23Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: Created page with &amp;quot;{{McNair Projects |Has title=Python Libraries |Has owner=Harrison Brown |Has deadline=Never |Has keywords=XML }} This page is dedicated to XML parsing project that have been d...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=XML&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to XML parsing project that have been done&lt;br /&gt;
&lt;br /&gt;
==Projects Using XML==&lt;br /&gt;
This will hold the projects that use XML&lt;br /&gt;
&lt;br /&gt;
==Information==&lt;br /&gt;
This will hold information on how to parse XML&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=21174</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=21174"/>
		<updated>2017-10-26T20:25:44Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
10/16/2017 1pm - 5:00pm - NLTK&lt;br /&gt;
* NLTK Information&lt;br /&gt;
** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
*** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
10/18/2017 1pm - 3:50pm &lt;br /&gt;
Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches. &lt;br /&gt;
&lt;br /&gt;
10/19/2017 1pm - 3:50pm &lt;br /&gt;
Continued to look into NLTK. Talked with Ed about looking into alternative approaches to gathering this data.&lt;br /&gt;
&lt;br /&gt;
10/25/2017 1pm - 3:50pm &lt;br /&gt;
Found information about a USITC database that we could use. Added this information to the wiki, and updated information on USITC wiki page.&lt;br /&gt;
&lt;br /&gt;
10/26/2017 1pm - 3:50pm &lt;br /&gt;
Met with Ed to talk about the direction of the project. Starting to work on extracting information from the XML files. Working on adding documentation to wiki and work log. Looking into work from other projects that may use XML.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21166</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21166"/>
		<updated>2017-10-26T19:56:54Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and a database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21161</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21161"/>
		<updated>2017-10-26T19:50:19Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Additional Information==&lt;br /&gt;
There is more information that the USITC provides besides 337 notices. &lt;br /&gt;
&lt;br /&gt;
Here is information and database on Section 701/731&lt;br /&gt;
 https://www.usitc.gov/trade_remedy/trade_research_tools&lt;br /&gt;
 https://pubapps2.usitc.gov/sunset/&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21157</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21157"/>
		<updated>2017-10-26T19:42:50Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
Here is where you can get data on the USITC 337 notices instead of extracting this information from PDFs&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21153</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21153"/>
		<updated>2017-10-26T19:35:07Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see Alternative Solutions section below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21152</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21152"/>
		<updated>2017-10-26T19:34:30Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
Did development work here&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
I had to use to the Postgres SQL server to download the PDFS&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Old Work==&lt;br /&gt;
This is the work that was done before I knew about the XML files provided by the USITC (see below)&lt;br /&gt;
&lt;br /&gt;
Here is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21120</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21120"/>
		<updated>2017-10-25T19:54:37Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Work==&lt;br /&gt;
&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
is a csv of the data that I have been able to scrape from the HTML&lt;br /&gt;
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21119</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21119"/>
		<updated>2017-10-25T19:51:39Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Work==&lt;br /&gt;
&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
is a csv of the data that I have been able to scrape from the HTML&lt;br /&gt;
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped if we needed to. They have a query builder (link below). There is also a link that download all of the raw data in JSON or XML.&lt;br /&gt;
&lt;br /&gt;
Here are links to various  statistics we could use:&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
 https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
 https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
 https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
 https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
 https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21118</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21118"/>
		<updated>2017-10-25T19:49:25Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Work==&lt;br /&gt;
&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
is a csv of the data that I have been able to scrape from the HTML&lt;br /&gt;
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
&lt;br /&gt;
337Info - Unfair Import Investigations Information System&lt;br /&gt;
&lt;br /&gt;
https://pubapps2.usitc.gov/337external/&lt;br /&gt;
&lt;br /&gt;
This link has links to Complaints. Most of the data that can be gathered is there and it is in a nice format that could be easily web scraped. There is a query builder but I believe there may be errors with it. I will try to see if I can get it to work.&lt;br /&gt;
&lt;br /&gt;
There are various statistics that are publicly available that we could use.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Settlement Rate Data&lt;br /&gt;
https://www.usitc.gov/intellectual_property/337_statistics_settlement_rate_data.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
https://www.usitc.gov/press_room/337_stats.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics: Number of New, Completed, and Active Investigations by Fiscal Year (Updated Quarterly)&lt;br /&gt;
https://www.usitc.gov/intellectual_property/337_statistics_number_new_completed_and_active.htm&lt;br /&gt;
&lt;br /&gt;
Section 337 Statistics&lt;br /&gt;
This contains links to various other pages with statistics&lt;br /&gt;
https://www.usitc.gov/intellectual_property/337_statistics.htm&lt;br /&gt;
&lt;br /&gt;
Here is a dictionary of terms used in these Documents&lt;br /&gt;
https://www.usitc.gov/documents/337Info_ext_DataDictionary.pdf&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There are FAQs listed here &lt;br /&gt;
https://www.usitc.gov/documents/337Info_FAQ.pdf&lt;br /&gt;
&lt;br /&gt;
There is a query builder for Section 337 Notices Here&lt;br /&gt;
https://pubapps2.usitc.gov/337external/advanced&lt;br /&gt;
&lt;br /&gt;
To use you must select fields from the GUI at the bottom of the page&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21110</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21110"/>
		<updated>2017-10-25T18:54:24Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Work==&lt;br /&gt;
&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
is a csv of the data that I have been able to scrape from the HTML&lt;br /&gt;
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Alternative Solutions==&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21109</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21109"/>
		<updated>2017-10-25T18:53:54Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: /* Status */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Work==&lt;br /&gt;
&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
is a csv of the data that I have been able to scrape from the HTML&lt;br /&gt;
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files. I need to calculate what PDFs were not able to be downloaded and why. &lt;br /&gt;
Investigating what other ways we can gather the information.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21108</id>
		<title>USITC</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=USITC&amp;diff=21108"/>
		<updated>2017-10-25T18:53:02Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=USITC Data&lt;br /&gt;
|Has owner=Harrison Brown&lt;br /&gt;
|Has start date=9/11/2017&lt;br /&gt;
|Has notes=In Progress&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Files==&lt;br /&gt;
&lt;br /&gt;
My  files are in 2 different places:&lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
&lt;br /&gt;
The Postgres SQL Server:&lt;br /&gt;
 128.42.44.182/bulk/USITC&lt;br /&gt;
&lt;br /&gt;
==Work==&lt;br /&gt;
&lt;br /&gt;
The results.csv file found here, &lt;br /&gt;
 E:\McNair\Projects\USITC&lt;br /&gt;
is a csv of the data that I have been able to scrape from the HTML&lt;br /&gt;
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm&lt;br /&gt;
&lt;br /&gt;
For every notice paper, there is a line in the CSV file that&lt;br /&gt;
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued&lt;br /&gt;
&lt;br /&gt;
I have also downloaded the PDFS from the website. These are the pdfs from the links that are in the csv file. Some of the PDFS were not able to be downloaded. The PDFs are here&lt;br /&gt;
 E:\McNair\Projects\USITC\pdfs_copy&lt;br /&gt;
&lt;br /&gt;
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.&lt;br /&gt;
You must download the PDFs on the Postgres Server and transfer them to the RDP. The script to download the PDFS&lt;br /&gt;
is &lt;br /&gt;
 128.42.44.182\bulk\USITC\download&lt;br /&gt;
&lt;br /&gt;
Using the pdf scraper from previous project found here&lt;br /&gt;
 E:\McNair\software\utilities\PDF_RIPPER&lt;br /&gt;
You can scrape the PDFs. This file was modified to scrape all of the pdfs in the pdfs folder. The modified code is in the McNair/Project/USITC/ directory and it is&lt;br /&gt;
called pdf_to_text_bulk.py&lt;br /&gt;
&lt;br /&gt;
An example of PDF parsing that works is parsing this pdf: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf&lt;br /&gt;
 E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt&lt;br /&gt;
There are PDFs where the parsing does not work completely and the  text is scrambled.&lt;br /&gt;
&lt;br /&gt;
I have started using NLTK to gather information about the filings. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
I am working on extracting the respondents from these documents.&lt;br /&gt;
&lt;br /&gt;
==Status==&lt;br /&gt;
Currently the web scraper is able to gather all of the data that I can gather from the HTML.&lt;br /&gt;
There are a few cases where the Investigation Number is not listed and I need to test for those and&lt;br /&gt;
fix that in the code.&lt;br /&gt;
&lt;br /&gt;
Downloaded most of the PDFs. There were errors download some of the files.&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Python_Libraries&amp;diff=20928</id>
		<title>Python Libraries</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Python_Libraries&amp;diff=20928"/>
		<updated>2017-10-19T20:47:07Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Peter Jalbert, Harrison Brown, Christy Warden, Jeemin Sim,&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=Python, Libraries&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to documenting all Python libraries, working or not. Please include a description of what the library is for, whether or not it is functional, and how to import and use it.&lt;br /&gt;
&lt;br /&gt;
==Geocoding Libraries==&lt;br /&gt;
&lt;br /&gt;
=NLP Libraries=&lt;br /&gt;
==NLTK==&lt;br /&gt;
NLTK is the Natural Language Toolkit&lt;br /&gt;
*NLTK Information&lt;br /&gt;
**Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
**Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
**Use pos_tag() to tag the sentences. This can be used to extract proper noun&lt;br /&gt;
**there are several packages that need to be downloaded, to do this:&lt;br /&gt;
***open up python in the shell&lt;br /&gt;
****run nltk.download()&lt;br /&gt;
****download all packages&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Python_Libraries&amp;diff=20927</id>
		<title>Python Libraries</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Python_Libraries&amp;diff=20927"/>
		<updated>2017-10-19T20:41:03Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Peter Jalbert, Harrison Brown, Christy Warden, Jeemin Sim,&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=Python, Libraries&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to documenting all Python libraries, working or not. Please include a description of what the library is for, whether or not it is functional, and how to import and use it.&lt;br /&gt;
&lt;br /&gt;
==Geocoding Libraries==&lt;br /&gt;
&lt;br /&gt;
=NLP Libraries=&lt;br /&gt;
==NLTK==&lt;br /&gt;
NLTK is the Natural Language Toolkit&lt;br /&gt;
*NLTK Information&lt;br /&gt;
**Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
**Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
**Use pos_tag() to tag the sentences. This can be used to extract proper noun&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Python_Libraries&amp;diff=20926</id>
		<title>Python Libraries</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Python_Libraries&amp;diff=20926"/>
		<updated>2017-10-19T20:40:09Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Peter Jalbert, Harrison Brown, Christy Warden, Jeemin Sim,&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=Python, Libraries&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to documenting all Python libraries, working or not. Please include a description of what the library is for, whether or not it is functional, and how to import and use it.&lt;br /&gt;
&lt;br /&gt;
==Geocoding Libraries==&lt;br /&gt;
&lt;br /&gt;
=NLP Libraries=&lt;br /&gt;
==NLTK==&lt;br /&gt;
NLTK is the Natural Language Toolkit&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Python_Libraries&amp;diff=20925</id>
		<title>Python Libraries</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Python_Libraries&amp;diff=20925"/>
		<updated>2017-10-19T20:39:34Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Python Libraries&lt;br /&gt;
|Has owner=Peter Jalbert, Harrison Brown, Christy Warden, Jeemin Sim,&lt;br /&gt;
|Has deadline=Never&lt;br /&gt;
|Has keywords=Python, Libraries&lt;br /&gt;
}}&lt;br /&gt;
This page is dedicated to documenting all Python libraries, working or not. Please include a description of what the library is for, whether or not it is functional, and how to import and use it.&lt;br /&gt;
&lt;br /&gt;
==Geocoding Libraries==&lt;br /&gt;
&lt;br /&gt;
==NLP Libraries==&lt;br /&gt;
=NLTK=&lt;br /&gt;
NLTK is the Natural Language Toolkit&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20862</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20862"/>
		<updated>2017-10-18T20:30:46Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
10/16/2017 1pm - 5:00pm - NLTK&lt;br /&gt;
* NLTK Information&lt;br /&gt;
** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
*** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
10/18/2017 1pm - 3:50pm &lt;br /&gt;
Trying to figure out the best way to extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20825</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20825"/>
		<updated>2017-10-16T21:37:56Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
10/16/2017 1pm - 5:00pm - NLTK&lt;br /&gt;
* NLTK Information&lt;br /&gt;
** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
*** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20806</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20806"/>
		<updated>2017-10-16T18:25:05Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
10/16/2017 1pm - 5:00pm - NLTK&lt;br /&gt;
* NLTK Information&lt;br /&gt;
** Need to convert text to ascii. Had issues with my PDF texts and had to convert&lt;br /&gt;
** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
*** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20805</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20805"/>
		<updated>2017-10-16T18:04:55Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
10/16/2017 1pm - 5:00pm - NLTK&lt;br /&gt;
* NLTK Information&lt;br /&gt;
** Can use sent_tokenize() function to split document into sentences, easier that regular expressions&lt;br /&gt;
** Use pos_tag() to tag the sentences. This can be used to extract proper nouns&lt;br /&gt;
*** Trying to figure out how to use this to grab location data from these documents&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20804</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20804"/>
		<updated>2017-10-16T18:03:23Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
10/16/2017 1pm - 5:00pm - NLTK&lt;br /&gt;
* NLTK Information&lt;br /&gt;
** Can use sent_tokenize to split document into sentences, easier that regular expressions&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20741</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20741"/>
		<updated>2017-10-11T20:30:30Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/05/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20638</id>
		<title>Harrison Brown (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Harrison_Brown_(Work_Log)&amp;diff=20638"/>
		<updated>2017-10-05T20:35:07Z</updated>

		<summary type="html">&lt;p&gt;Hbrown512: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
===Fall 2017 Work===&lt;br /&gt;
09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop&lt;br /&gt;
&lt;br /&gt;
09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code. &lt;br /&gt;
&lt;br /&gt;
09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges &lt;br /&gt;
cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked&lt;br /&gt;
&lt;br /&gt;
09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.&lt;br /&gt;
&lt;br /&gt;
09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.&lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow. &lt;br /&gt;
&lt;br /&gt;
09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.&lt;br /&gt;
&lt;br /&gt;
09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text&lt;br /&gt;
&lt;br /&gt;
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.&lt;br /&gt;
&lt;br /&gt;
10/02/2017 1pm- 5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS&lt;br /&gt;
&lt;br /&gt;
10/04/2017 1pm- 3:50pm - Made photos for the requested maps in ArcGIS with Peter and Jeemin.&lt;br /&gt;
        To access:&lt;br /&gt;
        Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS&lt;br /&gt;
         The photos can be found in there&lt;br /&gt;
        To generate the photos open ArcMap with the beginMapArc file&lt;br /&gt;
        To generate a PNG Click, File, Export to export the photos&lt;br /&gt;
        To adjust the data right click on the table name in the layers lab, and hit properties, then query builder&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Hbrown512</name></author>
		
	</entry>
</feed>