Changes

369 bytes added , 16:00, 30 November 2017

→‎Fall 2017

===Fall 2017===

[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]

~~===Fall~~ 2017 ~~Work===09/07/2017 2:20pm~~-311-29:~~50pm~~ *Got the tab- ~~Set Up Work Log Pages, Slack, Microsoft Remote Desktop~~delimited text files written for USITC data. Added detail to project page.

~~09/~~2017-11~~/2017 1pm~~-~~5pm~~ 29:*Finishing up converting JSON to tab- ~~Met with Dr. Egan and got assigned project. Set Up Project Page~~ delimited text, see USITC~~, Started Coding in Python for the Web Crawler. Look in McNair~~/~~Projects/UISTC for project notes and code~~JSON_scraping_python. Worked on creating images with ArcGIS

~~09/13/~~2017 ~~1pm~~-311-13:~~50pm -~~ *Worked on ~~parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges~~ ~~cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project~~ getting JSON to ~~the projects page I did not have it linked~~tab-delimited text

~~09/14/~~2017 ~~1pm~~-311-01:~~50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents~~*Looked at Oliver's code. ~~You can see these files and more information on~~ Got git repository set up for the ~~USITC~~ project ~~page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work~~ on ~~that next time~~Bonobo. ~~Generated a csv file~~ Started messing around with ~~more than 4000 entries from~~ reading the ~~webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No~~XML documents in Java.

~~09/18/~~2017 ~~1pm~~-510-30:~~00pm - Added features to python program to pull the dates in numerical form.~~ *Worked on ~~pulling the PDFs~~ seeing what data can be gathered from the ~~website. Currently working on pulling them in Python. The program can run~~ CSV and ~~pull PDFs on my local machine but it doesn't work on the Remote Desktop~~XML files. ~~I will work on this next time~~Started project page for project.

~~09/20/~~2017 ~~1pm~~- 310-26:~~50pm - Got connected~~ *Met with Ed to talk about the ~~database server and mounted~~ direction of the ~~drive onto my computer~~project. ~~Got the list of all the PDFS~~ Starting to work on extracting information from the ~~website and started a shell script~~ XML files. Working on ~~the database server~~ adding documentation to ~~download all of the PDFs~~wiki and work log. ~~I will leave it running overnight hopefully it completes by tomorrow~~Looking into work from other projects that may use XML.

~~09/20/~~2017 ~~1pm~~- 210-25:~~30pm - Shell program did not work~~*Found information about a USITC database that we could use. ~~Create Python program that can catch all exceptions (url does not exist, lost connection~~Added this information to the wiki, and ~~improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the~~ updated information on USITC ~~folder~~wiki page.

~~09/25/~~2017 ~~1pm~~- 510-19:~~00pm - Got 3000 PDFS downloaded~~*Continued to look into NLTK. ~~Script works. Completed a task~~ Talked with Ed about looking into alternative approaches to ~~get emails for people who had written papers about economics and entrepreneurship~~gathering this data. ~~Started work on pasring the PDFS to text~~

~~09/27/~~2017 ~~1pm~~- 310-18:~~50pm - Got~~ *Trying to figure out the ~~PDFS parsed~~ best way to ~~text~~extract respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. ~~Some of the formatting is off~~ Currently neither will ~~need~~ allow us to match every entity correctly so trying to ~~determine if data can still be gathered~~figure out alternate approaches.

2017-10-16:

*NLTK

** NLTK Information

*** Need to convert text to ascii. Had issues with my PDF texts and had to convert

*** Can use sent_tokenize() function to split document into sentences, easier that regular expressions

*** Use pos_tag() to tag the sentences. This can be used to extract proper nouns

**** Trying to figure out how to use this to grab location data from these documents

*** Worked with Peter to try to extract geographic information from the documents. We looked into tools Geograpy and GeoText. Geograpy does not have the functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.

~~09/28/~~2017 ~~1pm~~- 310-11:~~50pm - Helped Christy with set up on Postgres server. Looked through text documents~~ *Started to ~~see what information I could gather. Looked at Stanford~~ use NLTK library for ~~extracting the~~ gathering information to extract respondents ~~from the documents~~.See code in Projects/USITC/ProcessingTexts

~~10/02/~~2017 ~~1pm- 5:00pm~~ - ~~Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration~~ 10~~/04/2017 1pm~~- 305:~~50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS~~ ~~10/05/2017 1pm- 3:50pm -~~ *Made photos for the requested maps in ArcGIS with Peter and Jeemin.

To access:

Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS

To generate a PNG Click, File, Export to export the photos

To adjust the data right click on the table name in the layers lab, and hit properties, then query builder

~~10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts~~

2017-10~~/16/2017 1pm~~ - 504:~~00pm - NLTK~~* ~~NLTK Information~~Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS 2017-10-02:** Need to convert text to asciiStarted work with ArcGIS. ~~Had issues~~ Got the data with ~~my PDF texts and had to convert~~** Can use sent_tokenize() function to split document startups from Houston into ~~sentences, easier that regular expressions~~** Use pos_tag() to tag the ~~sentences~~ArcGIS application. ~~This can be used to extract proper nouns~~For notes see McNair/Porject/Agglomeration *** Trying to figure out how to use this to grab location data from these documents2017-09-28:** Worked Helped Christy with ~~Peter~~ set up on Postgres server. Looked through text documents to ~~try to extract geographic~~ see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents. ~~We looked into tools Geograpy and GeoText~~ 2017-09-28:*Got the PDFS parsed to text. ~~Geograpy does not have~~ Some of the ~~functionality that we would like. GeoText looks~~ formatting is off will need to determine if data can still be ~~better but we have issues with dependencies~~gathered. 2017-09-25:*Got 3000 PDFS downloaded. Script works. ~~Will try~~ Completed a task to ~~resolve these next time~~get emails for people who had written papers about economics and entrepreneurship.Started work on pasring the PDFS to text

~~10/18/~~2017 ~~1pm~~ - 309-20:~~50pm~~ ~~Trying~~ *Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder. *Got connected to ~~figure out~~ the ~~best way~~ database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to ~~extract respondents from~~ download all of the ~~documents~~PDFs. ~~Right now using exclusively NLTK~~ I will ~~not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches~~leave it running overnight hopefully it completes by tomorrow.

~~10/19/~~2017 ~~1pm~~ - 309-17:~~50pm~~ ~~Continued~~ *Added features to ~~look into NLTK~~python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. ~~Talked with Ed about looking into alternative approaches to gathering~~ The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this ~~data~~next time.

~~10/25/~~2017 ~~1pm~~ - 309-14:~~50pm~~ ~~Found information about~~ *Have a ~~USITC database~~ python program that can scrape the entire webpage and navigate through all of the pages that ~~we could use~~contain section 337 documents. ~~Added this information to the wiki,~~ You can see these files and ~~updated~~ more information on the USITC ~~wiki~~ project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.

~~10/26/~~2017 ~~1pm~~ - 309-13:~~50pm~~ ~~Met with Ed to talk about~~ *Worked on parsing the ~~direction~~ USITC website Section 337 Notices. Nearly have all of the ~~project~~data I can scrape. ~~Starting to work on extracting~~ Scraper works, but there are a few edges *cases where information ~~from~~ in the ~~XML files~~tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. ~~Working on adding documentation~~ Also added my USITC project to ~~wiki and work log. Looking into work from other~~ the projects ~~that may use XML.~~page I did not have it linked

2017-09-11:

Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code.

2017-09-07:

Set Up Work Log Pages, Slack, Microsoft Remote Desktop

</onlyinclude>

[[Category:Work Log]]

Hbrown512

Bureaucrats, Administrators (Semantic MediaWiki), Administrators

111

edits

Changes

Harrison Brown (Work Log) (view source)

Revision as of 16:00, 30 November 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools