Changes

Harrison Brown (Work Log) (view source)

Revision as of 15:28, 9 November 2017

249 bytes removed , 15:28, 9 November 2017

no edit summary

[[Harrison Brown]] [[Work Logs]] [[Harrison Brown (Work Log)|(log page)]]

~~09/07/~~2017 ~~2:20pm~~-311-01:~~50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop~~*Looked at Oliver's code. Got git repository set up for the project on Bonobo. Started messing around with reading the XML documents in Java.

~~09/11/~~2017 ~~1pm~~-~~5pm~~ 10- ~~Met with Dr. Egan~~ 30:*Worked on seeing what data can be gathered from the CSV and ~~got assigned project~~XML files. ~~Set Up Project Page USITC,~~ Started ~~Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC~~ project page for project ~~notes and code~~.

~~09/13/~~2017 ~~1pm~~-310-26:~~50pm - Worked on parsing~~ *Met with Ed to talk about the ~~USITC website Section 337 Notices. Nearly have all~~ direction of the ~~data I can scrape~~project. ~~Scraper works, but there are a few edges~~ ~~cases where~~ Starting to work on extracting information in from the ~~tables are part of a Notice but do not have Investigation Numbers~~XML files. ~~Will finish this hopefully next time~~Working on adding documentation to wiki and work log. ~~Also added my USITC project to the~~ Looking into work from other projects ~~page I did not have it linked~~that may use XML.

~~09/14/~~2017 ~~1pm~~-310-25:~~50pm - Have~~ *Found information about a ~~python program~~ USITC database that ~~can scrape~~ we could use. Added this information to the ~~entire webpage~~ wiki, and ~~navigate through all of the pages that contain section 337 documents. You can see these files and more~~ updated information on ~~the~~ USITC ~~project~~ wiki page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.

~~09/18/~~2017 ~~1pm~~-510-19:~~00pm - Added features~~ *Continued to ~~python program~~ look into NLTK. Talked with Ed about looking into alternative approaches to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on gathering this ~~next time~~data.

~~09/20/~~2017 ~~1pm~~- 310-18:~~50pm - Got connected~~ *Trying to figure out the ~~database server and mounted~~ best way to extract respondents from the ~~drive onto my computer~~documents. ~~Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs~~Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. I Currently neither will ~~leave it running overnight hopefully it completes by tomorrow~~allow us to match every entity correctly so trying to figure out alternate approaches.

~~09/20/~~2017 ~~1pm~~- 210-16:~~30pm - Shell program did not work~~*NLTK** NLTK Information*** Need to convert text to ascii. ~~Create Python program~~ Had issues with my PDF texts and had to convert*** Can use sent_tokenize() function to split document into sentences, easier that ~~can catch all exceptions~~ regular expressions*** Use pos_tag(~~url does not exist, lost connection, and improperly formatted url~~) ~~Hopefully it will complete with no problems~~to tag the sentences. This ~~program is found in~~ can be used to extract proper nouns**** Trying to figure out how to use this to grab location data from these documents*** Worked with Peter to try to extract geographic information from the ~~database server under~~ documents. We looked into tools Geograpy and GeoText. Geograpy does not have the ~~USITC folder~~functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time.

~~09/25/~~2017 ~~1pm~~- 510-11:~~00pm - Got 3000 PDFS downloaded. Script works. Completed a task~~ *Started to ~~get emails~~ use NLTK library for ~~people who had written papers about economics and entrepreneurship~~gathering information to extract respondents. ~~Started work on pasring the PDFS to text~~See code in Projects/USITC/ProcessingTexts

~~09/27/~~2017 ~~1pm~~- ~~3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.~~ 09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents. 10~~/02/2017 1pm~~- ~~5:00pm - Started work with ArcGIS. Got the data with startups from Houston into the ArcGIS application. For notes see McNair/Porject/Agglomeration~~ ~~10/04/2017 1pm- 3~~05:~~50pm - Worked with Peter on connecting ArcGIS to the database and displaying different points in ArcGIS~~ ~~10/05/2017 1pm- 3:50pm -~~ *Made photos for the requested maps in ArcGIS with Peter and Jeemin.

To access:

Go to E:\McNair\Projects\Agglomeration\HarrisonPeterWorkArcGIS

To generate a PNG Click, File, Export to export the photos

To adjust the data right click on the table name in the layers lab, and hit properties, then query builder

~~10/11/2017 1pm- 3:50pm - Started to use NLTK library for gathering information to extract respondents. See code in Projects/USITC/ProcessingTexts~~

2017-10~~/16/2017 1pm~~ - 504:~~00pm - NLTK~~* ~~NLTK Information~~** Need Worked with Peter on connecting ArcGIS to ~~convert text to ascii. Had issues with my PDF texts~~ the database and ~~had to convert~~displaying different points in ArcGIS ** Can use sent_tokenize() function to split document into sentences, easier that regular expressions2017-10-02:** Use pos_tag() to tag Started work with ArcGIS. Got the ~~sentences. This can be used to extract proper nouns~~*** Trying to figure out how to use this to grab location data ~~from these documents~~** Worked with ~~Peter to try to extract geographic information~~ startups from ~~the documents. We looked~~ Houston into ~~tools Geograpy and GeoText. Geograpy does not have~~ the ~~functionality that we would like. GeoText looks to be better but we have issues with dependencies. Will try to resolve these next time~~ArcGIS application.For notes see McNair/Porject/Agglomeration

~~10/18/~~2017 ~~1pm~~ - 309-28:~~50pm~~ ~~Trying~~ *Helped Christy with set up on Postgres server. Looked through text documents to ~~figure out~~ see what information I could gather. Looked at Stanford NLTK library for extracting the ~~best way to extract~~ respondents from the documents. Right now using exclusively NLTK will not get us any more accuracy that using regular expressions. Currently neither will allow us to match every entity correctly so trying to figure out alternate approaches.

~~10/19/~~2017 ~~1pm~~ - 309-28:~~50pm~~ ~~Continued~~ *Got the PDFS parsed to ~~look into NLTK~~text. ~~Talked with Ed about looking into alternative approaches~~ Some of the formatting is off will need to ~~gathering this~~ determine if datacan still be gathered.

~~10/25/~~2017 ~~1pm~~ - 309-25:~~50pm~~ ~~Found information about~~ *Got 3000 PDFS downloaded. Script works. Completed a ~~USITC database that we could use. Added this information~~ task to ~~the wiki,~~ get emails for people who had written papers about economics and ~~updated information~~ entrepreneurship. Started work on ~~USITC wiki page.~~pasring the PDFS to text

~~10/26/~~2017 ~~1pm~~ - 309-20:~~50pm~~ ~~Met~~ *Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with Ed no problems. This program is found in the database server under the USITC folder. *Got connected to ~~talk about~~ the ~~direction~~ database server and mounted the drive onto my computer. Got the list of all the ~~project. Starting to work~~ PDFS on ~~extracting information from~~ the ~~XML files. Working~~ website and started a shell script on ~~adding documentation~~ the database server to ~~wiki and work log~~download all of the PDFs. ~~Looking into work from other projects that may use XML~~I will leave it running overnight hopefully it completes by tomorrow.

~~10/30/~~2017 ~~1pm~~ - 509-17:~~00pm~~ *Added features to python program to pull the dates in numerical form. Worked on ~~seeing what data can be gathered~~ pulling the PDFs from the ~~CSV~~ website. Currently working on pulling them in Python. The program can run and ~~XML files~~pull PDFs on my local machine but it doesn't work on the Remote Desktop. ~~Started project page for project~~I will work on this next time.

~~11/31/~~2017 ~~1pm~~ - 309-14:~~50pm~~ ~~Looked at Oliver's code~~*Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. ~~Got git repository set up~~ It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No. 2017-09-13: *Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges *cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked 2017-09-11: Met with Dr. Egan and got assigned project ~~on Bonobo~~. Set Up Project Page USITC, Started ~~messing around with reading~~ Coding in Python for the ~~XML documents~~ Web Crawler. Look in ~~Java~~McNair/Projects/UISTC for project notes and code. 2017-09-07: Set Up Work Log Pages, Slack, Microsoft Remote Desktop

</onlyinclude>

[[Category:Work Log]]

Hbrown512

Bureaucrats, Administrators (Semantic MediaWiki), Administrators

111

edits

Changes

Harrison Brown (Work Log) (view source)

Revision as of 15:28, 9 November 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools