Difference between revisions of "Harrison Brown (Work Log)"

From edegan.com
Jump to navigation Jump to search
Line 21: Line 21:
 
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.
 
09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.
  
 +
 +
09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.
 
[[Category:Work Log]]
 
[[Category:Work Log]]

Revision as of 16:34, 28 September 2017

Harrison Brown Work Logs (log page)

Fall 2017 Work

09/07/2017 2:20pm-3:50pm - Set Up Work Log Pages, Slack, Microsoft Remote Desktop

09/11/2017 1pm-5pm - Met with Dr. Egan and got assigned project. Set Up Project Page USITC, Started Coding in Python for the Web Crawler. Look in McNair/Projects/UISTC for project notes and code.

09/13/2017 1pm-3:50pm - Worked on parsing the USITC website Section 337 Notices. Nearly have all of the data I can scrape. Scraper works, but there are a few edges cases where information in the tables are part of a Notice but do not have Investigation Numbers. Will finish this hopefully next time. Also added my USITC project to the projects page I did not have it linked

09/14/2017 1pm-3:50pm - Have a python program that can scrape the entire webpage and navigate through all of the pages that contain section 337 documents. You can see these files and more information on the USITC project page. It can pull all of the information that is in the HTML that can be gathered for each case. The PDFs now need to be scraped; will start work on that next time. Generated a csv file with more than 4000 entries from the webpage. There is a small edge case I need to fix where the entry does not contain the Investigation No.

09/18/2017 1pm-5:00pm - Added features to python program to pull the dates in numerical form. Worked on pulling the PDFs from the website. Currently working on pulling them in Python. The program can run and pull PDFs on my local machine but it doesn't work on the Remote Desktop. I will work on this next time.

09/20/2017 1pm- 3:50pm - Got connected to the database server and mounted the drive onto my computer. Got the list of all the PDFS on the website and started a shell script on the database server to download all of the PDFs. I will leave it running overnight hopefully it completes by tomorrow.

09/20/2017 1pm- 2:30pm - Shell program did not work. Create Python program that can catch all exceptions (url does not exist, lost connection, and improperly formatted url) Hopefully it will complete with no problems. This program is found in the database server under the USITC folder.

09/25/2017 1pm- 5:00pm - Got 3000 PDFS downloaded. Script works. Completed a task to get emails for people who had written papers about economics and entrepreneurship. Started work on pasring the PDFS to text

09/27/2017 1pm- 3:50pm - Got the PDFS parsed to text. Some of the formatting is off will need to determine if data can still be gathered.


09/28/2017 1pm- 3:50pm - Helped Christy with set up on Postgres server. Looked through text documents to see what information I could gather. Looked at Stanford NLTK library for extracting the respondents from the documents.