Difference between revisions of "USITC"

From edegan.com
Jump to navigation Jump to search
Line 23: Line 23:
  
  
I have also downloaded the PDFS from the website. That is here
+
I have also downloaded the PDFS from the website. These are the pdfs that are in the csv file. Some of the PDFS were no able to be downloaded. The PDFs are here
 
  E:\McNair\Projects\USITC\pdf_copy
 
  E:\McNair\Projects\USITC\pdf_copy
  
Line 36: Line 36:
 
An example of PDF parsing that works parsing this PDF: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf
 
An example of PDF parsing that works parsing this PDF: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf
 
  E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt
 
  E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt
However, there will be PDFs where the parsing does not work completely
+
However, there will be PDFs where the parsing does not work completely and the  text is scrambled.
 
 
  
 
==Status==
 
==Status==
Line 48: Line 47:
 
Next steps will be to parse the PDFS, currently running a script to convert them to text
 
Next steps will be to parse the PDFS, currently running a script to convert them to text
  
Currently running a shell script to download the PDFs. Will update when that is completed
+
Currently running a shell script to download the PDFs. Downloaded most of the PDFs. There were errors download some of the files.

Revision as of 14:44, 28 September 2017


McNair Project
USITC
Project logo 02.png
Project Information
Project Title USITC Data
Owner Harrison Brown
Start Date 9/11/2017
Deadline
Primary Billing
Notes In Progress
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


Files

The files are in 2 different places:

E:\McNair\Projects\USITC

The Postgres SQL Server:

128.42.44.182/bulk/USITC

The results.csv file found here,

E:\McNair\Projects\USITC

is a csv of the data that I have been able to scrape from the HTML of https://www.usitc.gov/secretary/fed_reg_notices/337.htm

For every notice paper, there is a line in the CSV file that contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued


I have also downloaded the PDFS from the website. These are the pdfs that are in the csv file. Some of the PDFS were no able to be downloaded. The PDFs are here

E:\McNair\Projects\USITC\pdf_copy

These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason. You must download the PDFs on Postgres and transfer them to the RDP. The script to download the PDFS is

128.42.44.182\bulk\USITC\download

Using the pdf scraper from previous project found here

E:\McNair\software\utilities\PDF_RIPPER

An example of PDF parsing that works parsing this PDF: https://www.usitc.gov/secretary/fed_reg_notices/337/337_959_notice02062017sgl.pdf

E:\McNair\Projects\USITC\Parsed_Texts\337_959_notice02062017sgl.txt

However, there will be PDFs where the parsing does not work completely and the text is scrambled.

Status

Check my work log to see what I have done on a day to day basis

Currently the web scraper is able to gather all of the data that I can gather from the HTML. There are a few cases where the Investigation Number is not listed and I need to test for those and fix that in the code.

Next steps will be to parse the PDFS, currently running a script to convert them to text

Currently running a shell script to download the PDFs. Downloaded most of the PDFs. There were errors download some of the files.