Changes

Jump to navigation Jump to search
580 bytes added ,  15:34, 27 September 2017
no edit summary
}}
==Files==
This is where the files will go.
The files are in2 different places:
E:\McNair\Projects\USITC
The Postgres SQL Server:128.42.44.182/bulk/USITC The results .csv file is a csv of the data that I have been able to scrape from the HTML
of https://www.usitc.gov/secretary/fed_reg_notices/337.htm
For every notice paper, there is a line in the CSV file that
contains the Investigation Title, Investigation No., link to the PDF on the website, Notice description, and date the notice was issued
 
 
I have also downloaded the PDFS from the website. That is here
E:\McNair\Projects\USITC\pdf_copy
 
These files were downloaded using the script on the Postgres Server. There are issues downloading PDFs onto the remote windows machine for some reason.
You must download the PDFs on Postgres and transfer them to the RDP. The script to download the PDFS
is
128.42.44.182/bulk/USITC/download
 
Using the pdf scraper from previous project found here
E/McNair/software/utilities/PDF_RIPPER
fix that in the code.
Next steps will be to parse the PDFS, currently running a script to convert them to text
Currently running a shell script to download the PDFs. Will update when that is completed

Navigation menu