Changes

Jump to navigation Jump to search
55 bytes added ,  15:46, 25 July 2018
no edit summary
===Google Scholar Crawler===
used [[Google Scholar Crawler]]
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.
===Downloading PDFs===
Used pdfdownloader.py[[PDF Downloader]]
I tweaked the code to take into account repeat of file names.
===pdf_to_txt_bulk_PTLR.py===
See [[PDF to Text Converter]]
 
The code must be run in E because of the libraries it uses is not in Z.
I reinstalled pdfminer which might be a problem in the future if the libraries change.
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.
108

edits

Navigation menu