Edit Project: Patent Thicket

You do not have permission to edit this page, for the following reason:

The action you have requested is limited to users in one of the groups: Users, team.

Has image:
Has title:
Has owner:
Has start date:
Has deadline date:
Has keywords:
Has project output:	Tool Data Content How-to Guide
Has project status:
Is dependent on:
Does subsume:
Has sponsor:
Has file locations:

Free text:

===Location of Files=== E://McNair/Software/Patent_Thicket Downloaded PDFs: E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads Converted PDFs to txt files: E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts ===Google Scholar Crawler=== Used [[Google Scholar Crawler]] I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me. I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for "patent thicket." ===Downloading PDFs=== Used [[PDF Downloader]] I tweaked the code to take into account repeat of file names. 5 of the pdf urls were not downloadable so I ended up with 608 working pdfs. ===pdf_to_txt_bulk_PTLR.py=== See [[PDF to Text Converter]] The code must be run in E because of the libraries it uses is not in Z. I reinstalled pdfminer which might be a problem in the future if the libraries change. This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper. There were 573 successful txt files and 36 files that failed to convert (which does not add up to 608 but I'm not sure why).

Summary:

This is a minor edit Watch this page

Cancel

Edit Project: Patent Thicket

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools