Changes

3,195 bytes added , 15:00, 12 December 2017

no edit summary

Thursday: 2:15-3:45

=Code=

12/12/17: [[Scholar Crawler Main Program]]

=Steps=

Incomplete, struggling to find links.

==Keywords List==

Find a copy of the Keywords List in the Dropbox: https://www.dropbox.com/s/mw5ep33fv7vz1rp/Keywords%20%3A%20Categories.xlsx?dl=0

=Christy's LOG=

2) Classifying papers based on the matrix of term appearances that the current program builds.

'''10/02'''

Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large.

Stuff to work on:

1) Neural net classification (computer suggesting which kind of paper it is)

2) Improving patent thicket definition finding

3) Finding the authors and having this as a contributing factor of the vectors

4) Potentially going back to the google scholar problem to try to find the PDFs automatically.

'''10/10'''

Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :')))

Code located at E:/McNair/Software/Google_Scholar_Crawler/downloadPDFs.py

'''11/02'''

Things are good! Today made the program so that we can get however many pages of search results we want and get the PDF links for all the ones we can see. Towards the end of the day, google scholar picked up that we were a robot and started blocking me. Hopefully the block goes away when I am back on Monday. Now working on parsing apart the txt file to go to the websites we saved and download the PDFs. Should not be particularly difficult.

'''11/28'''

Basically everything is ready to go, so long as Google Scholar leaves me alone. We currently have a program which will take in a search term and number of pages you want to search. The crawler will pull as many PDFs from this many pages as possible (it'll go slowly to avoid getting caught). Next, it will download all the PDFs discovered by the crawler (also possibly save the links for journals whose PDFs were not linked on scholar). It will then convert all the PDFs to text. Finally, it will search through the paper for a list of terms and for any definitions of patent thickets. I will be making documentation for these pieces of code today.

=Lauren's LOG=

ChristyW

272

edits

Changes

PTLR Webcrawler (view source)

Revision as of 15:00, 12 December 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools