Difference between revisions of "PTLR Webcrawler"
(Created page with "PTLR Codification") |
|||
(17 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
[[PTLR Codification]] | [[PTLR Codification]] | ||
+ | |||
+ | Christy | ||
+ | |||
+ | Monday: 3-5 | ||
+ | |||
+ | Tuesday: 9-10:30, 4-5:45 | ||
+ | |||
+ | Thursday: 2:15-3:45 | ||
+ | |||
+ | =Code= | ||
+ | 12/12/17: [[Scholar Crawler Main Program]] | ||
+ | |||
+ | =Steps= | ||
+ | |||
+ | ==Search on Google== | ||
+ | |||
+ | Complete, query in command line to get results | ||
+ | |||
+ | ==Download BibTex== | ||
+ | |||
+ | Complete | ||
+ | |||
+ | ==Download PDFs== | ||
+ | |||
+ | Incomplete, struggling to find links. | ||
+ | |||
+ | ==Keywords List== | ||
+ | |||
+ | Find a copy of the Keywords List in the Dropbox: https://www.dropbox.com/s/mw5ep33fv7vz1rp/Keywords%20%3A%20Categories.xlsx?dl=0 | ||
+ | |||
+ | =Christy's LOG= | ||
+ | |||
+ | '''09/27''' | ||
+ | |||
+ | Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. | ||
+ | Already included SIS, DHCI and OP terms and working on adding the others. | ||
+ | |||
+ | |||
+ | '''09/28''' | ||
+ | |||
+ | Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once. | ||
+ | |||
+ | Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start: | ||
+ | |||
+ | 1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after). | ||
+ | |||
+ | 2) Classifying papers based on the matrix of term appearances that the current program builds. | ||
+ | |||
+ | |||
+ | '''10/02''' | ||
+ | |||
+ | Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large. | ||
+ | |||
+ | Stuff to work on: | ||
+ | |||
+ | 1) Neural net classification (computer suggesting which kind of paper it is) | ||
+ | |||
+ | 2) Improving patent thicket definition finding | ||
+ | |||
+ | 3) Finding the authors and having this as a contributing factor of the vectors | ||
+ | |||
+ | 4) Potentially going back to the google scholar problem to try to find the PDFs automatically. | ||
+ | |||
+ | |||
+ | '''10/10''' | ||
+ | |||
+ | Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :'))) | ||
+ | |||
+ | Code located at E:/McNair/Software/Google_Scholar_Crawler/downloadPDFs.py | ||
+ | |||
+ | '''11/02''' | ||
+ | |||
+ | Things are good! Today made the program so that we can get however many pages of search results we want and get the PDF links for all the ones we can see. Towards the end of the day, google scholar picked up that we were a robot and started blocking me. Hopefully the block goes away when I am back on Monday. Now working on parsing apart the txt file to go to the websites we saved and download the PDFs. Should not be particularly difficult. | ||
+ | |||
+ | '''11/28''' | ||
+ | |||
+ | Basically everything is ready to go, so long as Google Scholar leaves me alone. We currently have a program which will take in a search term and number of pages you want to search. The crawler will pull as many PDFs from this many pages as possible (it'll go slowly to avoid getting caught). Next, it will download all the PDFs discovered by the crawler (also possibly save the links for journals whose PDFs were not linked on scholar). It will then convert all the PDFs to text. Finally, it will search through the paper for a list of terms and for any definitions of patent thickets. I will be making documentation for these pieces of code today. | ||
+ | |||
+ | =Lauren's LOG= | ||
+ | |||
+ | 09/27 | ||
+ | |||
+ | Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications: | ||
+ | |||
+ | Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf | ||
+ | |||
+ | Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf | ||
+ | |||
+ | Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf | ||
+ | |||
+ | Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf | ||
+ | |||
+ | Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf | ||
+ | |||
+ | Phuc (2014) - Firm's Strategic Responses in Standardization.pdf | ||
+ | |||
+ | Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf | ||
+ | |||
+ | Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf | ||
+ | |||
+ | Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf | ||
+ | |||
+ | Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf | ||
+ | |||
+ | Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf | ||
+ | |||
+ | Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf | ||
+ | |||
+ | Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf | ||
+ | |||
+ | Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf | ||
+ | |||
+ | |||
+ | 09/28 | ||
+ | |||
+ | I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear. |
Latest revision as of 15:00, 12 December 2017
Christy
Monday: 3-5
Tuesday: 9-10:30, 4-5:45
Thursday: 2:15-3:45
Code
12/12/17: Scholar Crawler Main Program
Steps
Search on Google
Complete, query in command line to get results
Download BibTex
Complete
Download PDFs
Incomplete, struggling to find links.
Keywords List
Find a copy of the Keywords List in the Dropbox: https://www.dropbox.com/s/mw5ep33fv7vz1rp/Keywords%20%3A%20Categories.xlsx?dl=0
Christy's LOG
09/27
Created file FindKeyTerms.py in Software/Google_Scholar_Crawler which takes in a text file and returns counts of the key terms from the codification page. Already included SIS, DHCI and OP terms and working on adding the others.
09/28
Thought that the pdf to text converter wasn't working, but realized that it does just sloooowly (70 papers converted overnight). Should be fine since we are still developing the rest of the code and we only need to convert them to txt once.
Continued to load PTLR codification terms to the word finding code and got most of the way through (there are so many ahhh but I'm learning ways to do this more quickly). Once they're all loaded up, I will create some example files of the kind output this program will produce for Lauren to review and start:
1) Seeking definitions of patent thicket (I think I'll start by pulling any sentence that patent thicket occurs in as well as the sentence before and after).
2) Classifying papers based on the matrix of term appearances that the current program builds.
10/02
Program finally outputting something useful YAY. In FindKeyTerms.py (under McNair/Software/Google_Scholar_Crawler) I can input the path of a folder of txt files and it will scan all of them and seek they key words. It will put reports for every file in a new folder called KeyTerms that will appear in the input folder once the program terminates. An example file will be emailed to Lauren for corrections and adjustment. The file currently takes all the categories in the codification page and says 1) How many terms in that category appeared 2) How many times each of those terms appeared. At the bottom, it suggests potential definitions for patent thicket in the paper, but this part is pretty poor for now and needs adjustment. On the bright side, the program executes absurdly quickly and we can get through hundreds of files in less than a minute. In addition, while the program is running I am outputting a bag of words vector into a folder called WordBags in the input folder for future neural net usage to classify the papers. Need a training dataset that is relatively large.
Stuff to work on:
1) Neural net classification (computer suggesting which kind of paper it is)
2) Improving patent thicket definition finding
3) Finding the authors and having this as a contributing factor of the vectors
4) Potentially going back to the google scholar problem to try to find the PDFs automatically.
10/10
Found a way to get past google scholar blocking my crawling so spent time writing selenium code. I can certainly download 10 search result BibTeXs when you search for a certain term automatically now which is awesome. I am part of the way through having the crawler save the pdf link once it has saved the BibTex for the search results. Yay selenium :')))
Code located at E:/McNair/Software/Google_Scholar_Crawler/downloadPDFs.py
11/02
Things are good! Today made the program so that we can get however many pages of search results we want and get the PDF links for all the ones we can see. Towards the end of the day, google scholar picked up that we were a robot and started blocking me. Hopefully the block goes away when I am back on Monday. Now working on parsing apart the txt file to go to the websites we saved and download the PDFs. Should not be particularly difficult.
11/28
Basically everything is ready to go, so long as Google Scholar leaves me alone. We currently have a program which will take in a search term and number of pages you want to search. The crawler will pull as many PDFs from this many pages as possible (it'll go slowly to avoid getting caught). Next, it will download all the PDFs discovered by the crawler (also possibly save the links for journals whose PDFs were not linked on scholar). It will then convert all the PDFs to text. Finally, it will search through the paper for a list of terms and for any definitions of patent thickets. I will be making documentation for these pieces of code today.
Lauren's LOG
09/27
Took a random sample from "Candidate Papers by LB" and am reading each paper, extracting the definitions, and coding the definitions by hand. This is expected to the be a control group which will be tested for accuracy against computer coded papers in the future. The random sample contains the following publications:
Entezarkheir (2016) - Patent Ownership Fragmentation and Market Value An Empirical Analysis.pdf
Herrera (2014) - Not Purely Wasteful Exploring a Potential Benefit to Weak Patents.pdf
Kumari et al. (2017) - Managing Intellectual Property in Collaborative Way to Meet the Agricultural Challenges in India.pdf
Pauly (2015) - The Role of Intellectual Property in Collaborative Research Crossing the 'Valley of Death' by Turning Discovery into Health.pdf
Lampe Moser (2013) - Patent Pools and Innovation in Substitute Technologies - Evidence From the 19th-Century Sewing Machine Industry.pdf
Phuc (2014) - Firm's Strategic Responses in Standardization.pdf
Reisinger Tarantino (2016) - Patent Pools in Vertically Related Markets.pdf
Miller Tabarrok (2014) - Ill-Conceived, Even If Competently Administered - Software Patents, Litigation, and Innovation--A Comment on Graham and Vishnubhakat.pdf
Llanes Poblete (2014) - Ex Ante Agreements in Standard Setting and Patent-Pool Formation.pdf
Utku (2014) The Near Certainty of Patent Assertion Entity Victory in Portfolio Patent Litigation.pdf
Trappey et al. (2016) - Computer Supported Comparative Analysis of Technology Portfolio for LTE-A Patent Pools.pdf
Delcamp Leiponen (2015) - Patent Acquisition Services - A Market Solution to a Legal Problem or Nuclear Warfare.pdf
Allison Lemley Schwartz (2015) - Our Divided Patent System.pdf
Cremers Schliessler (2014) - Patent Litigation Settlement in Germany - Why Parties Settle During Trial.pdf
09/28
I added a section to the PTLR Codification page titled "Individual Terms." Ed would like to have all downloaded papers searched for these terms and record the frequency of which they appear.