Difference between revisions of "Google Scholar Crawler"

From edegan.com
Jump to navigation Jump to search
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Tool
 +
|Has sponsor=McNair Center
 
|Has title=Google Scholar Crawler
 
|Has title=Google Scholar Crawler
 
|Has owner=Christy Warden,
 
|Has owner=Christy Warden,
Line 95: Line 97:
 
===scholarcrawl.py===
 
===scholarcrawl.py===
 
====Overview====
 
====Overview====
This code is the work-in-progress replacement for downloadPDFs.py. The issue with downloadPDFs was that its impossible to discover the sweet spot of not being discovered by Google, since you cannot find any info online about how many clicks/ how fast gets you marked as a robot. scholarcrawl.py tries to work around the issue by catching every time Google stops us, and waiting 24 hours before trying again, leaving off on the same page you were stopped on previously. It is being tested as of Friday Dec 8, 2017.  
+
This code is the work-in-progress replacement for downloadPDFs.py. The issue with downloadPDFs was that its impossible to discover the sweet spot of not being discovered by Google, since you cannot find any info online about how many clicks/ how fast gets you marked as a robot. scholarcrawl.py tries to work around the issue by catching every time Google stops us, and waiting 24 hours before trying again, leaving off on the same page you were stopped on previously. It is being tested as of Friday Dec 8, 2017. It is continuing to run as expected as of Dec 12, 2017 and has searched through 34 pages.  
  
 
====How to Use====
 
====How to Use====

Latest revision as of 13:47, 21 September 2020


Project
Google Scholar Crawler
Project logo 02.png
Project Information
Has title Google Scholar Crawler
Has owner Christy Warden
Has start date November 10, 2017
Has deadline date
Has keywords Google, Scholar, Tool
Has project status Active
Dependent(s): Patent Thicket
Has sponsor McNair Center
Has project output Tool
Copyright © 2019 edegan.com. All Rights Reserved.

Overview

Google Scholar does not have its own API provided by Google. This page is dedicated to investigation into alternative methods for parsing and crawling data from Google Scholar.

Existing Libraries

A couple of Python parsers for Google Scholar exist, but they do not satisfy everything we need from this crawler.

Scholar.py

The scholar.py script is the most extensive command line tool for parsing Google Scholar information. Given a search query, it returns results such as title, URL, year, number of citations, Cluster ID, Citations list, Version list, and an excerpt.

For example, once scholar.py is downloaded and all necessary components are installed the following command:

python scholar.py -c 3 --phrase "innovation" 

produces the following results:

Title Mastering the dynamics of innovation: how companies can seize opportunities in the face of technological change
          URL http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1496719
         Year 1994
    Citations 5107
     Versions 5
   Cluster ID 6139131108983230018
Citations list http://scholar.google.com/scholar?cites=6139131108983230018&as_sdt=2005&sciodt=0,5&hl=en
Versions list http://scholar.google.com/scholar?cluster=6139131108983230018&hl=en&as_sdt=0,5
      Excerpt Abstract: Explores how innovation transforms industries, suggesting a strategic model to help firms to adjust to ever-shifting market dynamics. Understanding and adapting to innovation--  
'at once the creator and destroyer of industries and corporations'--is essential  ...
        Title National innovation systems: a comparative analysis
          URL http://books.google.com/books?hl=en&lr=&id=YFDGjgxc2CYC&oi=fnd&pg=PR7&dq=%22innovation%22&ots=Opaxro2BTV&sig=9-svcPMAzs8nHezDp94Z-HATdRk
         Year 1993
    Citations 8590
     Versions 6
   Cluster ID 13756840170990063961
Citations list http://scholar.google.com/scholar?cites=13756840170990063961&as_sdt=2005&sciodt=0,5&hl=en
Versions list http://scholar.google.com/scholar?cluster=13756840170990063961&hl=en&as_sdt=0,5
      Excerpt The slowdown of growth in Western industrialized nations in the last twenty years, along with the rise of Japan as a major economic and technological power (and enhanced technical   
sophistication of Taiwan, Korea, and other NICs) has led to what the authors believe to be  ...
        Title Profiting from technological innovation: Implications for integration, collaboration, licensing and public policy
          URL http://www.sciencedirect.com/science/article/pii/0048733386900272
         Year 1986
    Citations 10397
     Versions 38
   Cluster ID 14785720633759689821
Citations list http://scholar.google.com/scholar?cites=14785720633759689821&as_sdt=2005&sciodt=0,5&hl=en
Versions list http://scholar.google.com/scholar?cluster=14785720633759689821&hl=en&as_sdt=0,5
      Excerpt Abstract This paper attempts to explain why innovating firms often fail to obtain significant economic returns from an innovation, while customers, imitators and other industry participants  
benefit Business strategy—particularly as it relates to the firm's decision to  ...


Scholarly

Another parser of potential interest is scholarly. However, it produces less information than the scholar parser does.


Code Written for McNair

downloadPDFs.py

Overview

downloadPDFs.py is currently being replaced by scholarcrawl.py, located in the same directory. This code exists in E:\McNair\Software\Google_Scholar_Crawler\downloadPDFs.py.

This program takes in a key term to search and a number of pages to search on. It seeks information about the papers in this search. It depends on Selenium due to Google Scholar's blocking of traditional crawling. It runs somewhat slowly to prevent getting blocked by the website.

How to Use

Before you run the program, you should build a file directory that you want all the results to go in. Inside of this directory, you should create a folder called "BibTeX." For example, I could make a folder in E:\McNair\Projects\Patent_Thickets called "My_Crawl." Inside of My_Crawl I should make sure I have a "BibTeX" folder. You should also choose a search terms and how many pages you want to search.

Open the program downloadPDFs.py in Komodo. At the very end of the program, type:

main(your query, your output directory, your num pages)

Replace "your query" with the search term you want (like "patent thickets", making sure to include quotes around the term). Replace "your output directory" with the output directory you want these files to go to. Still using my example above, I would type "E:\McNair\Projects\Patent_Thickets\My_Crawl", making sure to include the quotes around the directory. Finally, replace "your num pages" with the number of pages you want to search. Click the play button in the top center of the screen.

What you'll get back

After the program is done running, go back to the folder you created to see the outputs. First, in your BibTeX folder, you will see a series of files named by the BibTeX keys of papers. Each of these is a text file containing the BibTeX for the paper. In your outer folder, you will have a files called "Query_your query_pdfTable7.txt" where "your query" is your search term and 7 can be replaced with any number. Each of these files is a text file of BibTeX keys in the left column and a link to the PDF for that paper in the other column.

In Progress

1) Trying to find the sweet spot where we move as fast as possible without being DISCOVERED BY GOOGLE.

2) Trying to make it so that if a link to the PDF cannot be found directly on Google, the link to the journal will be saved so that someone can go look it up and try to download it later.

Notes

All BibTeXs for the papers will be saved, but not all PDFs are available online so not all of the papers viewed will have a link.


scholarcrawl.py

Overview

This code is the work-in-progress replacement for downloadPDFs.py. The issue with downloadPDFs was that its impossible to discover the sweet spot of not being discovered by Google, since you cannot find any info online about how many clicks/ how fast gets you marked as a robot. scholarcrawl.py tries to work around the issue by catching every time Google stops us, and waiting 24 hours before trying again, leaving off on the same page you were stopped on previously. It is being tested as of Friday Dec 8, 2017. It is continuing to run as expected as of Dec 12, 2017 and has searched through 34 pages.

How to Use

Before you run the program, you should build a file directory that you want all the results to go in. Inside of this directory, you should create a folder called "BibTeX." For example, I could make a folder in E:\McNair\Projects\Patent_Thickets called "My_Crawl." Inside of My_Crawl I should make sure I have a "BibTeX" folder. You should also choose a search terms and how many pages you want to search.

Open the program downloadPDFs.py in Komodo. At the very end of the program, type:

main(your query, your output directory, your num pages)

Replace "your query" with the search term you want (like "patent thickets", making sure to include quotes around the term). Replace "your output directory" with the output directory you want these files to go to. Still using my example above, I would type "E:\McNair\Projects\Patent_Thickets\My_Crawl", making sure to include the quotes around the directory. Finally, replace "your num pages" with the number of pages you want to search. Click the play button in the top center of the screen.

What you'll get back

After the program is done running, go back to the folder you created to see the outputs. First, in your BibTeX folder, you will see a series of files named by the BibTeX keys of papers. Each of these is a text file containing the BibTeX for the paper. In your outer folder, you will have a files called "Query_your query_pdfTable7.txt" where "your query" is your search term and 7 can be replaced with any number. Each of these files is a text file of BibTeX keys in the left column and a link to the PDF for that paper in the other column.

In Progress

1) Testing

Notes

All BibTeXs for the papers will be saved, but not all PDFs are available online so not all of the papers viewed will have a link.