Difference between revisions of "Google Crawler"

From edegan.com
Jump to navigation Jump to search
(Created page with "{{McNair Projects |Has title=Google Crawler |Has owner=Anne Freeman, |Has project status=Active |Depends upon it=Ecosystem Organization Classifier, Incubator Seed Data }} Goog...")
 
Line 5: Line 5:
 
|Depends upon it=Ecosystem Organization Classifier, Incubator Seed Data
 
|Depends upon it=Ecosystem Organization Classifier, Incubator Seed Data
 
}}
 
}}
Google Crawler to collect information from web searches of the format "incubator" + "city name"
+
 
 +
The crawler performs searches using google and collects the title and url for results. Searches are in the format of "incubator" + "city, state".
 +
 
 +
The crawler opens the text file containing a list of locations in the format "city, state", with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. 
 +
The titles and urls are stored in a csv file in the following format
 +
* first row: city, state
 +
* second row: titles of results
 +
* third row: urls of results
 +
* fourth row: blank
 +
This pattern repeats for each city, state query.
 +
 
 +
Relevant files, including python script, text files and csv files are located in
 +
E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler

Revision as of 15:04, 8 April 2019


McNair Project
Google Crawler
Project logo 02.png
Project Information
Project Title Google Crawler
Owner Anne Freeman
Start Date
Deadline
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


The crawler performs searches using google and collects the title and url for results. Searches are in the format of "incubator" + "city, state".

The crawler opens the text file containing a list of locations in the format "city, state", with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. The titles and urls are stored in a csv file in the following format

  • first row: city, state
  • second row: titles of results
  • third row: urls of results
  • fourth row: blank

This pattern repeats for each city, state query.

Relevant files, including python script, text files and csv files are located in

E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler