Changes

Google Crawler (view source)

Revision as of 14:12, 15 April 2019

957 bytes added , 14:12, 15 April 2019

no edit summary

{{McNair Projects

|Has title=~~Google~~ Web Crawler

|Has owner=Anne Freeman,

|Has project status=Active

}}

==Background==

We wanted to create a ~~google~~ web crawler that could collect data from ~~web~~ google searches specific to individual cities. The searches could be in the format of "incubator" + "city, state". It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module.

The output from this crawler could be used in several ways:

# The page can be passed over to Amazon's [https://www.mturk.com/ Mechanical Turk] to outsource the task of classifying pages as being incubators.

==Selenium Implementation==

The selenium implementation of the crawler requires a downloaded chrome driver. The crawler opens the text file containing a list of locations in the format "city, state" with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then the crawler uses the chromedriver browser to access the url and parse the results for each location. It's default is to parse 10 pages of results, meaning that approximately 100 lines of data are collected for each location.

Relevant files, including python script, text files are located in

E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\SeleniumCrawler

==Beautiful Soup Implementation==When we created the web crawler, our first implementation used beautiful soup to directly "request" the url. The crawler ~~opens~~ took the ~~text~~ same input file ~~containing a list of locations in the format "~~(city, state"on each line, ~~with each entry~~ separated by ~~a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator"~~ newlines) and ~~appropriately attaches~~ formatted queries in the ~~city and state name, using google escape characters for commas and spaces~~same manner. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. The ~~titles and urls are~~ data collected is stored in a ~~csv~~ tab separated text file ~~in the following format~~* first with each row: containing city, state* second row: titles of results* third row: urls , title of ~~results~~* fourth row: blank~~This pattern repeats for each city~~result, ~~state query.~~url

Relevant files, including python script, text ~~files and csv~~ files are located in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler This crawler was frequently blocked, as directly performed queries to google and parsed the results with beautiful soup. Additionally, this implementation would only collect eight results for each location. To prevent the crawler from being blocked and collect more results, we decided to switch and use selenium.

AnneFreeman

83

edits

Changes

Google Crawler (view source)

Revision as of 14:12, 15 April 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools