Changes

Jump to navigation Jump to search
1,443 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project|Has project output=Tool|Has sponsor=McNair ProjectsCenter|Has title=Web Google Crawler
|Has owner=Anne Freeman,
|Has project status=Active
Relevant files, including python script, text files are located in
E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\SeleniumCrawler SeleniumScraper
==Beautiful Soup Implementation==
This crawler was frequently blocked, as directly performed queries to google and parsed the results with beautiful soup. Additionally, this implementation would only collect eight results for each location. To prevent the crawler from being blocked and collect more results, we decided to switch and use selenium.
 
== Things to note/What needs work ==
The scraper coded using beautifulSoup does not work, it is frequently blocked by google. The scraper coded using Selenium pushes in the URL to google rather than typing in the search term and hitting enter. The Selenium script also does not collect results from multiple pages, I believe it collects results only from the first page at the moment.
 
== How to Run ==
The scripts incubator_scrape_data.py, and incubator_selenium_scrape.py were coded on a Mac in a virtualenv using python 3.6.5
The following packages were loaded into the environment for the Selenium Script:
* numpy 1.16.2
* pandas 0.24.2
* pip 19.1.1
* python-dateutil 2.8.0
* pytz 2019.1
* selenium 3.141.0
* setuptools 41.0.0
* six 1.12.0
* urllib3 1.24.1
* wheel 0.33.1
 
==Five Cities==
 
We retrieved the first 10 pages of results for each city in our 'five' cities. These included:
*Washington, DC and surrounds:
**Arlington VA
**Alexandria VA
**Crystal City VA
**Fairfax VA
**Washington DC
**Springfield MD
**Bethesda MD
**Gaithersburg MD
**Rockville MD
**Frederick MD
*Burlington VT
*Boulder, CO and select other CO cities:
**Boulder CO
**Colorado Springs CO
**Fort Collins CO
*The Twin Cities and adjacent city:
**St. Paul MN
**Minneapolis MN
**Bloomington MN
*Austin TX

Navigation menu