Changes

Jump to navigation Jump to search
672 bytes added ,  13:44, 21 September 2020
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter
|Has title=INBIA
|Has owner=Anne Freeman,
==Retrieve Data from URLS URLs Generated==
We wrote a web crawler that
# reads in the csv file containing the URLs to scrape into a pandas dataframe
# changes the urls by replacing ''?c=companyprofile&'' with ''companyprofile?'' and appending the domain http://exchange.inbia.org/network/findacompany to each url
# opens each url and extract extracts information using element tree parser# writes collects information for from each url to csv and stores it in a txt file
The crawler generates a csv tab separated text file called INBIA_data.csv txt containing [company_name, street_address, city, state, zipcode, country, website, contact_person] and is populated by information from the 415 entries from the database.
The csv txt file and the python script (inbia_scrape.py) are located in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIA  == How to Run ==The following script inbia_scrape.py was coded in a virtualenv on a Mac, using Python 3.6.5 The following packages where loaded in that virtualenv* beautifulsoup4 4.7.1 * certifi 2019.3.9* chardet 3.0.4 * idna 2.8 * numpy 1.16.2 * pandas 0.24.2 * pip 19.1.1 * python-dateutil 2.8.0 * pytz 2018.9 * requests 2.21.0 * setuptools 40.8.0 * six 1.12.0 * soupsieve 1.9 * urllib3 1.24.1 * wheel 0.33.1

Navigation menu