Changes

Jump to navigation Jump to search
673 bytes added ,  13:44, 21 September 2020
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter
|Has title=INBIA
|Has owner=Anne Freeman,
|Depends upon it=Incubator Seed Data
}}
 
 
==Initial Review of INBIA==
The [https://inbia.org/ International Business Innovation Association (INBIA)] has a [http://exchange.inbia.org/network/findacompany directory] that contains information for 415 incubators within the United States. It provides reliable links to a secondary page within the INBIA domain. This page contains information including the incubator's name, address, a link to the home page of their website, and information for key contacts. The secondary pages have the same HTML structure and are reliable in the data they contain, making INBIA an ideal candidate for web crawling methods to collect data from the internal pages.
==Retrieve Data from URLS URLs Generated==
We wrote a web crawler that
# reads in the csv file containing the URLs to scrape into a pandas dataframe# changes the urls by -- replacing ''?c=companyprofile&'' with ''companyprofile?'' and appending the domain http://exchange.inbia.org/network/findacompany to each url# open opens each url and extract extracts information using element tree parser# write collects information for from each url to csv and stores it in a txt file  
The crawler generates a tab separated text file called INBIA_data.txt containing [company_name, street_address, city, state, zipcode, country, website] and is populated by information from the 415 entries from the database.
The crawler generates a csv txt file called INBIA_data.csv containing [company_name, street_address, city, state, zipcode, country, website, contact_person] and is populated by information from the 415 entries from the databasepython script (inbia_scrape. py) are located in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIA
== How to Run ==The csv file and the python following script (inbia_scrape.py) are located was coded in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIAa virtualenv on a Mac, using Python 3.6.5 The following packages where loaded in that virtualenv* beautifulsoup4 4.7.1 * certifi 2019.3.9* chardet 3.0.4 * idna 2.8 * numpy 1.16.2 * pandas 0.24.2 * pip 19.1.1 * python-dateutil 2.8.0 * pytz 2018.9 * requests 2.21.0 * setuptools 40.8.0 * six 1.12.0 * soupsieve 1.9 * urllib3 1.24.1 * wheel 0.33.1

Navigation menu