Jump to navigation Jump to search
1,429 bytes added ,  13:44, 21 September 2020
no edit summary
{{Project|Has project output=Data,Tool|Has sponsor=McNair ProjectsCenter
|Has title=INBIA
|Has owner=Anne Freeman,
|Depends upon it=Incubator Seed Data
==Initial Review of INBIA==
The [ International Business Innovation Association (INBIA)] has a [ directory] that contains information for 415 incubators within the United States. It provides reliable links to a secondary page within the INBIA domain. This page contains information including the incubator's name, address, a link to the home page of their website, and information for key contacts. The secondary pages have the same HTML structure and are reliable in the data they contain, making INBIA an ideal candidate for web crawling methods to collect data from the internal pages.
==Retrieve Data from URLS URLs Generated==We wrote a web crawler that # reads in the csv file containing the URLs to scrape into a pandas dataframe# changes the urls by replacing ''?c=companyprofile&'' with ''companyprofile?'' and appending the domain to each url# opens each url and extracts information using element tree parser# collects information from each url and stores it in a txt file  The crawler generates a tab separated text file called INBIA_data.txt containing [company_name, street_address, city, state, zipcode, country, website] and is populated by information from the 415 entries from the database.  The txt file and the python script ( are located in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIA  == How to Run ==The following script was coded in a virtualenv on a Mac, using Python 3.6.5 The following packages where loaded in that virtualenv* beautifulsoup4 4.7.1 * certifi 2019.3.9* chardet 3.0.4 * idna 2.8 * numpy 1.16.2 * pandas 0.24.2 * pip 19.1.1 * python-dateutil 2.8.0 * pytz 2018.9 * requests 2.21.0 * setuptools 40.8.0 * six 1.12.0 * soupsieve 1.9 * urllib3 1.24.1 * wheel 0.33.1

Navigation menu