Changes

1,429 bytes added , 12:44, 21 September 2020

no edit summary

{{Project|Has project output=Data,Tool|Has sponsor=McNair ~~Projects~~Center

|Has title=INBIA

|Has owner=Anne Freeman,

|Depends upon it=Incubator Seed Data

}}

==Initial Review of INBIA==

The [https://inbia.org/ International Business Innovation Association (INBIA)] has a [http://exchange.inbia.org/network/findacompany directory] that contains information for 415 incubators within the United States. It provides reliable links to a secondary page within the INBIA domain. This page contains information including the incubator's name, address, a link to the home page of their website, and information for key contacts. The secondary pages have the same HTML structure and are reliable in the data they contain, making INBIA an ideal candidate for web crawling methods to collect data from the internal pages.

==Retrieve Data from ~~URLS~~ URLs Generated==We wrote a web crawler that # reads in the csv file containing the URLs to scrape into a pandas dataframe# changes the urls by replacing ''?c=companyprofile&'' with ''companyprofile?'' and appending the domain http://exchange.inbia.org/network/findacompany to each url# opens each url and extracts information using element tree parser# collects information from each url and stores it in a txt file The crawler generates a tab separated text file called INBIA_data.txt containing [company_name, street_address, city, state, zipcode, country, website] and is populated by information from the 415 entries from the database. The txt file and the python script (inbia_scrape.py) are located in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIA == How to Run ==The following script inbia_scrape.py was coded in a virtualenv on a Mac, using Python 3.6.5 The following packages where loaded in that virtualenv* beautifulsoup4 4.7.1 * certifi 2019.3.9* chardet 3.0.4 * idna 2.8 * numpy 1.16.2 * pandas 0.24.2 * pip 19.1.1 * python-dateutil 2.8.0 * pytz 2018.9 * requests 2.21.0 * setuptools 40.8.0 * six 1.12.0 * soupsieve 1.9 * urllib3 1.24.1 * wheel 0.33.1

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,658

edits

Changes

INBIA (view source)

Revision as of 12:44, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools