Changes

672 bytes added , 13:44, 21 September 2020

no edit summary

{{Project|Has project output=Data,Tool|Has sponsor=McNair ~~Projects~~Center

|Has title=INBIA

|Has owner=Anne Freeman,

==Retrieve Data from ~~URLS~~ URLs Generated==

We wrote a web crawler that

# reads in the csv file containing the URLs to scrape into a pandas dataframe

# changes the urls by replacing ''?c=companyprofile&'' with ''companyprofile?'' and appending the domain http://exchange.inbia.org/network/findacompany to each url

# opens each url and ~~extract~~ extracts information using element tree parser# ~~writes~~ collects information ~~for~~ from each url ~~to csv~~ and stores it in a txt file

The crawler generates a ~~csv~~ tab separated text file called INBIA_data.~~csv~~ txt containing [company_name, street_address, city, state, zipcode, country, website~~, contact_person~~] and is populated by information from the 415 entries from the database.

The ~~csv~~ txt file and the python script (inbia_scrape.py) are located in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIA == How to Run ==The following script inbia_scrape.py was coded in a virtualenv on a Mac, using Python 3.6.5 The following packages where loaded in that virtualenv* beautifulsoup4 4.7.1 * certifi 2019.3.9* chardet 3.0.4 * idna 2.8 * numpy 1.16.2 * pandas 0.24.2 * pip 19.1.1 * python-dateutil 2.8.0 * pytz 2018.9 * requests 2.21.0 * setuptools 40.8.0 * six 1.12.0 * soupsieve 1.9 * urllib3 1.24.1 * wheel 0.33.1

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

INBIA (view source)

Revision as of 13:44, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools