INBIA

From edegan.com
Revision as of 12:02, 3 April 2019 by AnneFreeman (talk | contribs)
Jump to navigation Jump to search


McNair Project
INBIA
Project logo 02.png
Project Information
Project Title INBIA
Owner Anne Freeman
Start Date
Deadline
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.



Initial Review of INBIA

The International Business Innovation Association (INBIA) has a directory that contains information for 415 incubators within the United States. It provides reliable links to a secondary page within the INBIA domain. This page contains information including the incubator's name, address, a link to the home page of their website, and information for key contacts. The secondary pages have the same HTML structure and are reliable in the data they contain, making INBIA an ideal candidate for web crawling methods to collect data from the internal pages.

See Wiki Page Table for more details on source evaluations.

Retrieve URLS from INBIA Directory

We retrieved the INBIA data as follows:

  1. Go to http://exchange.inbia.org/network/findacompany/ and search US
  2. Change to 100 results per page
  3. Save HTML page of 0-100
  4. Choose next page, Save HTML page of 100-200
  5. Sort Z-A
  6. Save HTML page 418-318
  7. Choose next page, Save HTML page of 318-218
  8. Note that we are missing some that start with L and M
  9. Search US L, Choose page with L as first letter, Save HTML of L
  10. Search US M, Choose page with M as first letter, Save HTML of M

Then process each of those html files with regular expressions in textpad

  • Search .*biobubblekey Replace #
  • Search ^[^#].*\n Replace NOTHING
  • Search .*href=\" Replace NOTHING
  • Search <\/a> Replace NOTHING
  • Search \"> Replace \t

Then combine files, throw out duplicates, move columns, sort. This results in a file without headers where the lines are like:

1863 Ventures/Project 500	/?c=companyprofile&UserKey=4794e0a6-3f61-4357-a1cb-513baf00957e	
4th Sector Innovations	/?c=companyprofile&UserKey=cc47b04e-1c2a-4019-88b3-05d1163a0d6a	
712 Innovations	/?c=companyprofile&UserKey=531ad600-e11a-4c74-9f37-bace816b9325	
AccelerateHER	/?c=companyprofile&UserKey=3c05d1c1-91b5-48ae-8ec3-c77765b10c2b	
ACTION Innovation Network	/?c=companyprofile&UserKey=5ac08dd0-364d-47b2-8de0-a7536a3b4802	

We can now build a crawler to call http://exchange.inbia.org/network/findacompany/ with then the URL extension (either encoded or with & replaced with just &), for example: http://exchange.inbia.org/network/findacompany/?c=companyprofile&UserKey=da2dbe35-9afa-4141-9b31-4e2cfd46a5aa Gets the company page for Cambridge Innovation Center.

We can then rip out the contact information, including URL, and the people, using either beautiful soup or regular expressions.


Retrieve Data from URLS Generated

We wrote a web crawler that

  1. reads in the csv file into a pandas dataframe
  2. changes the urls by replacing ?c=companyprofile& with companyprofile? and appending the domain http://exchange.inbia.org/network/findacompany to each url
  3. opens each url and extract information using element tree parser
  4. writes information for each url to csv file


The crawler generates a csv file called INBIA_data.csv containing [company_name, street_address, city, state, zipcode, country, website, contact_person] and is populated by information from the 415 entries from the database.

The csv file and the python script (inbia_scrape.py) are located in E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIA