Difference between revisions of "INBIA"
AnneFreeman (talk | contribs) |
AnneFreeman (talk | contribs) |
||
| Line 7: | Line 7: | ||
}} | }} | ||
The [https://inbia.org/ International Business Innovation Association (INBIA)] has a [http://exchange.inbia.org/network/findacompany directory] containing information on 415 incubators in the United States. | The [https://inbia.org/ International Business Innovation Association (INBIA)] has a [http://exchange.inbia.org/network/findacompany directory] containing information on 415 incubators in the United States. | ||
| + | |||
| + | |||
| + | ===INBIA=== | ||
| + | |||
| + | We retrieved the INBIA data as follows: | ||
| + | #Go to http://exchange.inbia.org/network/findacompany/ and search US | ||
| + | #Change to 100 results per page | ||
| + | #Save HTML page of 0-100 | ||
| + | #Choose next page, Save HTML page of 100-200 | ||
| + | #Sort Z-A | ||
| + | #Save HTML page 418-318 | ||
| + | #Choose next page, Save HTML page of 318-218 | ||
| + | #Note that we are missing some that start with L and M | ||
| + | #Search US L, Choose page with L as first letter, Save HTML of L | ||
| + | #Search US M, Choose page with M as first letter, Save HTML of M | ||
| + | |||
| + | Then process each of those html files with regular expressions in textpad | ||
| + | *Search .*biobubblekey Replace # | ||
| + | *Search ^[^#].*\n Replace NOTHING | ||
| + | *Search .*href=\" Replace NOTHING | ||
| + | *Search <\/a> Replace NOTHING | ||
| + | *Search \"> Replace \t | ||
| + | |||
| + | Then combine files, throw out duplicates, move columns, sort. This results in a file without headers where the lines are like: | ||
| + | 1863 Ventures/Project 500 /?c=companyprofile&UserKey=4794e0a6-3f61-4357-a1cb-513baf00957e | ||
| + | 4th Sector Innovations /?c=companyprofile&UserKey=cc47b04e-1c2a-4019-88b3-05d1163a0d6a | ||
| + | 712 Innovations /?c=companyprofile&UserKey=531ad600-e11a-4c74-9f37-bace816b9325 | ||
| + | AccelerateHER /?c=companyprofile&UserKey=3c05d1c1-91b5-48ae-8ec3-c77765b10c2b | ||
| + | ACTION Innovation Network /?c=companyprofile&UserKey=5ac08dd0-364d-47b2-8de0-a7536a3b4802 | ||
| + | |||
| + | We can now build a crawler to call http://exchange.inbia.org/network/findacompany/ with then the URL extension (either encoded or with <nowiki>&</nowiki> replaced with just &), for example: http://exchange.inbia.org/network/findacompany/?c=companyprofile&UserKey=da2dbe35-9afa-4141-9b31-4e2cfd46a5aa Gets the company page for Cambridge Innovation Center. | ||
| + | |||
| + | We can then rip out the contact information, including URL, and the people, using either beautiful soup or regular expressions. | ||
Revision as of 10:31, 3 April 2019
| INBIA | |
|---|---|
| Project Information | |
| Project Title | INBIA |
| Owner | Anne Freeman |
| Start Date | |
| Deadline | |
| Primary Billing | |
| Notes | |
| Has project status | Active |
| Subsumes: | Incubator Seed Data, Ecosystem Organization Classifier |
| Copyright © 2016 edegan.com. All Rights Reserved. | |
The International Business Innovation Association (INBIA) has a directory containing information on 415 incubators in the United States.
INBIA
We retrieved the INBIA data as follows:
- Go to http://exchange.inbia.org/network/findacompany/ and search US
- Change to 100 results per page
- Save HTML page of 0-100
- Choose next page, Save HTML page of 100-200
- Sort Z-A
- Save HTML page 418-318
- Choose next page, Save HTML page of 318-218
- Note that we are missing some that start with L and M
- Search US L, Choose page with L as first letter, Save HTML of L
- Search US M, Choose page with M as first letter, Save HTML of M
Then process each of those html files with regular expressions in textpad
- Search .*biobubblekey Replace #
- Search ^[^#].*\n Replace NOTHING
- Search .*href=\" Replace NOTHING
- Search <\/a> Replace NOTHING
- Search \"> Replace \t
Then combine files, throw out duplicates, move columns, sort. This results in a file without headers where the lines are like:
1863 Ventures/Project 500 /?c=companyprofile&UserKey=4794e0a6-3f61-4357-a1cb-513baf00957e 4th Sector Innovations /?c=companyprofile&UserKey=cc47b04e-1c2a-4019-88b3-05d1163a0d6a 712 Innovations /?c=companyprofile&UserKey=531ad600-e11a-4c74-9f37-bace816b9325 AccelerateHER /?c=companyprofile&UserKey=3c05d1c1-91b5-48ae-8ec3-c77765b10c2b ACTION Innovation Network /?c=companyprofile&UserKey=5ac08dd0-364d-47b2-8de0-a7536a3b4802
We can now build a crawler to call http://exchange.inbia.org/network/findacompany/ with then the URL extension (either encoded or with & replaced with just &), for example: http://exchange.inbia.org/network/findacompany/?c=companyprofile&UserKey=da2dbe35-9afa-4141-9b31-4e2cfd46a5aa Gets the company page for Cambridge Innovation Center.
We can then rip out the contact information, including URL, and the people, using either beautiful soup or regular expressions.