Difference between revisions of "Talk:Accelerator Seed List (Data)"
VeeralShah (talk | contribs) |
VeeralShah (talk | contribs) |
||
(42 intermediate revisions by 3 users not shown) | |||
Line 20: | Line 20: | ||
1. Filter out actual accelerators from the Crunchbase organizations data | 1. Filter out actual accelerators from the Crunchbase organizations data | ||
− | *Possibly by running accelerator_keywords.py | + | *Possibly by running accelerator_keywords.py |
− | *Possibly by using string searching in organizations.csv | + | *Possibly by using string searching in organizations.csv |
*Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list | *Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list | ||
2. Match this list against the current list of accelerators | 2. Match this list against the current list of accelerators | ||
− | *We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful) | + | *We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful) |
*This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not) | *This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not) | ||
3. Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list) | 3. Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list) | ||
Line 66: | Line 66: | ||
This just sums the cells from A1 to C1 | This just sums the cells from A1 to C1 | ||
− | =Veeral's | + | ='''Veeral's Summer Work'''= |
+ | ==WHAT I'VE DONE== | ||
+ | '''1)''' Used the 2013 Crunchbase Snapshot information to find more accelerators using keyword matching and manual researching/googling. Ended up with ~70 new accelerators which were all added to the current list | ||
− | == | + | '''2)''' Cohorts were manually obtained for each new accelerator and saved under (E:\McNair\Projects\Accelerators\Data) in the form [Accelerator Name].cohort.txt |
+ | |||
+ | '''3)''' All new accelerators and corresponding cohorts were added to Cleaned Cohort Data.xls spreadsheet in a new sheet called "Veeral - Updated" | ||
+ | |||
+ | '''4)''' Crawled through the Global Accelerator Network (GAN) site to obtain all of the GAN data. The parser, input, and output is located in (E:\McNair\Projects\Accelerators\GAN_Data) | ||
+ | |||
+ | '''5)''' Used the Crunchbase "Organizations" data and Whois parser to put together a comprehensive Textfile with all of our current accelerators and information on them (like URL, Location, Creation Date) located in (E:\McNair\Projects\Accelerators\Veeral\Accelerator_Data) | ||
+ | |||
+ | '''6)''' Matched existing SDC Platinum VC funding data (located in E:\McNair\Projects\Accelerators\VC Data) with Updated Cohort Data using the Matcher to obtain the Updated AccCo_VC matched file. | ||
+ | |||
+ | '''7)''' Copied the Updated AccCo_VC matched file and the Updated Cohort data textfile into the Z:\Accelerators database location. | ||
+ | |||
+ | ==NEXT STEPS== | ||
+ | |||
+ | '''1)''' Calculate the Percent VC funding rates for newly updated accelerator cohort data. | ||
+ | |||
+ | '''2)''' Find a way to obtain more variables for the current list of accelerators. | ||
+ | *POTENTIAL VARIABLES WE WANT: | ||
+ | **Company Type (i.e. Corporate, University, etc) | ||
+ | **Industry (i.e. Health, High-Tech, Food, etc) | ||
+ | **Equity | ||
+ | **Cohort size | ||
+ | **Seed Capital | ||
+ | **Employees | ||
+ | **ANY MORE YOU CAN FIND THAT MAY BE STATISTICALLY SIGNIFICANT | ||
+ | |||
+ | '''3)''' WRITE PAPERS | ||
+ | |||
+ | ==All New Files and what they Contain== | ||
+ | '''Accelerator Data''' | ||
+ | |||
+ | (Located in E:\McNair\Projects\Accelerators\Veeral) | ||
+ | |||
+ | Cleaned Cohort Data (Excel) - The sheet named "Veeral - Updated" has the most up to date Accelerator Cohort data. All other sheets are old data. | ||
+ | |||
+ | Organizations (Access) - Contains Crunchbase 2013 Snapshot Data used to extract more accelerators that are now all in the Cleaned Cohort Data. | ||
+ | |||
+ | Updated Cohort Data (TXT) - Most up to date Accelerator Cohort data. | ||
+ | |||
+ | Accelerator Data (TXT) - list of all Accelerators in Updated Cohort Data and other collected Accelerator characteristics. We have the cohort txt files (Located in Data folder; called "Accelerator Name".cohort) for every Accelerator in this list. | ||
+ | |||
+ | '''SQL Data for acquiring VC funding rates''' | ||
+ | |||
+ | *(Located in Z:\Accelerators) | ||
+ | *(Instructions for using SQL are located in E:\McNair\Projects\Accelerators\SQL_Data under "accelerator sql V") | ||
+ | *(Database is called "Accelerators") | ||
+ | |||
+ | Updated_AccCo_VC (TXT) - newer version of AccCo_VC | ||
+ | |||
+ | Updated_Cohort_Data (TXT) - newer version of Cohort_Data | ||
+ | |||
+ | '''GAN Data''' | ||
+ | (Located in E:\McNair\Projects\Accelerators\GAN Data) | ||
+ | |||
+ | ==Completing Master List of Accelerators (Process)== | ||
(Note: all files are found and stored under E:\McNair\Projects\Accelerators) | (Note: all files are found and stored under E:\McNair\Projects\Accelerators) | ||
Line 80: | Line 136: | ||
===Match Potential Accelerators with Cleaned Cohort Data using [[The Matcher (Tool)]].=== | ===Match Potential Accelerators with Cleaned Cohort Data using [[The Matcher (Tool)]].=== | ||
− | + | '''1.''' List of current accelerators obtained from Cleaned Cohort Data is in Organizations.accdb under the query, "List of Accelerators". The 381 Potential Accelerators are under the "Potential Accelerators" Query. | |
+ | |||
+ | '''2.''' Matched the Cleaned Cohort Data accelerator list with the potential accelerators obtained from the 2013 Crunchbase snapshot. There were 329 potential accelerators. | ||
+ | |||
+ | '''3.'''Manually went through the 329 potential accelerators by google searching and came up with 101 new accelerators - Can be found at ____________ (TBD) | ||
+ | |||
+ | |||
+ | '''4.''' Finding all of the cohorts of each new accelerator. | ||
+ | *Organized each cohort so the Name is in the first column and Description is in the second column. | ||
+ | *Saved each cohort txt file under the format "..Cohort Name..".cohort - for example, the cohorts of Velocity Accelerator would be saved under "Velocity Accelerator.cohort" | ||
+ | |||
+ | '''5.''' I am now going to add the new accelerators to our existing list and cross check our new, updated list of accelerators with all of the sources of accelerators that we've gone through so far plus the new 2017 Crunchbase data. | ||
− | + | ===Updated Cleaned Cohort List=== | |
− | == | + | Using the 70 or so new accelerators obtained from the Crunchbase snapshot, I ran Peter's "parse_cohort_data" script located in E:\McNair\Projects\Accelerators\Code+Final_Data on the new accelerator cohort files, all in the New Crunchbase Accelerator Cohorts Folder in Data (E:\McNair\Projects\Accelerators\Data\New Crunchbase Accelerator Cohorts) |
− | An entry: | + | |
+ | '''RESULTS''' | ||
+ | New AccCO_VC Match file - (E:\McNair\Projects\Accelerators\Veeral\Updated AccCo_VC) | ||
+ | |||
+ | COMPLETED MASTER LIST - (E:\McNair\Projects\Accelerators\Veeral\Accelerator_Data) | ||
+ | |||
+ | ==Global Accelerator Network Parser Spec== | ||
+ | |||
+ | HTML File - E:\McNair\Projects\Accelerators\GAN_data.txt | ||
+ | |||
+ | '''An entry:''' | ||
+ | <nowiki> | ||
<div class="member_entry clear"> | <div class="member_entry clear"> | ||
... | ... | ||
</div> | </div> | ||
− | + | </nowiki> | |
− | |||
− | Logo: | + | |
− | + | '''Within an entry:''' | |
− | <div class="logo"> | + | |
− | + | ||
− | </ | + | '''Logo:''' |
+ | <nowiki> | ||
+ | <header class="member"> | ||
+ | <div class="logo"> | ||
+ | <a href="http://gan.co/members/view/desai-accelerator"><img alt="123_large" src="./GAN_files/123_large.png"></a> | ||
+ | </div> | ||
+ | </nowiki> | ||
+ | |||
+ | '''Name:''' | ||
+ | <nowiki> | ||
+ | <header class="member"> | ||
+ | <h2 class="name"> | ||
+ | <a href="http://gan.co/members/view/desai-accelerator">Desai Accelerator</a> | ||
+ | </h2> | ||
+ | </nowiki> | ||
+ | |||
+ | '''Location:''' | ||
+ | <nowiki> | ||
+ | <header class="member"> | ||
+ | <h3 class="location"> | ||
+ | Ann Arbor, MI, USA | ||
+ | </h3> | ||
+ | </nowiki> | ||
====For Statistics on Companies:==== | ====For Statistics on Companies:==== | ||
− | + | '''We want stats for -- section class = "companies", "companies_funded", "companies_funded_raised", "funding_raised", "exits", "exit_funding", "employees", "mentors", "years"''' | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | <nowiki> | ||
+ | <section class="stats clear"> | ||
+ | <ul class="single_stats clear"> | ||
+ | <a href="http://gan.co/members/standalone_filter?label=total_companies&span=20,-1" class="companies"> | ||
+ | <span class="icon hide_text">GAN Compass</span> | ||
+ | <strong class="number">Under 20</strong> | ||
+ | <em class="caption"> | ||
+ | Graduated Companies | ||
+ | </em> | ||
+ | </a> | ||
+ | </nowiki> | ||
− | |||
====For Terms of Companies==== | ====For Terms of Companies==== | ||
− | <div class="terms_holder clear"> | + | '''We want terms for equity stake and S25k seed capital''' |
− | + | ||
− | + | <nowiki> | |
− | + | <div class="terms_holder clear"> | |
− | + | <section class="terms"> | |
− | + | <h4>Terms</h4> | |
− | + | <p> | |
− | + | <a href="http://gan.co/members/standalone_filter?label=terms_equity&span=4,-1">0% equity stake</a> | |
− | + | for | |
− | + | <a href="http://gan.co/members/standalone_filter?label=terms_seed&span=26,20">$25k seed capital</a> | |
+ | </p> | ||
+ | </section> | ||
+ | </div> | ||
+ | </nowiki> | ||
+ | |||
+ | ==Parser Results== | ||
+ | The code and the resulting tab-separated text file are located here: | ||
+ | E:\McNair\Projects\Accelerators\Web Scraping for Accelerators |
Latest revision as of 17:07, 2 August 2017
Hi Veeral,
Contents
- 1 Intro
- 2 Important docs
- 3 To-do list
- 4 Don't worry about this stuff
- 5 Veeral's Summer Work
- 5.1 WHAT I'VE DONE
- 5.2 NEXT STEPS
- 5.3 All New Files and what they Contain
- 5.4 Completing Master List of Accelerators (Process)
- 5.4.1 Transfer all of the organizations data into Access
- 5.4.2 Use Access keyword queries with the short descriptions of each organization to accumulate a list of Potential Accelerators from Organizations data
- 5.4.3 Match Potential Accelerators with Cleaned Cohort Data using The Matcher (Tool).
- 5.4.4 Updated Cleaned Cohort List
- 5.5 Global Accelerator Network Parser Spec
- 5.6 Parser Results
Intro
Welcome to the project. The documents are here: E:\Mcnair\Projects\Accelerators
SQL documents are here: E:\Mcnair\Projects\Accelerators\SQL_Data
Database Drive is here: Z:\Bulk\Accelerators
The database is called accelerator
Important docs
The SDC pull that includes all of the round data since 1999: E:\Mcnair\Projects\Accelerators\VC_Data_Repeated_Down.txt or E:\Mcnair\Projects\Accelerators\"VC Data.xlsx"
The Cohorts of accelerators (under the Updated tab on the bottom): E:\Mcnair\Projects\Accelerators\"Clean Cohort Data.xlsx"
The Crunchbase Snapshots of organizations: E:\Mcnair\Projects\Accelerators\"Crunchbase Snapshot"\organizations.csv
To-do list
1. Filter out actual accelerators from the Crunchbase organizations data
- Possibly by running accelerator_keywords.py
- Possibly by using string searching in organizations.csv
- Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list
2. Match this list against the current list of accelerators
- We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful)
- This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not)
3. Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list)
- We used regex for this
- once you find the cohort data put it into the updated cohort data list excel file
- You just need the cohort company name and the name of the accerator
4. Match the cohort data against the round data from SDC
- Make sure to get both the accelerator name and the cohort company name in the first document
- In the second document (to match against the first) put the list of all companies funded in rounds (from SDC)
- in summary: File1 = Accelerator Cohorts and File2 = SDC data
5. Upload the match file into the psql database, then follow the code in accelerators.sql
- making new code with your new uploaded tables and documents, you should just be able to follow what we've already done to get a similar percentVC table
- The previous percent VC table you'll want it to look like is PercentVc4
^this above is all for the VC percentage rankings
For more info you can use the whoisparser which will get data on website registration (location, time, who, potentailly age if you consider the website registration date as an age) You can also do an automated google lookup (this will harvest addresses that are within google)
^These two will get you the information of where & how old
Don't worry about this stuff
Rank on VC
- Getting a VC percentage for each Accelerator
Also categorize
- Age
- Nonprofit or not
- Location
RegEx Code for repeating data down for the round data from SDC:
\n([^\t]+\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t)(.*)\n\t\t\t\t\t\t\t\t\t\t
\n\1\2\n\1
=if(isnumber(search("blah",B2))=TRUE,1,0) where blah is the substring (what you're searching for), B2 is the string (what your searching in) and 1 represents that it's present and 0 means it isn't.
=sum(A1:C1) This just sums the cells from A1 to C1
Veeral's Summer Work
WHAT I'VE DONE
1) Used the 2013 Crunchbase Snapshot information to find more accelerators using keyword matching and manual researching/googling. Ended up with ~70 new accelerators which were all added to the current list
2) Cohorts were manually obtained for each new accelerator and saved under (E:\McNair\Projects\Accelerators\Data) in the form [Accelerator Name].cohort.txt
3) All new accelerators and corresponding cohorts were added to Cleaned Cohort Data.xls spreadsheet in a new sheet called "Veeral - Updated"
4) Crawled through the Global Accelerator Network (GAN) site to obtain all of the GAN data. The parser, input, and output is located in (E:\McNair\Projects\Accelerators\GAN_Data)
5) Used the Crunchbase "Organizations" data and Whois parser to put together a comprehensive Textfile with all of our current accelerators and information on them (like URL, Location, Creation Date) located in (E:\McNair\Projects\Accelerators\Veeral\Accelerator_Data)
6) Matched existing SDC Platinum VC funding data (located in E:\McNair\Projects\Accelerators\VC Data) with Updated Cohort Data using the Matcher to obtain the Updated AccCo_VC matched file.
7) Copied the Updated AccCo_VC matched file and the Updated Cohort data textfile into the Z:\Accelerators database location.
NEXT STEPS
1) Calculate the Percent VC funding rates for newly updated accelerator cohort data.
2) Find a way to obtain more variables for the current list of accelerators.
- POTENTIAL VARIABLES WE WANT:
- Company Type (i.e. Corporate, University, etc)
- Industry (i.e. Health, High-Tech, Food, etc)
- Equity
- Cohort size
- Seed Capital
- Employees
- ANY MORE YOU CAN FIND THAT MAY BE STATISTICALLY SIGNIFICANT
3) WRITE PAPERS
All New Files and what they Contain
Accelerator Data
(Located in E:\McNair\Projects\Accelerators\Veeral)
Cleaned Cohort Data (Excel) - The sheet named "Veeral - Updated" has the most up to date Accelerator Cohort data. All other sheets are old data.
Organizations (Access) - Contains Crunchbase 2013 Snapshot Data used to extract more accelerators that are now all in the Cleaned Cohort Data.
Updated Cohort Data (TXT) - Most up to date Accelerator Cohort data.
Accelerator Data (TXT) - list of all Accelerators in Updated Cohort Data and other collected Accelerator characteristics. We have the cohort txt files (Located in Data folder; called "Accelerator Name".cohort) for every Accelerator in this list.
SQL Data for acquiring VC funding rates
- (Located in Z:\Accelerators)
- (Instructions for using SQL are located in E:\McNair\Projects\Accelerators\SQL_Data under "accelerator sql V")
- (Database is called "Accelerators")
Updated_AccCo_VC (TXT) - newer version of AccCo_VC
Updated_Cohort_Data (TXT) - newer version of Cohort_Data
GAN Data (Located in E:\McNair\Projects\Accelerators\GAN Data)
Completing Master List of Accelerators (Process)
(Note: all files are found and stored under E:\McNair\Projects\Accelerators)
Transfer all of the organizations data into Access
- Done - Organizations.accdb
Use Access keyword queries with the short descriptions of each organization to accumulate a list of Potential Accelerators from Organizations data
- Companies with atleast 2 keywords from [accel, startup, mentor, seed, program, week, pitch, found, stage, incubat]
- Companies with location_country_code = USA
- 381 Potential Accelerators (These are not exclusively Accelerators -- some VC firms and startup firms snuck into the list from initial glance. Plan is to match it with list of accelerators and then eliminate the ones that do not match that are not accelerators in that step.
Match Potential Accelerators with Cleaned Cohort Data using The Matcher (Tool).
1. List of current accelerators obtained from Cleaned Cohort Data is in Organizations.accdb under the query, "List of Accelerators". The 381 Potential Accelerators are under the "Potential Accelerators" Query.
2. Matched the Cleaned Cohort Data accelerator list with the potential accelerators obtained from the 2013 Crunchbase snapshot. There were 329 potential accelerators.
3.Manually went through the 329 potential accelerators by google searching and came up with 101 new accelerators - Can be found at ____________ (TBD)
4. Finding all of the cohorts of each new accelerator.
- Organized each cohort so the Name is in the first column and Description is in the second column.
- Saved each cohort txt file under the format "..Cohort Name..".cohort - for example, the cohorts of Velocity Accelerator would be saved under "Velocity Accelerator.cohort"
5. I am now going to add the new accelerators to our existing list and cross check our new, updated list of accelerators with all of the sources of accelerators that we've gone through so far plus the new 2017 Crunchbase data.
Updated Cleaned Cohort List
Using the 70 or so new accelerators obtained from the Crunchbase snapshot, I ran Peter's "parse_cohort_data" script located in E:\McNair\Projects\Accelerators\Code+Final_Data on the new accelerator cohort files, all in the New Crunchbase Accelerator Cohorts Folder in Data (E:\McNair\Projects\Accelerators\Data\New Crunchbase Accelerator Cohorts)
RESULTS New AccCO_VC Match file - (E:\McNair\Projects\Accelerators\Veeral\Updated AccCo_VC)
COMPLETED MASTER LIST - (E:\McNair\Projects\Accelerators\Veeral\Accelerator_Data)
Global Accelerator Network Parser Spec
HTML File - E:\McNair\Projects\Accelerators\GAN_data.txt
An entry:
<div class="member_entry clear"> ... </div>
Within an entry:
Logo:
<header class="member"> <div class="logo"> <a href="http://gan.co/members/view/desai-accelerator"><img alt="123_large" src="./GAN_files/123_large.png"></a> </div>
Name:
<header class="member"> <h2 class="name"> <a href="http://gan.co/members/view/desai-accelerator">Desai Accelerator</a> </h2>
Location:
<header class="member"> <h3 class="location"> Ann Arbor, MI, USA </h3>
For Statistics on Companies:
We want stats for -- section class = "companies", "companies_funded", "companies_funded_raised", "funding_raised", "exits", "exit_funding", "employees", "mentors", "years"
<section class="stats clear"> <ul class="single_stats clear"> <a href="http://gan.co/members/standalone_filter?label=total_companies&span=20,-1" class="companies"> <span class="icon hide_text">GAN Compass</span> <strong class="number">Under 20</strong> <em class="caption"> Graduated Companies </em> </a>
For Terms of Companies
We want terms for equity stake and S25k seed capital
<div class="terms_holder clear"> <section class="terms"> <h4>Terms</h4> <p> <a href="http://gan.co/members/standalone_filter?label=terms_equity&span=4,-1">0% equity stake</a> for <a href="http://gan.co/members/standalone_filter?label=terms_seed&span=26,20">$25k seed capital</a> </p> </section> </div>
Parser Results
The code and the resulting tab-separated text file are located here:
E:\McNair\Projects\Accelerators\Web Scraping for Accelerators