Changes

Jump to navigation Jump to search
1. Filter out actual accelerators from the Crunchbase organizations data
*Possibly by running accelerator_keywords.py '''HOW DO YOU RUN THIS?'''*Possibly by using string searching in organizations.csv '''SHOULD I ADD MORE FILTERS?'''
*Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list
2. Match this list against the current list of accelerators
*We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful) '''CAN'T SEEM TO FIND DIFFERENT MODES - ALSO HOW DO YOU USE?'''
*This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not)
3. Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list)
This just sums the cells from A1 to C1
='''Veeral's PlanSummer Work'''===WHAT I'VE DONE=='''1)''' Used the 2013 Crunchbase Snapshot information to find more accelerators using keyword matching and manual researching/googling. Ended up with ~70 new accelerators which were all added to the current list
'''2)''' Cohorts were manually obtained for each new accelerator and saved under (E:\McNair\Projects\Accelerators\Data) in the form [Accelerator Name].cohort.txt '''3)''' All new accelerators and corresponding cohorts were added to Cleaned Cohort Data.xls spreadsheet in a new sheet called "Veeral - Updated" '''4)''' Crawled through the Global Accelerator Network (GAN) site to obtain all of the GAN data. The parser, input, and output is located in (E:\McNair\Projects\Accelerators\GAN_Data) '''5)''' Used the Crunchbase "Organizations" data and Whois parser to put together a comprehensive Textfile with all of our current accelerators and information on them (like URL, Location, Creation Date) located in (E:\McNair\Projects\Accelerators\Veeral\Accelerator_Data) '''6)''' Matched existing SDC Platinum VC funding data (located in E:\McNair\Projects\Accelerators\VC Data) with Updated Cohort Data using the Matcher to obtain the Updated AccCo_VC matched file. '''7)''' Copied the Updated AccCo_VC matched file and the Updated Cohort data textfile into the Z:\Accelerators database location. ==NEXT STEPS== '''1)''' Calculate the Percent VC funding rates for newly updated accelerator cohort data. '''2)''' Find a way to obtain more variables for the current list of accelerators. *POTENTIAL VARIABLES WE WANT:**Company Type (i.e. Corporate, University, etc)**Industry (i.e. Health, High-Tech, Food, etc)**Equity**Cohort size**Seed Capital**Employees**ANY MORE YOU CAN FIND THAT MAY BE STATISTICALLY SIGNIFICANT '''3)''' WRITE PAPERS ==All New Files and what they Contain=='''Accelerator Data''' (Located in E:\McNair\Projects\Accelerators\Veeral) Cleaned Cohort Data (Excel) - The sheet named "Veeral - Updated" has the most up to date Accelerator Cohort data. All other sheets are old data. Organizations (Access) - Contains Crunchbase 2013 Snapshot Data used to extract more accelerators that are now all in the Cleaned Cohort Data. Updated Cohort Data (TXT) - Most up to date Accelerator Cohort data. Accelerator Data (TXT) - list of all Accelerators in Updated Cohort Data and other collected Accelerator characteristics. We have the cohort txt files (Located in Data folder; called "Accelerator Name".cohort) for every Accelerator in this list.  '''SQL Data for acquiring VC funding rates''' *(Located in Z:\Accelerators)*(Instructions for using SQL are located in E:\McNair\Projects\Accelerators\SQL_Data under "accelerator sql V")*(Database is called "Accelerators") Updated_AccCo_VC (TXT) - newer version of AccCo_VC Updated_Cohort_Data (TXT) - newer version of Cohort_Data '''GAN Data'''(Located in E:\McNair\Projects\Accelerators\GAN Data) ==Complete Completing Master List of Accelerators(Process)==
(Note: all files are found and stored under E:\McNair\Projects\Accelerators)
===Match Potential Accelerators with Cleaned Cohort Data using [[The Matcher (Tool)]].===
*'''1.''' List of current accelerators obtained from Cleaned Cohort Data is in Organizations.accdb under the query, "List of Accelerators". The 381 Potential Accelerators are under the "Potential Accelerators" Query. '''2.''' Matched the Cleaned Cohort Data accelerator list with the potential accelerators obtained from the 2013 Crunchbase snapshot. There were 329 potential accelerators. '''3.'''Manually went through the 329 potential accelerators by google searching and came up with 101 new accelerators - Can be found at ____________ (TBD)  '''4.''' Finding all of the cohorts of each new accelerator.*Organized each cohort so the Name is in the first column and Description is in the second column.*Saved each cohort txt file under the format "..Cohort Name..".cohort - for example, the cohorts of Velocity Accelerator would be saved under "Velocity Accelerator.cohort"
*Discovered '''5.''' I am now going to add the Global Accelerator Network - downloaded new accelerators to our existing list and cross check our new, updated list of accelerators with all of the HTML and examined it to find out how sources of accelerators that we can parse 've gone through so far plus the websitenew 2017 Crunchbase data.
===Necessities to Parse Updated Cleaned Cohort List=== Using the 70 or so new accelerators obtained from the Crunchbase snapshot, I ran Peter's "parse_cohort_data" script located in E:\McNair\Projects\Accelerators\Code+Final_Data on the new accelerator cohort files, all in the New Crunchbase Accelerator Cohorts Folder in Data (E:\McNair\Projects\Accelerators\Data\New Crunchbase Accelerator Cohorts) '''RESULTS'''New AccCO_VC Match file - (E:\McNair\Projects\Accelerators\Veeral\Updated AccCo_VC) COMPLETED MASTER LIST - (E:\McNair\Projects\Accelerators\Veeral\Accelerator_Data) ==Global Accelerator Network HTML=Parser Spec== HTML File - E:\McNair\Projects\Accelerators\GAN_data.txt '''An entry:''' <nowiki>
<div class="member_entry clear">
...
</div>
Within an entry:</nowiki>
 '''Within an entry:'''  '''Logo:''' <nowiki> <header class="member"> <div class="logo"> <a href="http://gan.co/members/view/desai-accelerator"><img alt="123_large" src="./GAN_files/123_large.png"></a> </div> </divnowiki> '''Name:''' <nowiki> <header class="member"> <h2 class="name"> <a href="http://gan.co/members/view/desai-accelerator">Desai Accelerator</a> </h2> </nowiki> '''Location:''' <nowiki> <header class="member"> <h3 class="location"> Ann Arbor, MI, USA </h3> </nowiki>
====For Statistics on Companies:====
<'''We want stats for -- section class="stats clearcompanies", "companies_funded", "><ul class=companies_funded_raised"single_stats clear, "><a href=funding_raised"http://gan.co/members/standalone_filter?label=total_companies&amp;span=20,-1" class=exits", "companiesexit_funding"> <span class=, "icon hide_textemployees">GAN Compass</span> <strong class=, "numbermentors">Under 20</strong> <em class=, "captionyears"> Graduated Companies </em></a>'''
<nowiki>
<section class="stats clear">
<ul class="single_stats clear">
<a href="http://gan.co/members/standalone_filter?label=total_companies&amp;span=20,-1" class="companies">
<span class="icon hide_text">GAN Compass</span>
<strong class="number">Under 20</strong>
<em class="caption">
Graduated Companies
</em>
</a>
</nowiki>
We want stats for -- class = "companies", "companies_funded", "companies_funded_raised", "funding_raised", "exits", "exit_funding", "employees", "mentors", "years"
====For Terms of Companies====
'''We want terms for equity stake and S25k seed capital'''  <nowiki> <div class="terms_holder clear"> <section class="terms"> <h4>Terms</h4> <p> <a href="http://gan.co/members/standalone_filter?label=terms_equity&amp;span=4,-1">0% equity stake</a> for <a href="http://gan.co/members/standalone_filter?label=terms_seed&amp;span=26,20">$25k seed capital</a> </p> </section> </div> </nowiki>
We want terms ==Parser Results==The code and the resulting tab-separated text file are located here: E:\McNair\Projects\Accelerators\Web Scraping for equity stake and S25k seed capitalAccelerators
383

edits

Navigation menu