Changes

Jump to navigation Jump to search
1. Filter out actual accelerators from the Crunchbase organizations data
*Possibly by running accelerator_keywords.py '''HOW DO YOU RUN THIS?'''*Possibly by using string searching in organizations.csv '''SHOULD I ADD MORE FILTERS?'''
*Watch out for Venture capital companies (the organizations file has many of these and we'll probably pick up a lot in our "accelerator" filtered list
2. Match this list against the current list of accelerators
*We have our own copy of the matcher in the accelerators E drive (try mode 1 and mode 2 for different results, mode 2 might be more helpful) '''CAN'T SEEM TO FIND DIFFERENT MODES - ALSO HOW DO YOU USE?'''
*This will tell you whether it was part of the old list or not (and therefore whether we need to get data for it or not)
3. Find cohort data for all of the new accelerators (ones not previously on the list & if they're not accelerators take them off the list)
This just sums the cells from A1 to C1
='''Veeral's PlanSummer Work'''===WHAT I'VE DONE=='''1)''' Used the 2013 Crunchbase Snapshot information to find more accelerators using keyword matching and manual researching/googling. Ended up with ~70 new accelerators which were all added to the current list
'''2)''' Cohorts were manually obtained for each new accelerator and saved under (E:\McNair\Projects\Accelerators\Data) in the form [Accelerator Name].cohort.txt '''3)''' All new accelerators and corresponding cohorts were added to Cleaned Cohort Data.xls spreadsheet in a new sheet called "Veeral - Updated" '''4)''' Crawled through the Global Accelerator Network (GAN) site to obtain all of the GAN data. The parser, input, and output is located in (E:\McNair\Projects\Accelerators\GAN_Data) '''5)''' Used the Crunchbase "Organizations" data and Whois parser to put together a comprehensive Textfile with all of our current accelerators and information on them (like URL, Location, Creation Date) located in (E:\McNair\Projects\Accelerators\Veeral\Accelerator_Data) '''6)''' Matched existing SDC Platinum VC funding data (located in E:\McNair\Projects\Accelerators\VC Data) with Updated Cohort Data using the Matcher to obtain the Updated AccCo_VC matched file. '''7)''' Copied the Updated AccCo_VC matched file and the Updated Cohort data textfile into the Z:\Accelerators database location. ==NEXT STEPS== '''1)''' Calculate the Percent VC funding rates for newly updated accelerator cohort data. '''2)''' Find a way to obtain more variables for the current list of accelerators. *POTENTIAL VARIABLES WE WANT:**Company Type (i.e. Corporate, University, etc)**Industry (i.e. Health, High-Tech, Food, etc)**Equity**Cohort size**Seed Capital**Employees**ANY MORE YOU CAN FIND THAT MAY BE STATISTICALLY SIGNIFICANT '''3)''' WRITE PAPERS ==All New Files and what they Contain=='''Accelerator Data''' (Located in E:\McNair\Projects\Accelerators\Veeral) Cleaned Cohort Data (Excel) - The sheet named "Veeral - Updated" has the most up to date Accelerator Cohort data. All other sheets are old data. Organizations (Access) - Contains Crunchbase 2013 Snapshot Data used to extract more accelerators that are now all in the Cleaned Cohort Data. Updated Cohort Data (TXT) - Most up to date Accelerator Cohort data. Accelerator Data (TXT) - list of all Accelerators in Updated Cohort Data and other collected Accelerator characteristics. We have the cohort txt files (Located in Data folder; called "Accelerator Name".cohort) for every Accelerator in this list.  '''SQL Data for acquiring VC funding rates''' *(Located in Z:\Accelerators)*(Instructions for using SQL are located in E:\McNair\Projects\Accelerators\SQL_Data under "accelerator sql V")*(Database is called "Accelerators") Updated_AccCo_VC (TXT) - newer version of AccCo_VC Updated_Cohort_Data (TXT) - newer version of Cohort_Data '''GAN Data'''(Located in E:\McNair\Projects\Accelerators\GAN Data) ==Complete Completing Master List of Accelerators(Process)==
(Note: all files are found and stored under E:\McNair\Projects\Accelerators)
===Match Potential Accelerators with Cleaned Cohort Data using [[The Matcher (Tool)]].===
'''1. ''' List of current accelerators obtained from Cleaned Cohort Data is in Organizations.accdb under the query, "List of Accelerators". The 381 Potential Accelerators are under the "Potential Accelerators" Query. '''2.''' Matched the Cleaned Cohort Data accelerator list with the potential accelerators obtained from the 2013 Crunchbase snapshot. There were 329 potential accelerators.
2'''3. Matched the Cleaned Cohort Data accelerator list with '''Manually went through the 329 potential accelerators obtained from the 2013 Crunchbase snapshot. There were 329 potential by google searching and came up with 101 new accelerators.- Can be found at ____________ (TBD)
3. Manually went through the 329 potential accelerators by google searching and came up with 101 new accelerators.
'''4. ''' Finding all of the cohorts of each new accelerator.**Organized each cohort so the Name is in the first column and Description is in the second column.**Saved each cohort txt file under the format "..Cohort Name..".cohort - for example, the cohorts of Velocity Accelerator would be saved under "Velocity Accelerator.cohort"
'''5.''' I am now going to add the new accelerators to our existing list and cross check our new, updated list of accelerators with all of the sources of accelerators that we've gone through so far plus the new 2017 Crunchbase data.**This will include the following:*********
===Updated Cleaned Cohort List===
Using the 70 or so new accelerators obtained from the Crunchbase snapshot, I ran Peter's "parse_cohort_data" script located in E:\McNair\Projects\Accelerators\Code+Final_Data on the new accelerator cohort files, all in the New Crunchbase Accelerator Cohorts Folder in Data (E:\McNair\Projects\Accelerators\Data\New Crunchbase Accelerator Cohorts)
*Discovered the Global Accelerator Network '''RESULTS'''New AccCO_VC Match file - downloaded all of the HTML and examined it to find out how we can parse the website.(E:\McNair\Projects\Accelerators\Veeral\Updated AccCo_VC)
=COMPLETED MASTER LIST - (E:\McNair\Projects\Accelerators\Veeral\Accelerator_Data) ==Global Accelerator Network Parser Spec===
HTML File - E:\McNair\Projects\Accelerators\GAN_data.txt
</div>
</nowiki>
 
==Parser Results==
The code and the resulting tab-separated text file are located here:
E:\McNair\Projects\Accelerators\Web Scraping for Accelerators
383

edits

Navigation menu