Determinants of Seed Accelerator Performance: The Horse, the Jockey, and the Racetrack
Academic Paper | |
---|---|
Title | Determinants of Seed Accelerator Performance: The Horse, the Jockey, and the Racetrack |
Author | Ed Egan, Yael Hochberg |
Status | In development |
© edegan.com, 2016 |
Contents
Current Work
Note that TFTRTA-AcceleratorFinal.txt in E:\projects\accelerators was updated to included all creation dates and dead dates. This is not reflected below, except that the script had it its load SQL updated too.
Load the existing data
Dbase is accelerators
SQL code is in:
E:\projects\accelerators\LoadAcceleratorTables.sql
This script:
Loads files from The File to Rule Them All.xlsx
- AcceleratorsFinal 165
- cohortsfinal 12941
- FoundersMain 187
- FoundersExperience 823
- FoundersEducation 353
Loads 5 timing info files
- Timing1 'Formatted Timing Info.txt' 1167
- Timing2 'merging_work.txt' 257
- Timing3 'additional_timing_info2-fixed.txt' 1521
- Timing4 'SmallBatchTimingInfo.txt' 169
- Timing5 'TurkData2ndPush-FormattedTimingWHeaderClean.txt' 1538
See Seed Accelerator Data Assembly for more information on these files.
Determine 'conamecommon' and 'conamevariant' for all conames in timing files and cohortsfinal. Also create an accelerator name lookup file for between timing and TFTRTA ('AcceleratorFinalTimingUnionAcceleratorName.txt') and load it. Use both to build:
- TimingUnionNamesProper 3592 (Coname, Accelerator pair)
- CohortsFinalCommon 12941
Note that the following 'accelerators' were listed in the timing info but not in TFTRTA:
- KarmaTech
- Make In LA
- Rockstart AI
- Talent Tech Labs
- Ventures Accelerator
- Wake Forest Innovations
- White House Demo Day
- XRC Labs
The timing files were processed and their data was assembled. The stack starts with Attended (12896 obs, 7044 with year and 6493 with year and quarter) and sequentially adds timing information until the last table, Attended5 (15460 obs, 10446 with year and 9871 with year and quarter), is produced. With the exception of timing2, each timing file added new cohort cos. Timing1 and timing5 had evidence URLs (total of just 248 distinct).
New Pull
Made tables:
- TheMissing, 129 accs missing total of 4979 cohort cos
- ThePresent, 153 accs with total of 10446 cohort cos
- ThePresentByYear, 601 acc years
- TheReview, 475 acc years -> "TheReview.txt"
TheReview.txt was then processed into SearchTerms.txt in E:\projects\accelerators\Google:
Accelerator SearchTerm Year
After some experimentation, we decided to add the following keywords to every search: demo day graduation pitch competition cohort
We fixed up and ran E:\projects\accelerators\Google\DemoDayCrawler.py This script was based on E:\mcnair\Software\Accelerators\DemoDayCrawler.py, rather than the more recent E:\mcnair\Projects\Accelerator Demo Day\Test Run\STEP1_crawl.py
The output is:
- E:\projects\accelerators\Google\Results.txt 2515
- E:\projects\accelerators\Google\Results folder containing html
Previously run Google search results are in:
- 5 results per accelerator -- E:\mcnair\Software\Accelerators\demoday_crawl_full.txt 2777
- 10 results per accelerator -- E:\mcnair\Projects\Accelerator Demo Day\Test Run\demoday_crawl_full_from_testrun.txt 4351
- 10 results per select accelerator year -- E:\mcnair\Projects\Accelerator Demo Day\Test Run\demoday_crawl_full.txt 1230
These were all copied to Z:\accelerators and cleaned up, and loaded along with the new Results.txt into accelerators. The SQL is in E:\projects\accelerators\LoadAcceleratorTables.sql
It looks like 2340/2514 of our pages are new...
Other info
Found the following list of accelerators by accident: https://www.s-b-z.com/FORMING%20THE%20BUSINESS/db/accelerators.aspx
To do
Still to do:
- Re-train the classifier
- Run the classifier on the Google results
- Post the results to Mech Turk
- Process the Mech Turk results
- Match cohort cos to portcos (regenerate GotVC and add timing)
- Match cohort cos to crunchbase again
Previous Work
The main Accelerator Demo Day page was built by Minh Le and documented in Minh_Le_(Work_Log).
See also:
VC Code
The old VC code is in
E:\mcnair\Projects\MatchingEntrepsToVC\DataWorkMatchingEntrepsV2-2.sql
It uses vcdb2 and forks off of roundlinejoinerleanff, building the following sequence of tables:
- roundlineaggfirmsseq -> roundlineaggseqwexit (using roundlineaggfunds)
- RoundLineMasterSeqBase (from roundlineaggseqwexit and 10 LJ'd tables)
- RoundLineMasterSeq (RoundLineMasterSeqBase with FirmnameRoundInduTotal, FirmnameRoundInduHist)
- Build out by stage -- MatchMostNumerousSeed, MatchHighestRandomSeed, etc.
- RoundLineByStageKeys -> MasterByStageBase -> MasterByStage -> MasterByStageKeys -> MasterByStageBlownout
There is untested seq table code at the end of
E:\projects\vcdb3\OriginalSQL\MatchingEntrepsV3.sql
They build just roundlineaggfirmsseq
Accelerator Demo Day
See the Accelerator Demo Day for more information. We ran the code and posted several iterations to Turk, and completed at least one iteration by hand. from Amazon Mechanical Turk for Analyzing Demo Day Classifier's Results
- E:\mcnair\Projects\Accelerator Demo Day\Turk\batch_results_all_accs_excel.xlsx -- looks like it contains the results of a Turk run. 265 results, 160 usable.
- Accelerator_Demo_Day#Hand_Collecting_Data provides a link to a Google Sheet. This sheet was downloaded to E:\projects\accelerators\Demo Day Timing Info.xlsx - it contains 136 observations. Files of this format were processed by a script written by Grace?
Accelerator Code
The last build was by Ed and Hira. Hira's notes are on the Seed Accelerator Data Assembly page.
Claims:
- dbase is likely vcdb2
- All data files are in Z:/accelerator
- The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018
- Source data is E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx
- timing_final - This table is based on the most updated information on timing compiled in source file: Z:/accelerator/Formatted Timing Info.txt (by Grace)
- additional_timing_info - source file: "merging_work.xlxs" located in: E:\Projects\McNair\Seed DB 8)
- additional_timing_info2 - source file: "formatted timing info2.txt" located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks.
- 9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8. 10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined
- See also, Grace's code E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py. Last file it produced was TurkData2ndPush-FormattedTiming.txt
Hira's code
Load:
- timing_final from formatted_timing_final.txt -- 1167
- additional_timing_info from merging_work.txt --257
- additional_timing_info2 from additional_timing_info2.txt -- 1523
- timing_combined from all three above (additional_timing_info,timing_final,additional_timing_info2) -- 2817
- cohortsfinal from cohorts_final.txt from File to Rule Them All.xlx
- founders from founders_main.txt from File to Rule Them All.xlx
- founders_experience from founders_experience.txt from File to Rule Them All.xlx
Last code written by Ed was likely:
E:\mcnair\Projects\Accelerators\Summer 2018\FindTiming.sql /* timing_final Demo manual fill out effort --1167 additional_timing_info SeedDB crawl --257 additional_timing_info2 Main MTURK Crawl --1523 */
Timing related tables:
- Draws from timing_combined
- Produces FindThese and FindTheseCos
- \COPY TurkRun2 FROM 'TurkData2ndPush-FormattedTimingWHeader.txt' --1538
- \COPY ManualAdd2 FROM 'SmallBatchTimingInfo.txt' --169
Timing Info Files
TurkData2ndPush-FormattedTimingWHeaderClean.txt <- TurkData2ndPush-FormattedTimingWHeader.txt company pagedetails accelerator date cohortname 1539, cohortname is patchy but otherwise great
SmallBatchTimingInfo.txt conamestd accelerator date month year cohort quarter 171, everything is patchy
merging_work.txt conamestd accelerator matched coname url cohort name date month Year Quarter 259, very clean file
additional_timing_info2-fixed.txt companyname accelerator cohortname date month year season type 1524 (seems messy) Same as: Formatted Timing Info2 wHeaderCleaned.txt <- Formatted Timing Info2 wHeader.txt Coname Accelerator ResultDate ResultType CohortName 1524, fairly clean
Formatted Timing Info.txt coname acceleratorname keyword url webpage predicted gooddata page_details full_date month year cohort_name notes prog_duration_wks actual_date actual_month actual_year season 1168, fairly clean Same as: formatted_timing_final.txt coname acceleratorname keyword url webpage predicted gooddata page_details full_date month year cohort_name notes prog_duration_wks actual_date actual_month actual_year season 1169
Files in Summer 2018 with provenance
SmallBatchTimingInfo.txt
Appears hand collected 170 lines, conamestd accelerator date month year cohort quarter
TurkData2ndPush-FormattedTimingWHeader.txt
Processed by format_timing.py Comes from Final Turk Push.xlsx 1515 lines, company name normalized
Formatted Timing Info 2
No header but: coname accelerator date pagetype 1523 lines Seems to have come from GraceData.txt and been processed by an earlier version of format_timing.py
Formatted Timing Info
Header: coname acceleratorname keyword url webpage predicted gooddata page_details full_date month year cohort_name notes prog_duration_wks actual_date actual_month actual_year season 1168 lines, company name normalized Seems to have come from Demo Day Timing Info - Good Data Only.txt
Demo Day Timing Info Companies
No header, but appears coname normalized 1143 lines Might have come from Demo Day Timing Info - Good Data Only.txt Made obsolete by Formatted Timing Info?
Note that the most recent file is NewBatchForTimingInfo.txt, which contains coname, accelerator pairs. It's not clear if it was ever run.