Changes

Jump to navigation Jump to search
==Hand Collecting Data==
To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is: /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for. We used the crawler to search for cohort companies listed for these accelerators. During the initially initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns.
The file for hand-coding is in:
1. Go to the given URL.
 
2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data.
 
3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).
 
4. Record date, month, year, and the companies listed for that given accelerator.
 
5. Note any any information, such as a cohort's special name.

Navigation menu