Difference between revisions of "U.S. Seed Accelerators"

Project
U.S. Seed Accelerators
Project Information
Has title	U.S. Seed Accelerators
Has owner	Connor Rothschild
Has start date	06/18/2018
Has deadline date
Has keywords	accelerators, data
Has project status	Active
Is dependent on	Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data
Does subsume	Accelerator Data, Accelerator Seed List (Data)
Has sponsor	McNair Center
Has project output	Data, Tool
	Copyright © 2019 edegan.com. All Rights Reserved.

Latest revision as of 12:31, 6 October 2020

The U.S. Seed Accelerators project subsumes several related projects. These projects were intended to assemble near-population data on high-growth high-tech seed accelerators in the U.S. and understand how to automate the data collection process. As such, the project includes both a dataset and prototypes. Some of the prototypes were used in the Kauffman Incubator Project.

Project Location

The master file can be found at

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx

Note that TFTRTA-AcceleratorFinal.txt in E:\projects\accelerators was updated to included all creation dates and dead dates.

Relevant Former Projects

This page serves as an updated and tidied version of the data and work presented on the Accelerator Seed List (Data) Project, which subsumed Accelerator Data. Both of these projects (and as a corollary, this project) are dependent on the Demo Day Page Parser, Industry Classifier, and the Whois Parser.

Update for Hira

Final MTurk Push

Minh and I pushed a final batch of HITs to MTurk. We found that, among our data even after MTurk, we were missing timing info for around 1000 companies. Upon further inspection, we realized that around 800 of these companies belonged to only ~10 accelerators. We think the problem was that Google searches most recent results first, so we missed out on old cohorts for large accelerators. We therefore re-ran Minh's crawler on these accelerators with different year parameters. We got 650 results.

Upon pushing these to MTurk, we got good results for 144 companies. This number was the product of filtering out accelerators with no companies listed, no date listed, and no accelerator listed (after searching manually). We removed duplicates and removed accelerators we do not care about. The 144 companies collectively have 1,538 companies.

This file can be found here:

/bulk/McNair/Projects/Accelerators/Summer 2018/Final Turk Push.xlsx

The next step is to plug this sheet into Grace's Python script which takes these companies and converts each company to its own row, so that it can be merged with our other data.

Manual Searching

For the other 170 companies we lacked timing info for (that were not worth crawling for because there were few companies assigned to each accelerator) McNair Center interns manually searched for timing info. Of the 170 companies we searched for, we found timing information for 128 of them.

The sheet can be found here:

https://docs.google.com/spreadsheets/d/1hGgxNwLph0tWtqO_8bNUGM-kzVXTeb-N26ojwL3TTuk/edit?usp=sharing

And is ready to merge in with our existing data.

Recoded Founders' Experience

I have updated and reclassified Founders' job titles. We began with 451 unique job titles, and were able to condense them into 16 categories, which are:

Academic
Advisor
Board
C-level
CEO
Director
Founder
Investment
Management
Marketing
Partner
President
VP
Other

The formulas used to recode, the old data, and the newest, updated data can be found on this Google Sheet:

https://docs.google.com/spreadsheets/d/179ML4c1cO_1zooCGj4yjuXXUPwTDKZu52656_8uoNig/edit?usp=sharing

This has been merged into the File to Rule Them All.

Recoded Stage

I have updated and cleaned up the "what stage accs look for companies in" by splitting it up into three categories:

seed
early stage venture
late

Other classifications were collapsed into these three or were not significant (n<2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here:

https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing

It has been merged into The File To Rule Them All.

Recoded Dead Accelerators

We have updated dead accelerators on the following Google Sheet

https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing

This has been merged this into The File To Rule Them All.

Recoded Equity/Investment

The Google Sheet with this work is here:

https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing

It has also been merged into The File To Rule Them All.

I have updated equity from data from https://www.seed-db.com/accelerators.

I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:

the midpoint normalization, used by taking an average of the accelerators' investment ranges.
the upper bound normalization, used by taking an average of the highest amount accelerators will invest.

This dual normalization was performed because many accelerators say they invest "Up to $__,___" so a midpoint may not accurately reflect actual investment amounts.

The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.

NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.

By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.

Amazon Mechanical Turk Pricing

Information can be found here [1]

Recoded Founders Education

I have recoded two components of the founders' education sheet:

1) Degree name has been reclassified into nine categories:

High School
Associates
Bachelors
Masters
Certificate
JD
MBA
PhD
Other

2) Majors have also been recoded into nine categories:

H = Humanities
SS = Social Sciences
NS = Natural Sciences
E = Engineering (includes computer science)
B = Business and Economics
L = Leadership
MBA
JD
O = Other

The Google Sheet I used to reclassify can be found here:

https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing

With the Sheets OLD containing old, outdated data, WORK containing the process/formulas of reclassifying, and Updated Info containing only good, updated data.

The data has been merged into The File to Rule Them All. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet OLD (just wanted the go ahead from Hira or Ed).

Recoded multiple campuses and cohorts

The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.

The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:

https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing

We have also added a new cohort list. Under The File To Rule Them All, the sheet Multiple Campuses contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.

Fixed Manual Data from Google Sheet

We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called "Good Data Only", at the same link:

https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing

I first used our "recap" and "announced" classification to standardize and fix the dates.

Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.
Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.
Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date if the listed page was a recap.
Columns P and Q are the Month and Years stripped from Column O.
Finally, Column R is the season variable, as Ed said it should be coded.

We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.

Recoded employee count

I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently. Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).

The employee count column is standardized and can easily be edited given some modification of the Excel formula.

Recent Work

Here's a project update on the work that has been done since coming to McNair. The most recent file is

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx

Merging Cohort Companies with Crunchbase Info

More information on this part of the project can be found on the page Merging Existing Data with Crunchbase.

The newest updated sheet of cohort company info is under the Cohorts Final sheet of The File to Rule Them All.xlsx.

Working with Maxine Tao, we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:

short_description
long_description
category_list (details the company's category)
category_group_list (a less refined, more all-encompassing category classification)
founded_on date
employee_count
linkedin_url
address

And the following information was also pulled from Crunchbase and merged with our existing data:

URL (was merged with courl cells)
city (was merged with colocation)
state_code (was merged with colocation)
country_code (was merged with colocation)
status (was merged with costatus)

The Equity Variables

Maxine Tao and I have added six new variables to the Accelerators Final sheet. Those variables are:

Terms of joining - terms of joining accelerator and important details about program
Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information
Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))
Equity Amount Normalized - this copies the previous column but only keeps %>0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)
Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a "up to $######")
Investment Notes - anything to comment on previous 4 columns

These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.

Relevant information:

82 accelerators take equity, 42 do not, and we lack information for 37.
The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.

Matching Accelerators to UUIDs via Crunchbase

We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the Accelerators Final sheet.

The file with accelerators matched to Crunchbase UUIDs can be found at:

/bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx

This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.

More information can be found on the Crunchbase Data page.

Linking Accelerators to Founders/LinkedIn Crawling

Grace Tan got the LinkedIn Crawler (Python) to work, which means we currently have the following information about accelerator founders:

Current Job Title
Location
Employer
Job(s) Title
Dates Employed
Time Employed
Location of jobs
Extra Description
School Name
Degree Name
Major
Attended
Graduated
Societies

This information can be found in the various Founders sheets in The File to Rule Them All.

Finding Company URLs

See http://mcnair.bakerinstitute.org/wiki/URL_Finder_(Tool)#Summer_2018_URL_Finder_work for more details.

Seed DB Parser

See Seed DB Parser for information on functionality.

The results from crawling Seed DB gave us more information for 257 companies. This is located in (sheet: final):

E:\McNair\Projects\Seed DB\merging work.xlsx

An Overview

This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.

Helpful Links: http://seedrankings.com/

Remaining To Dos

The last update on Accelerator Seed List (Data) said the following needed to be done:

Cross-reference sheet with data from Peter's old accelerator consolidation file ("accelerator_data_noflag" and "accelerator_data" in "All Relevant Files") and fill in missing data
Variables that are 100% NOT in these 2 files:
- Cohort Breakout?
- Subtype
- Designed for Students?
- Campuses
- Stage
- Software Tech
- What stage do they look for?

(Outdated) Necessary Steps

Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):

Step Zero: Connect to Crunchbase and Link Data

Complete - more info: Crunchbase Data

Step One: LinkedIn Founders Data

This project will begin by working with Grace Tan and Maxine Tao to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through Crunchbase and find the UUID for companies and their founders (reference Crunchbase Data, Crunchbase Accelerator Founders, Crunchbase Accelerator Equity). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by Grace Tan).

The list of founders for accelerators can be found at

McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt

The Unfound Founders file codes a 0 for all companies not listed within the LinkedIn Founders file, and a 1 for those that do have founders listed.

Given the founders' names, we will then be able to use the LinkedIn Crawler (Python) to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).

Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase

In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not.

Maxine will acquire the list of accelerators who take equity from companies from the following sheet:

E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt

Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.

This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows.

We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.

Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.

We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.

From this, we get the following data:

Accelerator a given company went through
Year said company went through a cohort/Specific cohort company went through

Step Three: Demo Day Crawler

This part of the project relies on the contributions of the wonderful Minh Le. Better documentation for the project can be found on the Demo Day Page Parser, Demo Day Page Google Classifier, and Accelerator Demo Day project pages.

Essentially, this part of the accelerator data project will use the Demo Day Page Parser to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors (here's an example FAQ page from Y Combinator). The Demo Day Page Google Classifier will then determine if the page is, in fact, a demo day page.

Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):

The date a cohort began/the season the cohort went through the accelerator
- This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:

E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx

I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.

Step Four: Non-profit Finder

More at Non-profit Finder

Another important step in this project is finding which accelerators are non-profits.

A comprehensive list of nonprofits taken from the IRS can be found here:

E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx

Warning: this file has 1 million rows

This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.

Potential problem:

The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.

Workflow Image

(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:

@@ Line 1: / Line 1: @@
-{{McNair Projects
+{{Project
+|Has project output=Data,Tool
+|Has sponsor=McNair Center
 |Has title=U.S. Seed Accelerators
 |Has owner=Connor Rothschild,
@@ Line 10: / Line 12: @@
 |Does subsume=Accelerator Data, Accelerator Seed List (Data),
 }}
+<onlyinclude>The [[U.S. Seed Accelerators]] project subsumes several related projects. These projects were intended to assemble near-population data on high-growth high-tech seed accelerators in the U.S. and understand how to automate the data collection process. As such, the project includes both a dataset and prototypes. Some of the prototypes were used in the [[Kauffman Incubator Project]].</onlyinclude>
 ==Project Location==
 The master file can be found at
   /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''
+Note that TFTRTA-AcceleratorFinal.txt in E:\projects\accelerators was updated to included all creation dates and dead dates.
 ==Relevant Former Projects==
@@ Line 21: / Line 26: @@
 ==Update for Hira==
-After our Skype call, I did the following:
+===Final MTurk Push===
+Minh and I pushed a final batch of HITs to MTurk. We found that, among our data even after MTurk, we were missing timing info for around 1000 companies. Upon further inspection, we realized that around 800 of these companies belonged to only ~10 accelerators. We think the problem was that Google searches most recent results first, so we missed out on old cohorts for large accelerators. We therefore re-ran Minh's crawler on these accelerators with different year parameters. We got 650 results.
+Upon pushing these to MTurk, we got good results for 144 companies. This number was the product of filtering out accelerators with no companies listed, no date listed, and no accelerator listed (after searching manually). We removed duplicates and removed accelerators we do not care about. The 144 companies collectively have 1,538 companies.
-===Recoding Founders===
+This file can be found here:
+ /bulk/McNair/Projects/Accelerators/Summer 2018/Final Turk Push.xlsx
-I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).
+The next step is to plug this sheet into Grace's Python script which takes these companies and converts each company to its own row, so that it can be merged with our other data.
-We will need to talk about how to categorize the most extraneous job titles.
+===Manual Searching===
-===Multiple campuses and cohorts===
+For the other 170 companies we lacked timing info for (that were not worth crawling for because there were few companies assigned to each accelerator) McNair Center interns manually searched for timing info. Of the 170 companies we searched for, we found timing information for 128 of them.
+The sheet can be found here:
+ https://docs.google.com/spreadsheets/d/1hGgxNwLph0tWtqO_8bNUGM-kzVXTeb-N26ojwL3TTuk/edit?usp=sharing
+And is ready to merge in with our existing data.
+===Recoded Founders' Experience===
+I have updated and reclassified Founders' job titles. We began with 451 unique job titles, and were able to condense them into 16 categories, which are:
+*Academic
+*Advisor
+*Board
+*C-level
+*CEO
+*Director
+*Founder
+*Investment
+*Management
+*Marketing
+*Partner
+*President
+*VP
+*Other
+The formulas used to recode, the old data, and the newest, updated data can be found on this Google Sheet:
+ https://docs.google.com/spreadsheets/d/179ML4c1cO_1zooCGj4yjuXXUPwTDKZu52656_8uoNig/edit?usp=sharing
+This has been merged into the File to Rule Them All.
+===Recoded Stage===
+I have updated and cleaned up the "what stage accs look for companies in" by splitting it up into three categories:
+*seed
+*early stage venture
+*late
+Other classifications were collapsed into these three or were not significant (n<2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here:
+ https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing
+It has been merged into The File To Rule Them All.
+===Recoded Dead Accelerators===
+We have updated dead accelerators on the following Google Sheet
+ https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing
+This has been merged this into The File To Rule Them All.
+===Recoded Equity/Investment===
+The Google Sheet with this work is here:
+ https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing
+It has also been merged into The File To Rule Them All.
+I have updated equity from data from https://www.seed-db.com/accelerators.
+I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:
+*the midpoint normalization, used by taking an average of the accelerators' investment ranges.
+*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.
+This dual normalization was performed because many accelerators say they invest "Up to $__,___" so a midpoint may not accurately reflect actual investment amounts.
+The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.
+'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.
+By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''
+===Amazon Mechanical Turk Pricing===
+Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]
+===Recoded Founders Education===
+I have recoded two components of the founders' education sheet:
+) Degree name has been reclassified into nine categories:
+*High School
+*Associates
+*Bachelors
+*Masters
+*Certificate
+*JD
+*MBA
+*PhD
+*Other
+) Majors have also been recoded into nine categories:
+*H = Humanities
+*SS = Social Sciences
+*NS = Natural Sciences
+*E = Engineering (includes computer science)
+*B = Business and Economics
+*L = Leadership
+*MBA
+*JD
+*O = Other
+The Google Sheet I used to reclassify can be found here:
+ https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing
+With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process/formulas of reclassifying, and '''Updated Info''' containing only good, updated data.
+The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).
+===Recoded multiple campuses and cohorts===
 The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.
@@ Line 38: / Line 154: @@
 We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.
-===Fixed Google Sheet===
+===Fixed Manual Data from Google Sheet===
 We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called "Good Data Only", at the same link:
@@ Line 53: / Line 169: @@
 We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.
-===Recode employee count===
+===Recoded employee count===
 I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.
@@ Line 59: / Line 175: @@
 The employee count column is standardized and can easily be edited given some modification of the Excel formula.
-===Normalized investment amount===
-I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators "take up to __%"  equity and "invest up to $___" that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide "$____ up front and another $___ in follow up funding for each stage." How do we deal with these? Message me if you'd like to talk more about this.
-I refrained from creating max and min investment columns lest spending time on the data before we discuss it.
-===Remaning to do===
-*Founders Experience: code job title
-*Founders Education: remove unknowns, code degree and code major
 ==Recent Work==
@@ Line 148: / Line 253: @@
 ===Finding Company URLs===
-Excel master datasets are in:
+See http://mcnair.bakerinstitute.org/wiki/URL_Finder_(Tool)#Summer_2018_URL_Finder_work for more details.
- E:\McNair\Projects\Accelerators\Summer 2018
-Code and files specific to this URL finder are in:
- E:\McNair\Projects\Accelerators\Summer 2018\url finder
-====Results====
-I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.
-====Testing====
-In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):
- E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx
-We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.
-Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.
-To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.
-It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.
-====Actual Run Info====
-The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.
-The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.
-The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.
+===Seed DB Parser===
+See [[Seed DB Parser]] for information on functionality.
-Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9.
+The results from crawling Seed DB gave us more information for 257 companies. This is located in (sheet: final):
+ E:\McNair\Projects\Seed DB\merging work.xlsx
 ==An Overview==