<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Connorrothschild</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Connorrothschild"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/Connorrothschild"/>
	<updated>2026-06-02T02:07:45Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23933</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23933"/>
		<updated>2018-08-04T18:17:28Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Final MTurk Push===&lt;br /&gt;
&lt;br /&gt;
Minh and I pushed a final batch of HITs to MTurk. We found that, among our data even after MTurk, we were missing timing info for around 1000 companies. Upon further inspection, we realized that around 800 of these companies belonged to only ~10 accelerators. We think the problem was that Google searches most recent results first, so we missed out on old cohorts for large accelerators. We therefore re-ran Minh's crawler on these accelerators with different year parameters. We got 650 results. &lt;br /&gt;
&lt;br /&gt;
Upon pushing these to MTurk, we got good results for 144 companies. This number was the product of filtering out accelerators with no companies listed, no date listed, and no accelerator listed (after searching manually). We removed duplicates and removed accelerators we do not care about. The 144 companies collectively have 1,538 companies.&lt;br /&gt;
&lt;br /&gt;
This file can be found here:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Final Turk Push.xlsx&lt;br /&gt;
&lt;br /&gt;
The next step is to plug this sheet into Grace's Python script which takes these companies and converts each company to its own row, so that it can be merged with our other data.&lt;br /&gt;
&lt;br /&gt;
===Manual Searching===&lt;br /&gt;
&lt;br /&gt;
For the other 170 companies we lacked timing info for (that were not worth crawling for because there were few companies assigned to each accelerator) McNair Center interns manually searched for timing info. Of the 170 companies we searched for, we found timing information for 128 of them. &lt;br /&gt;
&lt;br /&gt;
The sheet can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1hGgxNwLph0tWtqO_8bNUGM-kzVXTeb-N26ojwL3TTuk/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
And is ready to merge in with our existing data.&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders' Experience===&lt;br /&gt;
&lt;br /&gt;
I have updated and reclassified Founders' job titles. We began with 451 unique job titles, and were able to condense them into 16 categories, which are:&lt;br /&gt;
*Academic&lt;br /&gt;
*Advisor&lt;br /&gt;
*Board&lt;br /&gt;
*C-level&lt;br /&gt;
*CEO&lt;br /&gt;
*Director&lt;br /&gt;
*Founder&lt;br /&gt;
*Investment&lt;br /&gt;
*Management&lt;br /&gt;
*Marketing&lt;br /&gt;
*Partner&lt;br /&gt;
*President&lt;br /&gt;
*VP&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
The formulas used to recode, the old data, and the newest, updated data can be found on this Google Sheet:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/179ML4c1cO_1zooCGj4yjuXXUPwTDKZu52656_8uoNig/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
This has been merged into the File to Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoded Stage===&lt;br /&gt;
&lt;br /&gt;
I have updated and cleaned up the &amp;quot;what stage accs look for companies in&amp;quot; by splitting it up into three categories:&lt;br /&gt;
*seed&lt;br /&gt;
*early stage venture&lt;br /&gt;
*late&lt;br /&gt;
&lt;br /&gt;
Other classifications were collapsed into these three or were not significant (n&amp;lt;2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoded Dead Accelerators===&lt;br /&gt;
&lt;br /&gt;
We have updated dead accelerators on the following Google Sheet&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
This has been merged this into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoded Equity/Investment===&lt;br /&gt;
&lt;br /&gt;
The Google Sheet with this work is here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has also been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
I have updated equity from data from https://www.seed-db.com/accelerators.&lt;br /&gt;
&lt;br /&gt;
I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:&lt;br /&gt;
*the midpoint normalization, used by taking an average of the accelerators' investment ranges.&lt;br /&gt;
*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.&lt;br /&gt;
&lt;br /&gt;
This dual normalization was performed because many accelerators say they invest &amp;quot;Up to $__,___&amp;quot; so a midpoint may not accurately reflect actual investment amounts.&lt;br /&gt;
&lt;br /&gt;
The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.&lt;br /&gt;
&lt;br /&gt;
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.&lt;br /&gt;
&lt;br /&gt;
By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process/formulas of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Recoded multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Manual Data from Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recoded employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
See http://mcnair.bakerinstitute.org/wiki/URL_Finder_(Tool)#Summer_2018_URL_Finder_work for more details.&lt;br /&gt;
&lt;br /&gt;
===Seed DB Parser===&lt;br /&gt;
See [[Seed DB Parser]] for information on functionality. &lt;br /&gt;
&lt;br /&gt;
The results from crawling Seed DB gave us more information for 257 companies. This is located in (sheet: final):&lt;br /&gt;
 E:\McNair\Projects\Seed DB\merging work.xlsx&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23875</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23875"/>
		<updated>2018-07-31T23:25:50Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders' Experience===&lt;br /&gt;
&lt;br /&gt;
I have updated and reclassified Founders' job titles. We began with 451 unique job titles, and were able to condense them into 16 categories, which are:&lt;br /&gt;
*Academic&lt;br /&gt;
*Advisor&lt;br /&gt;
*Board&lt;br /&gt;
*C-level&lt;br /&gt;
*CEO&lt;br /&gt;
*Director&lt;br /&gt;
*Founder&lt;br /&gt;
*Investment&lt;br /&gt;
*Management&lt;br /&gt;
*Marketing&lt;br /&gt;
*Partner&lt;br /&gt;
*President&lt;br /&gt;
*VP&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
The formulas used to recode, the old data, and the newest, updated data can be found on this Google Sheet:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/179ML4c1cO_1zooCGj4yjuXXUPwTDKZu52656_8uoNig/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
This has been merged into the File to Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoded Stage===&lt;br /&gt;
&lt;br /&gt;
I have updated and cleaned up the &amp;quot;what stage accs look for companies in&amp;quot; by splitting it up into three categories:&lt;br /&gt;
*seed&lt;br /&gt;
*early stage venture&lt;br /&gt;
*late&lt;br /&gt;
&lt;br /&gt;
Other classifications were collapsed into these three or were not significant (n&amp;lt;2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoded Dead Accelerators===&lt;br /&gt;
&lt;br /&gt;
We have updated dead accelerators on the following Google Sheet&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
This has been merged this into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoded Equity/Investment===&lt;br /&gt;
&lt;br /&gt;
The Google Sheet with this work is here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has also been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
I have updated equity from data from https://www.seed-db.com/accelerators.&lt;br /&gt;
&lt;br /&gt;
I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:&lt;br /&gt;
*the midpoint normalization, used by taking an average of the accelerators' investment ranges.&lt;br /&gt;
*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.&lt;br /&gt;
&lt;br /&gt;
This dual normalization was performed because many accelerators say they invest &amp;quot;Up to $__,___&amp;quot; so a midpoint may not accurately reflect actual investment amounts.&lt;br /&gt;
&lt;br /&gt;
The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.&lt;br /&gt;
&lt;br /&gt;
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.&lt;br /&gt;
&lt;br /&gt;
By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process/formulas of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Recoded multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Manual Data from Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recoded employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23874</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23874"/>
		<updated>2018-07-31T23:23:30Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders' Experience===&lt;br /&gt;
&lt;br /&gt;
I have updated and reclassified Founders' job titles. We began with 451 unique job titles, and were able to condense them into 16 categories, which are:&lt;br /&gt;
*Academic&lt;br /&gt;
*Advisor&lt;br /&gt;
*Board&lt;br /&gt;
*C-level&lt;br /&gt;
*CEO&lt;br /&gt;
*Director&lt;br /&gt;
*Founder&lt;br /&gt;
*Investment&lt;br /&gt;
*Management&lt;br /&gt;
*Marketing&lt;br /&gt;
*Partner&lt;br /&gt;
*President&lt;br /&gt;
*VP&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
The formulas used to recode, the old data, and the newest, updated data can be found on this Google Sheet:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/179ML4c1cO_1zooCGj4yjuXXUPwTDKZu52656_8uoNig/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
This has been merged into the File to Rule Them All.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Recoding Stage===&lt;br /&gt;
&lt;br /&gt;
I have updated and cleaned up the &amp;quot;what stage accs look for companies in&amp;quot; by splitting it up into three categories:&lt;br /&gt;
*seed&lt;br /&gt;
*early stage venture&lt;br /&gt;
*late&lt;br /&gt;
&lt;br /&gt;
Other classifications were collapsed into these three or were not significant (n&amp;lt;2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Dead Accelerators===&lt;br /&gt;
&lt;br /&gt;
We have updated dead accelerators on the following Google Sheet&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
This has been merged this into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoding Equity/Investment===&lt;br /&gt;
&lt;br /&gt;
The Google Sheet with this work is here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has also been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
I have updated equity from data from https://www.seed-db.com/accelerators.&lt;br /&gt;
&lt;br /&gt;
I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:&lt;br /&gt;
*the midpoint normalization, used by taking an average of the accelerators' investment ranges.&lt;br /&gt;
*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.&lt;br /&gt;
&lt;br /&gt;
This dual normalization was performed because many accelerators say they invest &amp;quot;Up to $__,___&amp;quot; so a midpoint may not accurately reflect actual investment amounts.&lt;br /&gt;
&lt;br /&gt;
The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.&lt;br /&gt;
&lt;br /&gt;
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.&lt;br /&gt;
&lt;br /&gt;
By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process/formulas of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23873</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23873"/>
		<updated>2018-07-31T23:22:37Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders' Experience===&lt;br /&gt;
&lt;br /&gt;
I have updated and reclassified Founders' job titles. We began with 451 unique job titles, and were able to condense them into 16 categories, which are:&lt;br /&gt;
*Academic&lt;br /&gt;
*Advisor&lt;br /&gt;
*Board&lt;br /&gt;
*C-level&lt;br /&gt;
*CEO&lt;br /&gt;
*Director&lt;br /&gt;
*Founder&lt;br /&gt;
*Investment&lt;br /&gt;
*Management&lt;br /&gt;
*Marketing&lt;br /&gt;
*Partner&lt;br /&gt;
*President&lt;br /&gt;
*VP&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
The formulas used to recode, the old data, and the newest, updated data can be found on this Google Sheet:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/179ML4c1cO_1zooCGj4yjuXXUPwTDKZu52656_8uoNig/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
This has been merged into the File to Rule Them All.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Recoding Stage===&lt;br /&gt;
&lt;br /&gt;
I have updated and cleaned up the &amp;quot;what stage accs look for companies in&amp;quot; by splitting it up into three categories:&lt;br /&gt;
*seed&lt;br /&gt;
*early stage venture&lt;br /&gt;
*late&lt;br /&gt;
&lt;br /&gt;
Other classifications were collapsed into these three or were not significant (n&amp;lt;2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Dead Accelerators===&lt;br /&gt;
&lt;br /&gt;
We have updated dead accelerators on the following Google Sheet&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
This has been merged this into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoding Equity/Investment===&lt;br /&gt;
&lt;br /&gt;
The Google Sheet with this work is here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has also been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
I have updated equity from data from https://www.seed-db.com/accelerators.&lt;br /&gt;
&lt;br /&gt;
I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:&lt;br /&gt;
*the midpoint normalization, used by taking an average of the accelerators' investment ranges.&lt;br /&gt;
*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.&lt;br /&gt;
&lt;br /&gt;
This dual normalization was performed because many accelerators say they invest &amp;quot;Up to $__,___&amp;quot; so a midpoint may not accurately reflect actual investment amounts.&lt;br /&gt;
&lt;br /&gt;
The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.&lt;br /&gt;
&lt;br /&gt;
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.&lt;br /&gt;
&lt;br /&gt;
By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process/formulas of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23872</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23872"/>
		<updated>2018-07-31T22:31:33Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras). This article will also preliminaries of the Mechanical Turk tool and how it can be used to collect data.&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==A Quick Glance through the File in The Directory==&lt;br /&gt;
All working file is stored in this folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The file   &lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
Login info:&lt;br /&gt;
 username: mcnair@rice.edu&lt;br /&gt;
 password: amount&lt;br /&gt;
&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:Demodayfinal.png]]&lt;br /&gt;
&lt;br /&gt;
===Pricing===&lt;br /&gt;
&lt;br /&gt;
We priced out task at $1.25 per HIT. Assuming workers take less than 10 minutes, this translates into &amp;gt;$7.50 per hour.&lt;br /&gt;
&lt;br /&gt;
We sent out the task in two batches. The first was 20 HITs to be completed by two workers each, as to test for interjudge reliability.&lt;br /&gt;
&lt;br /&gt;
The second batch was the remaining 264 HITs, to be completed by one worker each.&lt;br /&gt;
&lt;br /&gt;
MTurk charged fees of $.25 per HIT and an additional $.0625, meaning each HIT cost us $1.50.&lt;br /&gt;
&lt;br /&gt;
OUR FINAL PRICE: ((20*2)+264)*1.5625 = $475.00&lt;br /&gt;
&lt;br /&gt;
==Hand Collecting Data==&lt;br /&gt;
&lt;br /&gt;
To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx&lt;br /&gt;
&lt;br /&gt;
We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for.&lt;br /&gt;
&lt;br /&gt;
We used the crawler to search for cohort companies listed for these accelerators.&lt;br /&gt;
&lt;br /&gt;
During the initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns.&lt;br /&gt;
&lt;br /&gt;
The file for hand-coding is in:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''&lt;br /&gt;
&lt;br /&gt;
For the sake of collaboration, the team copied this information to a Google Sheet, accessible here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We split the process into four parts. Each interns will do the following:&lt;br /&gt;
&lt;br /&gt;
1. Go to the given URL.&lt;br /&gt;
&lt;br /&gt;
2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data.&lt;br /&gt;
&lt;br /&gt;
3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).&lt;br /&gt;
&lt;br /&gt;
4. Record date, month, year, and the companies listed for that given accelerator.&lt;br /&gt;
&lt;br /&gt;
5. Note any any information, such as a cohort's special name.&lt;br /&gt;
&lt;br /&gt;
Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
===Accelerators needed to Crawl===&lt;br /&gt;
The name lists of Accelerators to crawl is stored in the file:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt&lt;br /&gt;
&lt;br /&gt;
===Training Data===&lt;br /&gt;
Training data is stored in the folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML&lt;br /&gt;
&lt;br /&gt;
===The Crawler Functionality===&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
===The Classifier===&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ratio is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23858</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23858"/>
		<updated>2018-07-31T19:04:08Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Recoding Stage===&lt;br /&gt;
&lt;br /&gt;
I have updated and cleaned up the &amp;quot;what stage accs look for companies in&amp;quot; by splitting it up into three categories:&lt;br /&gt;
*seed&lt;br /&gt;
*early stage venture&lt;br /&gt;
*late&lt;br /&gt;
&lt;br /&gt;
Other classifications were collapsed into these three or were not significant (n&amp;lt;2) enough to be coded as a classification. The Google Sheet used to recode the stage variable can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1G_XbIrHB6YOU5tWs0dqZot6_eJLDsp-nxoAIuHM9_Yc/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Dead Accelerators===&lt;br /&gt;
&lt;br /&gt;
We have updated dead accelerators on the following Google Sheet&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing&lt;br /&gt;
and have merged this into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoding Equity/Investment===&lt;br /&gt;
&lt;br /&gt;
The Google Sheet with this work is here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has also been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
I have updated equity from data from https://www.seed-db.com/accelerators.&lt;br /&gt;
&lt;br /&gt;
I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:&lt;br /&gt;
*the midpoint normalization, used by taking an average of the accelerators' investment ranges.&lt;br /&gt;
*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.&lt;br /&gt;
&lt;br /&gt;
This dual normalization was performed because many accelerators say they invest &amp;quot;Up to $__,___&amp;quot; so a midpoint may not accurately reflect actual investment amounts.&lt;br /&gt;
&lt;br /&gt;
The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.&lt;br /&gt;
&lt;br /&gt;
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.&lt;br /&gt;
&lt;br /&gt;
By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23857</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23857"/>
		<updated>2018-07-31T16:01:36Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Dead Accelerators===&lt;br /&gt;
&lt;br /&gt;
We have updated dead accelerators on the following Google Sheet&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing&lt;br /&gt;
and have merged this into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoding Equity/Investment===&lt;br /&gt;
&lt;br /&gt;
The Google Sheet with this work is here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has also been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
I have updated equity from data from https://www.seed-db.com/accelerators.&lt;br /&gt;
&lt;br /&gt;
I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:&lt;br /&gt;
*the midpoint normalization, used by taking an average of the accelerators' investment ranges.&lt;br /&gt;
*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.&lt;br /&gt;
&lt;br /&gt;
This dual normalization was performed because many accelerators say they invest &amp;quot;Up to $__,___&amp;quot; so a midpoint may not accurately reflect actual investment amounts.&lt;br /&gt;
&lt;br /&gt;
The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.&lt;br /&gt;
&lt;br /&gt;
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.&lt;br /&gt;
&lt;br /&gt;
By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23856</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23856"/>
		<updated>2018-07-31T15:59:16Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Dead Accelerators===&lt;br /&gt;
&lt;br /&gt;
We have updated dead accelerators on the following Google Sheet&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1_mZ8QgEXwSoTeyVbiEg2ZfoQukvCHfr0NKSk2QMGYnI/edit?usp=sharing&lt;br /&gt;
and have merged this into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
===Recoding Equity===&lt;br /&gt;
&lt;br /&gt;
The Google Sheet with this work is here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1xFlFR1OAoHY4XgesB8ZAugL99DgT0OehxIEzcmYuMB8/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
It has also been merged into The File To Rule Them All.&lt;br /&gt;
&lt;br /&gt;
I have updated equity from data from https://www.seed-db.com/accelerators.&lt;br /&gt;
&lt;br /&gt;
I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:&lt;br /&gt;
*the midpoint normalization, used by taking an average of the accelerators' investment ranges.&lt;br /&gt;
*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.&lt;br /&gt;
&lt;br /&gt;
This dual normalization was performed because many accelerators say they invest &amp;quot;Up to $__,___&amp;quot; so a midpoint may not accurately reflect actual investment amounts.&lt;br /&gt;
&lt;br /&gt;
The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.&lt;br /&gt;
&lt;br /&gt;
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.&lt;br /&gt;
&lt;br /&gt;
By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23855</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23855"/>
		<updated>2018-07-31T15:55:01Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Recoding Equity===&lt;br /&gt;
&lt;br /&gt;
I have updated equity from data from https://www.seed-db.com/accelerators.&lt;br /&gt;
&lt;br /&gt;
I have also updated the columns with a new normalized version of investment. Within the File To Rule Them All, you will find two normalizations of investment:&lt;br /&gt;
*the midpoint normalization, used by taking an average of the accelerators' investment ranges.&lt;br /&gt;
*the upper bound normalization, used by taking an average of the highest amount accelerators will invest.&lt;br /&gt;
&lt;br /&gt;
This dual normalization was performed because many accelerators say they invest &amp;quot;Up to $__,___&amp;quot; so a midpoint may not accurately reflect actual investment amounts.&lt;br /&gt;
&lt;br /&gt;
The average investment when using the midpoint is $40,164 and the average investment when using the upper limit is $48,313.&lt;br /&gt;
&lt;br /&gt;
'''NOTE: There may be one outlier to control for, as Boost VC says they offer **between $50,000 and $500,000**. This is a huge range and the upper limit of $500,00 may throw off our analysis.&lt;br /&gt;
&lt;br /&gt;
By removing Boost VC's investment amounts, the average using midpoint drops to $37,555 (~$3,000 less) and the average using upper limits drops to $43,293 (~$5,000 less). The distance between the two averages drops from ~$8,000 to ~$6,000. We should consider/discuss removing or controlling for Boost VC.'''&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23854</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23854"/>
		<updated>2018-07-31T15:29:41Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk Pricing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found here [http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23853</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23853"/>
		<updated>2018-07-31T15:29:29Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk Pricing */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found [here http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23852</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23852"/>
		<updated>2018-07-31T15:29:16Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Amazon Mechanical Turk Pricing===&lt;br /&gt;
Information can be found [[here http://mcnair.bakerinstitute.org/wiki/Accelerator_Demo_Day#Pricing]]&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23851</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23851"/>
		<updated>2018-07-31T15:28:05Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras). This article will also preliminaries of the Mechanical Turk tool and how it can be used to collect data.&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==A Quick Glance through the File in The Directory==&lt;br /&gt;
All working file is stored in this folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The file   &lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
Login info:&lt;br /&gt;
 username: mcnair@rice.edu&lt;br /&gt;
 password: amount&lt;br /&gt;
&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:Demodayfinal.png]]&lt;br /&gt;
&lt;br /&gt;
===Pricing===&lt;br /&gt;
&lt;br /&gt;
Connor and Minh talked about how to price MTurk so that it is not too generous nor too stingy for workers. Connor could complete four MTurk HITs in 12 minutes (3min/HIT). He then asked his friends who were unfamiliar with MTurk to complete a few surveys, and found they completed around 3 in 15-20 minutes (5-7min/HIT). Given this, we think an upper limit of 10 min/HIT is appropriate. If this is the case, we should price each HIT at $1.50, which leads to an appropriate $9.00/hour rate for workers.&lt;br /&gt;
&lt;br /&gt;
==Hand Collecting Data==&lt;br /&gt;
&lt;br /&gt;
To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx&lt;br /&gt;
&lt;br /&gt;
We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for.&lt;br /&gt;
&lt;br /&gt;
We used the crawler to search for cohort companies listed for these accelerators.&lt;br /&gt;
&lt;br /&gt;
During the initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns.&lt;br /&gt;
&lt;br /&gt;
The file for hand-coding is in:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''&lt;br /&gt;
&lt;br /&gt;
For the sake of collaboration, the team copied this information to a Google Sheet, accessible here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We split the process into four parts. Each interns will do the following:&lt;br /&gt;
&lt;br /&gt;
1. Go to the given URL.&lt;br /&gt;
&lt;br /&gt;
2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data.&lt;br /&gt;
&lt;br /&gt;
3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).&lt;br /&gt;
&lt;br /&gt;
4. Record date, month, year, and the companies listed for that given accelerator.&lt;br /&gt;
&lt;br /&gt;
5. Note any any information, such as a cohort's special name.&lt;br /&gt;
&lt;br /&gt;
Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
===Accelerators needed to Crawl===&lt;br /&gt;
The name lists of Accelerators to crawl is stored in the file:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt&lt;br /&gt;
&lt;br /&gt;
===Training Data===&lt;br /&gt;
Training data is stored in the folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML&lt;br /&gt;
&lt;br /&gt;
===The Crawler Functionality===&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
===The Classifier===&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ratio is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23850</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23850"/>
		<updated>2018-07-31T15:24:41Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras). This article will also preliminaries of the Mechanical Turk tool and how it can be used to collect data.&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==A Quick Glance through the File in The Directory==&lt;br /&gt;
All working file is stored in this folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The file   &lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
Login info:&lt;br /&gt;
 username: mcnair@rice.edu&lt;br /&gt;
 password: amount&lt;br /&gt;
&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:Demodayfinal.png]]&lt;br /&gt;
&lt;br /&gt;
==Hand Collecting Data==&lt;br /&gt;
&lt;br /&gt;
To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx&lt;br /&gt;
&lt;br /&gt;
We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for.&lt;br /&gt;
&lt;br /&gt;
We used the crawler to search for cohort companies listed for these accelerators.&lt;br /&gt;
&lt;br /&gt;
During the initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns.&lt;br /&gt;
&lt;br /&gt;
The file for hand-coding is in:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''&lt;br /&gt;
&lt;br /&gt;
For the sake of collaboration, the team copied this information to a Google Sheet, accessible here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We split the process into four parts. Each interns will do the following:&lt;br /&gt;
&lt;br /&gt;
1. Go to the given URL.&lt;br /&gt;
&lt;br /&gt;
2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data.&lt;br /&gt;
&lt;br /&gt;
3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).&lt;br /&gt;
&lt;br /&gt;
4. Record date, month, year, and the companies listed for that given accelerator.&lt;br /&gt;
&lt;br /&gt;
5. Note any any information, such as a cohort's special name.&lt;br /&gt;
&lt;br /&gt;
Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
===Accelerators needed to Crawl===&lt;br /&gt;
The name lists of Accelerators to crawl is stored in the file:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt&lt;br /&gt;
&lt;br /&gt;
===Training Data===&lt;br /&gt;
Training data is stored in the folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML&lt;br /&gt;
&lt;br /&gt;
===The Crawler Functionality===&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
===The Classifier===&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ratio is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23849</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23849"/>
		<updated>2018-07-31T15:24:02Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras). This article will also preliminaries of the Mechanical Turk tool and how it can be used to collect data.&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==A Quick Glance through the File in The Directory==&lt;br /&gt;
All working file is stored in this folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The file   &lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
Login info:&lt;br /&gt;
 username: mcnair@rice.edu&lt;br /&gt;
 password: amount&lt;br /&gt;
&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:Demodayfinal]]&lt;br /&gt;
&lt;br /&gt;
==Hand Collecting Data==&lt;br /&gt;
&lt;br /&gt;
To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx&lt;br /&gt;
&lt;br /&gt;
We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for.&lt;br /&gt;
&lt;br /&gt;
We used the crawler to search for cohort companies listed for these accelerators.&lt;br /&gt;
&lt;br /&gt;
During the initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns.&lt;br /&gt;
&lt;br /&gt;
The file for hand-coding is in:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''&lt;br /&gt;
&lt;br /&gt;
For the sake of collaboration, the team copied this information to a Google Sheet, accessible here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We split the process into four parts. Each interns will do the following:&lt;br /&gt;
&lt;br /&gt;
1. Go to the given URL.&lt;br /&gt;
&lt;br /&gt;
2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data.&lt;br /&gt;
&lt;br /&gt;
3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).&lt;br /&gt;
&lt;br /&gt;
4. Record date, month, year, and the companies listed for that given accelerator.&lt;br /&gt;
&lt;br /&gt;
5. Note any any information, such as a cohort's special name.&lt;br /&gt;
&lt;br /&gt;
Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
===Accelerators needed to Crawl===&lt;br /&gt;
The name lists of Accelerators to crawl is stored in the file:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt&lt;br /&gt;
&lt;br /&gt;
===Training Data===&lt;br /&gt;
Training data is stored in the folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML&lt;br /&gt;
&lt;br /&gt;
===The Crawler Functionality===&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
===The Classifier===&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ratio is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=File:Demodayfinal.png&amp;diff=23848</id>
		<title>File:Demodayfinal.png</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=File:Demodayfinal.png&amp;diff=23848"/>
		<updated>2018-07-31T15:21:50Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23847</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23847"/>
		<updated>2018-07-31T15:20:45Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras). This article will also preliminaries of the Mechanical Turk tool and how it can be used to collect data.&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==A Quick Glance through the File in The Directory==&lt;br /&gt;
All working file is stored in this folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The file   &lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
Login info:&lt;br /&gt;
 username: mcnair@rice.edu&lt;br /&gt;
 password: amount&lt;br /&gt;
&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:demodayfinal]]&lt;br /&gt;
&lt;br /&gt;
==Hand Collecting Data==&lt;br /&gt;
&lt;br /&gt;
To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx&lt;br /&gt;
&lt;br /&gt;
We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for.&lt;br /&gt;
&lt;br /&gt;
We used the crawler to search for cohort companies listed for these accelerators.&lt;br /&gt;
&lt;br /&gt;
During the initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns.&lt;br /&gt;
&lt;br /&gt;
The file for hand-coding is in:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''&lt;br /&gt;
&lt;br /&gt;
For the sake of collaboration, the team copied this information to a Google Sheet, accessible here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We split the process into four parts. Each interns will do the following:&lt;br /&gt;
&lt;br /&gt;
1. Go to the given URL.&lt;br /&gt;
&lt;br /&gt;
2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data.&lt;br /&gt;
&lt;br /&gt;
3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).&lt;br /&gt;
&lt;br /&gt;
4. Record date, month, year, and the companies listed for that given accelerator.&lt;br /&gt;
&lt;br /&gt;
5. Note any any information, such as a cohort's special name.&lt;br /&gt;
&lt;br /&gt;
Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
===Accelerators needed to Crawl===&lt;br /&gt;
The name lists of Accelerators to crawl is stored in the file:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt&lt;br /&gt;
&lt;br /&gt;
===Training Data===&lt;br /&gt;
Training data is stored in the folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML&lt;br /&gt;
&lt;br /&gt;
===The Crawler Functionality===&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
===The Classifier===&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ratio is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild_(Work_Log)&amp;diff=23846</id>
		<title>Connor Rothschild (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild_(Work_Log)&amp;diff=23846"/>
		<updated>2018-07-30T22:29:10Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Summer 2018 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Summer 2018===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
7/30/2018 -&lt;br /&gt;
*Recoded founders' education&lt;br /&gt;
*In the process of recoding founders' job experience&lt;br /&gt;
*Worked with Minh to test MTurk survey&lt;br /&gt;
*Talked through MTurk logistics and strategy with Minh&lt;br /&gt;
*Recoded equity and investment variables given new SeedDB data&lt;br /&gt;
*Renormalized investment amount based on midpoint of ranges, and upper bounds (upon Hira's request)&lt;br /&gt;
&lt;br /&gt;
7/29/2018 - &lt;br /&gt;
*Finalized multiple campuses work, refined addresses&lt;br /&gt;
*Upon Hira's request, recoded dead/alive variable for updated accuracy&lt;br /&gt;
&lt;br /&gt;
7/27/2018 -&lt;br /&gt;
*Recoded founders&lt;br /&gt;
*Fixed multiple campuses and cohorts&lt;br /&gt;
*Fixed the Google Sheet&lt;br /&gt;
&lt;br /&gt;
7/26/2018 -&lt;br /&gt;
*Cleaned up and fixed the Google Sheet with timing info. &lt;br /&gt;
*Recoded the employee count variable. &lt;br /&gt;
*Normalized investment amount&lt;br /&gt;
&lt;br /&gt;
7/25/2018 -&lt;br /&gt;
*Created a comprehensive Google Sheet with new timing info, collaborated with other interns to find data. Cleaned up sheet.&lt;br /&gt;
&lt;br /&gt;
7/24/2018 -&lt;br /&gt;
Sick day :(&lt;br /&gt;
&lt;br /&gt;
7/23/2018 - &lt;br /&gt;
*Helped Minh with Demo Day information.&lt;br /&gt;
&lt;br /&gt;
7/19/2018 -&lt;br /&gt;
*Helped Minh with training data for Demo Day Crawler&lt;br /&gt;
&lt;br /&gt;
7/18/2018 - &lt;br /&gt;
*Helped Augi with MA cleaning&lt;br /&gt;
*Talked to Minh about Demo Day progress&lt;br /&gt;
&lt;br /&gt;
7/17/2018 - &lt;br /&gt;
*Worked with Ed to add/merge data from Crunchbase to existing data. This was a replication of the process but done by Ed in SQL, not Excel. New data can be found in &lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Merged With Crunchbase Info as of July 17.xlsx'''&lt;br /&gt;
&lt;br /&gt;
NOTE: Use this data rather than the sheet mentioned in yesterday's entry.&lt;br /&gt;
&lt;br /&gt;
7/16/2018 - &lt;br /&gt;
*Merged cohort company data with Crunchbase data, by doing a Vlookup then cleaning up data. I used a =IF(A2=&amp;quot;&amp;quot;,B2,A2) formula to merge cells only when blanks were present. This provided us updated data for four columns:&lt;br /&gt;
**colocation (removed 6324 blanks)&lt;br /&gt;
**codescription (removed 5151 blanks)&lt;br /&gt;
**costatus (removed 7342 blanks)&lt;br /&gt;
**courl (removed 6670 blanks)&lt;br /&gt;
and new columns:&lt;br /&gt;
**address&lt;br /&gt;
**founded_on date&lt;br /&gt;
**employee_count&lt;br /&gt;
**linkedin_url&lt;br /&gt;
&lt;br /&gt;
These new variables can be found in:&lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Crunchbase Info Populated Empty Cells.xlsx''' (OUTDATED:: DON'T USE)&lt;br /&gt;
&lt;br /&gt;
Upon Ed's approval, I'll move this sheet to replace Cohort Companies in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
7/13/2018 -&lt;br /&gt;
*Using SQL, matched our cohort companies with information from Crunchbase. This gave us a lot of new information, like employee counts, company status, the date founded, and the location of the company. This data can be found here:&lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Cohort Companies With Crunchbase Info.xlsx'''&lt;br /&gt;
&lt;br /&gt;
7/12/2018 -&lt;br /&gt;
&lt;br /&gt;
*Created 'The File to Rule them All' with finalized info on accelerators, cohort companies, and founders.&lt;br /&gt;
*Attempted to match our company data to Crunchbase data with SQL to get more info on companies.&lt;br /&gt;
&lt;br /&gt;
7/11/2018 -&lt;br /&gt;
&lt;br /&gt;
*Worked on LinkedIn Founders data. Cleaned up data, removed duplicates, checked for fidelity.&lt;br /&gt;
*Worked with Maxine to finish Crunchbase matching.&lt;br /&gt;
&lt;br /&gt;
7/10/2018 - &lt;br /&gt;
*Merged Clean Cohort Data (Veeral) and Cohort List (new) in the Accelerator Master Variable List file. Cross-referenced this list with Ed's data sent last week, titled accelerator_data_noflag.txt. We found that there are 4866 more entries in the new merged file, meaning Ed's merging may have dropped valid entries. (This was after filtering the list so we only looked at the accelerators on our list).&lt;br /&gt;
&lt;br /&gt;
7/9/2018 - &lt;br /&gt;
*Worked with Maxine to remove duplicates/gather clean data for Crunchbase matching&lt;br /&gt;
&lt;br /&gt;
06/29/2018 - &lt;br /&gt;
*Finished manually coding an equity variable in Master Variable List sheet (with the help of [[Maxine Tao]]).&lt;br /&gt;
*Finished editing terms of joining accelerator:&lt;br /&gt;
*Given the above two tasks, there are five new columns in our Master Variable List sheet: &lt;br /&gt;
**Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
**equity (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
**equity amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
**investment - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
**notes - anything to comment on previous 4 columns&lt;br /&gt;
*Taught [[Maxine Tao]] how to VLookup :D&lt;br /&gt;
&lt;br /&gt;
06/28/2018 -&lt;br /&gt;
*Began manually coding an equity variable in Master Variable List sheet. &lt;br /&gt;
*Edited terms of joining accelerator. &lt;br /&gt;
*Helped Grace with LinkedIn crawler.&lt;br /&gt;
&lt;br /&gt;
06/27/2018 - &lt;br /&gt;
*Finished coding duplicates. Final file can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Duplicate Companies.xlsx&lt;br /&gt;
&lt;br /&gt;
*Dylan taught interns Excel skills&lt;br /&gt;
&lt;br /&gt;
06/26/2018 - &lt;br /&gt;
*Began coding duplicates in CohortMainBaseWCounts.txt file that Ed sent. Sorted by company name alphabetically, then used conditional formatting to highlight when an accelerator had the same name as the accelerator above. This narrowed down the results to instances in which a company would go through the same accelerator twice. Most of the time, this was due to an error with the normalizer, so I moved those un-normalized company names to their own sheet and deleted them from the file.&lt;br /&gt;
&lt;br /&gt;
06/25/2018 - &lt;br /&gt;
*Went through and manually fixed discrepancies between our accelerator data and the Crunchbase data, found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators Matched by Name and Homepage URL.xlsx&lt;br /&gt;
&lt;br /&gt;
*Finalized a sheet with a list of accelerator names as we code them, as Crunchbase codes them, and the appropriate UUID for each accelerator. I recommend updating the names in our spreadsheet of accelerators to the Crunchbase list so that we will be able to look up that name without having an in-between. The list can be found in the rightmost columns here:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
and here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1n1sX5DqZrm_0vbUXG9ZaZIagF9sa0Kva9PAno-6H854/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
*Worked with [[Minh Le]] to better understand and begin documenting the [[Demo Day Page Parser]] project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
06/22/2018 - &lt;br /&gt;
*Finished going through Accelerator Master Variable List to refine industry classification and update addresses/accelerator statuses.&lt;br /&gt;
&lt;br /&gt;
06/21/2018 - &lt;br /&gt;
*Began manually editing entries in Accelerator Master Variable List.&lt;br /&gt;
*Reached out to Grace and Maxine and sent them the necessary sheets/txt files so they could begin on their Crunchbase project.&lt;br /&gt;
*I also made these graphics to better represent what our collaborative work would look like, and what the final project would include:&lt;br /&gt;
 https://docs.google.com/document/d/13Mb7lOLydm9r-ENYxSlZJVGgY9wxClATR6Hy8F9YK1Y/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
06/20/2018 - &lt;br /&gt;
*Talked with Ed about project details. &lt;br /&gt;
*Began looking through the Accelerator Master List to better understand project description. &lt;br /&gt;
*Sent Grace and Maxine the relevant company names listed in the Accelerator Master Spreadsheet so they could begin using their relevant parsers and tools to sort through data.&lt;br /&gt;
&lt;br /&gt;
06/19/2018 - &lt;br /&gt;
*Set up work stations on balcony, trained&lt;br /&gt;
&lt;br /&gt;
06/18/2018 - &lt;br /&gt;
*Trained, met other interns&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23830</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23830"/>
		<updated>2018-07-30T17:39:30Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
===Recoded Founders Education===&lt;br /&gt;
&lt;br /&gt;
I have recoded two components of the founders' education sheet:&lt;br /&gt;
&lt;br /&gt;
1) Degree name has been reclassified into nine categories:&lt;br /&gt;
*High School&lt;br /&gt;
*Associates&lt;br /&gt;
*Bachelors&lt;br /&gt;
*Masters&lt;br /&gt;
*Certificate&lt;br /&gt;
*JD&lt;br /&gt;
*MBA&lt;br /&gt;
*PhD&lt;br /&gt;
*Other&lt;br /&gt;
&lt;br /&gt;
2) Majors have also been recoded into nine categories:&lt;br /&gt;
*H = Humanities&lt;br /&gt;
*SS = Social Sciences&lt;br /&gt;
*NS = Natural Sciences&lt;br /&gt;
*E = Engineering (includes computer science)&lt;br /&gt;
*B = Business and Economics&lt;br /&gt;
*L = Leadership&lt;br /&gt;
*MBA&lt;br /&gt;
*JD&lt;br /&gt;
*O = Other&lt;br /&gt;
&lt;br /&gt;
The Google Sheet I used to reclassify can be found here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1XWtCTeaof8WxAuOCbn3XEFaK0sh-ZIif2JqcdbkH72I/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
With the Sheets '''OLD''' containing old, outdated data, '''WORK''' containing the process of reclassifying, and '''Updated Info''' containing only good, updated data. &lt;br /&gt;
&lt;br /&gt;
The data has been merged into '''The File to Rule Them All'''. Old data has not yet been deleted but can at any time, considering we have it saved in the Google Sheet '''OLD''' (just wanted the go ahead from Hira or Ed).&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders Experience===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
===Finding Company URLs===&lt;br /&gt;
Excel master datasets are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018&lt;br /&gt;
&lt;br /&gt;
Code and files specific to this URL finder are in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\url finder&lt;br /&gt;
&lt;br /&gt;
====Results====&lt;br /&gt;
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.&lt;br /&gt;
&lt;br /&gt;
====Testing====&lt;br /&gt;
&lt;br /&gt;
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx&lt;br /&gt;
&lt;br /&gt;
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs.&lt;br /&gt;
Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.&lt;br /&gt;
&lt;br /&gt;
To test, I ran about 40 companies from &amp;quot;smallcompanylist.txt&amp;quot;, using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.&lt;br /&gt;
&lt;br /&gt;
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.&lt;br /&gt;
&lt;br /&gt;
====Actual Run Info====&lt;br /&gt;
&lt;br /&gt;
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.&lt;br /&gt;
&lt;br /&gt;
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.&lt;br /&gt;
&lt;br /&gt;
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.&lt;br /&gt;
&lt;br /&gt;
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9 by setting this restriction in STEP2_findcorrecturl.py. Then I manually removed any duplicates/inaccurate results. If you want, you can set the threshold lower in STEP2 and use STEP3_clean.py to find the URL with the highest score for each company. &lt;br /&gt;
&lt;br /&gt;
The point of this URL finder is to find timing information for companies. Timing information can be found on Whois. See the page http://mcnair.bakerinstitute.org/wiki/Whois_Parser#Summer_2018_Work for information on running the whois parser. &lt;br /&gt;
&lt;br /&gt;
====Using Python files====&lt;br /&gt;
'''To use STEP1_crawl.py''': &lt;br /&gt;
 INPUT: a list of company names (or anything) you would like to find websites for by searching on google&lt;br /&gt;
 OUTPUT: a list of company names and the top X number of results from google &lt;br /&gt;
&lt;br /&gt;
1. Change LIST_FILEPATH in line 26 to be the name of the file that contains the list of things you would like to search. &lt;br /&gt;
&lt;br /&gt;
2. Change NUMRESULT to be however many results you would like from Google. &lt;br /&gt;
&lt;br /&gt;
3. Adjust DONT_COLLECT to include any websites that you don't want. &lt;br /&gt;
&lt;br /&gt;
4. If you would like to add another search keyword, add this in line 87 which is queries.append(name + &amp;quot;whatever you want here&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
5. Change line 127 to be the name of your output file.&lt;br /&gt;
&lt;br /&gt;
'''To use STEP2_findcorrecturl.py''':&lt;br /&gt;
 INPUT: output file from STEP1&lt;br /&gt;
 OUTPUT: a file formatted the same as the output of STEP1, but URLs that do not match over the threshold value you set will be replaced with &amp;quot;no match&amp;quot; &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file name from STEP1. Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
2. In the if statement on line 44, set your desired threshold. Note that anything greater than 0.6 is generally considered a decent match. It might be safer to use 0.75 and my use of 0.9 ensures almost exact matches. However, you should consider that if your list of companies (or whatever you are searching) includes really common names, then a 90%+ match might not be exactly what you're looking for. &lt;br /&gt;
&lt;br /&gt;
'''To use STEP3_clean.py''':&lt;br /&gt;
&lt;br /&gt;
Note this is an optional step to use depending on the accuracy level you need and what kind of data you crawled earlier. I chose not to use this and instead set a more restrictive threshold in STEP2. &lt;br /&gt;
&lt;br /&gt;
1. Change file f to be the output file from STEP2 (you should delete anything that says &amp;quot;no match&amp;quot;, and when you use STEP2, you must also write the ratio score to the text file). Change g to be the desired name of the output file for this part. &lt;br /&gt;
&lt;br /&gt;
Your output should be a text file containing the company name and the URL that had the highest assigned score in STEP2. In case of more than 1 URL with the highest score, the script should take the first one.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild_(Work_Log)&amp;diff=23813</id>
		<title>Connor Rothschild (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild_(Work_Log)&amp;diff=23813"/>
		<updated>2018-07-27T21:54:21Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Summer 2018 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Summer 2018===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
7/27/2018 -&lt;br /&gt;
*Recoded founders&lt;br /&gt;
*Fixed multiple campuses and cohorts&lt;br /&gt;
*Fixed the Google Sheet&lt;br /&gt;
&lt;br /&gt;
7/26/2018 -&lt;br /&gt;
*Cleaned up and fixed the Google Sheet with timing info. &lt;br /&gt;
*Recoded the employee count variable. &lt;br /&gt;
*Normalized investment amount&lt;br /&gt;
&lt;br /&gt;
7/25/2018 -&lt;br /&gt;
*Created a comprehensive Google Sheet with new timing info, collaborated with other interns to find data. Cleaned up sheet.&lt;br /&gt;
&lt;br /&gt;
7/24/2018 -&lt;br /&gt;
Sick day :(&lt;br /&gt;
&lt;br /&gt;
7/23/2018 - &lt;br /&gt;
*Helped Minh with Demo Day information.&lt;br /&gt;
&lt;br /&gt;
7/19/2018 -&lt;br /&gt;
*Helped Minh with training data for Demo Day Crawler&lt;br /&gt;
&lt;br /&gt;
7/18/2018 - &lt;br /&gt;
*Helped Augi with MA cleaning&lt;br /&gt;
*Talked to Minh about Demo Day progress&lt;br /&gt;
&lt;br /&gt;
7/17/2018 - &lt;br /&gt;
*Worked with Ed to add/merge data from Crunchbase to existing data. This was a replication of the process but done by Ed in SQL, not Excel. New data can be found in &lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Merged With Crunchbase Info as of July 17.xlsx'''&lt;br /&gt;
&lt;br /&gt;
NOTE: Use this data rather than the sheet mentioned in yesterday's entry.&lt;br /&gt;
&lt;br /&gt;
7/16/2018 - &lt;br /&gt;
*Merged cohort company data with Crunchbase data, by doing a Vlookup then cleaning up data. I used a =IF(A2=&amp;quot;&amp;quot;,B2,A2) formula to merge cells only when blanks were present. This provided us updated data for four columns:&lt;br /&gt;
**colocation (removed 6324 blanks)&lt;br /&gt;
**codescription (removed 5151 blanks)&lt;br /&gt;
**costatus (removed 7342 blanks)&lt;br /&gt;
**courl (removed 6670 blanks)&lt;br /&gt;
and new columns:&lt;br /&gt;
**address&lt;br /&gt;
**founded_on date&lt;br /&gt;
**employee_count&lt;br /&gt;
**linkedin_url&lt;br /&gt;
&lt;br /&gt;
These new variables can be found in:&lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Crunchbase Info Populated Empty Cells.xlsx''' (OUTDATED:: DON'T USE)&lt;br /&gt;
&lt;br /&gt;
Upon Ed's approval, I'll move this sheet to replace Cohort Companies in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
7/13/2018 -&lt;br /&gt;
*Using SQL, matched our cohort companies with information from Crunchbase. This gave us a lot of new information, like employee counts, company status, the date founded, and the location of the company. This data can be found here:&lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Cohort Companies With Crunchbase Info.xlsx'''&lt;br /&gt;
&lt;br /&gt;
7/12/2018 -&lt;br /&gt;
&lt;br /&gt;
*Created 'The File to Rule them All' with finalized info on accelerators, cohort companies, and founders.&lt;br /&gt;
*Attempted to match our company data to Crunchbase data with SQL to get more info on companies.&lt;br /&gt;
&lt;br /&gt;
7/11/2018 -&lt;br /&gt;
&lt;br /&gt;
*Worked on LinkedIn Founders data. Cleaned up data, removed duplicates, checked for fidelity.&lt;br /&gt;
*Worked with Maxine to finish Crunchbase matching.&lt;br /&gt;
&lt;br /&gt;
7/10/2018 - &lt;br /&gt;
*Merged Clean Cohort Data (Veeral) and Cohort List (new) in the Accelerator Master Variable List file. Cross-referenced this list with Ed's data sent last week, titled accelerator_data_noflag.txt. We found that there are 4866 more entries in the new merged file, meaning Ed's merging may have dropped valid entries. (This was after filtering the list so we only looked at the accelerators on our list).&lt;br /&gt;
&lt;br /&gt;
7/9/2018 - &lt;br /&gt;
*Worked with Maxine to remove duplicates/gather clean data for Crunchbase matching&lt;br /&gt;
&lt;br /&gt;
06/29/2018 - &lt;br /&gt;
*Finished manually coding an equity variable in Master Variable List sheet (with the help of [[Maxine Tao]]).&lt;br /&gt;
*Finished editing terms of joining accelerator:&lt;br /&gt;
*Given the above two tasks, there are five new columns in our Master Variable List sheet: &lt;br /&gt;
**Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
**equity (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
**equity amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
**investment - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
**notes - anything to comment on previous 4 columns&lt;br /&gt;
*Taught [[Maxine Tao]] how to VLookup :D&lt;br /&gt;
&lt;br /&gt;
06/28/2018 -&lt;br /&gt;
*Began manually coding an equity variable in Master Variable List sheet. &lt;br /&gt;
*Edited terms of joining accelerator. &lt;br /&gt;
*Helped Grace with LinkedIn crawler.&lt;br /&gt;
&lt;br /&gt;
06/27/2018 - &lt;br /&gt;
*Finished coding duplicates. Final file can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Duplicate Companies.xlsx&lt;br /&gt;
&lt;br /&gt;
*Dylan taught interns Excel skills&lt;br /&gt;
&lt;br /&gt;
06/26/2018 - &lt;br /&gt;
*Began coding duplicates in CohortMainBaseWCounts.txt file that Ed sent. Sorted by company name alphabetically, then used conditional formatting to highlight when an accelerator had the same name as the accelerator above. This narrowed down the results to instances in which a company would go through the same accelerator twice. Most of the time, this was due to an error with the normalizer, so I moved those un-normalized company names to their own sheet and deleted them from the file.&lt;br /&gt;
&lt;br /&gt;
06/25/2018 - &lt;br /&gt;
*Went through and manually fixed discrepancies between our accelerator data and the Crunchbase data, found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators Matched by Name and Homepage URL.xlsx&lt;br /&gt;
&lt;br /&gt;
*Finalized a sheet with a list of accelerator names as we code them, as Crunchbase codes them, and the appropriate UUID for each accelerator. I recommend updating the names in our spreadsheet of accelerators to the Crunchbase list so that we will be able to look up that name without having an in-between. The list can be found in the rightmost columns here:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
and here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1n1sX5DqZrm_0vbUXG9ZaZIagF9sa0Kva9PAno-6H854/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
*Worked with [[Minh Le]] to better understand and begin documenting the [[Demo Day Page Parser]] project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
06/22/2018 - &lt;br /&gt;
*Finished going through Accelerator Master Variable List to refine industry classification and update addresses/accelerator statuses.&lt;br /&gt;
&lt;br /&gt;
06/21/2018 - &lt;br /&gt;
*Began manually editing entries in Accelerator Master Variable List.&lt;br /&gt;
*Reached out to Grace and Maxine and sent them the necessary sheets/txt files so they could begin on their Crunchbase project.&lt;br /&gt;
*I also made these graphics to better represent what our collaborative work would look like, and what the final project would include:&lt;br /&gt;
 https://docs.google.com/document/d/13Mb7lOLydm9r-ENYxSlZJVGgY9wxClATR6Hy8F9YK1Y/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
06/20/2018 - &lt;br /&gt;
*Talked with Ed about project details. &lt;br /&gt;
*Began looking through the Accelerator Master List to better understand project description. &lt;br /&gt;
*Sent Grace and Maxine the relevant company names listed in the Accelerator Master Spreadsheet so they could begin using their relevant parsers and tools to sort through data.&lt;br /&gt;
&lt;br /&gt;
06/19/2018 - &lt;br /&gt;
*Set up work stations on balcony, trained&lt;br /&gt;
&lt;br /&gt;
06/18/2018 - &lt;br /&gt;
*Trained, met other interns&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23811</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23811"/>
		<updated>2018-07-27T21:38:26Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
After our Skype call, I did the following:&lt;br /&gt;
&lt;br /&gt;
===Recoding Founders===&lt;br /&gt;
&lt;br /&gt;
I've began recoding the Founders variable. I can categorize 232 of the 824 listed job titles (there are 466 distinct job titles)into 9 categories (the most frequent results as given in a PivotTable found in the sheet).&lt;br /&gt;
&lt;br /&gt;
We will need to talk about how to categorize the most extraneous job titles.&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23808</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23808"/>
		<updated>2018-07-27T20:23:45Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Update for Hira */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
After our Skype call, I did the following:&lt;br /&gt;
&lt;br /&gt;
===Multiple campuses and cohorts===&lt;br /&gt;
&lt;br /&gt;
The File to Rule Them All contains an updated address variable. 139 of our 166 accelerators have addresses that are available online. The ones we could not find information for are left blank.&lt;br /&gt;
&lt;br /&gt;
The collaborative sheet that Hira, Maxine, and I worked on to update this list can be accessed here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1nktgJZfm3L8IsSCHgYbPasSdvKb7QHKxZp8K5Si2iNo/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We have also added a new cohort list. Under The File To Rule Them All, the sheet '''Multiple Campuses''' contains the different locations for accelerators with multiple campuses. Column B can be used to filter out multiple location accelerators.&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23791</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23791"/>
		<updated>2018-07-26T21:59:54Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Fixed Google Sheet */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
After our Skype call, I did the following:&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Multiple campus and cohorts&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23790</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23790"/>
		<updated>2018-07-26T21:59:31Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* FIxed Google Sheet */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
After our Skype call, I did the following:&lt;br /&gt;
&lt;br /&gt;
===Fixed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Multiple campus and cohorts&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23789</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23789"/>
		<updated>2018-07-26T21:59:20Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* FIxing Google Sheet */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
After our Skype call, I did the following:&lt;br /&gt;
&lt;br /&gt;
===FIxed Google Sheet===&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Multiple campus and cohorts&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23788</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23788"/>
		<updated>2018-07-26T21:58:36Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Remaning to do */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
After our Skype call, I did the following:&lt;br /&gt;
&lt;br /&gt;
===FIxing Google Sheet===&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Multiple campus and cohorts&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23786</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23786"/>
		<updated>2018-07-26T21:53:59Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Update for Hira==&lt;br /&gt;
&lt;br /&gt;
After our Skype call, I did the following:&lt;br /&gt;
&lt;br /&gt;
===FIxing Google Sheet===&lt;br /&gt;
&lt;br /&gt;
I first used our &amp;quot;recap&amp;quot; and &amp;quot;announced&amp;quot; classification to standardize and fix the dates.&lt;br /&gt;
We created a new sheet with only data we want to keep, and cleaned it up. That sheet is called &amp;quot;Good Data Only&amp;quot;, at the same link:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
*Columns N-R contain our new data. Please note that all of these columns are based on formulas and will be made erroneous if edited.&lt;br /&gt;
*Column N is the # of weeks for an accelerator program, gathered via VLookup from The File to Rule Them All.&lt;br /&gt;
*Column O is the Actual Date we want to record, and was gathered by subtracting the # of weeks from a date '''if''' the listed page was a '''recap'''.&lt;br /&gt;
*Columns P and Q are the Month and Years stripped from Column O.&lt;br /&gt;
*Finally, Column R is the season variable, as Ed said it should be coded. &lt;br /&gt;
&lt;br /&gt;
We have also gone through and removed all bad data, all duplicates, and all rows without timing info. These is the most complete list possible.&lt;br /&gt;
&lt;br /&gt;
===Recode employee count===&lt;br /&gt;
&lt;br /&gt;
I have added a new column in Cohorts Final (of the File to Rule Them All) yet left the old column in case you would prefer to edit/classify differently.&lt;br /&gt;
Column AB (emp_count_scale) contains a variable coded on a scale of 1 to 9, with each number corresponding to one of the employee_count classifications (1 the lowest, 9 the highest). The exact output can be modified (1 could instead be tiny, 2 be small... 9 be huge).&lt;br /&gt;
&lt;br /&gt;
The employee count column is standardized and can easily be edited given some modification of the Excel formula.&lt;br /&gt;
&lt;br /&gt;
===Normalized investment amount===&lt;br /&gt;
&lt;br /&gt;
I've been trying to fix the investment amount. But I think its smart we discuss this before I move forward. I've done some tentative standardization (finding the average in a range if given), but so many accelerators &amp;quot;take up to __%&amp;quot;  equity and &amp;quot;invest up to $___&amp;quot; that I think its smartest we think hard about how to standardize first. Also notable is that some accelerators say they provide &amp;quot;$____ up front and another $___ in follow up funding for each stage.&amp;quot; How do we deal with these? Message me if you'd like to talk more about this.&lt;br /&gt;
&lt;br /&gt;
I refrained from creating max and min investment columns lest spending time on the data before we discuss it.&lt;br /&gt;
&lt;br /&gt;
===Remaning to do===&lt;br /&gt;
&lt;br /&gt;
- Founders Experience: code job title&lt;br /&gt;
- Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
- Multiple campus and cohorts&lt;br /&gt;
&lt;br /&gt;
==Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23781</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23781"/>
		<updated>2018-07-25T20:12:11Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Hand Collecting Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras). This article will also preliminaries of the Mechanical Turk tool and how it can be used to collect data.&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==A Quick Glance through the File in The Directory==&lt;br /&gt;
All working file is stored in this folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The file   &lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:Screen Shot 2018-07-25 at 11.37.02 AM.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Hand Collecting Data==&lt;br /&gt;
&lt;br /&gt;
To crawl, we only looked for data on accelerators which did not receive venture capital data (which Ed found via VentureXpert) and lacked timing info. The purpose of this crawl is to find timing info where we cannot find it otherwise, and if a company received VC we can find timing info via that investment. The file we used to find instances in which we lack timing info and lacked VC is:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Merged W Crunchbase Data as of July 17.xlsx&lt;br /&gt;
&lt;br /&gt;
We filtered this sheet in Excel (and checked our work by filtering in SQL) and found 809 companies that lacked timing info and didn't receive VC. From this, we found 74 accelerators which we needed to crawl for.&lt;br /&gt;
&lt;br /&gt;
We used the crawler to search for cohort companies listed for these accelerators.&lt;br /&gt;
&lt;br /&gt;
During the initial test run, the number of good pages was 359. The data is then handled by hand by fellow interns.&lt;br /&gt;
&lt;br /&gt;
The file for hand-coding is in:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''&lt;br /&gt;
&lt;br /&gt;
For the sake of collaboration, the team copied this information to a Google Sheet, accessible here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We split the process into four parts. Each interns will do the following:&lt;br /&gt;
&lt;br /&gt;
1. Go to the given URL.&lt;br /&gt;
&lt;br /&gt;
2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data.&lt;br /&gt;
&lt;br /&gt;
3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).&lt;br /&gt;
&lt;br /&gt;
4. Record date, month, year, and the companies listed for that given accelerator.&lt;br /&gt;
&lt;br /&gt;
5. Note any any information, such as a cohort's special name.&lt;br /&gt;
&lt;br /&gt;
Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
===Accelerators needed to Crawl===&lt;br /&gt;
The name lists of Accelerators to crawl is stored in the file:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt&lt;br /&gt;
&lt;br /&gt;
===Training Data===&lt;br /&gt;
Training data is stored in the folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML&lt;br /&gt;
&lt;br /&gt;
===The Crawler Functionality===&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
===The Classifier===&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ratio is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23779</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23779"/>
		<updated>2018-07-25T20:04:03Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Hand Collecting Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras). This article will also preliminaries of the Mechanical Turk tool and how it can be used to collect data.&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them. If you want to understand more about the files as a general user, go to the section A Quick Glance through the File in The Directory below. If you are a developer, go to the Advance User Guide section.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==A Quick Glance through the File in The Directory==&lt;br /&gt;
All working file is stored in this folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The file   &lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:Screen Shot 2018-07-25 at 11.37.02 AM.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Hand Collecting Data==&lt;br /&gt;
&lt;br /&gt;
During the initially test run, the number of good pages was 359. The data is then handled by hand by fellow interns.&lt;br /&gt;
&lt;br /&gt;
The file for hand-coding is in:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerator Demo Day/Test Run/CrawledDemoDayHTMLFull/'''FinalResultWithURL'''&lt;br /&gt;
&lt;br /&gt;
For the sake of collaboration, the team copied this information to a Google Sheet, accessible here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
We split the process into four parts. Each interns will do the following:&lt;br /&gt;
&lt;br /&gt;
1. Go to the given URL.&lt;br /&gt;
2. Record whether the page is good data (column F); this can later be used by [[Minh Le]] to refine/fine-tune training data.&lt;br /&gt;
3. Record whether the page is announcing a cohort or recapping/explaining a demo day (column G). This variable will be used to decide if we should subtract weeks from the given date (e.g. if it is recapping a demo day, the cohort went through the accelerator for the past ~12 weeks, and we should subtract weeks as such).&lt;br /&gt;
4. Record date, month, year, and the companies listed for that given accelerator.&lt;br /&gt;
5. Note any any information, such as a cohort's special name.&lt;br /&gt;
&lt;br /&gt;
Once this process is finished, we will filter only the 1s in Column F, and [[Connor Rothschild]] and [[Maxine Tao]] will work to populate empty cells in The File to Rule Them All with that data.&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
===Accelerators needed to Crawl===&lt;br /&gt;
The name lists of Accelerators to crawl is stored in the file:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\ListOfAccsToCrawl.txt&lt;br /&gt;
&lt;br /&gt;
===Training Data===&lt;br /&gt;
Training data is stored in the folder:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run\TrainingHTML&lt;br /&gt;
&lt;br /&gt;
===The Crawler Functionality===&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
===The Classifier===&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ratio is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=File:Screen_Shot_2018-07-25_at_11.37.02_AM.png&amp;diff=23764</id>
		<title>File:Screen Shot 2018-07-25 at 11.37.02 AM.png</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=File:Screen_Shot_2018-07-25_at_11.37.02_AM.png&amp;diff=23764"/>
		<updated>2018-07-25T16:38:45Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23763</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23763"/>
		<updated>2018-07-25T16:38:15Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras)&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:Screen Shot 2018-07-25 at 11.37.02 AM.png]]&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
==The Crawler Functionality==&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
==The Classifier==&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ration is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=File:Mturkdesign1.png&amp;diff=23762</id>
		<title>File:Mturkdesign1.png</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=File:Mturkdesign1.png&amp;diff=23762"/>
		<updated>2018-07-25T16:37:25Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23761</id>
		<title>Accelerator Demo Day</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Accelerator_Demo_Day&amp;diff=23761"/>
		<updated>2018-07-25T16:36:07Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Amazon Mechanical Turk */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Accelerator Demo Day&lt;br /&gt;
|Has owner=Minh Le,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Does subsume=Demo Day Page Parser, Demo Day Page Google Classifier&lt;br /&gt;
}}&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras)&lt;br /&gt;
&lt;br /&gt;
==Project Goal==&lt;br /&gt;
The goal of this project is to find good &amp;quot;Demo Day&amp;quot; candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.&lt;br /&gt;
&lt;br /&gt;
==Code Location==&lt;br /&gt;
The source code and relevant files for the project can be found here: &lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\&lt;br /&gt;
The current working model using RF is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
The RNN model is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerator Demo Day\Experiment&lt;br /&gt;
The RNN is still under much development. Modifying anything in this folder is not recommended&lt;br /&gt;
&lt;br /&gt;
All the other folders are used for experimenting purposes, please don't touch them.&lt;br /&gt;
&lt;br /&gt;
==General User Guide: How to Use this Project (Random Forest model)==&lt;br /&gt;
&lt;br /&gt;
First, change your directory to the working folder:&lt;br /&gt;
 cd E:\McNair\Projects\Accelerator Demo Day\Test Run&lt;br /&gt;
Then you need to specify the list of accelerators you want to crawl by modifying the following file:&lt;br /&gt;
 ListOfAccsToCrawl.txt&lt;br /&gt;
The first line must remain fixed as &amp;quot;Accelerator&amp;quot;. Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.&lt;br /&gt;
&lt;br /&gt;
All necessary preparations are now complete. Now onto running the code!&lt;br /&gt;
&lt;br /&gt;
Running the project is as simple as executing the code in the correct order. The files are named in the format &amp;quot;STEPX_name&amp;quot;, where as X is the order of execution. To be more specific, run the following 4 commands:&lt;br /&gt;
 ''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''&lt;br /&gt;
 python3 STEP1_crawl.py&lt;br /&gt;
 ''# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt''&lt;br /&gt;
 python3 STEP2_preprocessing_feature_matrix_generator.py&lt;br /&gt;
 ''# Train the RF model''&lt;br /&gt;
 python3 STEP3_train_rf.py&lt;br /&gt;
 ''# Run the model to predict on the HTML of the crawled HTMLs.''&lt;br /&gt;
 python3 STEP4_classify_rf.py&lt;br /&gt;
&lt;br /&gt;
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as &amp;quot;good candidate.&amp;quot; The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.&lt;br /&gt;
&lt;br /&gt;
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.&lt;br /&gt;
&lt;br /&gt;
==Amazon Mechanical Turk==&lt;br /&gt;
There's a file in the folder &lt;br /&gt;
 CrawledHTMLFull&lt;br /&gt;
called&lt;br /&gt;
 FinalResultWithURL&lt;br /&gt;
that was manually created by combining the file&lt;br /&gt;
 crawled_demoday_page_list.txt&lt;br /&gt;
in the mother folder and the file &lt;br /&gt;
 predicted.txt&lt;br /&gt;
This file combined the predictions to the actual url of the websites. &lt;br /&gt;
&lt;br /&gt;
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.&lt;br /&gt;
&lt;br /&gt;
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.&lt;br /&gt;
&lt;br /&gt;
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.&lt;br /&gt;
&lt;br /&gt;
To create the MTurk for this project, follow this tutorial in [[Mechanical Turk (Tool)]]. For testing and development purpose, use https://requestersandbox.mturk.com/&lt;br /&gt;
&lt;br /&gt;
Test account:&lt;br /&gt;
email: mcboatfaceboaty670@gmail.com&lt;br /&gt;
password: sameastheoneforemail2018&lt;br /&gt;
&lt;br /&gt;
For this project, all the fields that was asked of the user is:&lt;br /&gt;
&lt;br /&gt;
*Whether the page had a list of companies going through an accelerator&lt;br /&gt;
*The month and year of the demo day (or article)&lt;br /&gt;
*Accelerator name&lt;br /&gt;
*Companies going through accelerator&lt;br /&gt;
&lt;br /&gt;
Layout:&lt;br /&gt;
&lt;br /&gt;
[[File:Mturkdesign1]]&lt;br /&gt;
&lt;br /&gt;
==Advance User Guide: An in-depth look into the project and the various settings==&lt;br /&gt;
&lt;br /&gt;
==The Crawler Functionality==&lt;br /&gt;
The crawler functionality is stored in the file:&lt;br /&gt;
 STEP1_crawl.py&lt;br /&gt;
The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.&lt;br /&gt;
&lt;br /&gt;
BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:&lt;br /&gt;
 search_results = driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/div/h3/a&amp;quot;) + driver.find_elements_by_xpath(&amp;quot;//div[@class='g']/div/div/h3/a&amp;quot;)&lt;br /&gt;
Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.&lt;br /&gt;
 &lt;br /&gt;
==The Classifier==&lt;br /&gt;
&lt;br /&gt;
===Input (Features)===&lt;br /&gt;
The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach. &lt;br /&gt;
&lt;br /&gt;
Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file.&lt;br /&gt;
Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)&lt;br /&gt;
&lt;br /&gt;
This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach&lt;br /&gt;
&lt;br /&gt;
==Development Notes==&lt;br /&gt;
Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.&lt;br /&gt;
&lt;br /&gt;
The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.&lt;br /&gt;
&lt;br /&gt;
The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning. &lt;br /&gt;
&lt;br /&gt;
Test : train ration is 1:3 (25/75)&lt;br /&gt;
&lt;br /&gt;
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.&lt;br /&gt;
&lt;br /&gt;
==Reading resources==&lt;br /&gt;
http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23751</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23751"/>
		<updated>2018-07-24T20:56:07Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Most Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
==(Outdated) Necessary Steps==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data===&lt;br /&gt;
&lt;br /&gt;
Complete - more info: [[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23750</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23750"/>
		<updated>2018-07-24T20:52:09Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* 7/9/18 Update */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==Most Recent Work==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information (on top of what we already had) is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs via Crunchbase===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the [[Crunchbase Data]] page. &lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
This project is developing broad and near-population data on accelerators and their cohort companies. The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain details of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced.&lt;br /&gt;
&lt;br /&gt;
The primary use of this data is for an academic paper detailed on the [[Matching Entrepreneurs to Accelerators and VCs (Academic Paper)]] page. &lt;br /&gt;
&lt;br /&gt;
However, this project can also provide useful data to other academic papers ([[Urban Start-up Agglomeration]], [[Hubs (Academic Paper)]], and [[Hubs Scorecard (Academic Paper)]]), projects ([[Houston Entrepreneurship]]) and blog posts (under the [[Emerging Ecosystems]] umbrella project).&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The most recent update provided on [[Accelerator Seed List (Data)]] was on 05/21/2018. This update included the most recent '''master file''' of accelerator data, found at &lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The Google Sheets Master Sheet is found here&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=0&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
TODO:&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/unfound_founders.txt&lt;br /&gt;
A 0 means we don't have founder data for that accelerator.&lt;br /&gt;
Specs: A tab delimited text file with the following fields:&lt;br /&gt;
 Accelerator   First Name   Last Name   LinkedInURL(if possible)&lt;br /&gt;
Getting the LinkedInURL will ensure accuracy, but will work without it.&lt;br /&gt;
&lt;br /&gt;
*Shrey: Find &amp;quot;demo day&amp;quot; keywords, so that we can search AcceleratorName Year Keyword and get back potential demo day pages&lt;br /&gt;
&lt;br /&gt;
It is unclear if any of these tasks have been done since the update on 05/21. I will begin by seeing which of these things have been carried out.&lt;br /&gt;
&lt;br /&gt;
==Other Listed To Dos==&lt;br /&gt;
&lt;br /&gt;
*We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the &amp;quot;Accelerators&amp;quot; folder under &amp;quot;Data&amp;quot; or on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet].&lt;br /&gt;
*We have listed all of the startups from the accelerators that have break out cohorts on their website on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet]. This contains the following information in the &amp;quot;Cohort List (new)&amp;quot; sheet: accelerator name, year, cohort name, company name, description, founders, category/sector, and location. &lt;br /&gt;
*Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see [[Demo Day Page Google Classifier]]).&lt;br /&gt;
&lt;br /&gt;
==Moving Forward==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data - COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23719</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23719"/>
		<updated>2018-07-23T17:32:31Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* 7/9/18 Update */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==7/9/18 Update==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===Merging Cohort Companies with Crunchbase Info===&lt;br /&gt;
&lt;br /&gt;
More information on this part of the project can be found on the page [[Merging Existing Data with Crunchbase]].&lt;br /&gt;
&lt;br /&gt;
The newest updated sheet of cohort company info is under the '''Cohorts Final''' sheet of '''The File to Rule Them All.xlsx'''.&lt;br /&gt;
&lt;br /&gt;
Working with [[Maxine Tao]], we have matched companies to their respective pages and information found in Crunchbase (via UUID). We ensured single matches by doing a 1-1-1-1 match with our data and with Crunchbase (using the Matcher). We then received additional information on 8092 companies. The following new information is included in the sheet:&lt;br /&gt;
*short_description&lt;br /&gt;
*long_description&lt;br /&gt;
*category_list (details the company's category)&lt;br /&gt;
*category_group_list (a less refined, more all-encompassing category classification)&lt;br /&gt;
*founded_on date&lt;br /&gt;
*employee_count&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*address&lt;br /&gt;
&lt;br /&gt;
And the following information was also pulled from Crunchbase and '''merged''' with our existing data: &lt;br /&gt;
*URL (was merged with courl cells) &lt;br /&gt;
*city (was merged with colocation)&lt;br /&gt;
*state_code (was merged with colocation)&lt;br /&gt;
*country_code (was merged with colocation)&lt;br /&gt;
*status (was merged with costatus)&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the Crunchbase Data page.&lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
This project is developing broad and near-population data on accelerators and their cohort companies. The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain details of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced.&lt;br /&gt;
&lt;br /&gt;
The primary use of this data is for an academic paper detailed on the [[Matching Entrepreneurs to Accelerators and VCs (Academic Paper)]] page. &lt;br /&gt;
&lt;br /&gt;
However, this project can also provide useful data to other academic papers ([[Urban Start-up Agglomeration]], [[Hubs (Academic Paper)]], and [[Hubs Scorecard (Academic Paper)]]), projects ([[Houston Entrepreneurship]]) and blog posts (under the [[Emerging Ecosystems]] umbrella project).&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The most recent update provided on [[Accelerator Seed List (Data)]] was on 05/21/2018. This update included the most recent '''master file''' of accelerator data, found at &lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The Google Sheets Master Sheet is found here&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=0&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
TODO:&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/unfound_founders.txt&lt;br /&gt;
A 0 means we don't have founder data for that accelerator.&lt;br /&gt;
Specs: A tab delimited text file with the following fields:&lt;br /&gt;
 Accelerator   First Name   Last Name   LinkedInURL(if possible)&lt;br /&gt;
Getting the LinkedInURL will ensure accuracy, but will work without it.&lt;br /&gt;
&lt;br /&gt;
*Shrey: Find &amp;quot;demo day&amp;quot; keywords, so that we can search AcceleratorName Year Keyword and get back potential demo day pages&lt;br /&gt;
&lt;br /&gt;
It is unclear if any of these tasks have been done since the update on 05/21. I will begin by seeing which of these things have been carried out.&lt;br /&gt;
&lt;br /&gt;
==Other Listed To Dos==&lt;br /&gt;
&lt;br /&gt;
*We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the &amp;quot;Accelerators&amp;quot; folder under &amp;quot;Data&amp;quot; or on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet].&lt;br /&gt;
*We have listed all of the startups from the accelerators that have break out cohorts on their website on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet]. This contains the following information in the &amp;quot;Cohort List (new)&amp;quot; sheet: accelerator name, year, cohort name, company name, description, founders, category/sector, and location. &lt;br /&gt;
*Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see [[Demo Day Page Google Classifier]]).&lt;br /&gt;
&lt;br /&gt;
==Moving Forward==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data - COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23718</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23718"/>
		<updated>2018-07-23T17:16:22Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* 7/9/18 Update */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==7/9/18 Update==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair. The most recent file is&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added six new variables to the '''Accelerators Final''' sheet. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These six variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
We've also added UUIDs for 163 of our 166 accelerators. The UUIDs can be found in Column AE of the '''Accelerators Final''' sheet.&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the Crunchbase Data page.&lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
This information can be found in the various '''Founders''' sheets in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
This project is developing broad and near-population data on accelerators and their cohort companies. The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain details of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced.&lt;br /&gt;
&lt;br /&gt;
The primary use of this data is for an academic paper detailed on the [[Matching Entrepreneurs to Accelerators and VCs (Academic Paper)]] page. &lt;br /&gt;
&lt;br /&gt;
However, this project can also provide useful data to other academic papers ([[Urban Start-up Agglomeration]], [[Hubs (Academic Paper)]], and [[Hubs Scorecard (Academic Paper)]]), projects ([[Houston Entrepreneurship]]) and blog posts (under the [[Emerging Ecosystems]] umbrella project).&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The most recent update provided on [[Accelerator Seed List (Data)]] was on 05/21/2018. This update included the most recent '''master file''' of accelerator data, found at &lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The Google Sheets Master Sheet is found here&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=0&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
TODO:&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/unfound_founders.txt&lt;br /&gt;
A 0 means we don't have founder data for that accelerator.&lt;br /&gt;
Specs: A tab delimited text file with the following fields:&lt;br /&gt;
 Accelerator   First Name   Last Name   LinkedInURL(if possible)&lt;br /&gt;
Getting the LinkedInURL will ensure accuracy, but will work without it.&lt;br /&gt;
&lt;br /&gt;
*Shrey: Find &amp;quot;demo day&amp;quot; keywords, so that we can search AcceleratorName Year Keyword and get back potential demo day pages&lt;br /&gt;
&lt;br /&gt;
It is unclear if any of these tasks have been done since the update on 05/21. I will begin by seeing which of these things have been carried out.&lt;br /&gt;
&lt;br /&gt;
==Other Listed To Dos==&lt;br /&gt;
&lt;br /&gt;
*We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the &amp;quot;Accelerators&amp;quot; folder under &amp;quot;Data&amp;quot; or on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet].&lt;br /&gt;
*We have listed all of the startups from the accelerators that have break out cohorts on their website on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet]. This contains the following information in the &amp;quot;Cohort List (new)&amp;quot; sheet: accelerator name, year, cohort name, company name, description, founders, category/sector, and location. &lt;br /&gt;
*Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see [[Demo Day Page Google Classifier]]).&lt;br /&gt;
&lt;br /&gt;
==Moving Forward==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data - COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23717</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23717"/>
		<updated>2018-07-23T17:12:38Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* The Equity Variables: COMPLETE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==7/9/18 Update==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair.&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added five new variables to the '''Accelerator Master Variable List - Revised by Ed V2''' file. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity? (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Equity Amount Normalized - this copies the previous column but only keeps %&amp;gt;0, and if a range was given (e.g. 5-7%) it returns the average (e.g. 6%)&lt;br /&gt;
*Investment Amount - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Investment Notes - anything to comment on previous 4 columns&lt;br /&gt;
&lt;br /&gt;
These five variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the Crunchbase Data page.&lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
This project is developing broad and near-population data on accelerators and their cohort companies. The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain details of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced.&lt;br /&gt;
&lt;br /&gt;
The primary use of this data is for an academic paper detailed on the [[Matching Entrepreneurs to Accelerators and VCs (Academic Paper)]] page. &lt;br /&gt;
&lt;br /&gt;
However, this project can also provide useful data to other academic papers ([[Urban Start-up Agglomeration]], [[Hubs (Academic Paper)]], and [[Hubs Scorecard (Academic Paper)]]), projects ([[Houston Entrepreneurship]]) and blog posts (under the [[Emerging Ecosystems]] umbrella project).&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The most recent update provided on [[Accelerator Seed List (Data)]] was on 05/21/2018. This update included the most recent '''master file''' of accelerator data, found at &lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The Google Sheets Master Sheet is found here&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=0&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
TODO:&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/unfound_founders.txt&lt;br /&gt;
A 0 means we don't have founder data for that accelerator.&lt;br /&gt;
Specs: A tab delimited text file with the following fields:&lt;br /&gt;
 Accelerator   First Name   Last Name   LinkedInURL(if possible)&lt;br /&gt;
Getting the LinkedInURL will ensure accuracy, but will work without it.&lt;br /&gt;
&lt;br /&gt;
*Shrey: Find &amp;quot;demo day&amp;quot; keywords, so that we can search AcceleratorName Year Keyword and get back potential demo day pages&lt;br /&gt;
&lt;br /&gt;
It is unclear if any of these tasks have been done since the update on 05/21. I will begin by seeing which of these things have been carried out.&lt;br /&gt;
&lt;br /&gt;
==Other Listed To Dos==&lt;br /&gt;
&lt;br /&gt;
*We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the &amp;quot;Accelerators&amp;quot; folder under &amp;quot;Data&amp;quot; or on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet].&lt;br /&gt;
*We have listed all of the startups from the accelerators that have break out cohorts on their website on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet]. This contains the following information in the &amp;quot;Cohort List (new)&amp;quot; sheet: accelerator name, year, cohort name, company name, description, founders, category/sector, and location. &lt;br /&gt;
*Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see [[Demo Day Page Google Classifier]]).&lt;br /&gt;
&lt;br /&gt;
==Moving Forward==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data - COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23716</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23716"/>
		<updated>2018-07-23T17:11:10Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* The Equity Variable: COMPLETE */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==7/9/18 Update==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair.&lt;br /&gt;
&lt;br /&gt;
===The Equity Variables: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added five new variables to the '''Accelerator Master Variable List - Revised by Ed V2''' file. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Investment - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Notes - anything to comment on previous 4 columns&lt;br /&gt;
*These five variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the Crunchbase Data page.&lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
This project is developing broad and near-population data on accelerators and their cohort companies. The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain details of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced.&lt;br /&gt;
&lt;br /&gt;
The primary use of this data is for an academic paper detailed on the [[Matching Entrepreneurs to Accelerators and VCs (Academic Paper)]] page. &lt;br /&gt;
&lt;br /&gt;
However, this project can also provide useful data to other academic papers ([[Urban Start-up Agglomeration]], [[Hubs (Academic Paper)]], and [[Hubs Scorecard (Academic Paper)]]), projects ([[Houston Entrepreneurship]]) and blog posts (under the [[Emerging Ecosystems]] umbrella project).&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The most recent update provided on [[Accelerator Seed List (Data)]] was on 05/21/2018. This update included the most recent '''master file''' of accelerator data, found at &lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The Google Sheets Master Sheet is found here&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=0&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
TODO:&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/unfound_founders.txt&lt;br /&gt;
A 0 means we don't have founder data for that accelerator.&lt;br /&gt;
Specs: A tab delimited text file with the following fields:&lt;br /&gt;
 Accelerator   First Name   Last Name   LinkedInURL(if possible)&lt;br /&gt;
Getting the LinkedInURL will ensure accuracy, but will work without it.&lt;br /&gt;
&lt;br /&gt;
*Shrey: Find &amp;quot;demo day&amp;quot; keywords, so that we can search AcceleratorName Year Keyword and get back potential demo day pages&lt;br /&gt;
&lt;br /&gt;
It is unclear if any of these tasks have been done since the update on 05/21. I will begin by seeing which of these things have been carried out.&lt;br /&gt;
&lt;br /&gt;
==Other Listed To Dos==&lt;br /&gt;
&lt;br /&gt;
*We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the &amp;quot;Accelerators&amp;quot; folder under &amp;quot;Data&amp;quot; or on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet].&lt;br /&gt;
*We have listed all of the startups from the accelerators that have break out cohorts on their website on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet]. This contains the following information in the &amp;quot;Cohort List (new)&amp;quot; sheet: accelerator name, year, cohort name, company name, description, founders, category/sector, and location. &lt;br /&gt;
*Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see [[Demo Day Page Google Classifier]]).&lt;br /&gt;
&lt;br /&gt;
==Moving Forward==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data - COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23672</id>
		<title>Merging Existing Data with Crunchbase</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23672"/>
		<updated>2018-07-20T16:29:30Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Merging Existing Data with Crunchbase&lt;br /&gt;
|Has owner=Connor Rothschild&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
This page details the process of merging existing data with data pulled from Crunchbase.&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
&lt;br /&gt;
For the merge detailed in this page, our data was from:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
and Crunchbase info can be found in: &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx&lt;br /&gt;
&lt;br /&gt;
The data from Crunchbase, organized into tables, is in a script found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql&lt;br /&gt;
&lt;br /&gt;
Code for building tables (from [[Maxine Tao]]):&lt;br /&gt;
 E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql&lt;br /&gt;
&lt;br /&gt;
And the database is&lt;br /&gt;
 crunchbase2&lt;br /&gt;
&lt;br /&gt;
==Process==&lt;br /&gt;
&lt;br /&gt;
===Step One: Creating UUID Matches===&lt;br /&gt;
&lt;br /&gt;
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. [[More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information]]&lt;br /&gt;
&lt;br /&gt;
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).&lt;br /&gt;
&lt;br /&gt;
Upon refining our list based on recursive filtering, we found 40 companies which match our data, and added UUIDs appropriately.&lt;br /&gt;
&lt;br /&gt;
===Step Two: Pulling Data===&lt;br /&gt;
&lt;br /&gt;
The necessary tables for this pull can be found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''&lt;br /&gt;
&lt;br /&gt;
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.&lt;br /&gt;
&lt;br /&gt;
We pull based on the unique UUIDs found by Maxine, which can be found in the file:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W.&lt;br /&gt;
&lt;br /&gt;
The table we get most Crunchbase data from looks like this:&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE organizations;&lt;br /&gt;
 CREATE TABLE organizations (&lt;br /&gt;
  company_name varchar(100),&lt;br /&gt;
  role varchar(255),&lt;br /&gt;
  permalink varchar(255),&lt;br /&gt;
  domain varchar(5000),&lt;br /&gt;
  homepage_url varchar(5000),&lt;br /&gt;
  country_code varchar(10),&lt;br /&gt;
  state_code varchar(2),&lt;br /&gt;
  region varchar(50),&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  address text,&lt;br /&gt;
  status varchar(50),&lt;br /&gt;
  short_description text,&lt;br /&gt;
  category_list text,&lt;br /&gt;
  category_group_list  text,&lt;br /&gt;
  funding_rounds integer,&lt;br /&gt;
  funding_total_usd money,&lt;br /&gt;
  founded_on date, --yyyy-mm-dd&lt;br /&gt;
  last_funding_on date, --yyyy-mm-dd&lt;br /&gt;
  closed_on date, --yyyy-mm-dd&lt;br /&gt;
  employee_count varchar(255),&lt;br /&gt;
  email varchar(255),&lt;br /&gt;
  phone text,&lt;br /&gt;
  facebook_url varchar(5000),&lt;br /&gt;
  linkedin_url varchar(5000),&lt;br /&gt;
  cb_url varchar(5000),&lt;br /&gt;
  logo_url varchar(5000),&lt;br /&gt;
  twitter_url varchar(5000),&lt;br /&gt;
  alias varchar(10000),&lt;br /&gt;
  uuid varchar(255),&lt;br /&gt;
  created_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  updated_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  primary_role varchar(255),&lt;br /&gt;
  type varchar(255)&lt;br /&gt;
 );&lt;br /&gt;
&lt;br /&gt;
'''From this list, we care about the following:&lt;br /&gt;
*company_name,&lt;br /&gt;
*domain,&lt;br /&gt;
*country_code,&lt;br /&gt;
*state_code,&lt;br /&gt;
*city,&lt;br /&gt;
*address,&lt;br /&gt;
*status,&lt;br /&gt;
*short_description,&lt;br /&gt;
*category_list,&lt;br /&gt;
*category_group_list,&lt;br /&gt;
*founded_on,&lt;br /&gt;
*employee_count,&lt;br /&gt;
*linkedin_url,&lt;br /&gt;
*uuid -- our primary key'''&lt;br /&gt;
&lt;br /&gt;
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
We also, for the purposes of industry classification, want to pull ''category_name'' from the table ''category_groups'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the ''events'' table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.&lt;br /&gt;
&lt;br /&gt;
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.&lt;br /&gt;
&lt;br /&gt;
===Step Three: Merging===&lt;br /&gt;
&lt;br /&gt;
Of the data we've pulled from Crunchbase, we're interested in merging ''four'' columns with our existing data:&lt;br /&gt;
*domain (to be merged with the empty cells of courl)&lt;br /&gt;
*city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)&lt;br /&gt;
*status (to be merged with the empty cells of costatus)&lt;br /&gt;
*short_description and description '''''from the table organization_descriptions''''' (some combination to be merged with empty cells of codescription)&lt;br /&gt;
&lt;br /&gt;
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for [[Maxine Tao]]'s industry classifier.&lt;br /&gt;
&lt;br /&gt;
The other columns can be added to the end of our sheet as supplemental data.&lt;br /&gt;
&lt;br /&gt;
==SQL Scripts, Files, and Databases==&lt;br /&gt;
&lt;br /&gt;
The contents of E:\McNair\Projects\Accelerators\Summer 2018\For Ed Merge July 17.xlsx where copied into CohortCosWcbuuid.txt (in the Accelerators folder, as well as Z:/crunchbase2, Z:/../vcdb2).&lt;br /&gt;
&lt;br /&gt;
The script '''AddCBData.sql''' loads this data into '''crunchbase2'''. It then outputs the relevant crunchbase data into '''CBCohortData.txt'''&lt;br /&gt;
&lt;br /&gt;
The script '''LoadAcceleratorDataV2.sql''' (see around line 305) loads both '''CohortCosWcbuuid.txt''' and '''CBCohortData.txt''' into the database '''vcdb2'''. It then produces a CohortCoExtended table, which is output to a file.&lt;br /&gt;
&lt;br /&gt;
Note that '''CohortCoExtended.txt''' includes a variable GotVC, which takes the value 1 if the cohort company got VC and zero otherwise:&lt;br /&gt;
&lt;br /&gt;
  gotvc | count&lt;br /&gt;
 -------+-------&lt;br /&gt;
      0 | 11465&lt;br /&gt;
      1 |  1504&lt;br /&gt;
 (2 rows)&lt;br /&gt;
&lt;br /&gt;
We now need to determine which cohort companies we have timing information for and which we don't - and use demo days to get the info we are missing!&lt;br /&gt;
&lt;br /&gt;
==Getting Timing info for Companies Who Got VC==&lt;br /&gt;
&lt;br /&gt;
Line 136 of &lt;br /&gt;
 E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql&lt;br /&gt;
&lt;br /&gt;
contains the code to find the companies which recieved VC but did not have timing info. There are 809 such companies. This table was exported into '''needtiminginfo.txt'''. &lt;br /&gt;
&lt;br /&gt;
A list of distinct accelerators was also created, which was given to [[Minh Le]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23669</id>
		<title>Merging Existing Data with Crunchbase</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23669"/>
		<updated>2018-07-20T16:07:08Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Merging Existing Data with Crunchbase&lt;br /&gt;
|Has owner=Connor Rothschild&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
This page details the process of merging existing data with data pulled from Crunchbase.&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
&lt;br /&gt;
For the merge detailed in this page, our data was from:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
and Crunchbase info can be found in: &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx&lt;br /&gt;
&lt;br /&gt;
The data from Crunchbase, organized into tables, is in a script found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql&lt;br /&gt;
&lt;br /&gt;
Code for building tables (from [[Maxine Tao]]):&lt;br /&gt;
 E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql&lt;br /&gt;
&lt;br /&gt;
And the database is&lt;br /&gt;
 crunchbase2&lt;br /&gt;
&lt;br /&gt;
==Process==&lt;br /&gt;
&lt;br /&gt;
===Step One: Creating UUID Matches===&lt;br /&gt;
&lt;br /&gt;
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. [[More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information]]&lt;br /&gt;
&lt;br /&gt;
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).&lt;br /&gt;
&lt;br /&gt;
Upon refining our list based on recursive filtering, we found 40 companies which match our data, and added UUIDs appropriately.&lt;br /&gt;
&lt;br /&gt;
===Step Two: Pulling Data===&lt;br /&gt;
&lt;br /&gt;
The necessary tables for this pull can be found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''&lt;br /&gt;
&lt;br /&gt;
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.&lt;br /&gt;
&lt;br /&gt;
We pull based on the unique UUIDs found by Maxine, which can be found in the file:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W.&lt;br /&gt;
&lt;br /&gt;
The table we get most Crunchbase data from looks like this:&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE organizations;&lt;br /&gt;
 CREATE TABLE organizations (&lt;br /&gt;
  company_name varchar(100),&lt;br /&gt;
  role varchar(255),&lt;br /&gt;
  permalink varchar(255),&lt;br /&gt;
  domain varchar(5000),&lt;br /&gt;
  homepage_url varchar(5000),&lt;br /&gt;
  country_code varchar(10),&lt;br /&gt;
  state_code varchar(2),&lt;br /&gt;
  region varchar(50),&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  address text,&lt;br /&gt;
  status varchar(50),&lt;br /&gt;
  short_description text,&lt;br /&gt;
  category_list text,&lt;br /&gt;
  category_group_list  text,&lt;br /&gt;
  funding_rounds integer,&lt;br /&gt;
  funding_total_usd money,&lt;br /&gt;
  founded_on date, --yyyy-mm-dd&lt;br /&gt;
  last_funding_on date, --yyyy-mm-dd&lt;br /&gt;
  closed_on date, --yyyy-mm-dd&lt;br /&gt;
  employee_count varchar(255),&lt;br /&gt;
  email varchar(255),&lt;br /&gt;
  phone text,&lt;br /&gt;
  facebook_url varchar(5000),&lt;br /&gt;
  linkedin_url varchar(5000),&lt;br /&gt;
  cb_url varchar(5000),&lt;br /&gt;
  logo_url varchar(5000),&lt;br /&gt;
  twitter_url varchar(5000),&lt;br /&gt;
  alias varchar(10000),&lt;br /&gt;
  uuid varchar(255),&lt;br /&gt;
  created_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  updated_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  primary_role varchar(255),&lt;br /&gt;
  type varchar(255)&lt;br /&gt;
 );&lt;br /&gt;
&lt;br /&gt;
'''From this list, we care about the following:&lt;br /&gt;
*company_name,&lt;br /&gt;
*domain,&lt;br /&gt;
*country_code,&lt;br /&gt;
*state_code,&lt;br /&gt;
*city,&lt;br /&gt;
*address,&lt;br /&gt;
*status,&lt;br /&gt;
*short_description,&lt;br /&gt;
*category_list,&lt;br /&gt;
*category_group_list,&lt;br /&gt;
*founded_on,&lt;br /&gt;
*employee_count,&lt;br /&gt;
*linkedin_url,&lt;br /&gt;
*uuid -- our primary key'''&lt;br /&gt;
&lt;br /&gt;
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
We also, for the purposes of industry classification, want to pull ''category_name'' from the table ''category_groups'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the ''events'' table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.&lt;br /&gt;
&lt;br /&gt;
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.&lt;br /&gt;
&lt;br /&gt;
===Step Three: Merging===&lt;br /&gt;
&lt;br /&gt;
Of the data we've pulled from Crunchbase, we're interested in merging ''four'' columns with our existing data:&lt;br /&gt;
*domain (to be merged with the empty cells of courl)&lt;br /&gt;
*city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)&lt;br /&gt;
*status (to be merged with the empty cells of costatus)&lt;br /&gt;
*short_description and description '''''from the table organization_descriptions''''' (some combination to be merged with empty cells of codescription)&lt;br /&gt;
&lt;br /&gt;
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for [[Maxine Tao]]'s industry classifier.&lt;br /&gt;
&lt;br /&gt;
The other columns can be added to the end of our sheet as supplemental data.&lt;br /&gt;
&lt;br /&gt;
==SQL Scripts, Files, and Databases==&lt;br /&gt;
&lt;br /&gt;
The contents of E:\McNair\Projects\Accelerators\Summer 2018\For Ed Merge July 17.xlsx where copied into CohortCosWcbuuid.txt (in the Accelerators folder, as well as Z:/crunchbase2, Z:/../vcdb2).&lt;br /&gt;
&lt;br /&gt;
The script '''AddCBData.sql''' loads this data into '''crunchbase2'''. It then outputs the relevant crunchbase data into '''CBCohortData.txt'''&lt;br /&gt;
&lt;br /&gt;
The script '''LoadAcceleratorDataV2.sql''' (see around line 305) loads both '''CohortCosWcbuuid.txt''' and '''CBCohortData.txt''' into the database '''vcdb2'''. It then produces a CohortCoExtended table, which is output to a file.&lt;br /&gt;
&lt;br /&gt;
Note that '''CohortCoExtended.txt''' includes a variable GotVC, which takes the value 1 if the cohort company got VC and zero otherwise:&lt;br /&gt;
&lt;br /&gt;
  gotvc | count&lt;br /&gt;
 -------+-------&lt;br /&gt;
      0 | 11465&lt;br /&gt;
      1 |  1504&lt;br /&gt;
 (2 rows)&lt;br /&gt;
&lt;br /&gt;
We now need to determine which cohort companies we have timing information for and which we don't - and use demo days to get the info we are missing!&lt;br /&gt;
&lt;br /&gt;
==Getting Timing info for Companies Who Got VC==&lt;br /&gt;
&lt;br /&gt;
SELECT COUNT(*) FROM CohortCoExtended WHERE year IS NOT NULL AND quarter IS NOT NULL AND gotvc=1;&lt;br /&gt;
count&lt;br /&gt;
-------&lt;br /&gt;
  661&lt;br /&gt;
(1 row)&lt;br /&gt;
&lt;br /&gt;
vcdb2=# SELECT COUNT(*) FROM CohortCoExtended WHERE year IS NOT NULL AND gotvc=1;&lt;br /&gt;
count&lt;br /&gt;
-------&lt;br /&gt;
  684&lt;br /&gt;
(1 row&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild_(Work_Log)&amp;diff=23658</id>
		<title>Connor Rothschild (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild_(Work_Log)&amp;diff=23658"/>
		<updated>2018-07-19T21:45:22Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Summer 2018===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
7/19/2018 -&lt;br /&gt;
*Helped Minh with training data for Demo Day Crawler&lt;br /&gt;
&lt;br /&gt;
7/18/2018 - &lt;br /&gt;
*Helped Augi with MA cleaning&lt;br /&gt;
*Talked to Minh about Demo Day progress&lt;br /&gt;
&lt;br /&gt;
7/17/2018 - &lt;br /&gt;
*Worked with Ed to add/merge data from Crunchbase to existing data. This was a replication of the process but done by Ed in SQL, not Excel. New data can be found in &lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Merged With Crunchbase Info as of July 17.xlsx'''&lt;br /&gt;
&lt;br /&gt;
NOTE: Use this data rather than the sheet mentioned in yesterday's entry.&lt;br /&gt;
&lt;br /&gt;
7/16/2018 - &lt;br /&gt;
*Merged cohort company data with Crunchbase data, by doing a Vlookup then cleaning up data. I used a =IF(A2=&amp;quot;&amp;quot;,B2,A2) formula to merge cells only when blanks were present. This provided us updated data for four columns:&lt;br /&gt;
**colocation (removed 6324 blanks)&lt;br /&gt;
**codescription (removed 5151 blanks)&lt;br /&gt;
**costatus (removed 7342 blanks)&lt;br /&gt;
**courl (removed 6670 blanks)&lt;br /&gt;
and new columns:&lt;br /&gt;
**address&lt;br /&gt;
**founded_on date&lt;br /&gt;
**employee_count&lt;br /&gt;
**linkedin_url&lt;br /&gt;
&lt;br /&gt;
These new variables can be found in:&lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Crunchbase Info Populated Empty Cells.xlsx''' (OUTDATED:: DON'T USE)&lt;br /&gt;
&lt;br /&gt;
Upon Ed's approval, I'll move this sheet to replace Cohort Companies in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
7/13/2018 -&lt;br /&gt;
*Using SQL, matched our cohort companies with information from Crunchbase. This gave us a lot of new information, like employee counts, company status, the date founded, and the location of the company. This data can be found here:&lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Cohort Companies With Crunchbase Info.xlsx'''&lt;br /&gt;
&lt;br /&gt;
7/12/2018 -&lt;br /&gt;
&lt;br /&gt;
*Created 'The File to Rule them All' with finalized info on accelerators, cohort companies, and founders.&lt;br /&gt;
*Attempted to match our company data to Crunchbase data with SQL to get more info on companies.&lt;br /&gt;
&lt;br /&gt;
7/11/2018 -&lt;br /&gt;
&lt;br /&gt;
*Worked on LinkedIn Founders data. Cleaned up data, removed duplicates, checked for fidelity.&lt;br /&gt;
*Worked with Maxine to finish Crunchbase matching.&lt;br /&gt;
&lt;br /&gt;
7/10/2018 - &lt;br /&gt;
*Merged Clean Cohort Data (Veeral) and Cohort List (new) in the Accelerator Master Variable List file. Cross-referenced this list with Ed's data sent last week, titled accelerator_data_noflag.txt. We found that there are 4866 more entries in the new merged file, meaning Ed's merging may have dropped valid entries. (This was after filtering the list so we only looked at the accelerators on our list).&lt;br /&gt;
&lt;br /&gt;
7/9/2018 - &lt;br /&gt;
*Worked with Maxine to remove duplicates/gather clean data for Crunchbase matching&lt;br /&gt;
&lt;br /&gt;
06/29/2018 - &lt;br /&gt;
*Finished manually coding an equity variable in Master Variable List sheet (with the help of [[Maxine Tao]]).&lt;br /&gt;
*Finished editing terms of joining accelerator:&lt;br /&gt;
*Given the above two tasks, there are five new columns in our Master Variable List sheet: &lt;br /&gt;
**Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
**equity (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
**equity amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
**investment - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
**notes - anything to comment on previous 4 columns&lt;br /&gt;
*Taught [[Maxine Tao]] how to VLookup :D&lt;br /&gt;
&lt;br /&gt;
06/28/2018 -&lt;br /&gt;
*Began manually coding an equity variable in Master Variable List sheet. &lt;br /&gt;
*Edited terms of joining accelerator. &lt;br /&gt;
*Helped Grace with LinkedIn crawler.&lt;br /&gt;
&lt;br /&gt;
06/27/2018 - &lt;br /&gt;
*Finished coding duplicates. Final file can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Duplicate Companies.xlsx&lt;br /&gt;
&lt;br /&gt;
*Dylan taught interns Excel skills&lt;br /&gt;
&lt;br /&gt;
06/26/2018 - &lt;br /&gt;
*Began coding duplicates in CohortMainBaseWCounts.txt file that Ed sent. Sorted by company name alphabetically, then used conditional formatting to highlight when an accelerator had the same name as the accelerator above. This narrowed down the results to instances in which a company would go through the same accelerator twice. Most of the time, this was due to an error with the normalizer, so I moved those un-normalized company names to their own sheet and deleted them from the file.&lt;br /&gt;
&lt;br /&gt;
06/25/2018 - &lt;br /&gt;
*Went through and manually fixed discrepancies between our accelerator data and the Crunchbase data, found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators Matched by Name and Homepage URL.xlsx&lt;br /&gt;
&lt;br /&gt;
*Finalized a sheet with a list of accelerator names as we code them, as Crunchbase codes them, and the appropriate UUID for each accelerator. I recommend updating the names in our spreadsheet of accelerators to the Crunchbase list so that we will be able to look up that name without having an in-between. The list can be found in the rightmost columns here:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
and here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1n1sX5DqZrm_0vbUXG9ZaZIagF9sa0Kva9PAno-6H854/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
*Worked with [[Minh Le]] to better understand and begin documenting the [[Demo Day Page Parser]] project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
06/22/2018 - &lt;br /&gt;
*Finished going through Accelerator Master Variable List to refine industry classification and update addresses/accelerator statuses.&lt;br /&gt;
&lt;br /&gt;
06/21/2018 - &lt;br /&gt;
*Began manually editing entries in Accelerator Master Variable List.&lt;br /&gt;
*Reached out to Grace and Maxine and sent them the necessary sheets/txt files so they could begin on their Crunchbase project.&lt;br /&gt;
*I also made these graphics to better represent what our collaborative work would look like, and what the final project would include:&lt;br /&gt;
 https://docs.google.com/document/d/13Mb7lOLydm9r-ENYxSlZJVGgY9wxClATR6Hy8F9YK1Y/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
06/20/2018 - &lt;br /&gt;
*Talked with Ed about project details. &lt;br /&gt;
*Began looking through the Accelerator Master List to better understand project description. &lt;br /&gt;
*Sent Grace and Maxine the relevant company names listed in the Accelerator Master Spreadsheet so they could begin using their relevant parsers and tools to sort through data.&lt;br /&gt;
&lt;br /&gt;
06/19/2018 - &lt;br /&gt;
*Set up work stations on balcony, trained&lt;br /&gt;
&lt;br /&gt;
06/18/2018 - &lt;br /&gt;
*Trained, met other interns&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23651</id>
		<title>U.S. Seed Accelerators</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=U.S._Seed_Accelerators&amp;diff=23651"/>
		<updated>2018-07-19T20:09:05Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* 7/9/18 Update */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=U.S. Seed Accelerators&lt;br /&gt;
|Has owner=Connor Rothschild,&lt;br /&gt;
|Has start date=06/18/2018&lt;br /&gt;
|Has keywords=accelerators, data&lt;br /&gt;
|Has notes=Continuation of [[Accelerator Data]]&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Industry Classifier, Demo Day Page Parser, Accelerator Demo Day, Demo Day Page Google Classifier, Crunchbase Accelerator Equity, Crunchbase Accelerator Founders, Crunchbase Data,&lt;br /&gt;
|Has image=Seed-Accelerator.jpg&lt;br /&gt;
|Does subsume=Accelerator Data, Accelerator Seed List (Data),&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
The master file can be found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx'''&lt;br /&gt;
&lt;br /&gt;
==Relevant Former Projects==&lt;br /&gt;
This page serves as an updated and tidied version of the data and work presented on the [[Accelerator Seed List (Data)]] Project, which subsumed [[Accelerator Data]].&lt;br /&gt;
Both of these projects (and as a corollary, this project) are dependent on the [[Demo Day Page Parser]], [[Industry Classifier]], and the [[Whois Parser]].&lt;br /&gt;
&lt;br /&gt;
==7/9/18 Update==&lt;br /&gt;
&lt;br /&gt;
Here's a project update on the work that has been done since coming to McNair.&lt;br /&gt;
&lt;br /&gt;
===The Equity Variable: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Maxine Tao]] and I have added five new variables to the '''Accelerator Master Variable List - Revised by Ed V2''' file. Those variables are:&lt;br /&gt;
&lt;br /&gt;
*Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
*Equity (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
*Equity Amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
*Investment - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
*Notes - anything to comment on previous 4 columns&lt;br /&gt;
*These five variables tell us more about the characteristics of accelerators; specifically, which ones take equity and which ones do not, and how much equity accelerators take.&lt;br /&gt;
&lt;br /&gt;
Relevant information: &lt;br /&gt;
*82 accelerators take equity, 42 do not, and we lack information for 37.&lt;br /&gt;
*The average % of equity among accelerators who take equity (rough estimate--do not use for anything official) is 6.49% (got this number by only looking at accelerators who take equity, averaging equity amount for accelerators who report a range (e.g. 4%-10% equity would be coded as 7% equity) and took mean.&lt;br /&gt;
&lt;br /&gt;
===Matching Accelerators to UUIDs: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
The file with accelerators matched to Crunchbase UUIDs can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx&lt;br /&gt;
&lt;br /&gt;
This is the master file and should never be modified unless we find a UUID changed. ALL OTHER SHEETS with UUIDs are linked to this sheet so its changes will be reflected elsewhere.&lt;br /&gt;
&lt;br /&gt;
More information can be found on the Crunchbase Data page.&lt;br /&gt;
&lt;br /&gt;
===Linking Accelerators to Founders/LinkedIn Crawling: COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] got the [[LinkedIn Crawler (Python)]] to work, which means we currently have the following information about accelerator founders:&lt;br /&gt;
*Current Job Title&lt;br /&gt;
*Location&lt;br /&gt;
*Employer&lt;br /&gt;
*Job(s) Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location of jobs&lt;br /&gt;
*Extra Description&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==An Overview==&lt;br /&gt;
This project will be used to determine which accelerators are the most effective at churning out successful startups, as well as what characteristics are exhibited by these accelerators. First, we need to gather as much data as we can about as many accelerators as we can in order to look at factors that differentiate successful vs. unsuccessful ventures. Next, we need to create a web crawling program which will gather information about accelerators across the world by accessing their websites and extracting information. I believe that our overall goal with this research project is to gain insight into the methods of successful accelerators, as well as to find out what exactly differentiates very successful accelerators from dead accelerators.&lt;br /&gt;
&lt;br /&gt;
Helpful Links: http://seedrankings.com/&lt;br /&gt;
&lt;br /&gt;
This project is developing broad and near-population data on accelerators and their cohort companies. The objective is to identify which cohorts of which accelerators a cohort company was trained in, obtain details of the accelerators, and obtain details of the cohort companies, including information about any venture capital investment that the cohort company might have received and any IPO or acquisition the company may have experienced.&lt;br /&gt;
&lt;br /&gt;
The primary use of this data is for an academic paper detailed on the [[Matching Entrepreneurs to Accelerators and VCs (Academic Paper)]] page. &lt;br /&gt;
&lt;br /&gt;
However, this project can also provide useful data to other academic papers ([[Urban Start-up Agglomeration]], [[Hubs (Academic Paper)]], and [[Hubs Scorecard (Academic Paper)]]), projects ([[Houston Entrepreneurship]]) and blog posts (under the [[Emerging Ecosystems]] umbrella project).&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The most recent update provided on [[Accelerator Seed List (Data)]] was on 05/21/2018. This update included the most recent '''master file''' of accelerator data, found at &lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
&lt;br /&gt;
(OUTDATED) The Google Sheets Master Sheet is found here&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=0&lt;br /&gt;
&lt;br /&gt;
==Remaining To Dos==&lt;br /&gt;
&lt;br /&gt;
The last update on [[Accelerator Seed List (Data)]] said the following needed to be done: &lt;br /&gt;
&lt;br /&gt;
*Cross-reference sheet with data from Peter's old accelerator consolidation file (&amp;quot;accelerator_data_noflag&amp;quot; and &amp;quot;accelerator_data&amp;quot; in &amp;quot;All Relevant Files&amp;quot;) and fill in missing data&lt;br /&gt;
*Variables that are 100% NOT in these 2 files:&lt;br /&gt;
**Cohort Breakout?&lt;br /&gt;
**Subtype&lt;br /&gt;
**Designed for Students?&lt;br /&gt;
**Campuses&lt;br /&gt;
**Stage&lt;br /&gt;
**Software Tech&lt;br /&gt;
**What stage do they look for?&lt;br /&gt;
&lt;br /&gt;
TODO:&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/unfound_founders.txt&lt;br /&gt;
A 0 means we don't have founder data for that accelerator.&lt;br /&gt;
Specs: A tab delimited text file with the following fields:&lt;br /&gt;
 Accelerator   First Name   Last Name   LinkedInURL(if possible)&lt;br /&gt;
Getting the LinkedInURL will ensure accuracy, but will work without it.&lt;br /&gt;
&lt;br /&gt;
*Shrey: Find &amp;quot;demo day&amp;quot; keywords, so that we can search AcceleratorName Year Keyword and get back potential demo day pages&lt;br /&gt;
&lt;br /&gt;
It is unclear if any of these tasks have been done since the update on 05/21. I will begin by seeing which of these things have been carried out.&lt;br /&gt;
&lt;br /&gt;
==Other Listed To Dos==&lt;br /&gt;
&lt;br /&gt;
*We have compiled a very long list of accelerators from many different databases. For the past couple of weeks, everyone in the center has been going through this list, 20 at a time, classifying each one as an accelerator or not an accelerator, and then proceeding to gather data on the accelerator using the process outlined below. This process went very smoothly. We have successfully gone through about 80% of the list. We are still missing information on the last hundred or so names. All of the collected data is located on the RDP, within the &amp;quot;Accelerators&amp;quot; folder under &amp;quot;Data&amp;quot; or on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet].&lt;br /&gt;
*We have listed all of the startups from the accelerators that have break out cohorts on their website on the [https://docs.google.com/spreadsheets/d/1ikuxYwp9JIRrjz4qQcbdwTpbHOne-q2PterYTjzofjw/edit?ts=5aa2f1f9#gid=1132417337 &amp;quot;Accelerator Master Variable List&amp;quot; Google sheet]. This contains the following information in the &amp;quot;Cohort List (new)&amp;quot; sheet: accelerator name, year, cohort name, company name, description, founders, category/sector, and location. &lt;br /&gt;
*Next steps include going through the demo day pages that have been downloaded and writing notes on the different types if possible (see [[Demo Day Page Google Classifier]]).&lt;br /&gt;
&lt;br /&gt;
==Moving Forward==&lt;br /&gt;
&lt;br /&gt;
'''Acquiring the necessary data to complete the Accelerator Master Variable List and the Cohort List will require the following (not necessarily in this order):'''&lt;br /&gt;
&lt;br /&gt;
===Step Zero: Connect to Crunchbase and Link Data - COMPLETE===&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
===Step One: LinkedIn Founders Data===&lt;br /&gt;
&lt;br /&gt;
This project will begin by working with [[Grace Tan]] and [[Maxine Tao]] to connect accelerators to their founders and cohort companies using Crunchbase and LinkedIn crawlers. Grace and Maxine will go through [http://crunchbase.com Crunchbase] and find the UUID for companies and their founders (reference [[Crunchbase Data]], [[Crunchbase Accelerator Founders]], [[Crunchbase Accelerator Equity]]). Connect them using SQL and feed the names of founders into our LinkedIn crawler (headed by [[Grace Tan]]).&lt;br /&gt;
&lt;br /&gt;
The list of founders for accelerators can be found at&lt;br /&gt;
 McNair/Projects/Accelerators/Fall 2017/founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
The '''Unfound Founders''' file codes a 0 for all companies '''''not listed''''' within the LinkedIn Founders file, and a 1 for those that do have founders listed.&lt;br /&gt;
&lt;br /&gt;
'''Given the founders' names, we will then be able to use the [[LinkedIn Crawler (Python)]] to find the relevant details of an accelerator founder (education, work experience, etc.) This data on founders will help us solve the horse, jockey, racetrack question to detect what variables affect a startup's success (the accelerator, the founders, the environment/city).'''&lt;br /&gt;
&lt;br /&gt;
===Step Two: Linking Accelerators to Cohorts Using Investments on Crunchbase===&lt;br /&gt;
&lt;br /&gt;
In this step we focus on accelerators who take equity from the companies that engage in their program. We do this to prevent looking at accelerators who may also run funds/invest in various companies but do not take equity. This would provide us misleading results and lead us to believe some companies are in cohorts at accelerators that they are really not. &lt;br /&gt;
&lt;br /&gt;
Maxine will acquire the list of accelerators who take equity from companies from the following sheet:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/All Relevant Files/accelerator_data_noflag.txt&lt;br /&gt;
&lt;br /&gt;
Looking at the file, however, shows that very few are actually categorized well and the equity variable is messy. Moving forward, we need to check/refine/fix this classification.&lt;br /&gt;
&lt;br /&gt;
This file has 266 rows. The most recent, actual version of our accelerator database (found at E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx under the sheet Master Variable List) only has 167 rows, meaning the accelerator_data_noflag.txt file has too many rows. &lt;br /&gt;
&lt;br /&gt;
We will need to do a left join of Accelerator Master List with accelerator_data_noflag.txt to get rid of the accelerator names that are in accelerator_data_noflag but NOT in Accelerator Master List.&lt;br /&gt;
&lt;br /&gt;
Once this is finished, we should have an “Equity” classification variable for every accelerator in Accelerator Master List. The accelerators that have a Y (or maybe it’s a 1) are companies that do take equity. These are the companies we’ll be able to do your Crunchbase work on to see when accelerators take equity.&lt;br /&gt;
&lt;br /&gt;
We then look at the accelerators investments (or companies and the entities which invested in them), cross-reference the list of companies/accelerators, and once we find a match, we know that a company went through an accelerator and during which year they went through a cohort.&lt;br /&gt;
&lt;br /&gt;
'''From this, we get the following data:'''&lt;br /&gt;
*Accelerator a given company went through&lt;br /&gt;
*Year said company went through a cohort/Specific cohort company went through&lt;br /&gt;
&lt;br /&gt;
===Step Three: Demo Day Crawler===&lt;br /&gt;
&lt;br /&gt;
This part of the project relies on the contributions of the wonderful [[Minh Le]]. Better documentation for the project can be found on the [[Demo Day Page Parser]], [[Demo Day Page Google Classifier]], and [[Accelerator Demo Day]] project pages.&lt;br /&gt;
&lt;br /&gt;
Essentially, this part of the accelerator data project will use the [[Demo Day Page Parser]] to look through accelerator websites for pages which list a cohort's 'Demo Day', or the day in which accelerators present their companies to a group of special investors ([https://www.ycombinator.com/demoday/faq/ here's an example FAQ page from Y Combinator]). The [[Demo Day Page Google Classifier]] will then determine if the page is, in fact, a demo day page. &lt;br /&gt;
&lt;br /&gt;
'''Given a cohort's demo day, we can gather a few pieces of key information (check with Ed to make sure this is the correct information to gather from Demo Days):'''&lt;br /&gt;
*The date a cohort began/the season the cohort went through the accelerator&lt;br /&gt;
**This is acquired by looking at the cohort's demo day date, and subtracting the number of weeks/months of a cohort for that given accelerator. The length of a cohort can be found in the file:&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
*I assume we can also acquire the companies in a specific cohort, as in we will have the list of all companies in the (for example) Spring 2018 Cohort of Techstars.&lt;br /&gt;
&lt;br /&gt;
===Step Four: Non-profit Finder===&lt;br /&gt;
&lt;br /&gt;
More at [[Non-profit Finder]]&lt;br /&gt;
&lt;br /&gt;
Another important step in this project is finding which accelerators are non-profits.&lt;br /&gt;
&lt;br /&gt;
A comprehensive list of nonprofits taken from the [https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads IRS] can be found here:&lt;br /&gt;
&lt;br /&gt;
 E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Nonprofits in US.xlsx&lt;br /&gt;
Warning: this file has 1 million rows&lt;br /&gt;
&lt;br /&gt;
This file should be cross-referenced with the list of accelerators to find which ones are listed--those accelerators are non-profits. We'll need someone good in SQL to do this with a match.&lt;br /&gt;
&lt;br /&gt;
Potential problem:&lt;br /&gt;
*The names of listed non-profits will likely be somewhat different than the names of accelerators, because companies often file for tax exemption with different names than they show the public.&lt;br /&gt;
&lt;br /&gt;
==Workflow Image==&lt;br /&gt;
&lt;br /&gt;
(the color coding for this image is a very rough and preliminary estimate subject to change) I imagine the project will look something like this, in that it will require the following information to fully complete the project:&lt;br /&gt;
&lt;br /&gt;
[[File:image1.png]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild&amp;diff=23650</id>
		<title>Connor Rothschild</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild&amp;diff=23650"/>
		<updated>2018-07-19T19:56:30Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Time at McNair */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|position=Research Team&lt;br /&gt;
|name=Connor Rothschild&lt;br /&gt;
|user_image=ConnorRothschild.jpg&lt;br /&gt;
|degree=BA&lt;br /&gt;
|major=Social Policy Analysis, Sociology&lt;br /&gt;
|class=2021&lt;br /&gt;
|join_date=06/18/2018&lt;br /&gt;
|skills=Writing, Public Speaking, Excel,&lt;br /&gt;
|interests=Public Policy, Reading,&lt;br /&gt;
|fun_fact=Is really proud of his email aliases&lt;br /&gt;
|email=connor@rice.edu&lt;br /&gt;
|skype_name=connorrothschild@gmail.com&lt;br /&gt;
|status=Active&lt;br /&gt;
|college=Martel&lt;br /&gt;
}}&lt;br /&gt;
Connor Rothschild is a rising sophomore at Rice University currently working as a Research Assistant for the James A. Baker III Institute for Public Policy's McNair Center for Entrepreneurship and Innovation.&lt;br /&gt;
&lt;br /&gt;
==Early Life==&lt;br /&gt;
&lt;br /&gt;
Connor was born in Shawnee, Oklahoma to Philip and Jennifer Rothschild. Six months later, upon Philip receiving a job as a professor at Missouri State University, Connor and his family moved to Springfield, Missouri. &lt;br /&gt;
&lt;br /&gt;
==Education==&lt;br /&gt;
&lt;br /&gt;
Connor is a sophomore at Rice University majoring in Social Policy Analysis and Sociology, with a minor in Statistics. He resides at Martel College. &lt;br /&gt;
&lt;br /&gt;
==Organizational Involvement==&lt;br /&gt;
&lt;br /&gt;
Connor is the Executive Vice President for Community Engagement for the [http://bisf.bakerinstitute.org/ Baker Institute Student Forum]. He is also a project member with Rice's chapter of [http://dfarice.com/ Design for America]. Additionally, Connor is a Sophomore Class Representative at his residential college, Martel. He is also involved with Chi Alpha, the Student Admissions Council, and Orientation Week advising.&lt;br /&gt;
&lt;br /&gt;
==Work Experience==&lt;br /&gt;
&lt;br /&gt;
Connor has worked as a barista at Dunkin' Donuts in Springfield, Missouri, an intern for a Congressman in Washington, D.C., and as a research assistant for Professor Mark Jones while at Rice. &lt;br /&gt;
&lt;br /&gt;
==Time at McNair==&lt;br /&gt;
[[Connor Rothschild (Work Log)]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild_(Work_Log)&amp;diff=23641</id>
		<title>Connor Rothschild (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Connor_Rothschild_(Work_Log)&amp;diff=23641"/>
		<updated>2018-07-18T21:45:26Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Summer 2018 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Summer 2018===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
7/18/2018 - &lt;br /&gt;
*Helped Augi with MA cleaning.&lt;br /&gt;
*Talked to Minh about Demo Day progress.&lt;br /&gt;
&lt;br /&gt;
7/17/2018 - &lt;br /&gt;
*Worked with Ed to add/merge data from Crunchbase to existing data. This was a replication of the process but done by Ed in SQL, not Excel. New data can be found in &lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Merged With Crunchbase Info as of July 17.xlsx'''&lt;br /&gt;
&lt;br /&gt;
NOTE: Use this data rather than the sheet mentioned in yesterday's entry.&lt;br /&gt;
&lt;br /&gt;
7/16/2018 - &lt;br /&gt;
*Merged cohort company data with Crunchbase data, by doing a Vlookup then cleaning up data. I used a =IF(A2=&amp;quot;&amp;quot;,B2,A2) formula to merge cells only when blanks were present. This provided us updated data for four columns:&lt;br /&gt;
**colocation (removed 6324 blanks)&lt;br /&gt;
**codescription (removed 5151 blanks)&lt;br /&gt;
**costatus (removed 7342 blanks)&lt;br /&gt;
**courl (removed 6670 blanks)&lt;br /&gt;
and new columns:&lt;br /&gt;
**address&lt;br /&gt;
**founded_on date&lt;br /&gt;
**employee_count&lt;br /&gt;
**linkedin_url&lt;br /&gt;
&lt;br /&gt;
These new variables can be found in:&lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Crunchbase Info Populated Empty Cells.xlsx''' (OUTDATED:: DON'T USE)&lt;br /&gt;
&lt;br /&gt;
Upon Ed's approval, I'll move this sheet to replace Cohort Companies in '''The File to Rule Them All'''.&lt;br /&gt;
&lt;br /&gt;
7/13/2018 -&lt;br /&gt;
*Using SQL, matched our cohort companies with information from Crunchbase. This gave us a lot of new information, like employee counts, company status, the date founded, and the location of the company. This data can be found here:&lt;br /&gt;
 /McNair/Projects/Accelerators/Summer 2018/'''Cohort Companies With Crunchbase Info.xlsx'''&lt;br /&gt;
&lt;br /&gt;
7/12/2018 -&lt;br /&gt;
&lt;br /&gt;
*Created 'The File to Rule them All' with finalized info on accelerators, cohort companies, and founders.&lt;br /&gt;
*Attempted to match our company data to Crunchbase data with SQL to get more info on companies.&lt;br /&gt;
&lt;br /&gt;
7/11/2018 -&lt;br /&gt;
&lt;br /&gt;
*Worked on LinkedIn Founders data. Cleaned up data, removed duplicates, checked for fidelity.&lt;br /&gt;
*Worked with Maxine to finish Crunchbase matching.&lt;br /&gt;
&lt;br /&gt;
7/10/2018 - &lt;br /&gt;
*Merged Clean Cohort Data (Veeral) and Cohort List (new) in the Accelerator Master Variable List file. Cross-referenced this list with Ed's data sent last week, titled accelerator_data_noflag.txt. We found that there are 4866 more entries in the new merged file, meaning Ed's merging may have dropped valid entries. (This was after filtering the list so we only looked at the accelerators on our list).&lt;br /&gt;
&lt;br /&gt;
7/9/2018 - &lt;br /&gt;
*Worked with Maxine to remove duplicates/gather clean data for Crunchbase matching&lt;br /&gt;
&lt;br /&gt;
06/29/2018 - &lt;br /&gt;
*Finished manually coding an equity variable in Master Variable List sheet (with the help of [[Maxine Tao]]).&lt;br /&gt;
*Finished editing terms of joining accelerator:&lt;br /&gt;
*Given the above two tasks, there are five new columns in our Master Variable List sheet: &lt;br /&gt;
**Terms of joining - terms of joining accelerator and important details about program&lt;br /&gt;
**equity (1/0) - cells contain a 1 if the accelerator take equity, a 0 if an accelerator definitively does not, and is blank if we could not find that information&lt;br /&gt;
**equity amount - the % of equity the accelerator will take (can sometimes be a range (eg. 5-7%))&lt;br /&gt;
**investment - the $ the accelerator invests in a company to begin, if relevant (also could be a range or a &amp;quot;up to $######&amp;quot;)&lt;br /&gt;
**notes - anything to comment on previous 4 columns&lt;br /&gt;
*Taught [[Maxine Tao]] how to VLookup :D&lt;br /&gt;
&lt;br /&gt;
06/28/2018 -&lt;br /&gt;
*Began manually coding an equity variable in Master Variable List sheet. &lt;br /&gt;
*Edited terms of joining accelerator. &lt;br /&gt;
*Helped Grace with LinkedIn crawler.&lt;br /&gt;
&lt;br /&gt;
06/27/2018 - &lt;br /&gt;
*Finished coding duplicates. Final file can be found at:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Duplicate Companies.xlsx&lt;br /&gt;
&lt;br /&gt;
*Dylan taught interns Excel skills&lt;br /&gt;
&lt;br /&gt;
06/26/2018 - &lt;br /&gt;
*Began coding duplicates in CohortMainBaseWCounts.txt file that Ed sent. Sorted by company name alphabetically, then used conditional formatting to highlight when an accelerator had the same name as the accelerator above. This narrowed down the results to instances in which a company would go through the same accelerator twice. Most of the time, this was due to an error with the normalizer, so I moved those un-normalized company names to their own sheet and deleted them from the file.&lt;br /&gt;
&lt;br /&gt;
06/25/2018 - &lt;br /&gt;
*Went through and manually fixed discrepancies between our accelerator data and the Crunchbase data, found at &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerators Matched by Name and Homepage URL.xlsx&lt;br /&gt;
&lt;br /&gt;
*Finalized a sheet with a list of accelerator names as we code them, as Crunchbase codes them, and the appropriate UUID for each accelerator. I recommend updating the names in our spreadsheet of accelerators to the Crunchbase list so that we will be able to look up that name without having an in-between. The list can be found in the rightmost columns here:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Accelerator Master Variable List - Revised by Ed V2.xlsx&lt;br /&gt;
and here:&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/1n1sX5DqZrm_0vbUXG9ZaZIagF9sa0Kva9PAno-6H854/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
*Worked with [[Minh Le]] to better understand and begin documenting the [[Demo Day Page Parser]] project.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
06/22/2018 - &lt;br /&gt;
*Finished going through Accelerator Master Variable List to refine industry classification and update addresses/accelerator statuses.&lt;br /&gt;
&lt;br /&gt;
06/21/2018 - &lt;br /&gt;
*Began manually editing entries in Accelerator Master Variable List.&lt;br /&gt;
*Reached out to Grace and Maxine and sent them the necessary sheets/txt files so they could begin on their Crunchbase project.&lt;br /&gt;
*I also made these graphics to better represent what our collaborative work would look like, and what the final project would include:&lt;br /&gt;
 https://docs.google.com/document/d/13Mb7lOLydm9r-ENYxSlZJVGgY9wxClATR6Hy8F9YK1Y/edit?usp=sharing&lt;br /&gt;
&lt;br /&gt;
06/20/2018 - &lt;br /&gt;
*Talked with Ed about project details. &lt;br /&gt;
*Began looking through the Accelerator Master List to better understand project description. &lt;br /&gt;
*Sent Grace and Maxine the relevant company names listed in the Accelerator Master Spreadsheet so they could begin using their relevant parsers and tools to sort through data.&lt;br /&gt;
&lt;br /&gt;
06/19/2018 - &lt;br /&gt;
*Set up work stations on balcony, trained&lt;br /&gt;
&lt;br /&gt;
06/18/2018 - &lt;br /&gt;
*Trained, met other interns&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23615</id>
		<title>Merging Existing Data with Crunchbase</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23615"/>
		<updated>2018-07-17T19:39:30Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: /* Step One: Creating UUID Matches */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Merging Existing Data with Crunchbase&lt;br /&gt;
|Has owner=Connor Rothschild&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
This page details the process of merging existing data with data pulled from Crunchbase.&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
&lt;br /&gt;
For the merge detailed in this page, our data was from:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
and Crunchbase info can be found in: &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx&lt;br /&gt;
&lt;br /&gt;
The data from Crunchbase, organized into tables, is in a script found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql&lt;br /&gt;
&lt;br /&gt;
Code for building tables (from [[Maxine Tao]]):&lt;br /&gt;
 E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql&lt;br /&gt;
&lt;br /&gt;
And the database is&lt;br /&gt;
 crunchbase2&lt;br /&gt;
&lt;br /&gt;
==Process==&lt;br /&gt;
&lt;br /&gt;
===Step One: Creating UUID Matches===&lt;br /&gt;
&lt;br /&gt;
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. [[More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information]]&lt;br /&gt;
&lt;br /&gt;
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).&lt;br /&gt;
&lt;br /&gt;
Upon refining our list based on recursive filtering, we found 40 companies which match our data, and added UUIDs appropriately.&lt;br /&gt;
&lt;br /&gt;
===Step Two: Pulling Data===&lt;br /&gt;
&lt;br /&gt;
The necessary tables for this pull can be found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''&lt;br /&gt;
&lt;br /&gt;
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.&lt;br /&gt;
&lt;br /&gt;
We pull based on the unique UUIDs found by Maxine, which can be found in the file:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W.&lt;br /&gt;
&lt;br /&gt;
The table we get most Crunchbase data from looks like this:&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE organizations;&lt;br /&gt;
 CREATE TABLE organizations (&lt;br /&gt;
  company_name varchar(100),&lt;br /&gt;
  role varchar(255),&lt;br /&gt;
  permalink varchar(255),&lt;br /&gt;
  domain varchar(5000),&lt;br /&gt;
  homepage_url varchar(5000),&lt;br /&gt;
  country_code varchar(10),&lt;br /&gt;
  state_code varchar(2),&lt;br /&gt;
  region varchar(50),&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  address text,&lt;br /&gt;
  status varchar(50),&lt;br /&gt;
  short_description text,&lt;br /&gt;
  category_list text,&lt;br /&gt;
  category_group_list  text,&lt;br /&gt;
  funding_rounds integer,&lt;br /&gt;
  funding_total_usd money,&lt;br /&gt;
  founded_on date, --yyyy-mm-dd&lt;br /&gt;
  last_funding_on date, --yyyy-mm-dd&lt;br /&gt;
  closed_on date, --yyyy-mm-dd&lt;br /&gt;
  employee_count varchar(255),&lt;br /&gt;
  email varchar(255),&lt;br /&gt;
  phone text,&lt;br /&gt;
  facebook_url varchar(5000),&lt;br /&gt;
  linkedin_url varchar(5000),&lt;br /&gt;
  cb_url varchar(5000),&lt;br /&gt;
  logo_url varchar(5000),&lt;br /&gt;
  twitter_url varchar(5000),&lt;br /&gt;
  alias varchar(10000),&lt;br /&gt;
  uuid varchar(255),&lt;br /&gt;
  created_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  updated_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  primary_role varchar(255),&lt;br /&gt;
  type varchar(255)&lt;br /&gt;
 );&lt;br /&gt;
&lt;br /&gt;
'''From this list, we care about the following:&lt;br /&gt;
*company_name,&lt;br /&gt;
*domain,&lt;br /&gt;
*country_code,&lt;br /&gt;
*state_code,&lt;br /&gt;
*city,&lt;br /&gt;
*address,&lt;br /&gt;
*status,&lt;br /&gt;
*short_description,&lt;br /&gt;
*category_list,&lt;br /&gt;
*category_group_list,&lt;br /&gt;
*founded_on,&lt;br /&gt;
*employee_count,&lt;br /&gt;
*linkedin_url,&lt;br /&gt;
*uuid -- our primary key'''&lt;br /&gt;
&lt;br /&gt;
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
We also, for the purposes of industry classification, want to pull ''category_name'' from the table ''category_groups'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the ''events'' table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.&lt;br /&gt;
&lt;br /&gt;
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.&lt;br /&gt;
&lt;br /&gt;
===Step Three: Merging===&lt;br /&gt;
&lt;br /&gt;
Of the data we've pulled from Crunchbase, we're interested in merging ''four'' columns with our existing data:&lt;br /&gt;
*domain (to be merged with the empty cells of courl)&lt;br /&gt;
*city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)&lt;br /&gt;
*status (to be merged with the empty cells of costatus)&lt;br /&gt;
*short_description and description '''''from the table organization_descriptions''''' (some combination to be merged with empty cells of codescription)&lt;br /&gt;
&lt;br /&gt;
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for [[Maxine Tao]]'s industry classifier.&lt;br /&gt;
&lt;br /&gt;
The other columns can be added to the end of our sheet as supplemental data.&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23614</id>
		<title>Merging Existing Data with Crunchbase</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23614"/>
		<updated>2018-07-17T17:19:38Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Merging Existing Data with Crunchbase&lt;br /&gt;
|Has owner=Connor Rothschild&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
This page details the process of merging existing data with data pulled from Crunchbase.&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
&lt;br /&gt;
For the merge detailed in this page, our data was from:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
and Crunchbase info can be found in: &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx&lt;br /&gt;
&lt;br /&gt;
The data from Crunchbase, organized into tables, is in a script found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql&lt;br /&gt;
&lt;br /&gt;
Code for building tables (from [[Maxine Tao]]):&lt;br /&gt;
 E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql&lt;br /&gt;
&lt;br /&gt;
And the database is&lt;br /&gt;
 crunchbase2&lt;br /&gt;
&lt;br /&gt;
==Process==&lt;br /&gt;
&lt;br /&gt;
===Step One: Creating UUID Matches===&lt;br /&gt;
&lt;br /&gt;
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. [[More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information]]&lt;br /&gt;
&lt;br /&gt;
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).&lt;br /&gt;
&lt;br /&gt;
Upon refining our list based on recursive filtering, we found __ companies which match our data, and added UUIDs appropriately.&lt;br /&gt;
&lt;br /&gt;
===Step Two: Pulling Data===&lt;br /&gt;
&lt;br /&gt;
The necessary tables for this pull can be found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''&lt;br /&gt;
&lt;br /&gt;
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.&lt;br /&gt;
&lt;br /&gt;
We pull based on the unique UUIDs found by Maxine, which can be found in the file:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W.&lt;br /&gt;
&lt;br /&gt;
The table we get most Crunchbase data from looks like this:&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE organizations;&lt;br /&gt;
 CREATE TABLE organizations (&lt;br /&gt;
  company_name varchar(100),&lt;br /&gt;
  role varchar(255),&lt;br /&gt;
  permalink varchar(255),&lt;br /&gt;
  domain varchar(5000),&lt;br /&gt;
  homepage_url varchar(5000),&lt;br /&gt;
  country_code varchar(10),&lt;br /&gt;
  state_code varchar(2),&lt;br /&gt;
  region varchar(50),&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  address text,&lt;br /&gt;
  status varchar(50),&lt;br /&gt;
  short_description text,&lt;br /&gt;
  category_list text,&lt;br /&gt;
  category_group_list  text,&lt;br /&gt;
  funding_rounds integer,&lt;br /&gt;
  funding_total_usd money,&lt;br /&gt;
  founded_on date, --yyyy-mm-dd&lt;br /&gt;
  last_funding_on date, --yyyy-mm-dd&lt;br /&gt;
  closed_on date, --yyyy-mm-dd&lt;br /&gt;
  employee_count varchar(255),&lt;br /&gt;
  email varchar(255),&lt;br /&gt;
  phone text,&lt;br /&gt;
  facebook_url varchar(5000),&lt;br /&gt;
  linkedin_url varchar(5000),&lt;br /&gt;
  cb_url varchar(5000),&lt;br /&gt;
  logo_url varchar(5000),&lt;br /&gt;
  twitter_url varchar(5000),&lt;br /&gt;
  alias varchar(10000),&lt;br /&gt;
  uuid varchar(255),&lt;br /&gt;
  created_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  updated_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  primary_role varchar(255),&lt;br /&gt;
  type varchar(255)&lt;br /&gt;
 );&lt;br /&gt;
&lt;br /&gt;
'''From this list, we care about the following:&lt;br /&gt;
*company_name,&lt;br /&gt;
*domain,&lt;br /&gt;
*country_code,&lt;br /&gt;
*state_code,&lt;br /&gt;
*city,&lt;br /&gt;
*address,&lt;br /&gt;
*status,&lt;br /&gt;
*short_description,&lt;br /&gt;
*category_list,&lt;br /&gt;
*category_group_list,&lt;br /&gt;
*founded_on,&lt;br /&gt;
*employee_count,&lt;br /&gt;
*linkedin_url,&lt;br /&gt;
*uuid -- our primary key'''&lt;br /&gt;
&lt;br /&gt;
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
We also, for the purposes of industry classification, want to pull ''category_name'' from the table ''category_groups'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the ''events'' table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.&lt;br /&gt;
&lt;br /&gt;
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.&lt;br /&gt;
&lt;br /&gt;
===Step Three: Merging===&lt;br /&gt;
&lt;br /&gt;
Of the data we've pulled from Crunchbase, we're interested in merging ''four'' columns with our existing data:&lt;br /&gt;
*domain (to be merged with the empty cells of courl)&lt;br /&gt;
*city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)&lt;br /&gt;
*status (to be merged with the empty cells of costatus)&lt;br /&gt;
*short_description and description '''''from the table organization_descriptions''''' (some combination to be merged with empty cells of codescription)&lt;br /&gt;
&lt;br /&gt;
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for [[Maxine Tao]]'s industry classifier.&lt;br /&gt;
&lt;br /&gt;
The other columns can be added to the end of our sheet as supplemental data.&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23613</id>
		<title>Merging Existing Data with Crunchbase</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23613"/>
		<updated>2018-07-17T17:15:35Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Merging Existing Data with Crunchbase&lt;br /&gt;
|Has owner=Connor Rothschild&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
This page details the process of merging existing data with data pulled from Crunchbase.&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
&lt;br /&gt;
For the merge detailed in this page, our data was from:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
and Crunchbase info can be found in: &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx&lt;br /&gt;
&lt;br /&gt;
The data from Crunchbase, organized into tables, is in a script found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql&lt;br /&gt;
&lt;br /&gt;
==Process==&lt;br /&gt;
&lt;br /&gt;
===Step One: Creating UUID Matches===&lt;br /&gt;
&lt;br /&gt;
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. [[More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information]]&lt;br /&gt;
&lt;br /&gt;
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).&lt;br /&gt;
&lt;br /&gt;
Upon refining our list based on recursive filtering, we found __ companies which match our data, and added UUIDs appropriately.&lt;br /&gt;
&lt;br /&gt;
===Step Two: Pulling Data===&lt;br /&gt;
&lt;br /&gt;
The necessary tables for this pull can be found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''&lt;br /&gt;
&lt;br /&gt;
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.&lt;br /&gt;
&lt;br /&gt;
We pull based on the unique UUIDs found by Maxine, which can be found in the file:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W.&lt;br /&gt;
&lt;br /&gt;
The table we get most Crunchbase data from looks like this:&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE organizations;&lt;br /&gt;
 CREATE TABLE organizations (&lt;br /&gt;
  company_name varchar(100),&lt;br /&gt;
  role varchar(255),&lt;br /&gt;
  permalink varchar(255),&lt;br /&gt;
  domain varchar(5000),&lt;br /&gt;
  homepage_url varchar(5000),&lt;br /&gt;
  country_code varchar(10),&lt;br /&gt;
  state_code varchar(2),&lt;br /&gt;
  region varchar(50),&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  address text,&lt;br /&gt;
  status varchar(50),&lt;br /&gt;
  short_description text,&lt;br /&gt;
  category_list text,&lt;br /&gt;
  category_group_list  text,&lt;br /&gt;
  funding_rounds integer,&lt;br /&gt;
  funding_total_usd money,&lt;br /&gt;
  founded_on date, --yyyy-mm-dd&lt;br /&gt;
  last_funding_on date, --yyyy-mm-dd&lt;br /&gt;
  closed_on date, --yyyy-mm-dd&lt;br /&gt;
  employee_count varchar(255),&lt;br /&gt;
  email varchar(255),&lt;br /&gt;
  phone text,&lt;br /&gt;
  facebook_url varchar(5000),&lt;br /&gt;
  linkedin_url varchar(5000),&lt;br /&gt;
  cb_url varchar(5000),&lt;br /&gt;
  logo_url varchar(5000),&lt;br /&gt;
  twitter_url varchar(5000),&lt;br /&gt;
  alias varchar(10000),&lt;br /&gt;
  uuid varchar(255),&lt;br /&gt;
  created_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  updated_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  primary_role varchar(255),&lt;br /&gt;
  type varchar(255)&lt;br /&gt;
 );&lt;br /&gt;
&lt;br /&gt;
'''From this list, we care about the following:&lt;br /&gt;
*company_name,&lt;br /&gt;
*domain,&lt;br /&gt;
*country_code,&lt;br /&gt;
*state_code,&lt;br /&gt;
*city,&lt;br /&gt;
*address,&lt;br /&gt;
*status,&lt;br /&gt;
*short_description,&lt;br /&gt;
*category_list,&lt;br /&gt;
*category_group_list,&lt;br /&gt;
*founded_on,&lt;br /&gt;
*employee_count,&lt;br /&gt;
*linkedin_url,&lt;br /&gt;
*uuid -- our primary key'''&lt;br /&gt;
&lt;br /&gt;
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
We also, for the purposes of industry classification, want to pull ''category_name'' from the table ''category_groups'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the ''events'' table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.&lt;br /&gt;
&lt;br /&gt;
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.&lt;br /&gt;
&lt;br /&gt;
===Step Three: Merging===&lt;br /&gt;
&lt;br /&gt;
Of the data we've pulled from Crunchbase, we're interested in merging ''four'' columns with our existing data:&lt;br /&gt;
*domain (to be merged with the empty cells of courl)&lt;br /&gt;
*city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)&lt;br /&gt;
*status (to be merged with the empty cells of costatus)&lt;br /&gt;
*short_description and description '''''from the table organization_descriptions''''' (some combination to be merged with empty cells of codescription)&lt;br /&gt;
&lt;br /&gt;
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for [[Maxine Tao]]'s industry classifier.&lt;br /&gt;
&lt;br /&gt;
The other columns can be added to the end of our sheet as supplemental data.&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23612</id>
		<title>Merging Existing Data with Crunchbase</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23612"/>
		<updated>2018-07-17T17:15:21Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Merging Existing Data with Crunchbase&lt;br /&gt;
|Has owner=Connor Rothschild&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
This page details the process of merging existing data with data pulled from Crunchbase.&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
&lt;br /&gt;
For the merge detailed in this page, our data was from:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
and Crunchbase info can be found in: &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx&lt;br /&gt;
&lt;br /&gt;
The data from Crunchbase, organized into tables, is in a script found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql&lt;br /&gt;
&lt;br /&gt;
==Process==&lt;br /&gt;
&lt;br /&gt;
===Step One: Creating UUID Matches===&lt;br /&gt;
&lt;br /&gt;
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. [[[More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information]]]&lt;br /&gt;
&lt;br /&gt;
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).&lt;br /&gt;
&lt;br /&gt;
Upon refining our list based on recursive filtering, we found __ companies which match our data, and added UUIDs appropriately.&lt;br /&gt;
&lt;br /&gt;
===Step Two: Pulling Data===&lt;br /&gt;
&lt;br /&gt;
The necessary tables for this pull can be found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''&lt;br /&gt;
&lt;br /&gt;
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.&lt;br /&gt;
&lt;br /&gt;
We pull based on the unique UUIDs found by Maxine, which can be found in the file:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W.&lt;br /&gt;
&lt;br /&gt;
The table we get most Crunchbase data from looks like this:&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE organizations;&lt;br /&gt;
 CREATE TABLE organizations (&lt;br /&gt;
  company_name varchar(100),&lt;br /&gt;
  role varchar(255),&lt;br /&gt;
  permalink varchar(255),&lt;br /&gt;
  domain varchar(5000),&lt;br /&gt;
  homepage_url varchar(5000),&lt;br /&gt;
  country_code varchar(10),&lt;br /&gt;
  state_code varchar(2),&lt;br /&gt;
  region varchar(50),&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  address text,&lt;br /&gt;
  status varchar(50),&lt;br /&gt;
  short_description text,&lt;br /&gt;
  category_list text,&lt;br /&gt;
  category_group_list  text,&lt;br /&gt;
  funding_rounds integer,&lt;br /&gt;
  funding_total_usd money,&lt;br /&gt;
  founded_on date, --yyyy-mm-dd&lt;br /&gt;
  last_funding_on date, --yyyy-mm-dd&lt;br /&gt;
  closed_on date, --yyyy-mm-dd&lt;br /&gt;
  employee_count varchar(255),&lt;br /&gt;
  email varchar(255),&lt;br /&gt;
  phone text,&lt;br /&gt;
  facebook_url varchar(5000),&lt;br /&gt;
  linkedin_url varchar(5000),&lt;br /&gt;
  cb_url varchar(5000),&lt;br /&gt;
  logo_url varchar(5000),&lt;br /&gt;
  twitter_url varchar(5000),&lt;br /&gt;
  alias varchar(10000),&lt;br /&gt;
  uuid varchar(255),&lt;br /&gt;
  created_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  updated_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  primary_role varchar(255),&lt;br /&gt;
  type varchar(255)&lt;br /&gt;
 );&lt;br /&gt;
&lt;br /&gt;
'''From this list, we care about the following:&lt;br /&gt;
*company_name,&lt;br /&gt;
*domain,&lt;br /&gt;
*country_code,&lt;br /&gt;
*state_code,&lt;br /&gt;
*city,&lt;br /&gt;
*address,&lt;br /&gt;
*status,&lt;br /&gt;
*short_description,&lt;br /&gt;
*category_list,&lt;br /&gt;
*category_group_list,&lt;br /&gt;
*founded_on,&lt;br /&gt;
*employee_count,&lt;br /&gt;
*linkedin_url,&lt;br /&gt;
*uuid -- our primary key'''&lt;br /&gt;
&lt;br /&gt;
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
We also, for the purposes of industry classification, want to pull ''category_name'' from the table ''category_groups'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the ''events'' table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.&lt;br /&gt;
&lt;br /&gt;
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.&lt;br /&gt;
&lt;br /&gt;
===Step Three: Merging===&lt;br /&gt;
&lt;br /&gt;
Of the data we've pulled from Crunchbase, we're interested in merging ''four'' columns with our existing data:&lt;br /&gt;
*domain (to be merged with the empty cells of courl)&lt;br /&gt;
*city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)&lt;br /&gt;
*status (to be merged with the empty cells of costatus)&lt;br /&gt;
*short_description and description '''''from the table organization_descriptions''''' (some combination to be merged with empty cells of codescription)&lt;br /&gt;
&lt;br /&gt;
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for [[Maxine Tao]]'s industry classifier.&lt;br /&gt;
&lt;br /&gt;
The other columns can be added to the end of our sheet as supplemental data.&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23611</id>
		<title>Merging Existing Data with Crunchbase</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Merging_Existing_Data_with_Crunchbase&amp;diff=23611"/>
		<updated>2018-07-17T17:15:03Z</updated>

		<summary type="html">&lt;p&gt;Connorrothschild: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Merging Existing Data with Crunchbase&lt;br /&gt;
|Has owner=Connor Rothschild&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
This page details the process of merging existing data with data pulled from Crunchbase.&lt;br /&gt;
&lt;br /&gt;
==Project Location==&lt;br /&gt;
&lt;br /&gt;
For the merge detailed in this page, our data was from:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
and Crunchbase info can be found in: &lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx&lt;br /&gt;
&lt;br /&gt;
The data from Crunchbase, organized into tables, is in a script found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql&lt;br /&gt;
&lt;br /&gt;
==Process==&lt;br /&gt;
&lt;br /&gt;
===Step One: Creating UUID Matches===&lt;br /&gt;
&lt;br /&gt;
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. [More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information]&lt;br /&gt;
&lt;br /&gt;
Upon Ed's instruction, we then looked at companies ''in Crunchbase'' which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).&lt;br /&gt;
&lt;br /&gt;
Upon refining our list based on recursive filtering, we found __ companies which match our data, and added UUIDs appropriately.&lt;br /&gt;
&lt;br /&gt;
===Step Two: Pulling Data===&lt;br /&gt;
&lt;br /&gt;
The necessary tables for this pull can be found at:&lt;br /&gt;
 /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''&lt;br /&gt;
&lt;br /&gt;
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.&lt;br /&gt;
&lt;br /&gt;
We pull based on the unique UUIDs found by Maxine, which can be found in the file:&lt;br /&gt;
 /bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W.&lt;br /&gt;
&lt;br /&gt;
The table we get most Crunchbase data from looks like this:&lt;br /&gt;
&lt;br /&gt;
 DROP TABLE organizations;&lt;br /&gt;
 CREATE TABLE organizations (&lt;br /&gt;
  company_name varchar(100),&lt;br /&gt;
  role varchar(255),&lt;br /&gt;
  permalink varchar(255),&lt;br /&gt;
  domain varchar(5000),&lt;br /&gt;
  homepage_url varchar(5000),&lt;br /&gt;
  country_code varchar(10),&lt;br /&gt;
  state_code varchar(2),&lt;br /&gt;
  region varchar(50),&lt;br /&gt;
  city varchar(100),&lt;br /&gt;
  address text,&lt;br /&gt;
  status varchar(50),&lt;br /&gt;
  short_description text,&lt;br /&gt;
  category_list text,&lt;br /&gt;
  category_group_list  text,&lt;br /&gt;
  funding_rounds integer,&lt;br /&gt;
  funding_total_usd money,&lt;br /&gt;
  founded_on date, --yyyy-mm-dd&lt;br /&gt;
  last_funding_on date, --yyyy-mm-dd&lt;br /&gt;
  closed_on date, --yyyy-mm-dd&lt;br /&gt;
  employee_count varchar(255),&lt;br /&gt;
  email varchar(255),&lt;br /&gt;
  phone text,&lt;br /&gt;
  facebook_url varchar(5000),&lt;br /&gt;
  linkedin_url varchar(5000),&lt;br /&gt;
  cb_url varchar(5000),&lt;br /&gt;
  logo_url varchar(5000),&lt;br /&gt;
  twitter_url varchar(5000),&lt;br /&gt;
  alias varchar(10000),&lt;br /&gt;
  uuid varchar(255),&lt;br /&gt;
  created_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  updated_at date, --yyyy-mm-dd-hh-mm-s.s&lt;br /&gt;
  primary_role varchar(255),&lt;br /&gt;
  type varchar(255)&lt;br /&gt;
 );&lt;br /&gt;
&lt;br /&gt;
'''From this list, we care about the following:&lt;br /&gt;
*company_name,&lt;br /&gt;
*domain,&lt;br /&gt;
*country_code,&lt;br /&gt;
*state_code,&lt;br /&gt;
*city,&lt;br /&gt;
*address,&lt;br /&gt;
*status,&lt;br /&gt;
*short_description,&lt;br /&gt;
*category_list,&lt;br /&gt;
*category_group_list,&lt;br /&gt;
*founded_on,&lt;br /&gt;
*employee_count,&lt;br /&gt;
*linkedin_url,&lt;br /&gt;
*uuid -- our primary key'''&lt;br /&gt;
&lt;br /&gt;
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
We also, for the purposes of industry classification, want to pull ''category_name'' from the table ''category_groups'', matching based on UUID.&lt;br /&gt;
&lt;br /&gt;
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the ''events'' table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.&lt;br /&gt;
&lt;br /&gt;
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.&lt;br /&gt;
&lt;br /&gt;
===Step Three: Merging===&lt;br /&gt;
&lt;br /&gt;
Of the data we've pulled from Crunchbase, we're interested in merging ''four'' columns with our existing data:&lt;br /&gt;
*domain (to be merged with the empty cells of courl)&lt;br /&gt;
*city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)&lt;br /&gt;
*status (to be merged with the empty cells of costatus)&lt;br /&gt;
*short_description and description '''''from the table organization_descriptions''''' (some combination to be merged with empty cells of codescription)&lt;br /&gt;
&lt;br /&gt;
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for [[Maxine Tao]]'s industry classifier.&lt;br /&gt;
&lt;br /&gt;
The other columns can be added to the end of our sheet as supplemental data.&lt;/div&gt;</summary>
		<author><name>Connorrothschild</name></author>
		
	</entry>
</feed>