Difference between revisions of "Seed Accelerator Data Assembly"

Latest revision as of 17:28, 3 August 2018

Useful pages

Please add (or subtract) other relevant (or irrelevant) pages!

Database specification

Preamble

We need to get the data into approximately 3NF to prevent errors and make it more useable.

Inputs:

E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx
https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0

Recent work on the above from Connor:

made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)
recoded the employee count
normalized the investment amount

Connor's next to do:

Founders Experience: code job title
Founders Education: remove unknowns, code degree and code major
Fix multiple campus and cohorts (see below).

This is documented on U.S._Seed_Accelerators#Update_for_Hira.

The current database work is in vcdb2. The code to build the relevant tables in vcdb2 is in:

E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql

Note that AddCBData.sql can be run on crunchbase2 to get the relevant crunchbase data for import into vcdb2 (or elsewhere).

However, we should build a new database for this project:

createdb accelerator

As well as a new folder (Z:\accelerator) for the data.

The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!

Suggested Spec

We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).

Accelerator Table (fieldname colname/DISCARD, from sheet "Accelerators Final")

acceleratorname (Primary Key) Accelerators
url homepage_url
cohorturl cohort page URL
cohortlisting Break out cohorts on the website? (Y/N)
type Type
alive Alive
typenote Type notes
weeks Weeks
durationnotes Duration Notes
city city (note: address info is for the accelerator HQ here)
state state/region (note: address info is for the accelerator HQ here)
foundingdate Creation Date
terms Terms of Joining Accelerator (non-equity pay/equity/free)
equity Equity?
equityamountdesc Equity Amount
equityamount Equity Amount Normalized (only 1s, range is averaged)
investmentamountdesc Investment Amount
investmentamount Investment Amount Normalized (Midpoint)
investmentnotes Investment Notes
discard industry DISCARD
industry Industry
specgenindu Specified/General
address Address (note: address info is for the accelerator HQ here)
subtype DISCARD
nonprofit nonprofit?
studentfocus designed for students (Y/N)
multicampus Multiple campuses? (Y/N)
software tech (Y/N) DISCARD
stagepref What stage do they look for in cohort companies (SEL)
cbconame Name in Crunchbase
uuid UUID

Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.

CohortCompany Table (fieldname colname/DISCARD, from sheet "Cohorts Final")

conamestd (primary key)
coname
conameorg
colocation
city
state_code
country_code
address
codescription
short_desc
long_desc
cosectors
costatus
cofundingstage
courl
uuid
category_list
category_group_list
founded_on
employee_count
emp_count_scale
linkedin_url
gotvc

We then need to deduplicate this table and make sure that conamestd is a valid primary key.

Note that we won't be taking the following from "Cohorts Final" for this table:

year
accelerator
cohort
quarter
acclocation
accperks DISCARD
cofounder DISCARD

CohortParticipation

Cohort
year
quarter
accelerator (foreign key)
conamestd (foreign key)

Going further

We should likely rebuild the cohort variable to make it a sole "primary key" for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:

CohortParticipation

Cohort
accelerator (foreign key)
conamestd (foreign key)

Cohort

Cohort
year
quarter

We could then add campus and a seperate table for campuses:

Campus Table

CampusName (e.g., Techstars Boulder)
Accelerator (foreign key)
Address
City
State
Zip
Description

Founders Information

The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.

Founders:

Accelerator
First Name
Last Name
Full Name
Current Job
Current Location

FoundersExperience:

Accelerator
First Name
Last Name
Full Name
Employer
Job Title
Dates Employed
Time Employed
Location
Extra Description

FoundersEducation:

Accelerator
First Name
Last Name
Full Name
School Name
Degree Name
Major
Attended
Graduated
Societies

To do/For consideration

Minh:

Create a format for collecting timing data
Put the timing job on Turk

Connor:

Try for missing timings that we really (new process) need
Col W should be headquarter address
What stage- Clean up?

Maxine:

Build the google URL finder
Industry classification from description

Grace:

Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt
Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx
Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this
Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)

Accelerator Data Assembly Progress (Hira)

All data files are in Z:/accelerator
The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.

Data assembly details

The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:

 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx

It creates the following tables:

1) cohortsfinal - source file: Cohorts Final sheet in "The File to Rule Them All".

2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:

conamestd
conameorg
colocation
city
state_code
country_code
address
codescription
short_desc
long_desc
cosectors
costatus
cofundingstage
courl
uuid
category_list
category_group_list
founded_on
employee_count
emp_count_scale
linkedin_url
gotvc

3)CohortParticipation - uses table cohortsfinal

cohort
year
quarter
accelerator
conamestd

4) timing_final - This table is based on the most updated information on timing compiled in source file: Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:

coname
acceleratorname
keyword
url
webpage
predicted
gooddata
page_details
full_date
month
year
cohort_name
notes
prog_duration_wks
actual_date
actual_month
actual_year
season

5) Founders - source file: "The File to Rule Them All - Founders main sheet"

Accelerator
First_Name
Last_Name
Full_Name
Current_Job
Current_Location

6) founders_experience - source file: "The File to Rule Them All - Founders experience sheet"

Accelerator
First_Name
Last_Name
Full_Name
Employer
VC
VC_backed_startup
OLD_Job_Title
NEW_Job_Title
Dates_Employed
Time_Employed
Location
Extra_Description

7) additional_timing_info - source file: "merging_work.xlxs" located in: E:\Projects\McNair\Seed DB 8) additional_timing_info2 - source file: "formatted timing info2.txt" located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. Tables 7 and 8 include columns:

coname
acceleratorname
cohort_name
date
month
year
season

9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8. 10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined

Grace's Code

format_timing.py

 E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py

Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)

Output: txt file with companies mapped to accelerators with all the other information in the original file

prioritycodecategory.py

 E:/McNair/Projects/Accelerators/Summer 2018/prioritycodecategory.py

Input: txt file with a list of category groups (Column Y of Cohorts Final in The File to Rule Them All)

Output: txt file with line number and minor code

Final output: I took the txt file and copied the codes and pasted it into the added column Z of the Cohorts Final sheet from The File to Rule Them All.

 E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx

Chooses code based on important from priority ranking dictionary before choosing arbitrarily.

codecategory.py

 E:/McNair/Projects/Accelerators/Summer 2018/codecategory.py

Input: txt file with a list of category groups (column Y of Cohorts Final in The File to Rule Them All)

Output: txt file with line number and multiple minor codes

Final output: I took the minor codes and copied them into column Z of this sheet (a copy of The File to Rule Them All with this added column)

 E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code (no priority) Grace.xlsx

I arbitrarily chose the first code when multiple were given. I fixed this in excel by separating on commas. I also manually did a lot of them which is why there are mode values in this file than the one with priority.

Difference between revisions of "Seed Accelerator Data Assembly"

Latest revision as of 17:28, 3 August 2018

Contents

Useful pages

Database specification

Preamble

Suggested Spec

Going further

Founders Information

To do/For consideration

Accelerator Data Assembly Progress (Hira)

Data assembly details

Grace's Code

format_timing.py

prioritycodecategory.py

codecategory.py

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools

@@ Line 210: / Line 210: @@
 Grace:
 *Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt
-*Make a category group to minorcode lookup
+*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx
-*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)
+*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this
-*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)
+*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)
 ==Accelerator Data Assembly Progress (Hira) ==
@@ Line 317: / Line 317: @@
 ) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.
 ) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined
+==Grace's Code==
+===format_timing.py===
+  E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py
+Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)
+Output: txt file with companies mapped to accelerators with all the other information in the original file
+===prioritycodecategory.py===
+  E:/McNair/Projects/Accelerators/Summer 2018/prioritycodecategory.py
+Input: txt file with a list of category groups (Column Y of Cohorts Final in The File to Rule Them All)
+Output: txt file with line number and minor code
+Final output: I took the txt file and copied the codes and pasted it into the added column Z of the Cohorts Final sheet from The File to Rule Them All.
+  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx
+Chooses code based on important from priority ranking dictionary before choosing arbitrarily.
+===codecategory.py===
+  E:/McNair/Projects/Accelerators/Summer 2018/codecategory.py
+Input: txt file with a list of category groups (column Y of Cohorts Final in The File to Rule Them All)
+Output: txt file with line number and multiple minor codes
+Final output: I took the minor codes and copied them into column Z of this sheet (a copy of The File to Rule Them All with this added column)
+  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code (no priority) Grace.xlsx
+I arbitrarily chose the first code when multiple were given. I fixed this in excel by separating on commas. I also manually did a lot of them which is why there are mode values in this file than the one with priority.