==Useful pages==

*[[U.S. Seed Accelerators]]
**[[Accelerator Seed List (Data)]]
*[[Accelerator Demo Day]]
**[[Mechanical Turk (Tool)]]
*[[Crunchbase Accelerator Founders]]
**[[Crunchbase Data#Collecting Company Information]]
**[[Merging Existing Data with Crunchbase]]
*[[Whois Parser]]
*[[URL Finder (Tool)]]
*[[Industry Classifier]]
*[[Industry classifier yang]]
*[[VentureXpert Data]]

Please add (or subtract) other relevant (or irrelevant) pages!

==Database specification==

===Preamble===

We need to get the data into approximately 3NF to prevent errors and make it more useable.

Inputs:
E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx
https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0

Recent work on the above from Connor:
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)
*recoded the employee count
*normalized the investment amount

Connor's next to do:
*Founders Experience: code job title
*Founders Education: remove unknowns, code degree and code major
*Fix multiple campus and cohorts (see below).

This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].

The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:
E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql

Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).

However, we should build a new database for this project:
createdb accelerator

As well as a new folder (Z:\accelerator) for the data.

The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!

===Suggested Spec===

We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).

Accelerator Table (fieldname colname/DISCARD, from sheet "Accelerators Final")
*acceleratorname (Primary Key) Accelerators
*url homepage_url
*cohorturl cohort page URL
*cohortlisting Break out cohorts on the website? (Y/N)
*type Type
*alive Alive
*typenote Type notes
*weeks Weeks
*durationnotes Duration Notes
*city city (note: address info is for the accelerator HQ here)
*state state/region (note: address info is for the accelerator HQ here)
*foundingdate Creation Date
*terms Terms of Joining Accelerator (non-equity pay/equity/free)
*equity Equity?
*equityamountdesc Equity Amount
*equityamount Equity Amount Normalized (only 1s, range is averaged)
*investmentamountdesc Investment Amount
*investmentamount Investment Amount Normalized (Midpoint)
*investmentnotes Investment Notes
*discard industry DISCARD
*industry Industry
*specgenindu Specified/General
*address Address (note: address info is for the accelerator HQ here)
*subtype DISCARD
*nonprofit nonprofit?
*studentfocus designed for students (Y/N)
*multicampus Multiple campuses? (Y/N)
*software tech (Y/N) DISCARD
*stagepref What stage do they look for in cohort companies (SEL)
*cbconame Name in Crunchbase
*uuid UUID

Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.

CohortCompany Table (fieldname colname/DISCARD, from sheet "Cohorts Final")
*conamestd (primary key)
*coname
*conameorg
*colocation
*city
*state_code
*country_code
*address
*codescription
*short_desc
*long_desc
*cosectors
*costatus
*cofundingstage
*courl
*uuid
*category_list
*category_group_list
*founded_on
*employee_count
*emp_count_scale
*linkedin_url
*gotvc

We then need to deduplicate this table and make sure that conamestd is a valid primary key.

Note that we won't be taking the following from "Cohorts Final" for this table:
*year
*accelerator
*cohort
*quarter
*acclocation
*accperks DISCARD
*cofounder DISCARD

CohortParticipation
*Cohort
*year
*quarter
*accelerator (foreign key)
*conamestd (foreign key)

===Going further===

We should likely rebuild the cohort variable to make it a sole "primary key" for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:

CohortParticipation
*Cohort
*accelerator (foreign key)
*conamestd (foreign key)

Cohort
*Cohort
*year
*quarter

We could then add campus and a seperate table for campuses:

Campus Table
*CampusName (e.g., Techstars Boulder)
*Accelerator (foreign key)
*Address
*City
*State
*Zip
*Description

===Founders Information===

The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.

Founders:
*Accelerator
*First Name
*Last Name
*Full Name
*Current Job
*Current Location

FoundersExperience:
*Accelerator
*First Name
*Last Name
*Full Name
*Employer
*Job Title
*Dates Employed
*Time Employed
*Location
*Extra Description

FoundersEducation:
*Accelerator
*First Name
*Last Name
*Full Name
*School Name
*Degree Name
*Major
*Attended
*Graduated
*Societies

==To do/For consideration==

Minh:
*Create a format for collecting timing data
*Put the timing job on Turk

Connor:
*Try for missing timings that we really (new process) need
*Col W should be headquarter address
*What stage- Clean up?

Maxine:
*Build the google URL finder
*Industry classification from description

Grace:
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)

==Accelerator Data Assembly Progress (Hira) ==

*All data files are in Z:/accelerator
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.

=== Data assembly details ===

The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:
E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx

It creates the following tables:

1) cohortsfinal - source file: Cohorts Final sheet in "The File to Rule Them All".

2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:
*conamestd
*conameorg
*colocation
*city
*state_code
*country_code
*address
*codescription
*short_desc
*long_desc
*cosectors
*costatus
*cofundingstage
*courl
*uuid
*category_list
*category_group_list
*founded_on
*employee_count
*emp_count_scale
*linkedin_url
*gotvc

3)CohortParticipation - uses table cohortsfinal
*cohort
*year
*quarter
*accelerator
*conamestd

4) timing_final - This table is based on the most updated information on timing compiled in source file: Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:
*coname
*acceleratorname
*keyword
*url
*webpage
*predicted
*gooddata
*page_details
*full_date
*month
*year
*cohort_name
*notes
*prog_duration_wks
*actual_date
*actual_month
*actual_year
*season

5) Founders - source file: "The File to Rule Them All - Founders main sheet"
*Accelerator
*First_Name
*Last_Name
*Full_Name
*Current_Job
*Current_Location

6) founders_experience - source file: "The File to Rule Them All - Founders experience sheet"
*Accelerator
*First_Name
*Last_Name
*Full_Name
*Employer
*VC
*VC_backed_startup
*OLD_Job_Title
*NEW_Job_Title
*Dates_Employed
*Time_Employed
*Location
*Extra_Description

7) additional_timing_info - source file: "merging_work.xlxs" located in: E:\Projects\McNair\Seed DB
8) additional_timing_info2 - source file: "formatted timing info2.txt" located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks.
Tables 7 and 8 include columns:
*coname
*acceleratorname
*cohort_name
*date
*month
*year
*season

9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined

==Grace's Code==
===format_timing.py===
E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py

Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)

Output: txt file with companies mapped to accelerators with all the other information in the original file

===prioritycodecategory.py===
E:/McNair/Projects/Accelerators/Summer 2018/prioritycodecategory.py

Input: txt file with a list of category groups
Output: txt file with line number and minor code

Final output: I took the txt file and copied the codes and pasted it into column Z of the Cohorts Final sheet from The File to Rule Them All.
E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx

Chooses code based on important from priority ranking dictionary before choosing arbitrarily.

===codecategory.py===

2018-07-10T21:29:29Z

GraceTan:

{{McNair Projects
|Has title=Crunchbase Accelerator Founders
|Has owner=Grace Tan,
|Has start date=6/18/18
|Has project status=Active
}}

==Related Pages==
[[Crunchbase Data]]
[[Crunchbase Accelerator Equity]]
[[LinkedIn Crawler (Python)]]

==Getting Data==
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:
https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&user_key=662e263576fe3e4ea5991edfbcfb9883

==Code==
===scrapefounders.py===
This code lives in
Z:\crunchbase2\scrapefounders.py

This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.

==Updated LinkedIn Crawler==
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in
E:/McNair/Projects/LinkedIn Crawler/LinkedIn_Crawler/linkedin

My code is found in the selenium computer at the root and at
E:/McNair/Projects/LinkedIn Crawler 2018

There are 6 python files.

===New Test Account===
Username: mcboatfaceboaty670@gmail.com
Password: McNair2018

Use the selenium computer on Rice Visitor wifi.
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.

I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.

==Code==
===linkedin_crawler_main.py===
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have "unavailable" in the name).

===linked_in_crawler.py===
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.

Crunchbase Accelerator Founders

2018-07-10T21:20:48Z

GraceTan: