<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=GraceTan</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=GraceTan"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/GraceTan"/>
	<updated>2026-06-08T07:09:02Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan&amp;diff=23932</id>
		<title>Grace Tan</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan&amp;diff=23932"/>
		<updated>2018-08-03T22:42:07Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|position=Tech Team&lt;br /&gt;
|name=Grace Tan&lt;br /&gt;
|user_image=Profile_Grace_Tan.jpg&lt;br /&gt;
|degree=BA&lt;br /&gt;
|major=Computer Science&lt;br /&gt;
|class=2021&lt;br /&gt;
|join_date=Summer 2018&lt;br /&gt;
|skills=Python, Java&lt;br /&gt;
|email=gzt1@rice.edu&lt;br /&gt;
|skype_name=grace.tan330&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Work Logs==&lt;br /&gt;
[[Grace Tan (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
==Projects==&lt;br /&gt;
[[Crunchbase Data]] &lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
[[Patent Thicket]]&lt;br /&gt;
&lt;br /&gt;
[[Seed Accelerator Data Assembly#Grace's Code]]&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23927</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23927"/>
		<updated>2018-08-03T21:28:45Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: /* prioritycodecategory.py */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Grace's Code==&lt;br /&gt;
===format_timing.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py&lt;br /&gt;
&lt;br /&gt;
Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with companies mapped to accelerators with all the other information in the original file&lt;br /&gt;
&lt;br /&gt;
===prioritycodecategory.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/prioritycodecategory.py&lt;br /&gt;
&lt;br /&gt;
Input: txt file with a list of category groups (Column Y of Cohorts Final in The File to Rule Them All)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with line number and minor code &lt;br /&gt;
&lt;br /&gt;
Final output: I took the txt file and copied the codes and pasted it into the added column Z of the Cohorts Final sheet from The File to Rule Them All.&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
&lt;br /&gt;
Chooses code based on important from priority ranking dictionary before choosing arbitrarily.&lt;br /&gt;
&lt;br /&gt;
===codecategory.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/codecategory.py&lt;br /&gt;
&lt;br /&gt;
Input: txt file with a list of category groups (column Y of Cohorts Final in The File to Rule Them All)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with line number and multiple minor codes&lt;br /&gt;
&lt;br /&gt;
Final output: I took the minor codes and copied them into column Z of this sheet (a copy of The File to Rule Them All with this added column)&lt;br /&gt;
&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code (no priority) Grace.xlsx&lt;br /&gt;
&lt;br /&gt;
I arbitrarily chose the first code when multiple were given. I fixed this in excel by separating on commas. I also manually did a lot of them which is why there are mode values in this file than the one with priority.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23926</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23926"/>
		<updated>2018-08-03T21:27:49Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: /* codecategory.py */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Grace's Code==&lt;br /&gt;
===format_timing.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py&lt;br /&gt;
&lt;br /&gt;
Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with companies mapped to accelerators with all the other information in the original file&lt;br /&gt;
&lt;br /&gt;
===prioritycodecategory.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/prioritycodecategory.py&lt;br /&gt;
&lt;br /&gt;
Input: txt file with a list of category groups&lt;br /&gt;
&lt;br /&gt;
Output: txt file with line number and minor code &lt;br /&gt;
&lt;br /&gt;
Final output: I took the txt file and copied the codes and pasted it into column Z of the Cohorts Final sheet from The File to Rule Them All.&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
&lt;br /&gt;
Chooses code based on important from priority ranking dictionary before choosing arbitrarily.&lt;br /&gt;
&lt;br /&gt;
===codecategory.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/codecategory.py&lt;br /&gt;
&lt;br /&gt;
Input: txt file with a list of category groups (column Y of Cohorts Final in The File to Rule Them All)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with line number and multiple minor codes&lt;br /&gt;
&lt;br /&gt;
Final output: I took the minor codes and copied them into column Z of this sheet (a copy of The File to Rule Them All with this added column)&lt;br /&gt;
&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code (no priority) Grace.xlsx&lt;br /&gt;
&lt;br /&gt;
I arbitrarily chose the first code when multiple were given. I fixed this in excel by separating on commas. I also manually did a lot of them which is why there are mode values in this file than the one with priority.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23924</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23924"/>
		<updated>2018-08-03T21:27:08Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: /* codecategory.py */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Grace's Code==&lt;br /&gt;
===format_timing.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py&lt;br /&gt;
&lt;br /&gt;
Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with companies mapped to accelerators with all the other information in the original file&lt;br /&gt;
&lt;br /&gt;
===prioritycodecategory.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/prioritycodecategory.py&lt;br /&gt;
&lt;br /&gt;
Input: txt file with a list of category groups&lt;br /&gt;
&lt;br /&gt;
Output: txt file with line number and minor code &lt;br /&gt;
&lt;br /&gt;
Final output: I took the txt file and copied the codes and pasted it into column Z of the Cohorts Final sheet from The File to Rule Them All.&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
&lt;br /&gt;
Chooses code based on important from priority ranking dictionary before choosing arbitrarily.&lt;br /&gt;
&lt;br /&gt;
===codecategory.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/codecategory.py&lt;br /&gt;
&lt;br /&gt;
Input: txt file with a list of category groups (column Y of Cohorts Final in The File to Rule Them All)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with line number and multiple minor codes&lt;br /&gt;
&lt;br /&gt;
Final output: I took the minor codes and copied them into column Z of this sheet (a copy of The File to Rule Them All with this added column)&lt;br /&gt;
&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code (no priority) Grace.xlsx&lt;br /&gt;
&lt;br /&gt;
I arbitrarily chose the first code when multiple were given. I fixed this in excel by separating on commas.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23918</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23918"/>
		<updated>2018-08-03T21:17:47Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: /* prioritycodecategory.py */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Grace's Code==&lt;br /&gt;
===format_timing.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py&lt;br /&gt;
&lt;br /&gt;
Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with companies mapped to accelerators with all the other information in the original file&lt;br /&gt;
&lt;br /&gt;
===prioritycodecategory.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/prioritycodecategory.py&lt;br /&gt;
&lt;br /&gt;
Input: txt file with a list of category groups&lt;br /&gt;
&lt;br /&gt;
Output: txt file with line number and minor code &lt;br /&gt;
&lt;br /&gt;
Final output: I took the txt file and copied the codes and pasted it into column Z of the Cohorts Final sheet from The File to Rule Them All.&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
&lt;br /&gt;
Chooses code based on important from priority ranking dictionary before choosing arbitrarily.&lt;br /&gt;
&lt;br /&gt;
===codecategory.py===&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23917</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23917"/>
		<updated>2018-08-03T21:17:28Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: /* prioritycodecategory.py */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Grace's Code==&lt;br /&gt;
===format_timing.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py&lt;br /&gt;
&lt;br /&gt;
Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with companies mapped to accelerators with all the other information in the original file&lt;br /&gt;
&lt;br /&gt;
===prioritycodecategory.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/prioritycodecategory.py&lt;br /&gt;
&lt;br /&gt;
Input: txt file with a list of category groups&lt;br /&gt;
Output: txt file with line number and minor code &lt;br /&gt;
&lt;br /&gt;
Final output: I took the txt file and copied the codes and pasted it into column Z of the Cohorts Final sheet from The File to Rule Them All.&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
&lt;br /&gt;
Chooses code based on important from priority ranking dictionary before choosing arbitrarily.&lt;br /&gt;
&lt;br /&gt;
===codecategory.py===&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23915</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23915"/>
		<updated>2018-08-03T21:13:30Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: /* format_timing.py */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Grace's Code==&lt;br /&gt;
===format_timing.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py&lt;br /&gt;
&lt;br /&gt;
Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)&lt;br /&gt;
&lt;br /&gt;
Output: txt file with companies mapped to accelerators with all the other information in the original file&lt;br /&gt;
&lt;br /&gt;
===prioritycodecategory.py===&lt;br /&gt;
&lt;br /&gt;
===codecategory.py===&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23914</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23914"/>
		<updated>2018-08-03T21:10:45Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: /* format_timing.py */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Grace's Code==&lt;br /&gt;
===format_timing.py===&lt;br /&gt;
  E:/McNair/Projects/Accelerators/Summer 2018/format_timing.py&lt;br /&gt;
&lt;br /&gt;
Input: a txt file with accelerator mapped to multiple companies(in a single cell separated by columns or in separate rows)&lt;br /&gt;
Output: txt file with companies mapped to accelerators with all the other information in the original file&lt;br /&gt;
&lt;br /&gt;
===prioritycodecategory.py===&lt;br /&gt;
&lt;br /&gt;
===codecategory.py===&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23911</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23911"/>
		<updated>2018-08-03T21:06:47Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup - in E:/McNair/Projects/Accelerators/Summer 2018/Cohorts Final - minor code priority Grace.xlsx&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.) - Maxine did this&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi) - added 2 columns to The File to Rule Them All (VC and VC start up)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Grace's Code==&lt;br /&gt;
===format_timing.py===&lt;br /&gt;
&lt;br /&gt;
===prioritycodecategory.py===&lt;br /&gt;
&lt;br /&gt;
===codecategory.py===&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23903</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23903"/>
		<updated>2018-08-03T20:35:14Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-08-03: Fixed/debugged minor coding with priority ranking. Helped Connor find timing info for missing companies. Cleaned up wiki pages.&lt;br /&gt;
&lt;br /&gt;
2018-08-02: Redid minor codes with priority ranking.&lt;br /&gt;
&lt;br /&gt;
2018-08-01: Entered the rest of the minor codes and arbitrarily picked the first one for those that had multiple codes attached to them. &lt;br /&gt;
&lt;br /&gt;
2018-07-31: Minor coded cohorts based on contents of category group list in the file to rule them all. See Ed's slack message for key/legend. The rows highlighted in red are the ones that I'm not sure about. I wrote a python script to code most of them - E:\McNair\Projects\Accelerators\Summer 2018\codecategory.py . The coded sheet is a google sheet that I will add onto the wiki page once I make one. &lt;br /&gt;
&lt;br /&gt;
2018-07-30: Matched employers with VC firms, funds, and startups. There were 40 matches with firms and funds, and 4 matches with startups. Coded these into 2 columns in Founder Experience table. Updated all wiki pages.&lt;br /&gt;
&lt;br /&gt;
2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means.&lt;br /&gt;
&lt;br /&gt;
2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. &lt;br /&gt;
&lt;br /&gt;
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23879</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23879"/>
		<updated>2018-08-01T21:27:43Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-08-01: Entered the rest of the minor codes and arbitrarily picked the first one for those that had multiple codes attached to them. &lt;br /&gt;
&lt;br /&gt;
2018-07-31: Minor coded cohorts based on contents of category group list in the file to rule them all. See Ed's slack message for key/legend. The rows highlighted in red are the ones that I'm not sure about. I wrote a python script to code most of them - E:\McNair\Projects|Accelerators\Summer 2018\codecategory.py . The coded sheet is a google sheet that I will add onto the wiki page once I make one. &lt;br /&gt;
&lt;br /&gt;
2018-07-30: Matched employers with VC firms, funds, and startups. There were 40 matches with firms and funds, and 4 matches with startups. Coded these into 2 columns in Founder Experience table. Updated all wiki pages.&lt;br /&gt;
&lt;br /&gt;
2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means.&lt;br /&gt;
&lt;br /&gt;
2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. &lt;br /&gt;
&lt;br /&gt;
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23871</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23871"/>
		<updated>2018-07-31T21:54:01Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-31: Minor coded cohorts based on contents of category group list in the file to rule them all. See Ed's slack message for key/legend. The rows highlighted in red are the ones that I'm not sure about. I wrote a python script to code most of them - E:\McNair\Projects|Accelerators\Summer 2018\codecategory.py . The coded sheet is a google sheet that I will add onto the wiki page once I make one. &lt;br /&gt;
&lt;br /&gt;
2018-07-30: Matched employers with VC firms, funds, and startups. There were 40 matches with firms and funds, and 4 matches with startups. Coded these into 2 columns in Founder Experience table. Updated all wiki pages.&lt;br /&gt;
&lt;br /&gt;
2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means.&lt;br /&gt;
&lt;br /&gt;
2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. &lt;br /&gt;
&lt;br /&gt;
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23842</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23842"/>
		<updated>2018-07-30T21:28:28Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.&lt;br /&gt;
&lt;br /&gt;
==Part 1: Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
We are doing this because there were more results this way than looking at the people table in the crunchbase db for the keyword &amp;quot;founders.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code is located in: &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Part 2: Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files in the LinkedIn Crawler 2018 directory.&lt;br /&gt;
&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
This contains the main function that will run the LinkedIn Crawler. It includes two test accounts at the top which I went back and forth on to prevent LinkedIn from finding me. &lt;br /&gt;
&lt;br /&gt;
Inputs (set outside of function): username(of test account), password(of test account), query_filepath(txt file that includes name of accelerator, first_name, last_name, linkedin_url)&lt;br /&gt;
&lt;br /&gt;
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt&lt;br /&gt;
&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
This contains the LinkedInCrawler class that includes functions(login, logout, search, etc) that the driver calls to crawl the LinkedIn website.&lt;br /&gt;
&lt;br /&gt;
Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
===Obstacles and Notes===&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23841</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23841"/>
		<updated>2018-07-30T21:27:19Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.&lt;br /&gt;
&lt;br /&gt;
==Part 1: Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code is located in: &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Part 2: Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files in the LinkedIn Crawler 2018 directory.&lt;br /&gt;
&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
This contains the main function that will run the LinkedIn Crawler. It includes two test accounts at the top which I went back and forth on to prevent LinkedIn from finding me. &lt;br /&gt;
&lt;br /&gt;
Inputs (set outside of function): username(of test account), password(of test account), query_filepath(txt file that includes name of accelerator, first_name, last_name, linkedin_url)&lt;br /&gt;
&lt;br /&gt;
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt&lt;br /&gt;
&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
This contains the LinkedInCrawler class that includes functions(login, logout, search, etc) that the driver calls to crawl the LinkedIn website.&lt;br /&gt;
&lt;br /&gt;
Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
===Obstacles and Notes===&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23840</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23840"/>
		<updated>2018-07-30T21:27:03Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Part 1: Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code is located in: &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Part 2: Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files in the LinkedIn Crawler 2018 directory.&lt;br /&gt;
&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
This contains the main function that will run the LinkedIn Crawler. It includes two test accounts at the top which I went back and forth on to prevent LinkedIn from finding me. &lt;br /&gt;
&lt;br /&gt;
Inputs (set outside of function): username(of test account), password(of test account), query_filepath(txt file that includes name of accelerator, first_name, last_name, linkedin_url)&lt;br /&gt;
&lt;br /&gt;
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt&lt;br /&gt;
&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
This contains the LinkedInCrawler class that includes functions(login, logout, search, etc) that the driver calls to crawl the LinkedIn website.&lt;br /&gt;
&lt;br /&gt;
Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
===Obstacles and Notes===&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23839</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23839"/>
		<updated>2018-07-30T21:23:57Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Part 1: Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code is located in: &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Part 2: Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files in the LinkedIn Crawler 2018 directory.&lt;br /&gt;
&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
This contains the main function that will run the LinkedIn Crawler. It includes two test accounts at the top which I went back and forth on to prevent LinkedIn from finding me. &lt;br /&gt;
&lt;br /&gt;
Inputs (set outside of function): username(of test account), password(of test account), query_filepath(txt file that includes name of accelerator, first_name, last_name, linkedin_url)&lt;br /&gt;
&lt;br /&gt;
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt&lt;br /&gt;
&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
This contains the LinkedInCrawler class that includes functions(login, logout, search, etc) that the driver calls to crawl the LinkedIn website.&lt;br /&gt;
&lt;br /&gt;
Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23837</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23837"/>
		<updated>2018-07-30T21:08:26Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Part 1: Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code is located in: &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Part 2: Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files in the LinkedIn Crawler 2018 directory.&lt;br /&gt;
&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
This contains the main function that will run the LinkedIn Crawler. It includes two test accounts at the top which I went back and forth on to prevent LinkedIn from finding me. &lt;br /&gt;
&lt;br /&gt;
Inputs (set outside of function): username(of test account), password(of test account), query_filepath(txt file that includes name of accelerator, first_name, last_name, linkedin_url)&lt;br /&gt;
&lt;br /&gt;
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23836</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23836"/>
		<updated>2018-07-30T21:07:25Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Part 1: Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code is located in: &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Part 2: Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files in the LinkedIn Crawler 2018 directory.&lt;br /&gt;
&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
This contains the main function that will run the LinkedIn Crawler. It includes two test accounts at the top which I went back and forth on to prevent LinkedIn from finding me. &lt;br /&gt;
&lt;br /&gt;
Inputs (set outside of function) : username(of test account), password(of test account), query_filepath(txt file that includes name of accelerator, first_name, last_name, linkedin_url)&lt;br /&gt;
&lt;br /&gt;
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23835</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23835"/>
		<updated>2018-07-30T21:03:13Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Project Introduction==&lt;br /&gt;
This project uses the Crunchbase Data and API to find founders of the accelerators we are interested in. We then take the founders and run their names through the LinkedIn Crawler to find information about them.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Part 1: Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code is located in: &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Part 2: Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files in the LinkedIn Crawler 2018 directory.&lt;br /&gt;
&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
This contains the main function that will run the LinkedIn Crawler. It includes two test accounts at the top which I went back and forth on to prevent LinkedIn from finding me. &lt;br /&gt;
Inputs (set outside of function) : username(of test account), password(of test account). query_filepath(txt file that includes name of accelerator, first_name, last_name, linkedin_url).&lt;br /&gt;
Output: 3 txt files - founders_education.txt, founders_experience.txt, founders_main.txt&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23834</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23834"/>
		<updated>2018-07-30T20:40:28Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code lives in &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files.&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23833</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23833"/>
		<updated>2018-07-30T20:38:35Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code lives in &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files.&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23832</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23832"/>
		<updated>2018-07-30T20:38:09Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders and LinkedIn Crawler&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code lives in &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files.&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23831</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23831"/>
		<updated>2018-07-30T20:17:22Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-30: Matched employers with VC firms, funds, and startups. There were 40 matches with firms and funds, and 4 matches with startups. Coded these into 2 columns in Founder Experience table. Updated all wiki pages.&lt;br /&gt;
&lt;br /&gt;
2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means.&lt;br /&gt;
&lt;br /&gt;
2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. &lt;br /&gt;
&lt;br /&gt;
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23812</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23812"/>
		<updated>2018-07-27T21:52:45Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-27: Reformatted the timing info data to separate out the companies to look like The File To Rule Them All. It is located in E:/McNair/Projects/Accelerators/Summer 2018/Formatted Timing Info.txt. Looked at the WhoIs Parser but Maxine said she figured it out so she will finish it. I will start with Founders Experience on Monday and I do not understand what minorcode lookup means.&lt;br /&gt;
&lt;br /&gt;
2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. &lt;br /&gt;
&lt;br /&gt;
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23810</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23810"/>
		<updated>2018-07-27T20:40:09Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23809</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23809"/>
		<updated>2018-07-27T20:36:01Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23787</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23787"/>
		<updated>2018-07-26T21:55:17Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-26: Finished Demo Day Timing Info data. Talked with Ed and Hira about what to do for the last week. Cleaned up Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-25: Converted the 608 pdfs to txt files using [[PDF to Text Converter]]. All of them converted to txt files but some txt files are empty or do not contain the content of the paper. I do not know of a way to fix it or clean up the txt files to get only the txt files that are actually academic papers. Worked on Demo Day Timing Info data.&lt;br /&gt;
&lt;br /&gt;
2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. &lt;br /&gt;
&lt;br /&gt;
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan&amp;diff=23776</id>
		<title>Grace Tan</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan&amp;diff=23776"/>
		<updated>2018-07-25T19:50:30Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|position=Tech Team&lt;br /&gt;
|name=Grace Tan&lt;br /&gt;
|user_image=Profile_Grace_Tan.jpg&lt;br /&gt;
|degree=BA&lt;br /&gt;
|major=Computer Science&lt;br /&gt;
|class=2021&lt;br /&gt;
|join_date=Summer 2018&lt;br /&gt;
|skills=Python, Java&lt;br /&gt;
|email=gzt1@rice.edu&lt;br /&gt;
|skype_name=grace.tan330&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Work Logs==&lt;br /&gt;
[[Grace Tan (Work Log)]]&lt;br /&gt;
&lt;br /&gt;
==Projects==&lt;br /&gt;
[[Crunchbase Data]] &lt;br /&gt;
&lt;br /&gt;
[[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
[[Patent Thicket]]&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23775</id>
		<title>Patent Thicket</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23775"/>
		<updated>2018-07-25T19:49:16Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Patent Thicket&lt;br /&gt;
|Has owner=Grace Tan&lt;br /&gt;
|Has start date=Summer 2018&lt;br /&gt;
|Has keywords=&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Google Scholar Crawler, PDF Downloader, PDF to Text Converter&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
===Location of Files===&lt;br /&gt;
 E://McNair/Software/Patent_Thicket&lt;br /&gt;
&lt;br /&gt;
Downloaded PDFs:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads&lt;br /&gt;
&lt;br /&gt;
Converted PDFs to txt files:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts&lt;br /&gt;
&lt;br /&gt;
===Google Scholar Crawler===&lt;br /&gt;
Used [[Google Scholar Crawler]]&lt;br /&gt;
&lt;br /&gt;
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.&lt;br /&gt;
&lt;br /&gt;
I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for &amp;quot;patent thicket.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
===Downloading PDFs===&lt;br /&gt;
Used [[PDF Downloader]]&lt;br /&gt;
&lt;br /&gt;
I tweaked the code to take into account repeat of file names. &lt;br /&gt;
5 of the pdf urls were not downloadable so I ended up with 608 working pdfs.&lt;br /&gt;
&lt;br /&gt;
===pdf_to_txt_bulk_PTLR.py===&lt;br /&gt;
See [[PDF to Text Converter]]&lt;br /&gt;
&lt;br /&gt;
The code must be run in E because of the libraries it uses is not in Z.&lt;br /&gt;
I reinstalled pdfminer which might be a problem in the future if the libraries change.&lt;br /&gt;
&lt;br /&gt;
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.&lt;br /&gt;
&lt;br /&gt;
There were 573 successful txt files and 36 files that failed to convert (which does not add up to 608 but I'm not sure why).&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23774</id>
		<title>Patent Thicket</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23774"/>
		<updated>2018-07-25T19:47:24Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Patent Thicket&lt;br /&gt;
|Has owner=Grace Tan&lt;br /&gt;
|Has start date=Summer 2018&lt;br /&gt;
|Has keywords=&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Google Scholar Crawler, PDF Downloader, PDF to Text Converter&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
===Location of Files===&lt;br /&gt;
 E://McNair/Software/Patent_Thicket&lt;br /&gt;
&lt;br /&gt;
Downloaded PDFs:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads&lt;br /&gt;
&lt;br /&gt;
Converted PDFs to txt files:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts&lt;br /&gt;
&lt;br /&gt;
===Google Scholar Crawler===&lt;br /&gt;
Used [[Google Scholar Crawler]]&lt;br /&gt;
&lt;br /&gt;
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.&lt;br /&gt;
&lt;br /&gt;
I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for &amp;quot;patent thicket.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
===Downloading PDFs===&lt;br /&gt;
Used [[PDF Downloader]]&lt;br /&gt;
&lt;br /&gt;
I tweaked the code to take into account repeat of file names. &lt;br /&gt;
5 of the pdf urls were not downloadable so I ended up with 608 working pdfs.&lt;br /&gt;
&lt;br /&gt;
===pdf_to_txt_bulk_PTLR.py===&lt;br /&gt;
See [[PDF to Text Converter]]&lt;br /&gt;
&lt;br /&gt;
The code must be run in E because of the libraries it uses is not in Z.&lt;br /&gt;
I reinstalled pdfminer which might be a problem in the future if the libraries change.&lt;br /&gt;
&lt;br /&gt;
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23773</id>
		<title>Patent Thicket</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23773"/>
		<updated>2018-07-25T19:47:10Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Patent Thicket&lt;br /&gt;
|Has owner=Grace Tan&lt;br /&gt;
|Has start date=Summer 2018&lt;br /&gt;
|Has keywords=&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Google Scholar Crawler, PDF Downloader, PDF to Text Converter&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
===Location of Files===&lt;br /&gt;
 E:://McNair/Software/Patent_Thicket&lt;br /&gt;
&lt;br /&gt;
Downloaded PDFs:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads&lt;br /&gt;
&lt;br /&gt;
Converted PDFs to txt files:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts&lt;br /&gt;
&lt;br /&gt;
===Google Scholar Crawler===&lt;br /&gt;
Used [[Google Scholar Crawler]]&lt;br /&gt;
&lt;br /&gt;
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.&lt;br /&gt;
&lt;br /&gt;
I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for &amp;quot;patent thicket.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
===Downloading PDFs===&lt;br /&gt;
Used [[PDF Downloader]]&lt;br /&gt;
&lt;br /&gt;
I tweaked the code to take into account repeat of file names. &lt;br /&gt;
5 of the pdf urls were not downloadable so I ended up with 608 working pdfs.&lt;br /&gt;
&lt;br /&gt;
===pdf_to_txt_bulk_PTLR.py===&lt;br /&gt;
See [[PDF to Text Converter]]&lt;br /&gt;
&lt;br /&gt;
The code must be run in E because of the libraries it uses is not in Z.&lt;br /&gt;
I reinstalled pdfminer which might be a problem in the future if the libraries change.&lt;br /&gt;
&lt;br /&gt;
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23772</id>
		<title>Patent Thicket</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23772"/>
		<updated>2018-07-25T19:46:50Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Patent Thicket&lt;br /&gt;
|Has owner=Grace Tan&lt;br /&gt;
|Has start date=Summer 2018&lt;br /&gt;
|Has keywords=&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Google Scholar Crawler, PDF Downloader, PDF to Text Converter&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
===Location of Files===&lt;br /&gt;
 E:://McNair/Software/Patent_Thicket&lt;br /&gt;
&lt;br /&gt;
Downloaded PDFs:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads&lt;br /&gt;
&lt;br /&gt;
Converted PDFs to txt files:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts&lt;br /&gt;
&lt;br /&gt;
===Google Scholar Crawler===&lt;br /&gt;
used [[Google Scholar Crawler]]&lt;br /&gt;
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.&lt;br /&gt;
&lt;br /&gt;
I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for &amp;quot;patent thicket.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
===Downloading PDFs===&lt;br /&gt;
Used [[PDF Downloader]]&lt;br /&gt;
&lt;br /&gt;
I tweaked the code to take into account repeat of file names. &lt;br /&gt;
5 of the pdf urls were not downloadable so I ended up with 608 working pdfs.&lt;br /&gt;
&lt;br /&gt;
===pdf_to_txt_bulk_PTLR.py===&lt;br /&gt;
See [[PDF to Text Converter]]&lt;br /&gt;
&lt;br /&gt;
The code must be run in E because of the libraries it uses is not in Z.&lt;br /&gt;
I reinstalled pdfminer which might be a problem in the future if the libraries change.&lt;br /&gt;
&lt;br /&gt;
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23770</id>
		<title>Patent Thicket</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23770"/>
		<updated>2018-07-25T19:45:50Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Patent Thicket&lt;br /&gt;
|Has owner=Grace Tan&lt;br /&gt;
|Has start date=Summer 2018&lt;br /&gt;
|Has keywords=&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Google Scholar Crawler, PDF Downloader, PDF to Text Converter&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
===Location of Files===&lt;br /&gt;
 E:://McNair/Software/Patent_Thicket&lt;br /&gt;
&lt;br /&gt;
Downloaded PDFs:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads&lt;br /&gt;
&lt;br /&gt;
Converted PDFs to txt files:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts&lt;br /&gt;
&lt;br /&gt;
===Google Scholar Crawler===&lt;br /&gt;
used [[]]&lt;br /&gt;
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.&lt;br /&gt;
&lt;br /&gt;
I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for &amp;quot;patent thicket.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
===Downloading PDFs===&lt;br /&gt;
Used pdfdownloader.py&lt;br /&gt;
&lt;br /&gt;
I tweaked the code to take into account repeat of file names. &lt;br /&gt;
5 of the pdf urls were not downloadable so I ended up with 608 working pdfs.&lt;br /&gt;
&lt;br /&gt;
===pdf_to_txt_bulk_PTLR.py===&lt;br /&gt;
The code must be run in E because of the libraries it uses is not in Z.&lt;br /&gt;
I reinstalled pdfminer which might be a problem in the future if the libraries change.&lt;br /&gt;
&lt;br /&gt;
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23769</id>
		<title>Patent Thicket</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Patent_Thicket&amp;diff=23769"/>
		<updated>2018-07-25T19:44:27Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: Created page with &amp;quot;{{McNair Projects |Has title=Patent Thicket |Has owner=Grace Tan |Has start date=Summer 2018 |Has keywords= |Has project status=Active |Is dependent on=Google Scholar Crawler,...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Patent Thicket&lt;br /&gt;
|Has owner=Grace Tan&lt;br /&gt;
|Has start date=Summer 2018&lt;br /&gt;
|Has keywords=&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
|Is dependent on=Google Scholar Crawler, pdfdownloader.py, pdf_to_bulk_PTLR.py&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
===Location of Files===&lt;br /&gt;
 E:://McNair/Software/Patent_Thicket&lt;br /&gt;
&lt;br /&gt;
Downloaded PDFs:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/AllPDFs/successful_downloads&lt;br /&gt;
&lt;br /&gt;
Converted PDFs to txt files:&lt;br /&gt;
 E://McNair/Projects/Software/Patent_Thicket/Parsed_Texts&lt;br /&gt;
&lt;br /&gt;
===Google Scholar Crawler===&lt;br /&gt;
used [[]]&lt;br /&gt;
I used the selenium box and switched from Rice Visitor, Rice Owls, and eduroam to prevent Google Scholar from blocking me.&lt;br /&gt;
&lt;br /&gt;
I downloaded 613 pdf urls and 958 bibtex filees from 100 pages on Google Scholar when searching for &amp;quot;patent thicket.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
===Downloading PDFs===&lt;br /&gt;
Used pdfdownloader.py&lt;br /&gt;
&lt;br /&gt;
I tweaked the code to take into account repeat of file names. &lt;br /&gt;
5 of the pdf urls were not downloadable so I ended up with 608 working pdfs.&lt;br /&gt;
&lt;br /&gt;
===pdf_to_txt_bulk_PTLR.py===&lt;br /&gt;
The code must be run in E because of the libraries it uses is not in Z.&lt;br /&gt;
I reinstalled pdfminer which might be a problem in the future if the libraries change.&lt;br /&gt;
&lt;br /&gt;
This program converts all pdfs to txt files. It also generates two files _LOG_ERR.txt and _LOG_RUN.txt that includes the names of the pdfs that could not be converted and were converted successfully. Some of the files that were successfuly converted, especially the very small ones, don't have the text from the paper.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23752</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23752"/>
		<updated>2018-07-24T23:27:32Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-24: Realized that some pdfs did not download properly because the link was not to an immediate pdf. Found all pdfs possible and came up with 608 total and 5 that I could not find pdfs for. Ran pdf_to_txt_bulk_PTLR.py on the 608 pdfs. &lt;br /&gt;
&lt;br /&gt;
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23726</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23726"/>
		<updated>2018-07-23T19:14:58Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-23: Found my missing pdfs. There were 7 that were not able to download through the script for some reason so I manually downloaded 6 of them and I could not find the pdf for 1 of them. Now I have 612 pdfs on file and am ready to start converting to txt tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23700</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23700"/>
		<updated>2018-07-20T21:55:57Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-20: Reinstalled pdfminer and the test that GitHub provides works so it should be installed correctly. Ran all the pdf urls through pdfdownloader.py but came up short with 602/613 pdfs. I found 8 papers that did not successfully get downloaded but that still leaves 3 mystery papers and I'm not sure which ones they are. Did some sql to try to figure it out but hasn't worked yet.&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23659</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23659"/>
		<updated>2018-07-19T21:50:43Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-19: Started on converting pdfs to txt files. I found pdf_to_txt_bulk_PTLR.py in E:/McNair/Software/Google_Scholar_Crawler. I copied this and moved my data to E:/McNair/Software/Patent_Thicket. When I tried to run the program, it was giving me an error when trying to import pdfminer because I originally tried running it in Z, so I installed pdfminer.six and it did not complain with python3.6 It gave me a different error with a missing package which Wei said we shouldn't touch so I moved everything back to E. It now runs but cannot convert any pdfs. I have no idea how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23640</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23640"/>
		<updated>2018-07-18T21:44:55Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-18: Finished running the rest of the 100 pages. Took quite a long time because google scholar was catching me after 2-5 pages rather than 5-10. It helped to switch between the different wifi(rice visitor, rice owls, eduroam). Altogether resulted in 958 bibtex files and 613 pdfs from 1000 entries. There might be more entries but I'm not sure where to find them. I saved the data and code onto the rdp by connecting to it from the selenium box. &lt;br /&gt;
&lt;br /&gt;
2018-07-17: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. I finished running through 68/100 pages of google scholar.&lt;br /&gt;
&lt;br /&gt;
2018-07-16: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23620</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23620"/>
		<updated>2018-07-17T20:53:09Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-15: Ran google scholar crawler. When google scholar blocks me with a 403 error code, I exit the program and rerun it at the page that it last looked at by clicking on the correct page number before crawling. &lt;br /&gt;
&lt;br /&gt;
2018-07-14: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23597</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23597"/>
		<updated>2018-07-16T21:20:10Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-14: Ran through 10 pages of google scholar first thing without a problem. Tried running through all 100 pages but kept on getting caught. Helped Augi with discrepancies in data and will try google scholar crawler again tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23563</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23563"/>
		<updated>2018-07-13T21:40:25Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart. For some reason, selecting the css element triggered google scholar to find me. I changed the css element tag for the &amp;quot;next&amp;quot; button to the path and I was able to get through the 4th page. It is still not able to click on the actual link but I'm not sure if that's supposed to do anything.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23537</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23537"/>
		<updated>2018-07-13T17:17:40Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-13: Fixed problem where pdf urls were not saving to txt file. Created another txt file to save urls that are not pdfs. Didn't run into a single recaptcha all morning. Towards the end, it started catching me at the 7th query and forced the program to restart.&lt;br /&gt;
&lt;br /&gt;
2018-07-12: Figured out how to save BibTeX files to computer. Still had to do recaptcha tests. After giving it some time, I was able to run it completely once but I only got 49 BibTeX files and about half as many pdf links. When I tried to work on it further the recaptcha wasn't loading and gave me the error - &amp;quot;Cannot contact reCAPTCHA. Check your connection and try again.&amp;quot; I ended up moving to the selenium computer and spent the rest of the day converting the code to python3 and messing with regex because for some reason it wasn't matching the text correctly.&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23443</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23443"/>
		<updated>2018-07-12T15:23:55Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
2018-07-11: Started on Google Scholar Crawler for Patent Thicket Project. I'm not sure what the problem is. The code seems to work except that Google constantly blocks be to do reCaptcha tests. I am also not sure if the crawler is saving any data to txt files and if so, where those files are located.&lt;br /&gt;
&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23417</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23417"/>
		<updated>2018-07-11T15:24:13Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code lives in &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:\McNair\Projects\LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files.&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23411</id>
		<title>Grace Tan (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Grace_Tan_(Work_Log)&amp;diff=23411"/>
		<updated>2018-07-10T21:36:33Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Staff&lt;br /&gt;
|status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;onlyinclude&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Grace Tan]] [[Work Logs]] [[Grace Tan (Work Log)|(log page)]]&lt;br /&gt;
2018-07-10: Finished LinkedIn Crawler! When no search results were found, I did not find the href and instead, added the founder and company name to a txt file. Spent way too much time doing reCaptcha tests and logging out and logging in again because firefox and wifi was being slow. Cleaned up code and put it on the rdp as well as fixed the wiki page - [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
2018-07-09: Continued working on the LinkedIn Crawler. I figured out how to get the web element of the first search result using path. You can achieve the same result by looking for the css element which probably would have been easier. I then used get_attribute('href') to find the href in the web element that of the first result to get the url of the founder. Note that previously, there was another function to click on the name on the screen which would open another window with the profile but I found it easier to just extract the url. Next, I will run the (hopefully) working crawler on all the founder data. Update - ran into error when there are no search results found.&lt;br /&gt;
&lt;br /&gt;
2018-06-29: Spent a large part of today clicking on road signs and cars to prove to LinkedIn that I am not a robot. Figured out how to find search box with css element instead of xpath. Now trying to get information from search results.&lt;br /&gt;
&lt;br /&gt;
2018-06-28: Tried to figure out the xpath for the search box but came up with no solutions. Halfway through, linkedin discovered that I was a bot so I moved to the selenium computer and used the Rice Visitor wifi. Linkedin still wouldn't let me in so we made another test account (see project page for details). It finally let me in at the end of the day but I set one of the delays to be 3-4 min so that I have time to do the tests that linkedin gives to ensure that you are not a bot. I still have no idea how to find search box xpath.&lt;br /&gt;
&lt;br /&gt;
2018-06-27: Took the dictionary of accelerator to founder UUIDs and formed a table with the UUIDs combined with names of founders, gender, and linkedin_url. The file is in Z:\crunchbase2\FounderAccInfo.txt . Started looking at linkedin crawler documentation and got the crawler to get information from profiles with known urls. It crashed when it tried searching up founder names and accelerators so will work on that tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-26: Ended up finding all founders manually. Then talked to Ed and figured out how to get the founder data off the crunchbase API with a link (see project page). Created a python script that goes through all the API pages for each accelerator API and returns a dictionary of accelerator UUIDs mapped to founder UUIDs. I found 209 founders on the API but 224 manually so we'll look at the discrepancy tomorrow. &lt;br /&gt;
&lt;br /&gt;
2018-06-25: We took the 157 Accelerator UUIDs we found and created a new table that includes all the attributes of the accelerator that we want from organizations.csv called AccAllInfo. Maxine and I then split into our respectful projects. I tried joining people to the companies they are linked to in order to find the founders of each accelerator. I found about 90 matches but this there are still a lot of missing holes since some accelerators have no founders and others have multiple founders. Still unsure of how to fix this.&lt;br /&gt;
&lt;br /&gt;
2018-06-22: Matched Connor's master list of accelerators with organizations.csv based on homepage_url and company_name. Found 90 that matched along with 76 blanks. Then tried matching with homepage_url or company_name and manually found about 30 more that had slight variations in url or name that we should keep. Using ILIKE we found ~25 more company UUIDs that match with accelerators on the list.&lt;br /&gt;
&lt;br /&gt;
2018-06-21: Downloaded all 17 v3.1 csv tables and updated LoadTables.sql to match our data. We did this by manually updating the name and size of the fields. To solve the problem of &amp;quot;&amp;quot; from yesterday, we used regular expressions to change the empty string to nothing (see project page). We then worked with Connor to start extracting the accelerators from the organizations in the Crunchbase data. We found a lot of null matches based on company_name and a few that have the same name but are actually different companies. Maybe try matching with homepage_url tomorrow.&lt;br /&gt;
&lt;br /&gt;
2018-06-20: Learned more SQL. Started working on [[Crunchbase Data]] project with Maxine. Old code contained 22 csv tables but new Crunchbase data only has 17 csv tables. We will be using the new Crunchbase API v3.1 ( not v3) with only 17 csv tables as data. We then started updating the old SQL tables to align with the 17 tables we have. We ran into a problem where a field of &amp;quot;&amp;quot; in the data for a date type and SQL did not like that. Ed was helping us with this but we have not found a solution yet.&lt;br /&gt;
&lt;br /&gt;
2018-06-19: Set up monitors and continued learning SQL. We were also introduced to our projects. I will be continuing Christy's work on the Google Scholar Crawler as well as working with Maxine to update the Crunchbase data and then use that data to crawl Linkedin to find data on startup founders that go through accelerators.&lt;br /&gt;
&lt;br /&gt;
2018-06-18: Introduced to the wiki, connected to RDP, and learned SQL.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/onlyinclude&amp;gt;&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=LinkedIn_Crawler_(Python)&amp;diff=23410</id>
		<title>LinkedIn Crawler (Python)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=LinkedIn_Crawler_(Python)&amp;diff=23410"/>
		<updated>2018-07-10T21:31:23Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has Image=Web-crawler.jpg&lt;br /&gt;
|Has title=LinkedIn Crawler (Python)&lt;br /&gt;
|Has start date=April 3, 2017&lt;br /&gt;
|Has keywords=Selenium, LinkedIn, Crawler,Tool&lt;br /&gt;
}}&lt;br /&gt;
=2018 Update=&lt;br /&gt;
This Crawler was used to find information about founders of accelerators. LinkedIn had changed their website to use dynamic ids to prevent crawlers like this one!&lt;br /&gt;
&lt;br /&gt;
See here: [[Crunchbase Accelerator Founders]]&lt;br /&gt;
&lt;br /&gt;
=Overview=&lt;br /&gt;
&lt;br /&gt;
Files for this project can be found on our Git Server under the directory LinkedIn_Crawler.&lt;br /&gt;
&lt;br /&gt;
This page is dedicated to a new LinkedIn Crawler built using Selenium and Python. The goal of this project is to be able to crawl LinkedIn without being caught by LinkedIn's aggressive [https://www.linkedin.com/help/linkedin/answer/56347/prohibition-of-scraping-software?lang=en anti-scraping rules.] To do this, we will use Selenium to behave like a human, and use time delays to hide bot-like tendencies.&lt;br /&gt;
&lt;br /&gt;
The documentation for Selenium Web Driver can be found [here http://selenium-python.readthedocs.io/index.html].&lt;br /&gt;
&lt;br /&gt;
Relevant scripts can be found in the following directory:&lt;br /&gt;
 E:\McNair\Projects\LinkedIn Crawler&lt;br /&gt;
&lt;br /&gt;
The resulting data for accelerator founders can be found:&lt;br /&gt;
 E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\accelerator_founders_data&lt;br /&gt;
&lt;br /&gt;
The code from the original Summer 2016 Project can be found in:&lt;br /&gt;
 web_crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
The next section will provide details on the construction and functionality of the scripts located in the linkedin directory.&lt;br /&gt;
&lt;br /&gt;
The old documentation said that the programs/scripts (see details below) are located on our [[Software Repository|Bonobo Git Server]]. &lt;br /&gt;
 repository: Web_Crawler&lt;br /&gt;
 branch: researcher/linkedin&lt;br /&gt;
 directory: /linkedin&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Accounts==&lt;br /&gt;
Test Account:&lt;br /&gt;
&lt;br /&gt;
email: testapplicat6@gmail.com&lt;br /&gt;
&lt;br /&gt;
pass: McNair2017&lt;br /&gt;
&lt;br /&gt;
Real Account:&lt;br /&gt;
&lt;br /&gt;
email: ed.edgan@rice.edu&lt;br /&gt;
&lt;br /&gt;
pass: This area has intentionally been left blank.&lt;br /&gt;
&lt;br /&gt;
=LinkedIn Scripts=&lt;br /&gt;
==Overview==&lt;br /&gt;
This section provides a file by file breakdown of the contents of the folder located at:&lt;br /&gt;
 E:\McNair\Projects\LinkedIn Crawler\web_crawler\linkedin&lt;br /&gt;
The main script to run is:&lt;br /&gt;
 run_linkedin_recruiter.py&lt;br /&gt;
&lt;br /&gt;
==run_linkedin_recruiter.py==&lt;br /&gt;
This script executes the linkedin recruiter crawler. At the top of the file, just below the imports, are three fields: username, password, and query_filepath. The username and password fields are for the desired recruiter pro account you would like to log into, and query_filepath is a pathname to a text file that contains a list of properly formatted queries that can be read by the LinkedIn Crawler's simple_search method. The following are the functions listed in the script.&lt;br /&gt;
&lt;br /&gt;
===main()===&lt;br /&gt;
This function runs the LinkedIn Crawler and will automatically begin when called from the command line. If you only want to go through some of the queries, you can change the range of the splice in line 32, and if you wish to only look at a certain number of search results, you can change the range of the splice in line 40.&lt;br /&gt;
&lt;br /&gt;
===open_new_window(driver, element)===&lt;br /&gt;
This function does a shift click on a web element to open the link in a new window. It then changes the window handler to the new window. This method makes it simple to view search results and close them in a quick manner.&lt;br /&gt;
&lt;br /&gt;
===close_window_and_return(driver)===&lt;br /&gt;
This function closes the current window, and returns to the main window. It is used in conjunction with open_new_window() to view search results and close them in an iterative manner.&lt;br /&gt;
&lt;br /&gt;
===close_tab(driver)===&lt;br /&gt;
When necessary, this function is used to close the current tab and return to the main tab. It is similar to close_window_and_return(). This function is used to log out of the account.&lt;br /&gt;
&lt;br /&gt;
==crawlererror.py==&lt;br /&gt;
This script is a simple class construction for error messages. It is used in other scripts to raise errors to the user when errors with the crawler occur. Please continue.&lt;br /&gt;
&lt;br /&gt;
==linked_in_crawler.py==&lt;br /&gt;
This script constructs a class that provides navigation functionality around the traditional LinkedIn site. The beginning section lists some global xpaths that will be used by Selenium throughout the process. These xpaths are used to locate elements within the HTML. The following are some important functions to keep in mind when designing original programs using this code.&lt;br /&gt;
&lt;br /&gt;
=== login(self, username, password)===&lt;br /&gt;
This function takes a username and password, and logs in to LinkedIn. During the process, the function uses the MouseMove move_random() function to move the mouse randomly across the screen like a crazy person.&lt;br /&gt;
&lt;br /&gt;
===logout(self)===&lt;br /&gt;
This function logs out of LinkedIn. It works by clicking on the profile picture, and then selecting logout.&lt;br /&gt;
&lt;br /&gt;
===go_back(self)===&lt;br /&gt;
This function goes back a page if you ever need to do such a thing.This function also doesn't seem to work.&lt;br /&gt;
&lt;br /&gt;
===simple_search(self, query)===&lt;br /&gt;
This function takes a string as a query, and searches it using the search box. At the end of the functions run, a page with search results relevant to your search query will be on the screen.&lt;br /&gt;
&lt;br /&gt;
===advance_search(self, query)===&lt;br /&gt;
This function uses the advanced search feature of LinkedIn. Instead of a string, this function takes in a dictionary mapping predetermined keywords to their necessary values. This function has not been debugged yet.&lt;br /&gt;
&lt;br /&gt;
===get_search_results_on_page(self)===&lt;br /&gt;
This function is supposed to return all the search results on the current page. This function has not been debugged yet.&lt;br /&gt;
&lt;br /&gt;
===get_next_search_page(self)===&lt;br /&gt;
This function is supposed to click and load the next search page if one exists. This function has not been debugged yet.&lt;br /&gt;
&lt;br /&gt;
==linked_in_crawler_recruiter.py==&lt;br /&gt;
This script constructs a class called LinkedInCrawlerRecruiter that implements functionality specifically for the Recruiter Pro feature of LinkedIn. Similar to the regular linked_in_crawler, the program begins with a list of relevant xpaths. It is followed by multiple functions. Their functionalities are listed below.&lt;br /&gt;
&lt;br /&gt;
===login(self, username, password)===&lt;br /&gt;
This function logs into a normal LinkedIn account, and then launches the Recruiter Pro session from the LinkedIn home page. At the end of the function run, there will be a window with the Recruiter Pro feature open, and the Selenium web frame will be on that window.&lt;br /&gt;
&lt;br /&gt;
===simple_search(self, query)===&lt;br /&gt;
Similar to the original LinkedIn Crawler, this function implements a basic string query search for the Recruiter Pro feature. At the end of the function run, a page will be up with the relevant search results of the search query.&lt;br /&gt;
&lt;br /&gt;
===help_search_handler_stuff(self)===&lt;br /&gt;
This function does some things on the current page in an attempt to appear more human. As of now, the function has a notes feature that will randomly jot down notes on the current page.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==utils.py==&lt;br /&gt;
This file contains a few useful functions for waiting and moving the mouse. This is the human file for this project.&lt;br /&gt;
&lt;br /&gt;
===sleep_secs(secs)===&lt;br /&gt;
This is a simple function that has the browser wait for a specified number of seconds.&lt;br /&gt;
&lt;br /&gt;
===sleep_rand(limit=__SLEEP_LIMIT__)===&lt;br /&gt;
This function has the browser wait for a random amount of time less than the user provided limit. If the user does not provide a limit, the browser waits for a random time less than 5 seconds.&lt;br /&gt;
&lt;br /&gt;
===move_strategy1(self)===&lt;br /&gt;
This is a function within the MouseMove class. This function moves the mouse randomly across the window. It uses autopy to move the mouse across the window visibly to the user.&lt;br /&gt;
&lt;br /&gt;
===move_to(self, x=None, y=None)===&lt;br /&gt;
This function is a function within the MouseMove class. Given an x and y coordinates on the screen, this function will move the mouse to that given point. &lt;br /&gt;
&lt;br /&gt;
===move_random(self)===&lt;br /&gt;
This function chooses a random MouseMove method and executes it.&lt;br /&gt;
&lt;br /&gt;
==web_driver.py==&lt;br /&gt;
This file contains the relevant functions from the Selenium library that is used for web driving.&lt;br /&gt;
&lt;br /&gt;
=Constructing Your Query=&lt;br /&gt;
&lt;br /&gt;
Using Recruiter to search generic terms such as &amp;quot;CompanyName Founder&amp;quot; does not turn up valuable search results. For optimal performance, it is recommended that you determine through another source the exact person you are looking for. Methods to get such information will be listed below.&lt;br /&gt;
&lt;br /&gt;
==format_founders.py==&lt;br /&gt;
Script location:&lt;br /&gt;
 TBD&lt;br /&gt;
&lt;br /&gt;
This python script takes a textfile of company names, and uses the Crunchbase Snapshot to determine the founder names of each company. If Crunchbase does not have the records of the founder, it is unlikely that a generic search on LinkedIn will provide any useful results. The script returns a new textfile with each company name replaced with &amp;quot;CompanyName Founder FounderName&amp;quot; for each founder of the company listed in the Crunchbase Snapshot. This new textfile can then be used directly with the LinkedIn Crawler to generate accurate search results, and retrieve accurate html pages.&lt;br /&gt;
&lt;br /&gt;
The following lists the functionality of functions in the format_founders.py script.&lt;br /&gt;
&lt;br /&gt;
===create_pickle()===&lt;br /&gt;
This function creates a pickled python dictionary of the Crunchbase Snapshot, people.csv. If a different dataset should be used in the future, one should pickle a dictionary in a similar fashion to this function, and then use that pickled result in the next function to reformat your queries.&lt;br /&gt;
&lt;br /&gt;
===reformat(pathname, output_filename)===&lt;br /&gt;
This function takes a textfile pathname and an output filename, and converts the textfile to a searchable term by using the data from the pickled Crunchbase Snapshot. The new textfile with the corrected queries are saved to the output filename.&lt;br /&gt;
&lt;br /&gt;
===Results with Accelerator Data===&lt;br /&gt;
Of the 265 recorded accelerators we have data on, 94 of them have founders listed through the Crunchbase Snapshot. Some of these companies will have multiple founders with profiles, and some of these founders will not have LinkedIn profiles.&lt;br /&gt;
&lt;br /&gt;
The final data is a text file with accelerator name, founder name, profile summary, experience, and education. It can be found at:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LinkedIn Founders Data&lt;br /&gt;
&lt;br /&gt;
=Fall 2017=&lt;br /&gt;
&lt;br /&gt;
==Accelerator Founders Search==&lt;br /&gt;
&lt;br /&gt;
'''These results are for the paper: The Jockey, The Horse, or the RaceTrack'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Our LinkedIn Recruiter Pro account has expired. Unfortunately, it turns out that profiles cannot be viewed through LinkedIn if the target profile is 3rd degree away or further. However, a Google search on such a LinkedIn profile will still let you view the profile, provided that an account has been logged into prior to the search. &lt;br /&gt;
&lt;br /&gt;
===Piggybacking Google===&lt;br /&gt;
&lt;br /&gt;
In order to get our data, we will piggyback on Google's web crawler to work around the LinkedIn protective wall. The crawler begins by logging into our test LinkedIn Account (credentials displayed at the top), and then launching a Google search for each query. By adding &amp;quot;LinkedIn&amp;quot; before the query, and &amp;quot;Founder&amp;quot; after the query, we can turn up relevant search results. The top 5 results on Google search are explored, scraped, and saved.&lt;br /&gt;
&lt;br /&gt;
We ended up not opting to use the Google method for various reasons.&lt;br /&gt;
&lt;br /&gt;
===Crunchbase API===&lt;br /&gt;
&lt;br /&gt;
Instead, we opted to use data from Crunchbase we have access to through a license. A wiki page on the crunchbase data and how to use the API can be found [http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data here]. The data can be accessed either through the web API (discussed on the Crunchbase Data wiki page), or through the bulk download we have in our SQL server.&lt;br /&gt;
&lt;br /&gt;
The web API has the nice added feature of having a '''Founders''' section. The API returns a JSON when a GET request is submitted using the correct company identifier. The Founders section of this JSON contains information on the Founders of the accelerator if Crunchbase has said data. Details about the data can be found on the [http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data Crunchbase Data Page]. &lt;br /&gt;
&lt;br /&gt;
The script that queried the API is called '''crunchbase_founders.py''' and can be found:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\crunchbase_founders.py&lt;br /&gt;
&lt;br /&gt;
The resulting text file, called '''founders_linkedin.txt''', containing names and linkedin URLs of founders after messing around with the database can be found:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\founders_linkedin.txt&lt;br /&gt;
&lt;br /&gt;
===Crawling LinkedIn===&lt;br /&gt;
&lt;br /&gt;
The next step of the process uses this data to get information about these founders from their LinkedIn profiles. For the founders we have linkedin URLs for, we will use those. For those we do not have linkedin URLs for, we will do a simple LinkedIn search with their name and accelerator name. The code for this crawler, '''linkedin_founders.py''' can be found:&lt;br /&gt;
 E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin\linkedin_founders.py&lt;br /&gt;
&lt;br /&gt;
NOTE: Right now, this code needs to run in a virtual environment that contains Python3. This is due to the origins of the project, and this needs to be addressed when we have a lull in the development process. The only virtual environment we have managed to get working is on the Ubuntu machine sitting in the corner of the room. &lt;br /&gt;
&lt;br /&gt;
===Using the Ubuntu Virtual Environment===&lt;br /&gt;
&lt;br /&gt;
Step 1: Login using the researcher credentials. If you don't know what these are, ask someone.&lt;br /&gt;
&lt;br /&gt;
Step 2: Open the command prompt. Type:&lt;br /&gt;
 source dev/python3_venv_linkedin/bin/activate&lt;br /&gt;
&lt;br /&gt;
Your screen should now have (python3_venv_linkedin) next to any command you write. The virtual enivornment has been activated.&lt;br /&gt;
&lt;br /&gt;
Step 3: Change directories to: &lt;br /&gt;
  ~/dev/web_crawler/linkedin&lt;br /&gt;
&lt;br /&gt;
Step 4: All the files for any sort of LinkedIn Crawler are here. The file for this project is:&lt;br /&gt;
 linkedin_founders.py&lt;br /&gt;
&lt;br /&gt;
This file executes the crawler on all of the information stored in the file founders_linkedin.txt. Any file with the format company-tab-first name-tab-last name-tab-linkedin url-newline- will work.&lt;br /&gt;
The output of the data will be stored in founders_linkedin_main.txt, founders_linkedin_experience.txt, and founders_linkedin_education.txt.&lt;br /&gt;
&lt;br /&gt;
Step 5: To run the file, enter:&lt;br /&gt;
 python linkedin_founders.py&lt;br /&gt;
&lt;br /&gt;
The crawler will begin running automatically.&lt;br /&gt;
&lt;br /&gt;
Step 6: If you want to leave the virtual environment and return to the normal environment, simply enter the following in the command prompt:&lt;br /&gt;
 deactivate&lt;br /&gt;
&lt;br /&gt;
==LinkedIn Crawler on the RDP==&lt;br /&gt;
As of 12/18/2017, the linkedin crawler has been updated to be compatible with the RDP. Some of the bells and whistles have been removed from the ubuntu version due to download failures related to a missing vcvarsall.bat. &lt;br /&gt;
&lt;br /&gt;
Relevant files are located: &lt;br /&gt;
 E:\McNair\Projects\LinkedIn Crawler\LinkedIn_Crawler\linkedin&lt;br /&gt;
&lt;br /&gt;
===Crawling Google for unknown LinkedIn accounts===&lt;br /&gt;
For accelerator founders without a recorded LinkedIn profile, a quick google search will most likely get the correct page if the person has a LinkedIn profile. The script to run this process is in the same folder, and is called:&lt;br /&gt;
 goog_linkedin_founders.py&lt;br /&gt;
This file uses the same formatted text file for its queries.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Previous Posts about the LinkedIn Crawler=&lt;br /&gt;
== To what extent are we able to reproduce the network structure in LinkedIn (From Previous) == &lt;br /&gt;
&lt;br /&gt;
Example 1: 1st degree contact- You are connected to his profile&lt;br /&gt;
Albert Nabiullin (485 connections)&lt;br /&gt;
&lt;br /&gt;
Example 2: 2nd degree contact- You are connected to someone who is connected to him&lt;br /&gt;
Amir Kazempour Esmati (63 connections)&lt;br /&gt;
&lt;br /&gt;
Example 3: 3rd degree contact- You are connected to someone who is connected to someone else who is connected to her. &lt;br /&gt;
Linda Szabados(500+ connections) &lt;br /&gt;
&lt;br /&gt;
Any profile with a distance greater than three is defined as out your network. &lt;br /&gt;
&lt;br /&gt;
Summary: Individual specific network information are not accessible even for the first degree connections. Therefore, any such plans to construct a network structure based on the connection of every individuals is not feasible. &lt;br /&gt;
&lt;br /&gt;
It seems that the only possible direction would be using the advanced search feature.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23409</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23409"/>
		<updated>2018-07-10T21:29:29Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code lives in &lt;br /&gt;
  Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:/McNair/Projects/LinkedIn Crawler/LinkedIn_Crawler/linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:/McNair/Projects/LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 6 python files.&lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
I never figured out how to stop reCaptcha from losing connection so I spent a lot of time completing reCaptcha tests.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
Main function for the crawler which opens up a window and enters the known linkedin urls of each founder and puts the information into txt files. It then uses the search box to search for founder name + company and selects the href from the html and opens the profile on a new page. If a founder cannot be found(there are no search results), the founder name and company is put into unavailable_profiles.txt (might be called something else but should have &amp;quot;unavailable&amp;quot; in the name).&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;br /&gt;
Contains the LinkedInCrawler class that contains functions for login, logout, search, etc. Relies heavily on locating the xpath of an element of that we want on the webpage. I ran into a lot of difficulty with this because the xpath in the code no longer exists in the new linkedin html which includes dynamic ids. To get around this, I located different aspects that we could use to find an xpath to the element we were looking for and also located elements by their css.&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23408</id>
		<title>Crunchbase Accelerator Founders</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Crunchbase_Accelerator_Founders&amp;diff=23408"/>
		<updated>2018-07-10T21:20:48Z</updated>

		<summary type="html">&lt;p&gt;GraceTan: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Crunchbase Accelerator Founders&lt;br /&gt;
|Has owner=Grace Tan,&lt;br /&gt;
|Has start date=6/18/18&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
&lt;br /&gt;
==Related Pages==&lt;br /&gt;
[[Crunchbase Data]]&lt;br /&gt;
[[Crunchbase Accelerator Equity]]&lt;br /&gt;
[[LinkedIn Crawler (Python)]]&lt;br /&gt;
&lt;br /&gt;
==Getting Data==&lt;br /&gt;
To get the founder UUIDs from each accelerator, input the accelerator UUID (or name all lowercase if its one work) into this link:&lt;br /&gt;
  https://api.crunchbase.com/v3.1/organizations/ + UUID of organization + ?relationships=founders&amp;amp;user_key=662e263576fe3e4ea5991edfbcfb9883&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===scrapefounders.py===&lt;br /&gt;
This code lives in Z:\crunchbase2\scrapefounders.py&lt;br /&gt;
&lt;br /&gt;
This program takes Accelerators and UUIDs.txt found in a Z:\crunchbase2\Accelerators and UUIDs.txt and extracts the accelerator UUIDs and loads the information of each founder from the crunchbase API using the link above. It then takes the information given by the API and returns a dictionary of accelerator UUIDs as keys and founder UUIDs as values.&lt;br /&gt;
&lt;br /&gt;
==Updated LinkedIn Crawler==&lt;br /&gt;
We will be using a LinkedIn Crawler to find information about accelerator founders. There is a previous project whose code is found in &lt;br /&gt;
  E:/McNair/Projects/LinkedIn Crawler/LinkedIn_Crawler/linkedin&lt;br /&gt;
&lt;br /&gt;
My code is found in the selenium computer at the root and at&lt;br /&gt;
  E:/McNair/Projects/LinkedIn Crawler 2018&lt;br /&gt;
&lt;br /&gt;
There are 5 python files needed to run the crawler in addition to search.py which I included but did not use because it was in the previous code I found. &lt;br /&gt;
&lt;br /&gt;
===New Test Account===&lt;br /&gt;
  Username: mcboatfaceboaty670@gmail.com&lt;br /&gt;
  Password: McNair2018&lt;br /&gt;
&lt;br /&gt;
Use the selenium computer on Rice Visitor wifi.&lt;br /&gt;
After logging in a couple of times, LinkedIn will get suspicious and ask you to confirm that you are not a robot using reCaptcha. I got around this by delaying the program by 3 minutes so that I had time to complete the reCaptcha test. However, sometimes reCaptcha loses connection and it forces you to continue the tests which can be frustrating. When this happens, I disconnect and reconnect from the wifi as well as switch between the test accounts.&lt;br /&gt;
&lt;br /&gt;
==Code==&lt;br /&gt;
===linkedin_crawler_main.py===&lt;br /&gt;
&lt;br /&gt;
===linked_in_crawler.py===&lt;/div&gt;</summary>
		<author><name>GraceTan</name></author>
		
	</entry>
</feed>