<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=HiraF</id>
	<title>edegan.com - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://www.edegan.com/mediawiki/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=HiraF"/>
	<link rel="alternate" type="text/html" href="http://www.edegan.com/wiki/Special:Contributions/HiraF"/>
	<updated>2026-06-02T01:16:55Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23896</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23896"/>
		<updated>2018-08-02T20:45:40Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Data assembly details */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;br /&gt;
8) additional_timing_info2 - source file: &amp;quot;formatted timing info2.txt&amp;quot; located in E:\Projects\McNair\Accelerators\Summer 2018. This was collected through MTurks. &lt;br /&gt;
Tables 7 and 8 include columns:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*cohort_name&lt;br /&gt;
*date&lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*season&lt;br /&gt;
&lt;br /&gt;
9) timing_combined - This table combines all timing information we have and appends tables 4, 7 and 8.&lt;br /&gt;
10) cohortcompanies_wtiming - merges data in tables cohortcompany and timing_combined&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23895</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23895"/>
		<updated>2018-08-02T18:24:23Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Data assembly details */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;br /&gt;
&lt;br /&gt;
7) additional_timing_info - source file: &amp;quot;merging_work.xlxs&amp;quot; located in: E:\Projects\McNair\Seed DB&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23886</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23886"/>
		<updated>2018-08-02T08:04:45Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Data assembly details */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name &lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Employer &lt;br /&gt;
*VC &lt;br /&gt;
*VC_backed_startup &lt;br /&gt;
*OLD_Job_Title&lt;br /&gt;
*NEW_Job_Title &lt;br /&gt;
*Dates_Employed&lt;br /&gt;
*Time_Employed&lt;br /&gt;
*Location &lt;br /&gt;
*Extra_Description&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23885</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23885"/>
		<updated>2018-08-02T08:02:40Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Accelerator Data Assembly Progress (Hira) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file that loads all data is: LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
Accelerator text,&lt;br /&gt;
 *First_Name text,&lt;br /&gt;
*Last_Name text,&lt;br /&gt;
 *Full_Name text,&lt;br /&gt;
 *Employer text,&lt;br /&gt;
 *VC varchar(5),&lt;br /&gt;
 *VC_backed_startup varchar(5),&lt;br /&gt;
*OLD_Job_Title text,&lt;br /&gt;
 *NEW_Job_Title text,&lt;br /&gt;
 *Dates_Employed text,&lt;br /&gt;
 *Time_Employed text,&lt;br /&gt;
 *Location text,&lt;br /&gt;
 *Extra_Description text&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23884</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23884"/>
		<updated>2018-08-02T08:01:30Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Data assembly details */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohortsfinal - source file:  Cohorts Final sheet in &amp;quot;The File to Rule Them All&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohortsfinal and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
3)CohortParticipation - uses table cohortsfinal&lt;br /&gt;
*cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator&lt;br /&gt;
*conamestd&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
4) timing_final - This table is based on the most updated information on timing compiled in source file:  Z:/accelerator/Formatted Timing Info.txt (by Grace). It includes:&lt;br /&gt;
*coname&lt;br /&gt;
*acceleratorname&lt;br /&gt;
*keyword&lt;br /&gt;
*url&lt;br /&gt;
*webpage &lt;br /&gt;
*predicted &lt;br /&gt;
*gooddata&lt;br /&gt;
*page_details &lt;br /&gt;
*full_date &lt;br /&gt;
*month&lt;br /&gt;
*year&lt;br /&gt;
*cohort_name&lt;br /&gt;
*notes&lt;br /&gt;
*prog_duration_wks&lt;br /&gt;
*actual_date &lt;br /&gt;
*actual_month&lt;br /&gt;
*actual_year &lt;br /&gt;
*season &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5) Founders - source file: &amp;quot;The File to Rule Them All - Founders main sheet&amp;quot;&lt;br /&gt;
*Accelerator &lt;br /&gt;
*First_Name&lt;br /&gt;
*Last_Name&lt;br /&gt;
*Full_Name&lt;br /&gt;
*Current_Job &lt;br /&gt;
*Current_Location&lt;br /&gt;
&lt;br /&gt;
6) founders_experience - source file: &amp;quot;The File to Rule Them All - Founders experience sheet&amp;quot;&lt;br /&gt;
Accelerator text,&lt;br /&gt;
 *First_Name text,&lt;br /&gt;
*Last_Name text,&lt;br /&gt;
 *Full_Name text,&lt;br /&gt;
 *Employer text,&lt;br /&gt;
 *VC varchar(5),&lt;br /&gt;
 *VC_backed_startup varchar(5),&lt;br /&gt;
*OLD_Job_Title text,&lt;br /&gt;
 *NEW_Job_Title text,&lt;br /&gt;
 *Dates_Employed text,&lt;br /&gt;
 *Time_Employed text,&lt;br /&gt;
 *Location text,&lt;br /&gt;
 *Extra_Description text&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23883</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23883"/>
		<updated>2018-08-02T07:46:01Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Accelerator Data Assembly Progress (Hira) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;br /&gt;
&lt;br /&gt;
=== Data assembly details ===&lt;br /&gt;
&lt;br /&gt;
The SQL file LoadAccData.sql currently loads data on Cohorts final and Founders from:&lt;br /&gt;
  E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
&lt;br /&gt;
It creates the following tables:&lt;br /&gt;
&lt;br /&gt;
1) cohorts final - This table loads all information in the sheet Cohorts Final in The File to Rule Them All.&lt;br /&gt;
&lt;br /&gt;
2) CohortCompany - this uses data in cohorts final and creates a table with the following:&lt;br /&gt;
*conamestd&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23882</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23882"/>
		<updated>2018-08-02T07:29:52Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Database specification */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23881</id>
		<title>Seed Accelerator Data Assembly</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Seed_Accelerator_Data_Assembly&amp;diff=23881"/>
		<updated>2018-08-02T07:27:08Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* To do/For consideration */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Useful pages==&lt;br /&gt;
&lt;br /&gt;
*[[U.S. Seed Accelerators]]&lt;br /&gt;
**[[Accelerator Seed List (Data)]]&lt;br /&gt;
*[[Accelerator Demo Day]]&lt;br /&gt;
**[[Mechanical Turk (Tool)]]&lt;br /&gt;
*[[Crunchbase Accelerator Founders]]&lt;br /&gt;
**[[Crunchbase Data#Collecting Company Information]]&lt;br /&gt;
**[[Merging Existing Data with Crunchbase]]&lt;br /&gt;
*[[Whois Parser]]&lt;br /&gt;
*[[URL Finder (Tool)]]&lt;br /&gt;
*[[Industry Classifier]]&lt;br /&gt;
*[[Industry classifier yang]]&lt;br /&gt;
*[[VentureXpert Data]]&lt;br /&gt;
&lt;br /&gt;
Please add (or subtract) other relevant (or irrelevant) pages!&lt;br /&gt;
&lt;br /&gt;
==Database specification==&lt;br /&gt;
&lt;br /&gt;
===Preamble===&lt;br /&gt;
&lt;br /&gt;
We need to get the data into approximately 3NF to prevent errors and make it more useable.&lt;br /&gt;
&lt;br /&gt;
Inputs:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\Summer 2018\The File To Rule Them All.xlsx&lt;br /&gt;
 https://docs.google.com/spreadsheets/d/16Suyp364lMkmUuUmK2dy_9MeSoS1X4DfFl3dYYDGPT4/edit#gid=0&lt;br /&gt;
&lt;br /&gt;
Recent work on the above from Connor:&lt;br /&gt;
*made the Demo Day Timing Google sheet as clean as possible (fixed dates, removed duplicates, created season column)&lt;br /&gt;
*recoded the employee count&lt;br /&gt;
*normalized the investment amount&lt;br /&gt;
&lt;br /&gt;
Connor's next to do: &lt;br /&gt;
*Founders Experience: code job title&lt;br /&gt;
*Founders Education: remove unknowns, code degree and code major&lt;br /&gt;
*Fix multiple campus and cohorts (see below). &lt;br /&gt;
&lt;br /&gt;
This is documented on [[U.S._Seed_Accelerators#Update_for_Hira]].&lt;br /&gt;
&lt;br /&gt;
The current database work is in '''vcdb2'''. The code to build the relevant tables in vcdb2 is in:&lt;br /&gt;
 E:\McNair\Projects\Accelerators\LoadAcceleratorDataV2.sql&lt;br /&gt;
 &lt;br /&gt;
Note that AddCBData.sql can be run on '''crunchbase2''' to get the relevant crunchbase data for import into vcdb2 (or elsewhere).&lt;br /&gt;
&lt;br /&gt;
However, we should build a new database for this project:&lt;br /&gt;
 createdb accelerator&lt;br /&gt;
&lt;br /&gt;
As well as a new folder (Z:\accelerator) for the data.&lt;br /&gt;
&lt;br /&gt;
The work in vcdb2 is essentially to build the data that goes into the File To Rule Them All.xlsx. We should start from there!&lt;br /&gt;
&lt;br /&gt;
===Suggested Spec===&lt;br /&gt;
&lt;br /&gt;
We need to address the issue with multiple campus and cohorts. This will require loading and manipulation of the data in SQL (Hira) as well as some manual fixes to the data (Connor).&lt;br /&gt;
&lt;br /&gt;
Accelerator Table (fieldname	colname/DISCARD, from sheet &amp;quot;Accelerators Final&amp;quot;)&lt;br /&gt;
*acceleratorname (Primary Key)	Accelerators&lt;br /&gt;
*url	homepage_url&lt;br /&gt;
*cohorturl	cohort page URL&lt;br /&gt;
*cohortlisting	Break out cohorts on the website? (Y/N)&lt;br /&gt;
*type	Type&lt;br /&gt;
*alive	Alive&lt;br /&gt;
*typenote	Type notes&lt;br /&gt;
*weeks	Weeks&lt;br /&gt;
*durationnotes	Duration Notes&lt;br /&gt;
*city	city	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*state	state/region	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*foundingdate	Creation Date&lt;br /&gt;
*terms	Terms of Joining Accelerator (non-equity pay/equity/free)&lt;br /&gt;
*equity	Equity?&lt;br /&gt;
*equityamountdesc	Equity Amount&lt;br /&gt;
*equityamount	Equity Amount Normalized (only 1s, range is averaged)&lt;br /&gt;
*investmentamountdesc	Investment Amount &lt;br /&gt;
*investmentamount	Investment Amount Normalized (Midpoint) &lt;br /&gt;
*investmentnotes	Investment Notes&lt;br /&gt;
*discard industry	DISCARD&lt;br /&gt;
*industry	Industry	&lt;br /&gt;
*specgenindu	Specified/General&lt;br /&gt;
*address	Address	(note: address info is for the accelerator HQ here)&lt;br /&gt;
*subtype	DISCARD&lt;br /&gt;
*nonprofit	nonprofit?&lt;br /&gt;
*studentfocus	designed for students (Y/N)&lt;br /&gt;
*multicampus	Multiple campuses? (Y/N)&lt;br /&gt;
*software tech (Y/N)	DISCARD&lt;br /&gt;
*stagepref	What stage do they look for in cohort companies (SEL)&lt;br /&gt;
*cbconame	Name in Crunchbase&lt;br /&gt;
*uuid	UUID&lt;br /&gt;
&lt;br /&gt;
Note that we will assume that duration and terms are common across all cohorts run by the same accelerator. We might need to revisit this assumption.&lt;br /&gt;
&lt;br /&gt;
CohortCompany Table (fieldname	colname/DISCARD, from sheet &amp;quot;Cohorts Final&amp;quot;)&lt;br /&gt;
*conamestd (primary key)&lt;br /&gt;
*coname&lt;br /&gt;
*conameorg&lt;br /&gt;
*colocation&lt;br /&gt;
*city&lt;br /&gt;
*state_code&lt;br /&gt;
*country_code&lt;br /&gt;
*address&lt;br /&gt;
*codescription&lt;br /&gt;
*short_desc&lt;br /&gt;
*long_desc&lt;br /&gt;
*cosectors&lt;br /&gt;
*costatus&lt;br /&gt;
*cofundingstage&lt;br /&gt;
*courl&lt;br /&gt;
*uuid&lt;br /&gt;
*category_list&lt;br /&gt;
*category_group_list&lt;br /&gt;
*founded_on&lt;br /&gt;
*employee_count&lt;br /&gt;
*emp_count_scale&lt;br /&gt;
*linkedin_url&lt;br /&gt;
*gotvc&lt;br /&gt;
&lt;br /&gt;
We then need to deduplicate this table and make sure that conamestd is a valid primary key.&lt;br /&gt;
&lt;br /&gt;
Note that we won't be taking the following from &amp;quot;Cohorts Final&amp;quot; for this table:&lt;br /&gt;
*year&lt;br /&gt;
*accelerator&lt;br /&gt;
*cohort&lt;br /&gt;
*quarter&lt;br /&gt;
*acclocation&lt;br /&gt;
*accperks	DISCARD&lt;br /&gt;
*cofounder	DISCARD&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
===Going further===&lt;br /&gt;
&lt;br /&gt;
We should likely rebuild the cohort variable to make it a sole &amp;quot;primary key&amp;quot; for a cohort. This would mean turning each entry of cohort into something unique like: TechStars Boulder Fall 2017 or 1440 Cohort 1, so that this key could look up the year of the cohort, its quarter, etc. We could then break CohortParticipation into two tables:&lt;br /&gt;
&lt;br /&gt;
CohortParticipation&lt;br /&gt;
*Cohort&lt;br /&gt;
*accelerator	(foreign key)&lt;br /&gt;
*conamestd (foreign key)&lt;br /&gt;
&lt;br /&gt;
Cohort&lt;br /&gt;
*Cohort&lt;br /&gt;
*year&lt;br /&gt;
*quarter&lt;br /&gt;
&lt;br /&gt;
We could then add campus and a seperate table for campuses:&lt;br /&gt;
&lt;br /&gt;
Campus Table&lt;br /&gt;
*CampusName (e.g., Techstars Boulder)&lt;br /&gt;
*Accelerator (foreign key)&lt;br /&gt;
*Address&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Zip&lt;br /&gt;
*Description &lt;br /&gt;
&lt;br /&gt;
===Founders Information===&lt;br /&gt;
&lt;br /&gt;
The three founder sheets turn into three tables nicely. We don't need to renormalize them for now, just fix up their variables and do some matching on employer.&lt;br /&gt;
&lt;br /&gt;
Founders:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Current Job&lt;br /&gt;
*Current Location&lt;br /&gt;
&lt;br /&gt;
FoundersExperience:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*Employer&lt;br /&gt;
*Job Title&lt;br /&gt;
*Dates Employed&lt;br /&gt;
*Time Employed&lt;br /&gt;
*Location&lt;br /&gt;
*Extra Description&lt;br /&gt;
&lt;br /&gt;
FoundersEducation:&lt;br /&gt;
*Accelerator&lt;br /&gt;
*First Name&lt;br /&gt;
*Last Name&lt;br /&gt;
*Full Name&lt;br /&gt;
*School Name&lt;br /&gt;
*Degree Name&lt;br /&gt;
*Major&lt;br /&gt;
*Attended&lt;br /&gt;
*Graduated&lt;br /&gt;
*Societies&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==To do/For consideration==&lt;br /&gt;
&lt;br /&gt;
Minh:&lt;br /&gt;
*Create a format for collecting timing data&lt;br /&gt;
*Put the timing job on Turk&lt;br /&gt;
&lt;br /&gt;
Connor:&lt;br /&gt;
*Try for missing timings that we really (new process) need&lt;br /&gt;
*Col W should be headquarter address&lt;br /&gt;
*What stage-  Clean up?&lt;br /&gt;
&lt;br /&gt;
Maxine: &lt;br /&gt;
*Build the google URL finder&lt;br /&gt;
*Industry classification from description&lt;br /&gt;
&lt;br /&gt;
Grace:&lt;br /&gt;
*Process and Join in new timing data - new date located in Z:/accelerator/Formatted Timing Info.txt&lt;br /&gt;
*Make a category group to minorcode lookup&lt;br /&gt;
*Run WHOIS crawler on all valid URLs (not facebook pages, etc.)&lt;br /&gt;
*Founders Experience: Match Employers to VC funds/firms, VC backed startups (requires data from Augi)&lt;br /&gt;
&lt;br /&gt;
==Accelerator Data Assembly Progress (Hira) == &lt;br /&gt;
&lt;br /&gt;
*All data files are in Z:/accelerator&lt;br /&gt;
*The SQL file LoadAccData.sql. It is located in E:\McNair\Projects\Accelerators\Summer 2018.&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19764</id>
		<title>Hubs Analysis 2017</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19764"/>
		<updated>2017-08-04T19:57:33Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Analysis Stage2 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs Analysis 2017&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data Analysis&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Analysis Stage 1 == &lt;br /&gt;
&lt;br /&gt;
The results are based on preliminary data analysis. The do-file that performs some analysis is located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
It is titled '''load_vc_hubs_data.do'''.&lt;br /&gt;
&lt;br /&gt;
The sample of hubs used at this stage is the 29-30 hubs that were shortlisted in summer 2016. After data cleaning we get a sample of ~ 25 hubs. Important observations from initial fixed effects regressions are:&lt;br /&gt;
&lt;br /&gt;
* The sign of correlation between vc funding and the presence of hubs is very sensitive to the measure of vc funding. It is positive for early investment and negative for later investments.&lt;br /&gt;
*The size and significance of the correlation is sensitive to number of deals vs the amount of funding. Results are more significant for number of deals compared to the results for value of the deals.&lt;br /&gt;
*Including SBIR data significantly reduces the number of groups in the panel and hence the sample size.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Analysis Stage2 == &lt;br /&gt;
All the final data files related to Hubs analysis are located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
*The stata do file '''hubs_load_data.do''' uploads the final hubs data file. This is the raw data that includes variables early funding, late funding, number of early funding deals, number of late funding deals, dummies for whether a hub is present etc. The do file processes data, creates more variables. It also estimates the hazard rate model, and matches the sample based on estimated hazard ratios. It eliminates all city_states that were never matched with a neighbor in ANY year from 2000-2017. The output from this file is '''matched_hubs.dta'''. &lt;br /&gt;
&lt;br /&gt;
*We need a panel data with controls and treatments, such that for each matched control-treatment pair, with pairid = n, in '''matched_hubs.dta''' we assign the same pairid = n to observations of city_states 2 periods before and after they were assigned control and treatment status originally. The matlab code in '''time_matching_hubs.m''' performs this. However the resulting panel has problems of overlap. For instance if chicago_2008 was treatment and boulder_2014 was control with pair id 200 in the original hazard rate match, chicago_2009 is a control in some pair 203 and this observation gets matched with our pair id 200. Need to resolve this. &lt;br /&gt;
&lt;br /&gt;
*The stata do file  '''hubs_load_clean.do''' uploads this final panel data set - but this panel has the problem of overlaps discussed above.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:Internal]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19763</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19763"/>
		<updated>2017-08-04T19:56:21Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
'''19/06/2017''' - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
'''27/06/2017''' - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs]]&lt;br /&gt;
&lt;br /&gt;
'''28/07/2017''' - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs]]. &lt;br /&gt;
Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
'''10/07/2017''' - Received VC data from Adrian. Cleaned and prepared data on grants (NSF, NIH, Clinical trials), income, publicly traded firms and other variables to merge with VC data. Further details in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''11/07/2017''' - All previous data work on Hubs has been shifted to new page  [[Hubs Summer 2016]]. The new master data set is titled '''final.txt'''. [[Hubs]] includes more information on the columns included in the final data set.&lt;br /&gt;
&lt;br /&gt;
'''12/07/2017''' - In addition to grants information on NSF, NIH and clinical trials, we also decided to add SBIR grants data to our final data set. SBIR are government grants given to small businesses. SBIR data and summary tables are now available in the '''cities''' db. Further information is included in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''13-14/07/2017''' - The term '''hubs''' does not have a definition yet. In order to obtain a sample of hubs for this study, we need to finalize our definition of Hubs.  Currently I am gathering further information on hubs listed in the sample of hubs from last year and deciding what the most suitable definition of hubs could be for our study. Updates posted on [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''17/07/2017''' - Last year a scorecard was used in order to determine whether a candidate hub would qualify to be in the study sample. This method produced a sample size of ~ 30. In order to increase the sample size, I have started working with 5 main characteristics that our sample hub should possess. Joe and I are going through the list of 125 candidate hubs from last year to check whether any hub rejected last year fulfills our slightly more &amp;quot;relaxed&amp;quot; definition. For further details see [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''19/07/2017''' - Developed the definition of hubs. Cleaned hubs data for stata.&lt;br /&gt;
&lt;br /&gt;
'''20/07/2017''' - Cleaned hubs data. Performed preliminary analysis for hubs in stata. See the section on analysis in page [[Hubs Analysis 2017]].&lt;br /&gt;
&lt;br /&gt;
'''21/07/2017''' - review of propensity score matching.&lt;br /&gt;
&lt;br /&gt;
'''24/07/2017''' - Review of propensity score matching and the Hochberg and Fehder (2015) paper.&lt;br /&gt;
&lt;br /&gt;
'''25/07/2017''' - Review of instrumental variables in panel data methods.&lt;br /&gt;
&lt;br /&gt;
'''26/07/2017''' - in order to eliminate any endogeneity in the categorical variable &amp;quot;hub&amp;quot;, a potential instrument is real estate taxes. If the real estate taxes are higher, there might be a higher demand for Hubs. Data on mortgage based real estate taxes was downloaded from the census data, using the American Communities Survey. A fixed effects IV regression shows that taxes don't perform as satisfactory IVs in this model. &lt;br /&gt;
&lt;br /&gt;
'''31/07/2017''' - Estimated hazard rate using '''stcox''' in stata. Relevant do-file is '''hubs_load_data'''. It is located in Z&amp;gt;Hubs&amp;gt;2017&amp;gt;hubs_data.&lt;br /&gt;
&lt;br /&gt;
'''1/08/2017''' - Explored stata routines for performing matching based on manually supplied scores. '''teffects''' only matches on automatically estimated probabilities. '''psmatch2''' has more scope of taking a manually estimated variable to match on.&lt;br /&gt;
&lt;br /&gt;
'''2/07/2017''' - Estimated matched sample using Cox proportional hazard ratios.  Relevant do-file is '''hubs_load_data'''. It is located in Z&amp;gt;Hubs&amp;gt;2017&amp;gt;hubs_data.&lt;br /&gt;
&lt;br /&gt;
'''3/07/2017''' - After matching on hazard ratios, we need to first create a panel of T-C observations for diff in diff. More explanation on processing data for creating this panel is in [[Hubs Analysis 2017]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19762</id>
		<title>Hubs Analysis 2017</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19762"/>
		<updated>2017-08-04T19:47:38Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Analysis Stage 1 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs Analysis 2017&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data Analysis&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Analysis Stage 1 == &lt;br /&gt;
&lt;br /&gt;
The results are based on preliminary data analysis. The do-file that performs some analysis is located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
It is titled '''load_vc_hubs_data.do'''.&lt;br /&gt;
&lt;br /&gt;
The sample of hubs used at this stage is the 29-30 hubs that were shortlisted in summer 2016. After data cleaning we get a sample of ~ 25 hubs. Important observations from initial fixed effects regressions are:&lt;br /&gt;
&lt;br /&gt;
* The sign of correlation between vc funding and the presence of hubs is very sensitive to the measure of vc funding. It is positive for early investment and negative for later investments.&lt;br /&gt;
*The size and significance of the correlation is sensitive to number of deals vs the amount of funding. Results are more significant for number of deals compared to the results for value of the deals.&lt;br /&gt;
*Including SBIR data significantly reduces the number of groups in the panel and hence the sample size.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Analysis Stage2 == &lt;br /&gt;
All the final data files related to Hubs analysis are located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
*The stata do file '''hubs_load_data.do''' uploads the final hubs data file. This is the raw data that includes variables early funding, late funding, number of early funding deals, number of late funding deals, dummies for whether a hub is present etc. The do file processes data, creates more variables. It also estimates the hazard rate model, and matches the sample based on estimated hazard ratios. It eliminates all city_states that were never matched with a neighbor in ANY year from 2000-2017. The output from this file is '''matched_hubs.dta'''. &lt;br /&gt;
&lt;br /&gt;
*We need a panel data with controls and treatments, such that for each matched control-treatment pair, with pairid = n, in '''matched_hubs.dta''' we assign the same pairid = n to observations of city_states 2 periods before and after they were assigned control and treatment status originally. The matlab code in '''time_matching_hubs.m''' performs this. However the resulting panel has problems of overlap. For instance if chicago_2008 was treatment and boulder_2014 was control with pair id 200 in the original hazard rate match, chicago_2009 is a control in some pair 203 and this observation gets matched with our pair id 200. Need to resolve this.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:Internal]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19543</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19543"/>
		<updated>2017-07-26T21:55:21Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
'''19/06/2017''' - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
'''27/06/2017''' - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs]]&lt;br /&gt;
&lt;br /&gt;
'''28/07/2017''' - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs]]. &lt;br /&gt;
Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
'''10/07/2017''' - Received VC data from Adrian. Cleaned and prepared data on grants (NSF, NIH, Clinical trials), income, publicly traded firms and other variables to merge with VC data. Further details in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''11/07/2017''' - All previous data work on Hubs has been shifted to new page  [[Hubs Summer 2016]]. The new master data set is titled '''final.txt'''. [[Hubs]] includes more information on the columns included in the final data set.&lt;br /&gt;
&lt;br /&gt;
'''12/07/2017''' - In addition to grants information on NSF, NIH and clinical trials, we also decided to add SBIR grants data to our final data set. SBIR are government grants given to small businesses. SBIR data and summary tables are now available in the '''cities''' db. Further information is included in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''13-14/07/2017''' - The term '''hubs''' does not have a definition yet. In order to obtain a sample of hubs for this study, we need to finalize our definition of Hubs.  Currently I am gathering further information on hubs listed in the sample of hubs from last year and deciding what the most suitable definition of hubs could be for our study. Updates posted on [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''17/07/2017''' - Last year a scorecard was used in order to determine whether a candidate hub would qualify to be in the study sample. This method produced a sample size of ~ 30. In order to increase the sample size, I have started working with 5 main characteristics that our sample hub should possess. Joe and I are going through the list of 125 candidate hubs from last year to check whether any hub rejected last year fulfills our slightly more &amp;quot;relaxed&amp;quot; definition. For further details see [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''19/07/2017''' - Developed the definition of hubs. Cleaned hubs data for stata.&lt;br /&gt;
&lt;br /&gt;
'''20/07/2017''' - Cleaned hubs data. Performed preliminary analysis for hubs in stata. See the section on analysis in page [[Hubs Analysis 2017]].&lt;br /&gt;
&lt;br /&gt;
'''21/07/2017''' - review of propensity score matching.&lt;br /&gt;
&lt;br /&gt;
'''24/07/2017''' - Review of propensity score matching and the Hochberg and Fehder (2015) paper.&lt;br /&gt;
&lt;br /&gt;
'''25/07/2017''' - Review of instrumental variables in panel data methods.&lt;br /&gt;
&lt;br /&gt;
'''26/07/2017''' - in order to eliminate any endogeneity in the categorical variable &amp;quot;hub&amp;quot;, a potential instrument is real estate taxes. If the real estate taxes are higher, there might be a higher demand for Hubs. Data on mortgage based real estate taxes was downloaded from the census data, using the American Communities Survey. A fixed effects IV regression shows that taxes don't perform as satisfactory IVs in this model. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19542</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19542"/>
		<updated>2017-07-26T21:54:58Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
'''19/06/2017''' - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
'''27/06/2017''' - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs]]&lt;br /&gt;
&lt;br /&gt;
'''28/07/2017''' - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs]]. &lt;br /&gt;
Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
'''10/07/2017''' - Received VC data from Adrian. Cleaned and prepared data on grants (NSF, NIH, Clinical trials), income, publicly traded firms and other variables to merge with VC data. Further details in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''11/07/2017''' - All previous data work on Hubs has been shifted to new page  [[Hubs Summer 2016]]. The new master data set is titled '''final.txt'''. [[Hubs]] includes more information on the columns included in the final data set.&lt;br /&gt;
&lt;br /&gt;
'''12/07/2017''' - In addition to grants information on NSF, NIH and clinical trials, we also decided to add SBIR grants data to our final data set. SBIR are government grants given to small businesses. SBIR data and summary tables are now available in the '''cities''' db. Further information is included in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''13-14/07/2017''' - The term '''hubs''' does not have a definition yet. In order to obtain a sample of hubs for this study, we need to finalize our definition of Hubs.  Currently I am gathering further information on hubs listed in the sample of hubs from last year and deciding what the most suitable definition of hubs could be for our study. Updates posted on [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''17/07/2017''' - Last year a scorecard was used in order to determine whether a candidate hub would qualify to be in the study sample. This method produced a sample size of ~ 30. In order to increase the sample size, I have started working with 5 main characteristics that our sample hub should possess. Joe and I are going through the list of 125 candidate hubs from last year to check whether any hub rejected last year fulfills our slightly more &amp;quot;relaxed&amp;quot; definition. For further details see [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''19/07/2017''' - Developed the definition of hubs. Cleaned hubs data for stata.&lt;br /&gt;
&lt;br /&gt;
'''20/07/2017''' - Cleaned hubs data. Performed preliminary analysis for hubs in stata. See the section on analysis in page [[Hubs Analysis 2017]].&lt;br /&gt;
&lt;br /&gt;
'''21/07/2017''' - review of propensity score matching.&lt;br /&gt;
&lt;br /&gt;
'''24/07/2017''' - Review of propensity score matching and the Hochberg and Fehder (2015) paper.&lt;br /&gt;
'''25/07/2017''' - Review of instrumental variables in panel data methods.&lt;br /&gt;
'''26/07/2017''' - in order to eliminate any endogeneity in the categorical variable &amp;quot;hub&amp;quot;, a potential instrument is real estate taxes. If the real estate taxes are higher, there might be a higher demand for Hubs. Data on mortgage based real estate taxes was downloaded from the census data, using the American Communities Survey. A fixed effects IV regression shows that taxes don't perform as satisfactory IVs in this model. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Summer_2016&amp;diff=19457</id>
		<title>Hubs Summer 2016</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Summer_2016&amp;diff=19457"/>
		<updated>2017-07-20T21:57:39Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:Internal]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19456</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19456"/>
		<updated>2017-07-20T21:57:16Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
'''Note on joining:''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, nih, nsf, sbir, compustat) is first joined with the VC data on city-state-year ID and then the resulting tables are all joined together in the final table.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). &lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
**Revenue of firms&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
It is located in&lt;br /&gt;
 Z:\Hubs\2017\Output_Files&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state &lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\Hubs\2017\clean data&lt;br /&gt;
  The file name is new_vc_data.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*numdeals&lt;br /&gt;
*numalive&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1948-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''new_vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*numdeals&lt;br /&gt;
*numalive&lt;br /&gt;
*year&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Final Joined Data set == &lt;br /&gt;
&lt;br /&gt;
The final data set is in file '''final.txt''' and is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_year - (ID that data is merged on)&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - Seed Amount&lt;br /&gt;
*earlyamtm - Early Investment Amount&lt;br /&gt;
*lateramtm - Late Investment Amount&lt;br /&gt;
*selamtm - Seed early or late amount&lt;br /&gt;
*numseeds - Number of seed investments &lt;br /&gt;
*numearly - Number of early investments&lt;br /&gt;
*numlater - Number of late investments&lt;br /&gt;
*numsel &lt;br /&gt;
*numdeals - Number of deals (first contracts)&lt;br /&gt;
*numalive - Number of start ups alive&lt;br /&gt;
*income - Income per capita in each city-year&lt;br /&gt;
*sbir_nogrants - Number of SBIR grants&lt;br /&gt;
*sbir_valuegrant - Value of SBIR grants&lt;br /&gt;
*emp - Employment stats of each city-year&lt;br /&gt;
*unemp - Rate of unemployment&lt;br /&gt;
*popestimate - Population estimate of each city-year&lt;br /&gt;
*private - Enrollment in private schools&lt;br /&gt;
*public - Enrollment in public schools&lt;br /&gt;
*total - &lt;br /&gt;
*numfirms - Number of publicly traded firms&lt;br /&gt;
*randd - R&amp;amp;D expenditure of publicly traded firms&lt;br /&gt;
*revenue - Revenue of PTF&lt;br /&gt;
*totalassets &lt;br /&gt;
*nsf_nogrants - Number of NSF grants&lt;br /&gt;
*valuegrant - Value of NSF grants&lt;br /&gt;
*nih_nogrants - Number of NIH grants&lt;br /&gt;
*nih_valuegrant - Value of NIH grants&lt;br /&gt;
*noctrials - NUmber of clinical trials&lt;br /&gt;
&lt;br /&gt;
== Defining Hubs == &lt;br /&gt;
'''Summer 2016''' - Last year a master list of 125 &amp;quot;potential&amp;quot; hubs was used. A scorecard was developed which filtered these 125 candidate hubs to determine which of these should be included in the study sample. This method resulted in a sample size of ~ 30. The master list and the final hubs list is titled '''Hubs Data v2_'16'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
'''Summer 2017''' - In order to obtain a more statistically significant sample of hubs, we developed 5 criteria which produce a more relaxed definition of hubs than last year. These include&lt;br /&gt;
&lt;br /&gt;
*Availability of co-working space&lt;br /&gt;
*Coding classes or tech events&lt;br /&gt;
*Some focus on the tech sector (this is important as our dependent variable is VC funding)&lt;br /&gt;
*Presence of an accelerator&lt;br /&gt;
*Availability of mentorship for members.&lt;br /&gt;
&lt;br /&gt;
We will review the 125 candidate hubs and select those which satisfy a subset or all of these characteristics.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:Internal]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19455</id>
		<title>Hubs Analysis 2017</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19455"/>
		<updated>2017-07-20T21:56:38Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs Analysis 2017&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data Analysis&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Analysis Stage 1 == &lt;br /&gt;
&lt;br /&gt;
The results are based on preliminary data analysis. The do-file that performs some analysis is located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
It is titled '''load_vc_hubs_data.do'''.&lt;br /&gt;
&lt;br /&gt;
The sample of hubs used at this stage is the 29-30 hubs that were shortlisted in summer 2016. After data cleaning we get a sample of ~ 25 hubs. Important observations from initial fixed effects regressions are:&lt;br /&gt;
&lt;br /&gt;
* The sign of correlation between vc funding and the presence of hubs is very sensitive to the measure of vc funding. It is positive for early investment and negative for later investments.&lt;br /&gt;
*The size and significance of the correlation is sensitive to number of deals vs the amount of funding. Results are more significant for number of deals compared to the results for value of the deals.&lt;br /&gt;
*Including SBIR data significantly reduces the number of groups in the panel and hence the sample size.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[category:Internal]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19454</id>
		<title>Hubs Analysis 2017</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19454"/>
		<updated>2017-07-20T21:54:40Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Analysis Stage 1 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs Analysis 2017&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data Analysis&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Analysis Stage 1 == &lt;br /&gt;
&lt;br /&gt;
The results are based on preliminary data analysis. The do-file that performs some analysis is located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
It is titled '''load_vc_hubs_data.do'''.&lt;br /&gt;
&lt;br /&gt;
The sample of hubs used at this stage is the 29-30 hubs that were shortlisted in summer 2016. After data cleaning we get a sample of ~ 25 hubs. Important observations from initial fixed effects regressions are:&lt;br /&gt;
&lt;br /&gt;
* The sign of correlation between vc funding and the presence of hubs is very sensitive to the measure of vc funding. It is positive for early investment and negative for later investments.&lt;br /&gt;
*The size and significance of the correlation is sensitive to number of deals vs the amount of funding. Results are more significant for number of deals compared to the results for value of the deals.&lt;br /&gt;
*Including SBIR data significantly reduces the number of groups in the panel and hence the sample size.&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19453</id>
		<title>Hubs Analysis 2017</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19453"/>
		<updated>2017-07-20T21:53:00Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs Analysis 2017&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data Analysis&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
==Analysis Stage 1 == &lt;br /&gt;
&lt;br /&gt;
The results are based on preliminary data analysis. The do-file that performs some analysis is located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
It is titled '''load_vc_hubs_data.do'''.&lt;br /&gt;
&lt;br /&gt;
The sample of hubs used at this stage is the 29-30 hubs that were shortlisted in summer 2016. After data cleaning we get a sample of ~ 25 hubs. Important observations from initial fixed effects regressions are:&lt;br /&gt;
&lt;br /&gt;
* The sign of correlation between vc funding and the presence of hubs is very sensitive to the measure of vc funding. It is positive for early investment and negative for later investments.&lt;br /&gt;
*The size and significance of the correlation is sensitive to number of deals vs the amount of funding. Results are more significant for number of deals compared to the results for value of the deals.&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19450</id>
		<title>Hubs Analysis 2017</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19450"/>
		<updated>2017-07-20T21:40:15Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs Analysis 2017&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data Analysis&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19449</id>
		<title>Hubs Analysis 2017</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Analysis_2017&amp;diff=19449"/>
		<updated>2017-07-20T21:39:55Z</updated>

		<summary type="html">&lt;p&gt;HiraF: Created page with &amp;quot;{{McNair Projects |Has title=Hubs |Has owner=Hira Farooqi, |Has keywords=Data Analysis |Has project status=Active }}&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data Analysis&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19448</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19448"/>
		<updated>2017-07-20T21:37:50Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
'''19/06/2017''' - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
'''27/06/2017''' - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs]]&lt;br /&gt;
&lt;br /&gt;
'''28/07/2017''' - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs]]. &lt;br /&gt;
Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
'''10/07/2017''' - Received VC data from Adrian. Cleaned and prepared data on grants (NSF, NIH, Clinical trials), income, publicly traded firms and other variables to merge with VC data. Further details in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''11/07/2017''' - All previous data work on Hubs has been shifted to new page  [[Hubs Summer 2016]]. The new master data set is titled '''final.txt'''. [[Hubs]] includes more information on the columns included in the final data set.&lt;br /&gt;
&lt;br /&gt;
'''12/07/2017''' - In addition to grants information on NSF, NIH and clinical trials, we also decided to add SBIR grants data to our final data set. SBIR are government grants given to small businesses. SBIR data and summary tables are now available in the '''cities''' db. Further information is included in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''13-14/07/2017''' - The term '''hubs''' does not have a definition yet. In order to obtain a sample of hubs for this study, we need to finalize our definition of Hubs.  Currently I am gathering further information on hubs listed in the sample of hubs from last year and deciding what the most suitable definition of hubs could be for our study. Updates posted on [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''17/07/2017''' - Last year a scorecard was used in order to determine whether a candidate hub would qualify to be in the study sample. This method produced a sample size of ~ 30. In order to increase the sample size, I have started working with 5 main characteristics that our sample hub should possess. Joe and I are going through the list of 125 candidate hubs from last year to check whether any hub rejected last year fulfills our slightly more &amp;quot;relaxed&amp;quot; definition. For further details see [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''19/07/2017''' - Developed the definition of hubs. Cleaned hubs data for stata.&lt;br /&gt;
&lt;br /&gt;
'''20/07/2017''' - Cleaned hubs data. Performed preliminary analysis for hubs in stata. See the section on analysis in page [[Hubs Analysis 2017]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19447</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19447"/>
		<updated>2017-07-20T21:36:42Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
'''19/06/2017''' - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
'''27/06/2017''' - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs]]&lt;br /&gt;
&lt;br /&gt;
'''28/07/2017''' - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs]]. &lt;br /&gt;
Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
'''10/07/2017''' - Received VC data from Adrian. Cleaned and prepared data on grants (NSF, NIH, Clinical trials), income, publicly traded firms and other variables to merge with VC data. Further details in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''11/07/2017''' - All previous data work on Hubs has been shifted to new page  [[Hubs Summer 2016]]. The new master data set is titled '''final.txt'''. [[Hubs]] includes more information on the columns included in the final data set.&lt;br /&gt;
&lt;br /&gt;
'''12/07/2017''' - In addition to grants information on NSF, NIH and clinical trials, we also decided to add SBIR grants data to our final data set. SBIR are government grants given to small businesses. SBIR data and summary tables are now available in the '''cities''' db. Further information is included in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''13-14/07/2017''' - The term '''hubs''' does not have a definition yet. In order to obtain a sample of hubs for this study, we need to finalize our definition of Hubs.  Currently I am gathering further information on hubs listed in the sample of hubs from last year and deciding what the most suitable definition of hubs could be for our study. Updates posted on [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''17/07/2017''' - Last year a scorecard was used in order to determine whether a candidate hub would qualify to be in the study sample. This method produced a sample size of ~ 30. In order to increase the sample size, I have started working with 5 main characteristics that our sample hub should possess. Joe and I are going through the list of 125 candidate hubs from last year to check whether any hub rejected last year fulfills our slightly more &amp;quot;relaxed&amp;quot; definition. For further details see [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''19/07/2017''' - Developed the definition of hubs. Cleaned hubs data for stata.&lt;br /&gt;
&lt;br /&gt;
'''20/07/2017''' - Cleaned hubs data. Performed preliminary analysis for hubs in stata. See the section on analysis in page [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Dylan_Dickens_(Research_Plan)&amp;diff=19446</id>
		<title>Dylan Dickens (Research Plan)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Dylan_Dickens_(Research_Plan)&amp;diff=19446"/>
		<updated>2017-07-20T21:33:20Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19383</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19383"/>
		<updated>2017-07-18T16:22:21Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
'''Note on joining:''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, nih, nsf, sbir, compustat) is first joined with the VC data on city-state-year ID and then the resulting tables are all joined together in the final table.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). &lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
**Revenue of firms&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
It is located in&lt;br /&gt;
 Z:\Hubs\2017\Output_Files&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state &lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\Hubs\2017\clean data&lt;br /&gt;
  The file name is new_vc_data.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*numdeals&lt;br /&gt;
*numalive&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1948-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''new_vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*numdeals&lt;br /&gt;
*numalive&lt;br /&gt;
*year&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Final Joined Data set == &lt;br /&gt;
&lt;br /&gt;
The final data set is in file '''final.txt''' and is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_year - (ID that data is merged on)&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - Seed Amount&lt;br /&gt;
*earlyamtm - Early Investment Amount&lt;br /&gt;
*lateramtm - Late Investment Amount&lt;br /&gt;
*selamtm - Seed early or late amount&lt;br /&gt;
*numseeds - Number of seed investments &lt;br /&gt;
*numearly - Number of early investments&lt;br /&gt;
*numlater - Number of late investments&lt;br /&gt;
*numsel &lt;br /&gt;
*numdeals - Number of deals (first contracts)&lt;br /&gt;
*numalive - Number of start ups alive&lt;br /&gt;
*income - Income per capita in each city-year&lt;br /&gt;
*sbir_nogrants - Number of SBIR grants&lt;br /&gt;
*sbir_valuegrant - Value of SBIR grants&lt;br /&gt;
*emp - Employment stats of each city-year&lt;br /&gt;
*unemp - Rate of unemployment&lt;br /&gt;
*popestimate - Population estimate of each city-year&lt;br /&gt;
*private - Enrollment in private schools&lt;br /&gt;
*public - Enrollment in public schools&lt;br /&gt;
*total - &lt;br /&gt;
*numfirms - Number of publicly traded firms&lt;br /&gt;
*randd - R&amp;amp;D expenditure of publicly traded firms&lt;br /&gt;
*revenue - Revenue of PTF&lt;br /&gt;
*totalassets &lt;br /&gt;
*nsf_nogrants - Number of NSF grants&lt;br /&gt;
*valuegrant - Value of NSF grants&lt;br /&gt;
*nih_nogrants - Number of NIH grants&lt;br /&gt;
*nih_valuegrant - Value of NIH grants&lt;br /&gt;
*noctrials - NUmber of clinical trials&lt;br /&gt;
&lt;br /&gt;
== Defining Hubs == &lt;br /&gt;
'''Summer 2016''' - Last year a master list of 125 &amp;quot;potential&amp;quot; hubs was used. A scorecard was developed which filtered these 125 candidate hubs to determine which of these should be included in the study sample. This method resulted in a sample size of ~ 30. The master list and the final hubs list is titled '''Hubs Data v2_'16'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\hubs_data&lt;br /&gt;
&lt;br /&gt;
'''Summer 2017''' - In order to obtain a more statistically significant sample of hubs, we developed 5 criteria which produce a more relaxed definition of hubs than last year. These include&lt;br /&gt;
&lt;br /&gt;
*Availability of co-working space&lt;br /&gt;
*Coding classes or tech events&lt;br /&gt;
*Some focus on the tech sector (this is important as our dependent variable is VC funding)&lt;br /&gt;
*Presence of an accelerator&lt;br /&gt;
*Availability of mentorship for members.&lt;br /&gt;
&lt;br /&gt;
We will review the 125 candidate hubs and select those which satisfy a subset or all of these characteristics.&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19382</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19382"/>
		<updated>2017-07-18T16:07:07Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
'''19/06/2017''' - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
'''27/06/2017''' - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs]]&lt;br /&gt;
&lt;br /&gt;
'''28/07/2017''' - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs]]. &lt;br /&gt;
Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
'''10/07/2017''' - Received VC data from Adrian. Cleaned and prepared data on grants (NSF, NIH, Clinical trials), income, publicly traded firms and other variables to merge with VC data. Further details in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''11/07/2017''' - All previous data work on Hubs has been shifted to new page  [[Hubs Summer 2016]]. The new master data set is titled '''final.txt'''. [[Hubs]] includes more information on the columns included in the final data set.&lt;br /&gt;
&lt;br /&gt;
'''12/07/2017''' - In addition to grants information on NSF, NIH and clinical trials, we also decided to add SBIR grants data to our final data set. SBIR are government grants given to small businesses. SBIR data and summary tables are now available in the '''cities''' db. Further information is included in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''13-14/07/2017''' - The term '''hubs''' does not have a definition yet. In order to obtain a sample of hubs for this study, we need to finalize our definition of Hubs.  Currently I am gathering further information on hubs listed in the sample of hubs from last year and deciding what the most suitable definition of hubs could be for our study. Updates posted on [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
'''17/07/2017''' - Last year a scorecard was used in order to determine whether a candidate hub would qualify to be in the study sample. This method produced a sample size of ~ 30. In order to increase the sample size, I have started working with 5 main characteristics that our sample hub should possess. Joe and I are going through the list of 125 candidate hubs from last year to check whether any hub rejected last year fulfills our slightly more &amp;quot;relaxed&amp;quot; definition. For further details see [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19365</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19365"/>
		<updated>2017-07-17T18:53:28Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
19/06/2017 - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
27/06/2017 - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs]]&lt;br /&gt;
&lt;br /&gt;
28/07/2017 - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs]]. &lt;br /&gt;
Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
10/07/2017 - Received VC data from Adrian. Cleaned and prepared data on grants (NSF, NIH, Clinical trials), income, publicly traded firms and other variables to merge with VC data. Further details in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
11/07/2017 - All previous data work on Hubs has been shifted to new page  [[Hubs Summer 2016]]. The new master data set is titled '''final.txt'''. [[Hubs]] includes more information on the columns included in the final data set.&lt;br /&gt;
&lt;br /&gt;
12/07/2017 - In addition to grants information on NSF, NIH and clinical trials, we also decided to add SBIR grants data to our final data set. SBIR are government grants given to small businesses. SBIR data and summary tables are now available in the '''cities''' db. Further information is included in [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
13-14/07/2017 - The term '''hubs''' does not have a definition yet. In order to obtain a sample of hubs for this study, we need to finalize our definition of Hubs.  Currently I am gathering further information on hubs listed in the sample of hubs from last year and deciding what the most suitable definition of hubs could be for our study. Updates posted on [[Hubs]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19361</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19361"/>
		<updated>2017-07-17T17:22:52Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Final Joined Data set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
'''Note on joining:''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, nih, nsf, sbir, compustat) is first joined with the VC data on city-state-year ID and then the resulting tables are all joined together in the final table.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). &lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
**Revenue of firms&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
It is located in&lt;br /&gt;
 Z:\Hubs\2017\Output_Files&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state &lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\Hubs\2017\clean data&lt;br /&gt;
  The file name is new_vc_data.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*numdeals&lt;br /&gt;
*numalive&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1948-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''new_vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*numdeals&lt;br /&gt;
*numalive&lt;br /&gt;
*year&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Final Joined Data set == &lt;br /&gt;
&lt;br /&gt;
The final data set is in file '''final.txt''' and is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_year - (ID that data is merged on)&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - Seed Amount&lt;br /&gt;
*earlyamtm - Early Investment Amount&lt;br /&gt;
*lateramtm - Late Investment Amount&lt;br /&gt;
*selamtm - Seed early or late amount&lt;br /&gt;
*numseeds - Number of seed investments &lt;br /&gt;
*numearly - Number of early investments&lt;br /&gt;
*numlater - Number of late investments&lt;br /&gt;
*numsel &lt;br /&gt;
*numdeals - Number of deals (first contracts)&lt;br /&gt;
*numalive - Number of start ups alive&lt;br /&gt;
*income - Income per capita in each city-year&lt;br /&gt;
*sbir_nogrants - Number of SBIR grants&lt;br /&gt;
*sbir_valuegrant - Value of SBIR grants&lt;br /&gt;
*emp - Employment stats of each city-year&lt;br /&gt;
*unemp - Rate of unemployment&lt;br /&gt;
*popestimate - Population estimate of each city-year&lt;br /&gt;
*private - Enrollment in private schools&lt;br /&gt;
*public - Enrollment in public schools&lt;br /&gt;
*total - &lt;br /&gt;
*numfirms - Number of publicly traded firms&lt;br /&gt;
*randd - R&amp;amp;D expenditure of publicly traded firms&lt;br /&gt;
*revenue - Revenue of PTF&lt;br /&gt;
*totalassets &lt;br /&gt;
*nsf_nogrants - Number of NSF grants&lt;br /&gt;
*valuegrant - Value of NSF grants&lt;br /&gt;
*nih_nogrants - Number of NIH grants&lt;br /&gt;
*nih_valuegrant - Value of NIH grants&lt;br /&gt;
*noctrials - NUmber of clinical trials&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19360</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19360"/>
		<updated>2017-07-17T17:21:09Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* VC Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
'''Note on joining:''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, nih, nsf, sbir, compustat) is first joined with the VC data on city-state-year ID and then the resulting tables are all joined together in the final table.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). &lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
**Revenue of firms&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
It is located in&lt;br /&gt;
 Z:\Hubs\2017\Output_Files&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state &lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\Hubs\2017\clean data&lt;br /&gt;
  The file name is new_vc_data.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*numdeals&lt;br /&gt;
*numalive&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1948-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''new_vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*numdeals&lt;br /&gt;
*numalive&lt;br /&gt;
*year&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Final Joined Data set == &lt;br /&gt;
&lt;br /&gt;
The final data set is in file '''final.txt'' and is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_year - (ID that data is merged on)&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - Seed Amount&lt;br /&gt;
*earlyamtm - Early Investment Amount&lt;br /&gt;
*lateramtm - Late Investment Amount&lt;br /&gt;
*selamtm - Seed early or late amount&lt;br /&gt;
*numseeds - Number of seed investments &lt;br /&gt;
*numearly - Number of early investments&lt;br /&gt;
*numlater - Number of late investments&lt;br /&gt;
*numsel &lt;br /&gt;
*numdeals - Number of deals (first contracts)&lt;br /&gt;
*numalive - Number of start ups alive&lt;br /&gt;
*income - Income per capita in each city-year&lt;br /&gt;
*sbir_nogrants - Number of SBIR grants&lt;br /&gt;
*sbir_valuegrant - Value of SBIR grants&lt;br /&gt;
*emp - Employment stats of each city-year&lt;br /&gt;
*unemp - Rate of unemployment&lt;br /&gt;
*popestimate - Population estimate of each city-year&lt;br /&gt;
*private - Enrollment in private schools&lt;br /&gt;
*public - Enrollment in public schools&lt;br /&gt;
*total - &lt;br /&gt;
*numfirms - Number of publicly traded firms&lt;br /&gt;
*randd - R&amp;amp;D expenditure of publicly traded firms&lt;br /&gt;
*revenue - Revenue of PTF&lt;br /&gt;
*totalassets &lt;br /&gt;
*nsf_nogrants - Number of NSF grants&lt;br /&gt;
*valuegrant - Value of NSF grants&lt;br /&gt;
*nih_nogrants - Number of NIH grants&lt;br /&gt;
*nih_valuegrant - Value of NIH grants&lt;br /&gt;
*noctrials - NUmber of clinical trials&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19308</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19308"/>
		<updated>2017-07-13T19:59:08Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
'''Note on joining:''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, nih, nsf, sbir, compustat) is first joined with the VC data on city-state-year ID and then the resulting tables are all joined together in the final table.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). &lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
**Revenue of firms&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
It is located in&lt;br /&gt;
 Z:\Hubs\2017\Output_Files&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state &lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\VentureCapitalData\SDCVCData&lt;br /&gt;
  The file name is roundcitystateyear.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1953-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*year&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19306</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19306"/>
		<updated>2017-07-13T18:58:00Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* NIH Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). &lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
**Revenue of firms&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
It is located in&lt;br /&gt;
 Z:\Hubs\2017\Output_Files&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state &lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\VentureCapitalData\SDCVCData&lt;br /&gt;
  The file name is roundcitystateyear.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1953-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*year&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19305</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19305"/>
		<updated>2017-07-13T18:52:16Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* COMPUSTAT Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). &lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
**Revenue of firms&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
It is located in&lt;br /&gt;
 Z:\Hubs\2017\Output_Files&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state (the city-state ID that we'll merge on)&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\VentureCapitalData\SDCVCData&lt;br /&gt;
  The file name is roundcitystateyear.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1953-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*year&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19304</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19304"/>
		<updated>2017-07-13T18:28:21Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* COMPUSTAT Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). &lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
**Revenue of firms&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state (the city-state ID that we'll merge on)&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\VentureCapitalData\SDCVCData&lt;br /&gt;
  The file name is roundcitystateyear.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1953-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*year&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19303</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19303"/>
		<updated>2017-07-13T18:16:06Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.&lt;br /&gt;
&lt;br /&gt;
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state (the city-state ID that we'll merge on)&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\VentureCapitalData\SDCVCData&lt;br /&gt;
  The file name is roundcitystateyear.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1953-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*year&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs_Summer_2016&amp;diff=19302</id>
		<title>Hubs Summer 2016</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs_Summer_2016&amp;diff=19302"/>
		<updated>2017-07-13T18:11:43Z</updated>

		<summary type="html">&lt;p&gt;HiraF: Created page with &amp;quot;===Primary Data Set=== The Hubs data set, from SDC Platinum, has been constructed in the server:  Data files are in 128.42.44.181/bulk/Hubs  All files are in 128.42.44.182/bul...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19301</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19301"/>
		<updated>2017-07-13T18:11:29Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Details of initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state (the city-state ID that we'll merge on)&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\VentureCapitalData\SDCVCData&lt;br /&gt;
  The file name is roundcitystateyear.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1953-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*year&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=New_hubs&amp;diff=19299</id>
		<title>New hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=New_hubs&amp;diff=19299"/>
		<updated>2017-07-13T17:28:55Z</updated>

		<summary type="html">&lt;p&gt;HiraF: Created page with &amp;quot;We first join table 1 and table 2 which produce a temporary table with combined data from table1 and table2, which is then joined to table3. This formula can be extended for m...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;We first join table 1 and table 2 which produce a temporary table with combined data from table1 and table2, which is then joined to table3. This formula can be extended for more than 3 tables to N tables, You just need to make sure that SQL query should have N-1 join statement in order to join N tables. like for joining two tables we require 1 join statement and for joining 3 tables we need 2 join statement.&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19298</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19298"/>
		<updated>2017-07-13T17:28:24Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* VC Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
===Joined NSF table===&lt;br /&gt;
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.&lt;br /&gt;
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state (the city-state ID that we'll merge on)&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
===Joined NIH table===&lt;br /&gt;
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.&lt;br /&gt;
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.&lt;br /&gt;
The sql script is in&lt;br /&gt;
 Z:\HUbs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials table===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The file is in:&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The name of the file is:&lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with VC table. &lt;br /&gt;
The joined SQL script is: '''new_ctrials.sql''' and it is located in&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The name of the joined table is '''new_merged_ctrials'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: '''population.sql'''&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
The file names are &lt;br /&gt;
 1_population.txt - contains data on population estimates from 2000-2009&lt;br /&gt;
 2_population.txt - contains data on population estimates from 2010-2016&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_population.sql''', located in &lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The population table is joined on VC table. The table is called '''new_merged_population'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017\sql scripts&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 INC_05.txt - INC_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data  &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
 &lt;br /&gt;
The file names are:&lt;br /&gt;
 EMP_05.txt - EMP_15.txt &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: '''new_employment.sql''' and it is located in &lt;br /&gt;
Z:\Hubs\2017\sql scripts&lt;br /&gt;
&lt;br /&gt;
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\clean data&lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling table===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs\clean data&lt;br /&gt;
The file names are:&lt;br /&gt;
 SCH_05.txt - SCH_15.txt&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''&lt;br /&gt;
The final table is in db '''cities''' titled '''new_merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\VentureCapitalData\SDCVCData&lt;br /&gt;
  The file name is roundcitystateyear.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1953-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The table is in db '''cities''' titled '''vc'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*year&lt;br /&gt;
&lt;br /&gt;
check pg [[new hubs]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19201</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19201"/>
		<updated>2017-07-12T16:07:48Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* NSF Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
*Organization state code&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state (the city-state ID that we'll merge on)&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
===Joined clinical trials data===&lt;br /&gt;
&lt;br /&gt;
The file which contains the number of trials in each city and year is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
The name of the file is &lt;br /&gt;
  ctrialsSummary.txt&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*year&lt;br /&gt;
*city_state_year&lt;br /&gt;
*noctrials - number of trials&lt;br /&gt;
&lt;br /&gt;
The ctrials is joined with vc_city_state_year. &lt;br /&gt;
The joined SQL script is: merged_ctrials.sql&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*year&lt;br /&gt;
*noctrials&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: population.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_population.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
===Joined income data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_income.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Income&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
===Joined employment data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_employment.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Employment rates of individuals of 16 years or older&lt;br /&gt;
*Unemployment rates of individuals of 16 years or older&lt;br /&gt;
*Year&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_schooling.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*city_state_year to uniquely identify each city in each year&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
&lt;br /&gt;
==VC Data==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
  Z:\VentureCapitalData\SDCVCData&lt;br /&gt;
  The file name is roundcitystateyear.txt&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*year&lt;br /&gt;
*seedamtm - seed, amount in millions&lt;br /&gt;
*earlyamtm - early, amount in millions&lt;br /&gt;
*lateramtm - late, amount in millions&lt;br /&gt;
*selamtm - seed early late, amount in millions&lt;br /&gt;
*numseeds - number of seeds&lt;br /&gt;
*numearly &lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 1953-2017&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges VC data with the MSA-City file is titled '''vc.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''vc_city_state_year'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*city_state_id&lt;br /&gt;
*city_state_year&lt;br /&gt;
*seedamtm&lt;br /&gt;
*earlyamtm&lt;br /&gt;
*lateramtm&lt;br /&gt;
*selamtm&lt;br /&gt;
*numseeds&lt;br /&gt;
*numearly&lt;br /&gt;
*numlater&lt;br /&gt;
*numsel&lt;br /&gt;
*year&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19106</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19106"/>
		<updated>2017-06-30T21:13:38Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* NIH Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source file is nsf2017.txt, copied from table titled '''nsf''' in the biotech db.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, nogrants, valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
Cities are not unique. Eg. NEW YORK and New York are two different cities. Need to merge their data.&lt;br /&gt;
*3854 cities&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
*city_state (the city-state ID that we'll merge on)&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: population.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_population.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_schooling.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19105</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19105"/>
		<updated>2017-06-30T21:12:51Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* NIH Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source file is nsf2017.txt, copied from table titled '''nsf''' in the biotech db.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, nogrants, valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
Cities are not unique. Eg. NEW YORK and New York are two different cities. Need to merge their data.&lt;br /&gt;
*3854 cities&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:&lt;br /&gt;
&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
This table includes&lt;br /&gt;
*year&lt;br /&gt;
*city&lt;br /&gt;
*state&lt;br /&gt;
*country&lt;br /&gt;
*nogrants (number of grants)&lt;br /&gt;
*valuegrant&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: population.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_population.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_schooling.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19100</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19100"/>
		<updated>2017-06-30T20:55:24Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Income Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source file is nsf2017.txt, copied from table titled '''nsf''' in the biotech db.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, nogrants, valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
Cities are not unique. Eg. NEW YORK and New York are two different cities. Need to merge their data.&lt;br /&gt;
*3854 cities&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: population.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_population.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_schooling.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19095</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19095"/>
		<updated>2017-06-30T20:30:15Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Income Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source file is nsf2017.txt, copied from table titled '''nsf''' in the biotech db.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, nogrants, valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
Cities are not unique. Eg. NEW YORK and New York are two different cities. Need to merge their data.&lt;br /&gt;
*3854 cities&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: population.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
===Joined population data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_population.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Population estimates&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;br /&gt;
*State full name&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
This master list includes:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA name&lt;br /&gt;
*Principal City&lt;br /&gt;
*State&lt;br /&gt;
*Place code (city code)&lt;br /&gt;
*State Code&lt;br /&gt;
&lt;br /&gt;
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Cleaned Income data files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in &lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID &lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
   &lt;br /&gt;
 &lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;br /&gt;
&lt;br /&gt;
==Employment Data==&lt;br /&gt;
&lt;br /&gt;
Data on employment was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Employment rate of individuals 16 years or older&lt;br /&gt;
*Unemployment rate of individuals 16 years or older&lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_employment'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Employment rate&lt;br /&gt;
*Unemployment rate&lt;br /&gt;
&lt;br /&gt;
==Schooling Data==&lt;br /&gt;
&lt;br /&gt;
Data on schooling was obtained from American Communities Survey, US Census Bureau.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA&lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total number of population 3 years and over enrolled in school&lt;br /&gt;
*Percent of population 3 years and over enrolled in public school&lt;br /&gt;
*Percent of population 3 years and over enrolled in private school &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. &lt;br /&gt;
The file is located in:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final table is in db '''cities''' titled '''merged_schooling'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total&lt;br /&gt;
*Percent_public_schooling&lt;br /&gt;
*Percent_private_schooling&lt;br /&gt;
&lt;br /&gt;
===Joined schooling data===&lt;br /&gt;
&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: merged_schooling.sql&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*City&lt;br /&gt;
*State&lt;br /&gt;
*city_state_id to uniquely identify each city&lt;br /&gt;
*Total number of school enrollment&lt;br /&gt;
*Percentage enrolled in public schools&lt;br /&gt;
*Percentage enrolled in private schools&lt;br /&gt;
*Year&lt;br /&gt;
*Code from the state code and Fips code&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19009</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19009"/>
		<updated>2017-06-29T15:40:42Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
19/07/2017 - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
27/07/2017 - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs]]&lt;br /&gt;
&lt;br /&gt;
28/07/2017 - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs]]. Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19006</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19006"/>
		<updated>2017-06-28T22:56:47Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
19/07/2017 - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
27/07/2017 - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
28/07/2017 - Cleaned data files for income at MSA level. For further information on final income data tables and merging, refer to Income Data section in [[Hubs (Academic Paper)]]. Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19005</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19005"/>
		<updated>2017-06-28T22:56:06Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
19/07/2017 - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
27/07/2017 - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
28/07/2017 - Cleaned data files for income at MSA level. Final income data tables and other information on merging refer to Income Data section in [[Hubs (Academic Paper)]]. Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19004</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=19004"/>
		<updated>2017-06-28T22:55:48Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
19/07/2017 - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
27/07/2017 - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs (Academic Paper)]]&lt;br /&gt;
28/07/2017 - Cleaned data files for income at MSA level. Final income data tables and other information on merging refer to Income Data section in [[Hubs (Academic Paper)]]. Data on Houston's current entrepreneurial ecosystem, including R&amp;amp;D expenditure by firms, NSF, NIH grants and STEM Grad students compiled and added to the following location&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\Houston&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19003</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=19003"/>
		<updated>2017-06-28T22:50:22Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Income Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source file is nsf2017.txt, copied from table titled '''nsf''' in the biotech db.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, nogrants, valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
Cities are not unique. Eg. NEW YORK and New York are two different cities. Need to merge their data.&lt;br /&gt;
*3854 cities&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: population.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 Z:\Hubs\2017\merging_on_ID    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA code&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;br /&gt;
&lt;br /&gt;
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
The final income table is in db '''cities''' titled '''merged_income'''.&lt;br /&gt;
&lt;br /&gt;
It includes:&lt;br /&gt;
*MSA&lt;br /&gt;
*City&lt;br /&gt;
*Year&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
The table includes 8780 observations&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=18983</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=18983"/>
		<updated>2017-06-27T22:32:57Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
19/07/2017 - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
27/07/2017 - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=18982</id>
		<title>Hira Farooqi (Work Log)</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hira_Farooqi_(Work_Log)&amp;diff=18982"/>
		<updated>2017-06-27T22:32:33Z</updated>

		<summary type="html">&lt;p&gt;HiraF: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Hira Farooqi]] [[Work Logs]] [[Hira Farooqi (Work Log)|(log page)]]&lt;br /&gt;
&lt;br /&gt;
19/07/2017 - Abstract added to [[Hubs (Academic Paper)]]&lt;br /&gt;
27/07/2017 - Cleaned data files for income at MSA level. Data description added to Income Data section on [[Hubs (Academic Paper)]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Work Log]]&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
	<entry>
		<id>http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=18981</id>
		<title>Hubs</title>
		<link rel="alternate" type="text/html" href="http://www.edegan.com/mediawiki/index.php?title=Hubs&amp;diff=18981"/>
		<updated>2017-06-27T22:28:28Z</updated>

		<summary type="html">&lt;p&gt;HiraF: /* Income Data */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{McNair Projects&lt;br /&gt;
|Has title=Hubs&lt;br /&gt;
|Has owner=Hira Farooqi,&lt;br /&gt;
|Has keywords=Data&lt;br /&gt;
|Has project status=Active&lt;br /&gt;
}}&lt;br /&gt;
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of &amp;quot;hubs&amp;quot;, a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. &lt;br /&gt;
&lt;br /&gt;
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located. &lt;br /&gt;
&lt;br /&gt;
===Primary Data Set===&lt;br /&gt;
The Hubs data set, from SDC Platinum, has been constructed in the server:&lt;br /&gt;
 Data files are in 128.42.44.181/bulk/Hubs&lt;br /&gt;
 All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs&lt;br /&gt;
 psql Hubs&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.&lt;br /&gt;
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The data set has now been uploaded to the database server, named Hubs.&lt;br /&gt;
There are 4 tables: &lt;br /&gt;
*Rounds: Rounddate, coname, state, roundno, stage1, etc.&lt;br /&gt;
*CombinedRounds: Coname, rounddate, discamount, fundname&lt;br /&gt;
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)&lt;br /&gt;
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address&lt;br /&gt;
&lt;br /&gt;
Used variables:&lt;br /&gt;
&lt;br /&gt;
 Companies: Coname, MSACode, Industry, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper&lt;br /&gt;
 IndustryLookupTable: IndustryMajor, InduCode&lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Funds: fundname, msacode, state&lt;br /&gt;
 MSALookupTable: MSACode, MSASuper &lt;br /&gt;
 -&amp;gt; &lt;br /&gt;
 FundInfo: fundname, msacode, state (complete)&lt;br /&gt;
&lt;br /&gt;
 Rounds: coname, rounddate, stagecode, roundno&lt;br /&gt;
 CombinedRounds: coname, rounddate, discamount, fundname&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfoSuper: coname, rounddate, '''nofunds''', discamount   &lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount (complete)&lt;br /&gt;
&lt;br /&gt;
Then take:&lt;br /&gt;
 RoundInfo: Coname, roundyear, fundname, estamount&lt;br /&gt;
 CompanyInfo: Coname, MSASuper, InduCode, state&lt;br /&gt;
 FundInfo: fundname, msacode, state&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount&lt;br /&gt;
 -&amp;gt;&lt;br /&gt;
 MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
'''Notes on Creation of Primary Data Set'''&lt;br /&gt;
&lt;br /&gt;
Raw tables&lt;br /&gt;
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) &lt;br /&gt;
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) &lt;br /&gt;
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) &lt;br /&gt;
* combined rounds (company name, round date, disclosed amount, investor) &lt;br /&gt;
* msalist (changes MSAs to CMSAs— combined MSAs)&lt;br /&gt;
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) &lt;br /&gt;
&lt;br /&gt;
Process&lt;br /&gt;
*cleaned tables to eliminate duplications, undisclosed variables&lt;br /&gt;
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) &lt;br /&gt;
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) &lt;br /&gt;
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt &amp;gt;&amp;gt; cleanfundfinal.txt)&lt;br /&gt;
*join by round and company conames&lt;br /&gt;
*bridge years (1990-2016), stage, and cmsa&lt;br /&gt;
* populate data with count of companies (Deal flow) and estimated amount ($)&lt;br /&gt;
** data set in 181 hubs folder under summarycmsa.txt (38394)&lt;br /&gt;
&lt;br /&gt;
Key decisions:&lt;br /&gt;
*Threw out undisclosed co through-out as no address&lt;br /&gt;
*Count is done by joining round and company&lt;br /&gt;
*Anything fund related must be disclosed fund&lt;br /&gt;
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Glossary of Tables'''&lt;br /&gt;
 cleanco — used to remove duplicates from companies&lt;br /&gt;
 cleanedcompanies — clean set of companies with no duplicates&lt;br /&gt;
 cmsafunds- &lt;br /&gt;
 cmsas— list of all CMSAs in final data set (for merging) &lt;br /&gt;
 cmsastats- statistics not including empty years (pre-merge) &lt;br /&gt;
 cmsastats2 - statistics separated by year-MSA&lt;br /&gt;
 cmsastats3— statistics separated by year-MSA-stage&lt;br /&gt;
 cmsastats4&lt;br /&gt;
 cmsayears— empty merged table between year and cmsa&lt;br /&gt;
 cmsayearstage — empty merged table between cmsa/years and stage&lt;br /&gt;
 combinedrounds— raw sdc data for combined rounds&lt;br /&gt;
 combinedroundswamt— used to join rounds and combined rounds for roundinfo2&lt;br /&gt;
 companies- raw SDC company data&lt;br /&gt;
 companyinfo — cleaned companies joined with state and CMSA information&lt;br /&gt;
 companyinfo2— companyinfo1 with original industry categories&lt;br /&gt;
 companyinfo3— companyinfo2 with updated industry categories and codes&lt;br /&gt;
 companyinfo4-- clean version of companyinfo3&lt;br /&gt;
 companyround- combined company information with round information&lt;br /&gt;
 companyround2- combined company information with round information, cleaned up from companyround2&lt;br /&gt;
 companyround3- combined company information with round information, cleaned up from companyround3&lt;br /&gt;
 '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information&lt;br /&gt;
 fundinfo— funds joined with CMSA info&lt;br /&gt;
 fundinfo2 - clean version of fundinfo1&lt;br /&gt;
 fundinfoclean - used in process to clean fundinfo2&lt;br /&gt;
 fundinfoclean2- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleanfinal- used in process to clean fundinfo2&lt;br /&gt;
 fundinfocleannodups- final clean set of fundinfo&lt;br /&gt;
 funds - raw SDC fund data&lt;br /&gt;
 Houston - analysis for Houston ecosystem team&lt;br /&gt;
 Houston2- analysis for Houston ecosystem team&lt;br /&gt;
 houston3- analysis for Houston ecosystem team&lt;br /&gt;
 industry — new industry codes (4)— used for all future data sets&lt;br /&gt;
 industrylist— lookup table for new industry codes (went from 6 to 4) &lt;br /&gt;
 joined1- used for matching process&lt;br /&gt;
 joined2- used for matching process&lt;br /&gt;
 matchfund2- used for matching process&lt;br /&gt;
 matchfunds- used for matching process&lt;br /&gt;
 matchroundfund - used for matching process&lt;br /&gt;
 matchroundfund2- used for matching process&lt;br /&gt;
 msalist — lookup table for MSA to CMSA (used for all future data sets) &lt;br /&gt;
 nearfar1-- beginning set before adding nearfar/stage variables &lt;br /&gt;
 nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset&lt;br /&gt;
 roundfund— not used— joined round to fund; drop/ignore&lt;br /&gt;
 roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate&lt;br /&gt;
 roundinfo2— roundinfo1 including name of investors/funds&lt;br /&gt;
 roundinfo3— clean version of roundinfo2&lt;br /&gt;
 roundinfoclean — final clean version of roundinfo3 (final roundinfo table)&lt;br /&gt;
 rounds — raw SDC round data&lt;br /&gt;
 stages — table for merging stage-year-CMSA&lt;br /&gt;
 superinfo — ignore/drop&lt;br /&gt;
 temp - used for matching process&lt;br /&gt;
 years — table for merging stage-year-CMSA&lt;br /&gt;
&lt;br /&gt;
===Hub Candidates Data Set===&lt;br /&gt;
&lt;br /&gt;
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.&lt;br /&gt;
&lt;br /&gt;
Characteristics/Variables&lt;br /&gt;
*Year Founded&lt;br /&gt;
*Square footage&lt;br /&gt;
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) &lt;br /&gt;
*Activeness on Twitter (binomial)&lt;br /&gt;
*Member Directory available online (binomial)&lt;br /&gt;
*Number of conference rooms&lt;br /&gt;
*Price ($/month) for Flex desk &lt;br /&gt;
*Offers Reserved desk (binomial)&lt;br /&gt;
*Offers office space for rent (binomial) &lt;br /&gt;
*Offers community membership-- not for coworking but for community events, etc. (binomial)&lt;br /&gt;
*Number of events offered per month (estimate)&lt;br /&gt;
*Offers code academy&lt;br /&gt;
*Mission Statement/Vision (for qualitative or key-word analysis) &lt;br /&gt;
&lt;br /&gt;
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub. &lt;br /&gt;
&lt;br /&gt;
As of March 10th 2016, the list contains 125 Hub candidates.&lt;br /&gt;
&lt;br /&gt;
'''Where to find''': The Hubs data set can be found in the Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;dataset folder. It is not currently in the database due to a UTF8 issue&lt;br /&gt;
&lt;br /&gt;
===Supplementary Data Sets===&lt;br /&gt;
'''Patent data''': to be pulled from USPTO or SDC Platinum. &lt;br /&gt;
&lt;br /&gt;
'''Number of STEM Graduate Students''' (NSF) and '''University R&amp;amp;D Spending''' (NSF):&lt;br /&gt;
*University R&amp;amp;D Data found under file &amp;quot;NSF DATA_2004 to 2011.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets)&lt;br /&gt;
*R&amp;amp;D spending found at the university level for 2014 (&amp;quot;Stem Grad Students.xlsx) or at state level (&amp;quot;Science and Engineering Grad Students by State and Year 2000-2011.csv&amp;quot;)&lt;br /&gt;
** not uploaded to server or matched yet to CMSA code, because of this discrepancy. &lt;br /&gt;
**&amp;quot;Stem Grad Students.xlsx&amp;quot; contains categorized university by MSA, can be used for all university-based projects&lt;br /&gt;
&lt;br /&gt;
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): &lt;br /&gt;
*&amp;quot;Per Capita Personal Income by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;Datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
*&amp;quot;Wages and Salaries by MSA 2000-2012.xlsx&amp;quot; in datasets folder (Ecosystem&amp;gt;&amp;gt;Hubs&amp;gt;&amp;gt;datasets&amp;gt;&amp;gt;Data from Yael)&lt;br /&gt;
**not uploaded to server or matched yet to CMSA code&lt;br /&gt;
&lt;br /&gt;
'''Firm Births''' (BDS)&lt;br /&gt;
*in server 181, under table name &amp;quot;BDS&amp;quot;&lt;br /&gt;
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa&lt;br /&gt;
*includes code for CMSA but is not aggregated by CMSA&lt;br /&gt;
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)&lt;br /&gt;
&lt;br /&gt;
===Resources===&lt;br /&gt;
* Yael Hochberg and Fehder (2015), located in dropbox&lt;br /&gt;
** Use this paper as a guideline on how to conduct the analysis&lt;br /&gt;
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&amp;amp;prodType=table&lt;br /&gt;
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm&lt;br /&gt;
*MSA level trends: http://www.metrotrends.org/data.cf&lt;br /&gt;
&lt;br /&gt;
===The Target Dataset===&lt;br /&gt;
&lt;br /&gt;
We will need to process the following variables:&lt;br /&gt;
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?&lt;br /&gt;
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Example dataset:&lt;br /&gt;
 MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....&lt;br /&gt;
 ----------------------------------------------------------------------------------------------------------------------------&lt;br /&gt;
 1234     2001       1000000        20000000            30000000       4          7              7&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Note that the unit of observation is MSA-Year.&lt;br /&gt;
&lt;br /&gt;
Variables to be computed at the MSA level:&lt;br /&gt;
*HubActive (binary)&lt;br /&gt;
*NoHubsActive (Count)&lt;br /&gt;
*HubSqFt&lt;br /&gt;
*Other Hub Vars (build list!!!)&lt;br /&gt;
*'''SeedVCInv'''  (Seed/Start-up)&lt;br /&gt;
*'''EarlyVCInv''' (Early Stage)&lt;br /&gt;
*'''LaterStageVC''' (Later)&lt;br /&gt;
*'''OtherStageVC''' (Buyout/Acq, Other)&lt;br /&gt;
*'''NoDeals''' (done by local VCs?)&lt;br /&gt;
**'''NoDealsNear'''&lt;br /&gt;
**'''NoDealsFar'''&lt;br /&gt;
*NoPortCosFunded&lt;br /&gt;
*'''FundsInv''' (in an MSA)&lt;br /&gt;
**'''FundsInvFromNear''' (within MSA?)&lt;br /&gt;
**'''FundsInvFromFar''' (outside MSA?)&lt;br /&gt;
*DistinctInvestors (?)&lt;br /&gt;
**DistinctInvestorsNear (within MSA?)&lt;br /&gt;
**DistinctInvestorsFar (outside MSA?)&lt;br /&gt;
*PatentCount&lt;br /&gt;
*NoSTEMGrads&lt;br /&gt;
*FirmBirths (BDS data)&lt;br /&gt;
*UniRandDSpend&lt;br /&gt;
*PerCapitaIncome&lt;br /&gt;
*Employment&lt;br /&gt;
&lt;br /&gt;
We need to:&lt;br /&gt;
*Check funds invested means dollars invested&lt;br /&gt;
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).&lt;br /&gt;
&lt;br /&gt;
===Final Primary Data Set===&lt;br /&gt;
&lt;br /&gt;
*Deal is a round syndicate (near/far deal is one investor that is near/far).&lt;br /&gt;
&lt;br /&gt;
Table name: finaldataset&lt;br /&gt;
 cmsa&lt;br /&gt;
 year&lt;br /&gt;
 totalamountinv--total amount invested &lt;br /&gt;
 nearamountinv--amount invested from local funds&lt;br /&gt;
 faramountinv-- amount invested from funds outside CMSA &lt;br /&gt;
 earlyinv--amount invested in early stage companies &lt;br /&gt;
 laterinv--amount invested in later stage companies &lt;br /&gt;
 startupseedinv--amount invested in seed or startup stage companies &lt;br /&gt;
 otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies &lt;br /&gt;
 investingfund--distinct funds that are investing in that CMSA-year &lt;br /&gt;
 investingfundnear--distinct funds from that CMSA that invested in that CMSA-year &lt;br /&gt;
 investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year &lt;br /&gt;
 deals--number of deals &lt;br /&gt;
 neardeals--number of deals inside a CMSA &lt;br /&gt;
 fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA&lt;br /&gt;
 earlystagedeals--deals with earlystage companies&lt;br /&gt;
 laterstagedeals--deals with later stage companies &lt;br /&gt;
 startupseeddeals--deals with startup/seed companies &lt;br /&gt;
 otherstagedeals--deals with companies in other stages &lt;br /&gt;
 newportcosfunded--number of portfolio companies to receive their first investment in that year&lt;br /&gt;
&lt;br /&gt;
===Data by zip code===&lt;br /&gt;
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www2.census.gov/programs-surveys/popest/datasets/&lt;br /&gt;
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
https://www.irs.gov/uac/about-irs&lt;br /&gt;
*DCI index, to assess the economic well-being of communities&lt;br /&gt;
http://eig.org/dci/interactive-maps/u-s-zip-codes&lt;br /&gt;
*R&amp;amp;D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)&lt;br /&gt;
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).&lt;br /&gt;
&lt;br /&gt;
== Data by MSA ==&lt;br /&gt;
&lt;br /&gt;
We have principle cities of MSAs from the census:&lt;br /&gt;
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html&lt;br /&gt;
&lt;br /&gt;
We might be able to go City -&amp;gt; FIPS place code -&amp;gt; MSA?&lt;br /&gt;
&lt;br /&gt;
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html&lt;br /&gt;
&lt;br /&gt;
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html&lt;br /&gt;
However, there is only CBSA!&lt;br /&gt;
&lt;br /&gt;
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf&lt;br /&gt;
We can maybe track city to principal city to MSA&lt;br /&gt;
&lt;br /&gt;
==COMPUSTAT Data==&lt;br /&gt;
&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: COMPUSTAT.sql&lt;br /&gt;
&lt;br /&gt;
The source file is RandDExpenditures.txt. It contains:&lt;br /&gt;
*Date from 1980-2017 (July). All COMPUSTAT.&lt;br /&gt;
*427799 records&lt;br /&gt;
*Fields include:&lt;br /&gt;
**R&amp;amp;D Expenditure&lt;br /&gt;
**Address (inc. city, zip, state)&lt;br /&gt;
&lt;br /&gt;
Output file is COMPUSTATSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, No.public firms, sum R&amp;amp;D, sum Sales, sum total assets&lt;br /&gt;
*1979-2016&lt;br /&gt;
*4440 cities&lt;br /&gt;
&lt;br /&gt;
==NSF Data==&lt;br /&gt;
Data is in:&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
 Z:\Hubs\2017&lt;br /&gt;
 &lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
&lt;br /&gt;
SQL script is: nsf_2017.sql&lt;br /&gt;
&lt;br /&gt;
The source file is nsf2017.txt, copied from table titled '''nsf''' in the biotech db.&lt;br /&gt;
&lt;br /&gt;
It contains:&lt;br /&gt;
*Award ID&lt;br /&gt;
*Award Institution&lt;br /&gt;
*Award Effective date&lt;br /&gt;
*Institution city&lt;br /&gt;
*Award Value&lt;br /&gt;
From 1900 - 2017&lt;br /&gt;
&lt;br /&gt;
Output file is nsfSummary.txt. It contains:&lt;br /&gt;
*Variables: City, year, nogrants, valuegrant &lt;br /&gt;
*1900-2017&lt;br /&gt;
&lt;br /&gt;
Cities are not unique. Eg. NEW YORK and New York are two different cities. Need to merge their data.&lt;br /&gt;
*3854 cities&lt;br /&gt;
&lt;br /&gt;
==NIH Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: nih2017.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*nih_1986_2001.csv&lt;br /&gt;
*nih_2002_2012.txt&lt;br /&gt;
*nih_2013_2015&lt;br /&gt;
located in E:\McNair\Projects\Federal Grant Data\NIH&lt;br /&gt;
&lt;br /&gt;
*Date from 1986-2015&lt;br /&gt;
&lt;br /&gt;
==Clinical Trials Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: ctrials.sql&lt;br /&gt;
The source file is: &lt;br /&gt;
&lt;br /&gt;
*medclinical.txt&lt;br /&gt;
&lt;br /&gt;
located in Z:\Hubs\2017&lt;br /&gt;
&lt;br /&gt;
*Date from 1999-2017&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Population Data==&lt;br /&gt;
Data is in: &lt;br /&gt;
 Z:\Hubs&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017&lt;br /&gt;
&lt;br /&gt;
Database is '''cities'''&lt;br /&gt;
SQL script is: population.sql&lt;br /&gt;
The source files are: &lt;br /&gt;
*pop2000_2009.xlsx&lt;br /&gt;
*pop2010_2016.xlsx&lt;br /&gt;
&lt;br /&gt;
They contain:&lt;br /&gt;
*State&lt;br /&gt;
*City name	&lt;br /&gt;
*Year	&lt;br /&gt;
*Population Estimates&lt;br /&gt;
&lt;br /&gt;
Date from 2000-2016&lt;br /&gt;
&lt;br /&gt;
==Income Data==&lt;br /&gt;
&lt;br /&gt;
Raw data was obtained from Census data, American Communities Survey.&lt;br /&gt;
&lt;br /&gt;
Raw Data is in: &lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip &lt;br /&gt;
Cleaned files are in&lt;br /&gt;
 E:\McNair\Projects\Hubs\Summer 2017\MSA_Income_cleaning    &lt;br /&gt;
 &lt;br /&gt;
They contain:&lt;br /&gt;
*MSA&lt;br /&gt;
*Year	&lt;br /&gt;
*Total Household Income &lt;br /&gt;
&lt;br /&gt;
Date from 2005-2015&lt;/div&gt;</summary>
		<author><name>HiraF</name></author>
		
	</entry>
</feed>