edegan.com - User contributions [en]

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-04T21:29:54Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

*Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
*Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData\vcdb2
The file is '''colevelsimple.txt'''

In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate.
The script for the average city coordinates is in
Z:\Hubs\2017\sql scripts and the file name is '''newcolevel.sql'''.

The differences are taken in excel. The file containing the differences is in
Z:\Hubs\2017 and the file name is '''new_colevel.txt'''.

*Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts
The name of the file is '''new_firm_nr_circles.sql'''

*We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda.sql'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*Accelerators data is in
Z:\Hubs\2017\clean data
The file name is accelerators.txt
The table is '''accelerators'''
The joined accelerators data with the VC table is in joined_accelerators table.
The script is in
Z:\Hubs\2017\sql scripts
The file name is '''join_accelerators.sql'''

The do file is in
Z:\Hubs\2017\kerda
The name is '''agglomeartion_kerda.do'''
It includes the graphs, tables and the preliminary FE regressions with VC funding amount and growth rate.
It also predicts the hazard rates, matches on the hazard rate in order to create synthetic control and treatment groups.
What is left to do is to add 2 lagged and 3 forward observations for the cities which do have a match. Remove the overlapping observations for the years that get a treatment but which at the same time serve as a control.

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-04T21:22:23Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

*Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
*Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData\vcdb2
The file is '''colevelsimple.txt'''

In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate.
The script for the average city coordinates is in
Z:\Hubs\2017\sql scripts and the file name is '''newcolevel.sql'''.

The differences are taken in excel. The file containing the differences is in
Z:\Hubs\2017 and the file name is '''new_colevel.txt'''.

*Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts
The name of the file is '''new_firm_nr_circles.sql'''

*We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda.sql'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*Accelerators data is in
Z:\Hubs\2017\clean data
The file name is accelerators.txt
The table is '''accelerators'''
The joined accelerators data with the VC table is in joined_accelerators table.
The script is in
Z:\Hubs\2017\sql scripts
The file name is '''join_accelerators.sql'''

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-04T21:05:54Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

*Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
*Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData\vcdb2
The file is '''colevelsimple.txt'''

In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate.
The script for the average city coordinates is in
Z:\Hubs\2017\sql scripts and the file name is '''newcolevel.sql'''.

The differences are taken in excel. The file containing the differences is in
Z:\Hubs\2017 and the file name is '''new_colevel.txt'''.

*Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts
The name of the file is '''new_firm_nr_circles.sql'''

*We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda.sql'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-04T21:00:56Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData\vcdb2
The file is '''colevelsimple.txt'''

In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate.
The script for the average city coordinates is in
Z:\Hubs\2017\sql scripts and the file name is '''newcolevel.sql'''.

The differences are taken in excel. The file containing the differences is in
Z:\Hubs\2017 and the file name is '''new_colevel.txt'''.

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts
The name of the file is '''new_firm_nr_circles.sql'''

We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda.sql'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-04T20:51:59Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData\vcdb2
The file is '''colevelsimple.txt'''

In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate.
The script for the average city coordinates is in Z:\Hubs\2017\sql scripts and the file name is '''newcolevel.sql'''.
The differences are taken in excel. The file containing the differences is in Z:\Hubs\2017 and the file name is '''new_colevel.txt'''.

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts
The name of the file is '''new_firm_nr_circles.sql'''

We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda.sql'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-04T20:33:31Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData\vcdb2
The file is '''colevelsimple.txt'''

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts
The name of the file is '''new_firm_nr_circles.sql'''

We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda.sql'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Hubs

2017-08-04T20:14:06Z

KerdaV: /* VC Data */

Hubs

2017-08-04T18:25:47Z

KerdaV: /* VC Data */

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-01T15:46:48Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData
The file is '''colevelsimple.txt'''

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts
The name of the file is '''new_firm_nr_circles.sql'''

We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda.sql'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-01T15:45:40Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData
The file is '''colevelsimple.txt'''

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts
The name of the file is '''new_firm_nr_circles.sql'''

We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-08-01T15:44:10Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData

The file is '''colevelsimple.txt'''

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data

The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities

SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts

The name of the file is '''new_firm_nr_circles.sql'''

We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017

The file is '''CitiesWithGT10Active.txt'''

The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts
The file name is '''final_joined_kerda'''.
The final data is in
Z:\Hubs\2017\clean data
The file name is '''new_final_kerda.txt'''.

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-07-31T19:38:41Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData

The file is '''colevelsimple.txt'''

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data

The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities

SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts

The name of the file is '''new_firm_nr_circles.sql'''

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-07-31T19:38:11Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Raw data is in:
Z:\VentureCapitalData\SDCVCData

The file is '''colevelsimple.txt'''

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data

The name of the file is '''circles.txt'''. (It contains only 106 observations)

Database is cities

SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts

The name of the file is '''new_firm_nr_circles.sql'''

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-07-31T17:28:53Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: '''nr_firms.sql'''

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)
Database is cities
SQL script is: '''circles.sql'''

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts

The name of the file is '''new_firm_nr_circles.sql'''

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-07-31T17:28:13Z

KerdaV: /* Data */

{{AcademicPaper
|Has title=Urban Start-up Agglomeration
|Has author=Ed Egan,
|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden,
|Has paper status=In development
}}

==Summary==

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

==Description==

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

[[File:regression_equation.png]]

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument.
The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

==Data==

*SDC VentureXpert
*GIS City Data

Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in [[Hubs]].
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''firm_nr.txt'''.
Database is cities
SQL script is: nr_firms.sql

Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is '''circles.txt'''. (It contains only 106 observations)
Database is cities
SQL script is: circles.sql

The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts

The name of the file is '''new_firm_nr_circles.sql'''

Also:
*[[Enclosing Circle Algorithm]]
*Normalizer
*Geocode.py

===Unbiased measure===

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

===GIS Resources===
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats

Urban Start-up Agglomeration and Venture Capital Investment

2017-07-31T17:27:19Z

KerdaV: /* Data */

Urban Start-up Agglomeration and Venture Capital Investment

2017-07-31T16:30:28Z

KerdaV:

Hubs

2017-07-14T18:00:23Z

KerdaV: /* VC Data */

Hubs

2017-07-14T17:59:41Z

KerdaV: /* VC Data */

Hubs

2017-07-14T17:58:45Z

KerdaV: /* VC Data */

Hubs Summer 2016

2017-07-13T20:19:38Z

KerdaV:

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

Hubs

2017-07-13T20:18:44Z

KerdaV:

Hubs

2017-07-12T21:18:14Z

KerdaV: /* Joined NIH table */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

===Joined NIH table===
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials table===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population table===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income table===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment table===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling table===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:17:44Z

KerdaV: /* NIH Data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

===Joined NIH table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials table===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population table===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income table===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment table===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling table===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:16:11Z

KerdaV: /* Joined schooling data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials table===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population table===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income table===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment table===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling table===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:15:54Z

KerdaV: /* Joined employment data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials table===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population table===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income table===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment table===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:15:37Z

KerdaV: /* Joined income data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials table===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population table===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income table===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment data===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:15:12Z

KerdaV: /* Joined NSF table */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials table===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population table===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income data===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment data===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:14:27Z

KerdaV: /* Joined population data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials table===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population table===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income data===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment data===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:14:04Z

KerdaV: /* Joined clinical trials data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials table===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population data===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income data===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment data===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:13:44Z

KerdaV: /* NSF Data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

===Joined NSF table===
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials data===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population data===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income data===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment data===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T21:10:45Z

KerdaV: /* Joined clinical trials data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials data===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population data===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income data===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment data===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T20:34:03Z

KerdaV: /* VC Data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials data===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population data===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income data===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment data===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The table is in db '''cities''' titled '''vc'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year

Hubs

2017-07-12T20:32:35Z

KerdaV: /* Joined employment data */

{{McNair Projects
|Has title=Hubs
|Has owner=Hira Farooqi,
|Has keywords=Data
|Has project status=Active
}}
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

===Primary Data Set===
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs
All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
psql Hubs

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper
->
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:
RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

'''Notes on Creation of Primary Data Set'''

Raw tables
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
* combined rounds (company name, round date, disclosed amount, investor)
* msalist (changes MSAs to CMSAs— combined MSAs)
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process
*cleaned tables to eliminate duplications, undisclosed variables
*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
*join by round and company conames
*bridge years (1990-2016), stage, and cmsa
* populate data with count of companies (Deal flow) and estimated amount ($)
** data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:
*Threw out undisclosed co through-out as no address
*Count is done by joining round and company
*Anything fund related must be disclosed fund
*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

===Hub Candidates Data Set===

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables
*Year Founded
*Square footage
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
*Activeness on Twitter (binomial)
*Member Directory available online (binomial)
*Number of conference rooms
*Price ($/month) for Flex desk
*Offers Reserved desk (binomial)
*Offers office space for rent (binomial)
*Offers community membership-- not for coworking but for community events, etc. (binomial)
*Number of events offered per month (estimate)
*Offers code academy
*Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

===Supplementary Data Sets===
'''Patent data''': to be pulled from USPTO or SDC Platinum.

'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
**not uploaded to server or matched yet to CMSA code

'''Firm Births''' (BDS)
*in server 181, under table name "BDS"
*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
*includes code for CMSA but is not aggregated by CMSA
** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

===Resources===
* Yael Hochberg and Fehder (2015), located in dropbox
** Use this paper as a guideline on how to conduct the analysis
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
*MSA level trends: http://www.metrotrends.org/data.cf

===The Target Dataset===

We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
*CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:
*HubActive (binary)
*NoHubsActive (Count)
*HubSqFt
*Other Hub Vars (build list!!!)
*'''SeedVCInv''' (Seed/Start-up)
*'''EarlyVCInv''' (Early Stage)
*'''LaterStageVC''' (Later)
*'''OtherStageVC''' (Buyout/Acq, Other)
*'''NoDeals''' (done by local VCs?)
**'''NoDealsNear'''
**'''NoDealsFar'''
*NoPortCosFunded
*'''FundsInv''' (in an MSA)
**'''FundsInvFromNear''' (within MSA?)
**'''FundsInvFromFar''' (outside MSA?)
*DistinctInvestors (?)
**DistinctInvestorsNear (within MSA?)
**DistinctInvestorsFar (outside MSA?)
*PatentCount
*NoSTEMGrads
*FirmBirths (BDS data)
*UniRandDSpend
*PerCapitaIncome
*Employment

We need to:
*Check funds invested means dollars invested
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

===Final Primary Data Set===

*Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset
cmsa
year
totalamountinv--total amount invested
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA
earlyinv--amount invested in early stage companies
laterinv--amount invested in later stage companies
startupseedinv--amount invested in seed or startup stage companies
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
investingfund--distinct funds that are investing in that CMSA-year
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
deals--number of deals
neardeals--number of deals inside a CMSA
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies
startupseeddeals--deals with startup/seed companies
otherstagedeals--deals with companies in other stages
newportcosfunded--number of portfolio companies to receive their first investment in that year

===Data by zip code===
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
*DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

== Data by MSA ==

We have principle cities of MSAs from the census:
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
We can maybe track city to principal city to MSA

==COMPUSTAT Data==

Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:
*Date from 1980-2017 (July). All COMPUSTAT.
*427799 records
*Fields include:
**R&D Expenditure
**Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
*1979-2016
*4440 cities

==NSF Data==
Data is in:
E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is '''cities'''

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.

They contain:
*Award ID
*Award Institution
*Award Effective date
*Institution city
*Award Value
*Organization state code
From 1900 - 2017

Output file is nsfSummary.txt. It contains:
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
*1900-2017

==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: nih2017.sql
The source files are:
*nih_1986_2001.csv
*nih_2002_2012.txt
*nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes
*year
*city
*state
*country
*nogrants (number of grants)
*valuegrant
*city_state (the city-state ID that we'll merge on)

*Date from 1986-2015

==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''
SQL script is: ctrials.sql
The source file is:

*medclinical.txt

located in Z:\Hubs\2017

*Date from 1999-2017

===Joined clinical trials data===

The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017

The file is in:
Z:\Hubs\2017\clean data
The name of the file is:
ctrialsSummary.txt

It contains:
*city
*year
*city_state_year
*noctrials - number of trials

The ctrials is joined with VC table.
The joined SQL script is: '''new_ctrials.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The name of the joined table is '''new_merged_ctrials'''.

It contains:
*city
*state
*city_state_id
*city_state_year
*year
*noctrials
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel

==Population Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: '''population.sql'''
The source files are:
*pop2000_2009.xlsx
*pop2010_2016.xlsx

They contain:
*State
*City name
*Year
*Population Estimates

Date from 2000-2016

===Joined population data===

Data is in:
Z:\Hubs\2017\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is '''cities'''
SQL script is: '''new_population.sql''', located in
Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called '''new_merged_population'''.

They contain:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Population estimates
*Year
*Code from the state code and Fips code
*State full name

==Income Data==

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
Z:\Hubs\2017

This master list includes:
*MSA code
*MSA name
*Principal City
*State
*Place code (city code)
*State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.

Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID

They contain:
*MSA code
*MSA
*Year
*Total Household Income

The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
Z:\Hubs\2017\sql scripts

The final income table is in db '''cities''' titled '''merged_income'''.

It includes:
*MSA
*City
*State
*Year
*Total Household Income

The table includes 8780 observations

===Joined income data===

Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt

Database is '''cities'''
SQL script is: merged_income.sql

They contain:
*City
*State
*city_state_id to uniquely identify each city
*Income
*Year
*Code from the state code and Fips code

==Employment Data==

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Employment rate of individuals 16 years or older
*Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_employment'''.

It includes:
*MSA
*City
*Year
*Employment rate
*Unemployment rate

===Joined employment data===

Data is in:
Z:\Hubs\clean data

The file names are:
EMP_05.txt - EMP_15.txt

Database is '''cities'''
SQL script is: '''new_employment.sql''' and it is located in
Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled '''new_merged_employment'''.

They contain:
*City
*State
*Code from the state code and Fips code
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Employment rates of individuals of 16 years or older
*Unemployment rates of individuals of 16 years or older
*Year

==Schooling Data==

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\clean data

They contain:
*MSA code
*MSA
*Year
*Total number of population 3 years and over enrolled in school
*Percent of population 3 years and over enrolled in public school
*Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''merged_schooling'''.

It includes:
*MSA
*City
*Year
*Total
*Percent_public_schooling
*Percent_private_schooling

===Joined schooling data===

Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt

Database is '''cities'''
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
The final table is in db '''cities''' titled '''new_merged_schooling'''.

It contains:
*City
*State
*city_state_id to uniquely identify each city
*city_state_year to uniquely identify each city in each year
*Total number of school enrollment
*Percentage enrolled in public schools
*Percentage enrolled in private schools
*Year
*Code from the state code and Fips code

==VC Data==

Raw Data is in:
Z:\VentureCapitalData\SDCVCData
The file name is roundcitystateyear.txt

It contains:
*city
*state
*year
*seedamtm - seed, amount in millions
*earlyamtm - early, amount in millions
*lateramtm - late, amount in millions
*selamtm - seed early late, amount in millions
*numseeds - number of seeds
*numearly
*numlater
*numsel

Date from 1953-2017

The SQL file that merges VC data with the MSA-City file is titled '''vc.sql'''.
The file is located in:
Z:\Hubs\2017

The final table is in db '''cities''' titled '''vc_city_state_year'''.

It includes:
*city
*state
*city_state_id
*city_state_year
*seedamtm
*earlyamtm
*lateramtm
*selamtm
*numseeds
*numearly
*numlater
*numsel
*year