Difference between revisions of "Urban Start-up Agglomeration and Venture Capital Investment"

From edegan.com
Jump to navigation Jump to search
Line 255: Line 255:
 
===ACS (American Community Survey) Data ===
 
===ACS (American Community Survey) Data ===
  
Steps to download:
+
Steps to download:
*
+
 
 +
1) I know the dataset or table(s) that I want to download.
 +
2) Press Next
 +
3) Choose dataset of interest
 +
* We plan to pull ACS 1-year estimates for each year, from 2005-2016
 +
4)

Revision as of 15:26, 24 October 2017

Academic Paper
Title Urban Start-up Agglomeration
Author Ed Egan
RAs Peter Jalbert, Jake Silberman, Christy Warden
Status In development
© edegan.com, 2016


Summary

Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.

Description

Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:

Regression equation.png

The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument. The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.

Data

Making the circle input data

Ed's additional datawork is in

Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessingCoLevelSimple.sql

The key table for circle processing is CoLevelBlowout, which is restricted (to include cities with greater than 10 active at some point in the data) to make CoLevelForCircles.

We need to:

  1. Winsorize CoLevelBlowout
  2. Compute the circles!
  3. Make the Bay Area (over time) data
  4. Plot the Bay Area data (with colors per Bay Area city) for 1985 to present
  5. Combine the plots to make an animated gif

To winsorize the data we need the formula for Great Circle Distance. The radius of the earth is 6,378km (half of diameter: 12,756 km). So:

GCD = acos( sin(lat1) x sin(lat2) + cos(lat1) x cos(lat2) x cos(long1-long2) ) x r

Main Sources

The primary sources of data for this project are:

  • SDC VentureXpert - from VC Database Rebuild, the key table is
  • GIS City Data
  • Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in Hubs.

VC data

Data on the number of new vc backed firms in each city and year is in:

Z:\Hubs\2017\clean data
The name of the file is firm_nr.txt.

Database is cities SQL script is: nr_firms.sql

Raw data is in:

Z:\VentureCapitalData\SDCVCData\vcdb2
The file is colevelsimple.txt

In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate. The script for the average city coordinates is in

Z:\Hubs\2017\sql scripts and the file name is newcolevel.sql.

The differences are taken in excel. The file containing the differences is in

Z:\Hubs\2017 and the file name is new_colevel.txt.
  • Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data
The name of the file is circles.txt. (It contains only 106 observations)

Database is cities SQL script is: circles.sql

The script for joining the two tables on the VC table is in:

Z:\Hubs\2017\sql scripts
 The name of the file is new_firm_nr_circles.sql
  • We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017
The file is CitiesWithGT10Active.txt

The script for joining the final data with this file is located in

Z:\Hubs\2017\sql scripts
The file name is final_joined_kerda.sql.

The final data is in

Z:\Hubs\2017\clean data
The file name is new_final_kerda.txt.

Accelerator data

Accelerators data is in

Z:\Hubs\2017\clean data
The file name is accelerators.txt
The table is accelerators

The joined accelerators data with the VC table is in joined_accelerators table. The script is in

Z:\Hubs\2017\sql scripts
The file name is join_accelerators.sql

The do file is in

Z:\Hubs\2017\kerda
The name is agglomeartion_kerda.do

It includes the graphs, tables and the preliminary FE regressions with VC funding amount and growth rate. It also predicts the hazard rates, matches on the hazard rate in order to create synthetic control and treatment groups. What is left to do is to add 2 lagged and 3 forward observations for the cities which do have a match. Remove the overlapping observations for the years that get a treatment but which at the same time serve as a control.

See also

Also:


Unbiased measure

The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.

For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf

GIS Resources

Useful functions for spatial joins

sum(expression): aggregate to return a sum for a set of records
count(expression): aggregate to return the size of a set of records
ST_Area(geometry) returns the area of the polygons
ST_AsText(geometry) returns WKT text
ST_Buffer(geometry, distance): For geometry: Returns a geometry that represents all points whose distance from this Geometry is less than or equal to distance. Calculations are in the Spatial Reference System of this Geometry. For geography: Uses a planar transform wrapper.
ST_Contains(geometry A, geometry B) returns the true if geometry A contains geometry B
ST_Distance(geometry A, geometry B) returns the minimum distance between geometry A and geometry B
ST_DWithin(geometry A, geometry B, radius) returns the true if geometry A is radius distance or less from geometry B
ST_GeomFromText(text) returns geometry
ST_Intersection(geometry A, geometry B): Returns a geometry that represents the shared portion of geomA and geomB. The geography implementation does a transform to geometry to do the intersection and then transform back to WGS84
ST_Intersects(geometry A, geometry B) returns the true if geometry A intersects geometry B
ST_Length(linestring) returns the length of the linestring
ST_Touches(geometry A, geometry B) returns the true if the boundary of geometry A touches geometry B
ST_Within(geometry A, geometry B) returns the true if geometry A is within geometry B
geometry_a && geometry_b: Returns TRUE if A’s bounding box overlaps B’s.
geometry_a = geometry_b: Returns TRUE if A’s bounding box is the same as B’s.
ST_SetSRID(geometry, srid): Sets the SRID on a geometry to a particular integer value.
ST_SRID(geometry): Returns the spatial reference identifier for the ST_Geometry as defined in spatial_ref_sys table.
ST_Transform(geometry, srid): Returns a new geometry with its coordinates transformed to the SRID referenced by the integer parameter.
ST_Union(): Returns a geometry that represents the point set union of the Geometries.
substring(string [from int] [for int]): PostgreSQL string function to extract substring matching SQL regular expression.
ST_Relate(geometry A, geometry B): Returns a text string representing the DE9IM relationship between the geometries.
ST_GeoHash(geometry A): Returns a text string representing the GeoHash of the bounds of the object.

Native functions for geography

ST_AsText(geography) returns text
ST_GeographyFromText(text) returns geography
ST_AsBinary(geography) returns bytea
ST_GeogFromWKB(bytea) returns geography
ST_AsSVG(geography) returns text
ST_AsGML(geography) returns text
ST_AsKML(geography) returns text
ST_AsGeoJson(geography) returns text
ST_Distance(geography, geography) returns double
ST_DWithin(geography, geography, float8) returns boolean
ST_Area(geography) returns double
ST_Length(geography) returns double
ST_Covers(geography, geography) returns boolean
ST_CoveredBy(geography, geography) returns boolean
ST_Intersects(geography, geography) returns boolean
ST_Buffer(geography, float8) returns geography [1]
ST_Intersection(geography, geography) returns geography [1]

Functions for Linear Referencing

ST_LineInterpolatePoint(geometry A, double measure): Returns a point interpolated along a line.
ST_LineLocatePoint(geometry A, geometry B): Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point.
ST_Line_Substring(geometry A, double from, double to): Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length.
ST_Locate_Along_Measure(geometry A, double measure): Return a derived geometry collection value with elements that match the specified measure.
ST_Locate_Between_Measures(geometry A, double from, double to): Return a derived geometry collection value with elements that match the specified range of measures inclusively.
ST_AddMeasure(geometry A, double from, double to): Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.

3-D Functions

ST_3DClosestPoint — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line.
ST_3DDistance — For geometry type Returns the 3-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
ST_3DDWithin — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units.
ST_3DDFullyWithin — Returns true if all of the 3D geometries are within the specified distance of one another.
ST_3DIntersects — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings
ST_3DLongestLine — Returns the 3-dimensional longest line between two geometries
ST_3DMaxDistance — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units.
ST_3DShortestLine — Returns the 3-dimensional shortest line between two geometries

Relevant PostgreSQL Commands

\dt *.* Show all tables
\q Exit table

Specifities/ Outliers to consider

New York (decompose)
Princeton area (keep Princeton  unique)
Reston, Virginia (keep)
San Diego (include La Jolla)
Silicon Valley (all distinct)

To make a circle

SELECT ST_Buffer([desired point], [desired radius], 'quad_segs=8') 
FROM [desired table]

quad_segs=8 indicates circle

CirclePostGIS.png

For more precision in circle:

SELECT ST_Transform(geometry( 
    ST_Buffer(geography( 
        ST_Transform( [desired point], 4326 )), 
            [desired radius]')), 
            900913) FROM [desired table]

4326 and 900913 represent particular precision.

Decimal Degrees

We are working with longitude and latitude in decimal degrees. See https://en.wikipedia.org/wiki/Decimal_degrees

When converting radius to km, multiply by 111.3199. For area, multiple by (111.3199)^2=12,392.12013601.

Census Data

Population

The Census Gazetteer files for 2010, 2000 and 1990 can give use population by census place. See https://www.census.gov/geo/maps-data/data/gazetteer.html

The places file contains data for all incorporated places and census designated places (CDPs) in the 50 states, the District of Columbia and Puerto Rico as of the January 1, 2010. The file is tab-delimited text, one line per record. Some records contain special characters.
Download the National Places Gazetteer Files (1.2MB)
Download the State-Based Places Gazetteer Files:
Column	Label	Description
Column 1	USPS	United States Postal Service State Abbreviation
Column 2	GEOID	Geographic Identifier - fully concatenated geographic code (State FIPS and Place FIPS)
Column 3	ANSICODE	American National Standards Insititute code
Column 4	NAME	Name
Column 5	LSAD	Legal/Statistical area descriptor.
Column 6	FUNCSTAT	Functional status of entity.
Column 7	POP10	2010 Census population count.
Column 8	HU10	2010 Census housing unit count.
Column 9	ALAND	Land Area (square meters) - Created for statistical purposes only.
Column 10	AWATER	Water Area (square meters) - Created for statistical purposes only.
Column 11	ALAND_SQMI	Land Area (square miles) - Created for statistical purposes only.
Column 12	AWATER_SQMI	Water Area (square miles) - Created for statistical purposes only.
Column 13	INTPTLAT	Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively
Column 14	INTPTLONG	Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively.

Relationships

See https://www.census.gov/geo/maps-data/data/relationship.html

These text files describe geographic relationships. There are two types of relationship files; those that show the relationship between the same type of geography over time (comparability) and those that show the relationship between two types of geography for the same time period.

ACS (American Community Survey) Data

Steps to download:

1) I know the dataset or table(s) that I want to download. 2) Press Next 3) Choose dataset of interest

  • We plan to pull ACS 1-year estimates for each year, from 2005-2016

4)