Changes

Jump to navigation Jump to search
679 bytes added ,  13:41, 21 September 2020
no edit summary
{{Project|Has project output=Data|Has sponsor=McNair ProjectsCenter
|Has title=Hubs
|Has owner=Jake SilbermanHira Farooqi,|Has keywords=Data,|Has project status=TabledActive|Does subsume=Hubs Analysis 2017,
}}
'''Important Notice: The Hubs Research Project is a full-length academic paper analyzing last update to the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement data was done manually by Ed and growth of entrepreneurial success is in a metropolitan areaE:\projects\MeasuringHGHTEcosystems\HubsData-RevisedSimplified. xlsx'''
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.
===Primary Data Set===The Hubs data setResearch Project is a full-length academic paper analyzing the effectiveness of "hubs", from SDC Platinuma component of the entrepreneurship ecosystem, has been constructed in the server: Data files are advancement and growth of entrepreneurial success in 128.42.44a metropolitan area.181/bulk/Hubs All files are It focuses on cities in 128the United States as the primary unit of analysis.42.44.182/bulk/Projects/Ecosystem/Hubs psql Hubs
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.
The Information on initial data set has now been uploaded work done prior to the database server, named Summer 2017 can be found at [[HubsSummer 2016]].There are 4 tables: *Rounds: Rounddate, coname, state, roundno, stage1, etc.*CombinedRounds: Coname, rounddate, discamount, fundname*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address
Used variables'''Note on joining:''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, nih, nsf, sbir, compustat) is first joined with the VC data on city-state-year ID and then the resulting tables are all joined together in the final table.
Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
->
CompanyInfo: Coname, MSASuper, InduCode, state (complete)
Funds===Data by zip code===*Population data, 2000-2016 - US Census Bureau (E: fundname\McNair\Hubs\summer 2017)https://www2.census.gov/programs-surveys/popest/datasets/*Income data, msacode1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)https://www.irs.gov/uac/about-irs*DCI index, stateto assess the economic well-being of communities MSALookupTablehttp: MSACode//eig.org/dci/interactive-maps/u-s-zip-codes*R&D Expenses, MSASuper 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017) *Zipcode look-> FundInfoup table obtained from https: fundname, msacode, state //www.unitedstateszipcodes.org/zip-code-database/. It's available in (completeE:\McNair\Hubs\summer 2017).
Rounds: coname, rounddate, stagecode, roundno CombinedRounds: coname, rounddate, discamount, fundname -> RoundInfoSuper: coname, rounddate, '''nofunds''', discamount -> RoundInfo: Coname, roundyear, fundname, estamount (complete)== Data by MSA ==
Then takeWe have principle cities of MSAs from the census: RoundInfohttps: Coname, roundyear, fundname, estamount CompanyInfo: Coname, MSASuper, InduCode, state FundInfo: fundname, msacode, state //www.census.gov/geographies/reference-files/time-series/demo/metro-> SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount micro/delineation-> MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear ..files.html
'''Notes on Creation of Primary Data Set'''We might be able to go City -> FIPS place code -> MSA?
Raw tables* companies Cities and their FIPS codes (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) * funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) * rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) * combined rounds (company name, round date, disclosed amount, investor) * msalist (changes MSAs to CMSAs— combined MSAs)*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Otherwhich don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html
Process*cleaned tables The Census claims to eliminate duplications, undisclosed variables*changed all original characters provide city to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) *matched funds to avoid any issues with names (i.e. Fund ABC LMSA here: https://www.Pcensus.gov/Fund ABC LPgeo/Fund ABC) *matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)*join by round and company conames*bridge years (1990maps-2016), stage, and cmsa* populate data with count of companies (Deal flow) and estimated amount ($)** /data set in 181 hubs folder under summarycmsa/ua_rel_download.txt (38394)htmlHowever, there is only CBSA!
Key decisionsThis might do it:*Threw out undisclosed co throughhttps://www2.census.gov/geo/pdfs/maps-out as no address*Count is done by joining round and company*Anything fund related must be disclosed funddata/data/rel/explanation_ua_cbsa_rel_10.pdf*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match onlyWe can maybe track city to principal city to MSA
==COMPUSTAT Data==
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?).
'''Glossary of Tables'''
cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds-
cmsas— list of all CMSAs in final data set (for merging)
cmsastats- statistics not including empty years (pre-merge)
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
'''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4)
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets)
nearfar1-- beginning set before adding nearfar/stage variables
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA
===Hub Candidates Raw Data Set===is in: E:\McNair\Projects\Hubs\Summer 2017 Z:\Hubs\2017
The Hubs candidate data set source file is a list of potential hubs found in MSAs throughout the countryRandDExpenditures.txt. Researchers are currently pulling qualitative and quantitative information It contains:*Date from the candidate's websites1980-2017 (July). *427799 records*Fields include:**R&D Expenditure**Address (inc. city, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pullzip, as there is little to no quantitative information available for this category state)**Revenue of institution, and firms Database is dependent on accessibility of information to the public on the internet.'''cities'''
Characteristics/Variables*Year Founded*Square footage*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile) *Activeness on Twitter (binomial)*Member Directory available online (binomial)*Number of conference rooms*Price ($/month) for Flex desk *Offers Reserved desk (binomial)*Offers office space for rent (binomial) *Offers community membership-- not for coworking but for community events, etcSQL script is: COMPUSTAT. (binomial)*Number of events offered per month (estimate)*Offers code academy*Mission Statement/Vision (for qualitative or key-word analysis) sql
These characteristics/variables will be used to determine whether a candidate Output file is or is not likely to be a HubCOMPUSTATSummary.txt. It contains:*Variables: City, year, No. public firms, sum R&D, sum Sales, sum total assets*1979-2016*4440 cities
As of March 10th 2016, the list contains 125 Hub candidates.It is located in Z:\Hubs\2017\Output_Files
==NSF Data==Data is in: E:\McNair\Projects\Hubs\Summer 2017 Z:\Hubs\2017 Database is '''Where to findcities''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue
===Supplementary Data Sets==='''Patent data'''SQL script is: to be pulled from USPTO or SDC Platinumnsf_2017. sql
The source files are: nsf2017.txt, copied from table '''Number of STEM Graduate Studentsnsf''' (NSF) , and nsf_institution copied from table '''University R&D Spendingnsf_grants_institution''' (NSF):*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)*R&D spending found at from the university level for 2014 ("Stem Grad Studentsbiotech db.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")** not uploaded to server or matched yet to CMSA code, because of this discrepancy. **"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau)They contain: *"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)Award ID*Award Institution*Award Effective date*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)Institution city*Award Value*not uploaded to server or matched yet to CMSA Organization state codeFrom 1900 - 2017
'''Firm Births''' (BDS)Output file is nsfSummary.txt. It contains:*in server 181Variables: City, under table name "BDS"*includes birthState code year, deathnsf_nogrants, net(birth-death) and rate(death rate) for years 1990-2013 for every msansf_valuegrant *includes code for CMSA but is not aggregated by CMSA** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)1900-2017
===ResourcesJoined NSF table===* Yael Hochberg and Fehder (2015), located The joined nsf table with the VC table is found in dropboxdb '''cities'''. The table is named '''merged_nsf'''.** Use this paper as a guideline on how All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to conduct the analysis*US Census Bureau data on employment by MSA: http://factfinder0.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htmThe sql script is in*MSA level trends Z: http://www.metrotrends.org/data.cf\HUbs\2017\sql scripts
===The Target Dataset=NIH Data==Data is in: Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017
We will need to process the following variablesDatabase is '''cities'''SQL script is: nih2017.sqlThe source files are:*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?nih_1986_2001.csv*CSV mapping msas to cmsas is nih_2002_2012.txt*nih_2013_2015located in the folder (and a table in the dbase)E:\McNair\Projects\Federal Grant Data\NIH
Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....
----------------------------------------------------------------------------------------------------------------------------
1234 2001 1000000 20000000 30000000 4 7 7
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:
Note that the unit of observation is MSA-Year. Z:\Hubs\2017\sql scripts
Variables to be computed at the MSA level:This table includes*HubActive (binary)*NoHubsActive (Count)*HubSqFt*Other Hub Vars (build list!!!)*'''SeedVCInv''' (Seed/Start-up)*'''EarlyVCInv''' (Early Stage)*'''LaterStageVC''' (Later)*'''OtherStageVC''' (Buyout/Acq, Other)*'''NoDeals''' (done by local VCs?)**'''NoDealsNear'''**'''NoDealsFar'''*NoPortCosFunded*'''FundsInv''' (in an MSA)**'''FundsInvFromNear''' (within MSA?)**'''FundsInvFromFar''' (outside MSA?)year*DistinctInvestors (?)city**DistinctInvestorsNear (within MSA?)state**DistinctInvestorsFar (outside MSA?)*PatentCount*NoSTEMGradscountry*FirmBirths nogrants (BDS datanumber of grants)*UniRandDSpend*PerCapitaIncomevaluegrant*Employmentcity_state
We need to:*Check funds invested means dollars invested*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?Date from 1986-2015
===Joined NIH table===
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.
The sql script is in
Z:\HUbs\2017\sql scripts
There may be a second dataset that has Hub-Industry-Year (where industry ==Clinical Trials Data==Data is semiconductor/non-semiconductor?).in: Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017
===Final Primary Data Set===Database is '''cities'''SQL script is: ctrials.sqlThe source file is:
*Deal is a round syndicate (near/far deal is one investor that is near/far)medclinical.txt
Table namelocated in Z:\Hubs\2017 *Date from 1999-2017 ===Joined clinical trials table=== The file which contains the number of trials in each city and year is located in: Z: finaldataset\Hubs\2017 cmsaThe file is in: Z:\Hubs\2017\clean dataThe name of the file is: ctrialsSummary.txt It contains:*city*year*city_state_year*noctrials - number of trials The ctrials is joined with VC table. The joined SQL script is: '''new_ctrials.sql''' and it is located in totalamountinvZ:\Hubs\2017\sql scripts The name of the joined table is '''new_merged_ctrials'''. It contains:*city*state*city_state_id*city_state_year*year*noctrials*seedamtm*earlyamtm*lateramtm*selamtm*numseeds*numearly*numlater*numsel All the values of noctrials with missing values for years 1999-2017 are set equal to 0. ==Population Data==Data is in: Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017 Database is '''cities''' SQL script is: '''population.sql'''The source files are: *pop2000_2009.xlsx*pop2010_2016.xlsx They contain:*State*City name *Year *Population Estimates Date from 2000-total amount invested 2016 ===Joined population table=== Data is in: Z:\Hubs\2017\clean dataThe file names are nearamountinv1_population.txt -contains data on population estimates from 2000-amount invested from local funds2009 faramountinv2_population.txt -contains data on population estimates from 2010- amount invested 2016  Database is '''cities'''SQL script is: '''new_population.sql''', located in Z:\Hubs\2017\sql scripts The population table is joined on VC table. The table is called '''new_merged_population'''. They contain:*City*State*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Population estimates*Year*Code from the state code and Fips code*State full name ==Income Data== Raw data was obtained from Census data, American Communities Survey. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip   Date from funds outside CMSA 2005-2015 The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at: earlyinvZ:\Hubs\2017 This master list includes:*MSA code*MSA name*Principal City*State*Place code (city code)*State Code This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-amount invested NJ. So '''list''' was edited to put New York with NY.   Cleaned Income data files are in early stage companies laterinvZ:\Hubs\2017\merging_on_ID  They contain:*MSA code*MSA*Year *Total Household Income  The MSA-City-amount invested State look up file is titled '''msa_city_state_wcode.txt'''. It is located in later stage companies startupseedinvZ:\Hubs\2017\merging_on_ID  The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here: Z:\Hubs\2017\sql scripts The final income table is in db '''cities''' titled '''merged_income'''. It includes:*MSA*City*State*Year*Total Household Income  The table includes 8780 observations ===Joined income table=== Data is in: Z:\Hubs\clean dataThe file names are: INC_05.txt -amount invested INC_15.txt  Database is '''cities'''SQL script is: merged_income.sql  They contain:*City*State*city_state_id to uniquely identify each city*Income*Year*Code from the state code and Fips code ==Employment Data== Data on employment was obtained from American Communities Survey, US Census Bureau. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSACleaned files are in seed Z:\Hubs\2017\clean data They contain:*MSA code*MSA*Year *Employment rate of individuals 16 years or older*Unemployment rate of individuals 16 years or startup stage companies older  otherstageinvDate from 2005-2015 The SQL file that merges employment data from ACS (by MSA -Year) with the MSA-amount invested City file is titled '''Employment.sql'''. The file is located in: Z:\Hubs\2017 The final table is in db '''cities''' titled '''merged_employment'''. It includes:*MSA*City*Year*Employment rate*Unemployment rate ===Joined employment table=== Data is in Acquisition/Buy: Z:\Hubs\clean data The file names are: EMP_05.txt -outs/Other stage companies EMP_15.txt  Database is '''cities'''SQL script is: '''new_employment.sql''' and it is located in Z:\Hubs\2017\sql scripts The final table which is joined on VC is in db cities titled '''new_merged_employment'''. They contain:*City*State*Code from the state code and Fips code*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Employment rates of individuals of 16 years or older*Unemployment rates of individuals of 16 years or older*Year ==Schooling Data== Data on schooling was obtained from American Communities Survey, US Census Bureau. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSACleaned files are in investingfundZ:\Hubs\2017\clean data They contain:*MSA code*MSA*Year *Total number of population 3 years and over enrolled in school*Percent of population 3 years and over enrolled in public school*Percent of population 3 years and over enrolled in private school  Date from 2005-2015 The SQL file that merges schooling data from ACS (by MSA -Year) with the MSA-distinct funds that City file is titled '''schooling.sql'''. The file is located in: Z:\Hubs\2017 The final table is in db '''cities''' titled '''merged_schooling'''. It includes:*MSA*City*Year*Total*Percent_public_schooling*Percent_private_schooling ===Joined schooling table=== Data is in: Z:\Hubs\clean dataThe file names are investing : SCH_05.txt - SCH_15.txt  Database is '''cities'''SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''The final table is in db '''cities''' titled '''new_merged_schooling'''. It contains:*City*State*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in that CMSA-each year*Total number of school enrollment*Percentage enrolled in public schools*Percentage enrolled in private schools*Year*Code from the state code and Fips code ==VC Data==   Raw Data is in: Z:\VentureCapitalData\SDCVCData\vcdb2 The file name is roundleveloutput2.txt It contains:*city*state*year investingfundnear*seedamtm -seed, amount in millions*earlyamtm -distinct funds from that CMSA that invested early, amount in that CMSAmillions*lateramtm -year late, amount in millions investingfundfar*selamtm -seed early late, amount in millions*numseeds -distinct funds number of seeds*numearly *numlater*numsel*numdeals*numalive  Date from outside that CMSA that invested 1948-2017  The table is in db '''cities''' titled '''new_vc'''. It includes:*city*state*city_state_id*city_state_year*seedamtm*earlyamtm*lateramtm*selamtm*numseeds*numearly*numlater*numsel*numdeals*numalive*year ==Final Joined Data set ==  The final data set is in file '''final.txt''' and is located here: Z:\Hubs\2017 It includes:*city*state*city_state_year - (ID that CMSAdata is merged on)*year*seedamtm - Seed Amount*earlyamtm - Early Investment Amount*lateramtm - Late Investment Amount*selamtm - Seed early or late amount*numseeds - Number of seed investments *numearly -year Number of early investments deals*numlater -Number of late investments*numsel *numdeals -number Number of deals (first contracts)*numalive - Number of start ups alive neardeals*income -Income per capita in each city-number year*sbir_nogrants - Number of deals inside a CMSA SBIR grants fardeals*sbir_valuegrant -Value of SBIR grants*emp -number Employment stats of deals from outside a CMSA each city-year*unemp -Rate of unemployment*popestimate -some Population estimate of these deals might count each city-year*private - Enrollment in private schools*public - Enrollment in both categories, because public schools*total - *numfirms - Number of syndicate members being both inside and outside the CMSApublicly traded firms earlystagedeals*randd -R&D expenditure of publicly traded firms*revenue -deals with earlystage companiesRevenue of PTF*totalassets laterstagedeals*nsf_nogrants -Number of NSF grants*valuegrant -deals with later stage companies Value of NSF grants startupseeddeals*nih_nogrants -Number of NIH grants*nih_valuegrant -deals with startup/seed companies Value of NIH grants otherstagedeals*noctrials -NUmber of clinical trials == Defining Hubs == '''Summer 2016''' -deals with companies Last year a master list of 125 "potential" hubs was used. A scorecard was developed which filtered these 125 candidate hubs to determine which of these should be included in the study sample. This method resulted in other stages a sample size of ~ 30. The master list and the final hubs list is titled '''Hubs Data v2_'16'''. It is located here: newportcosfundedZ:\Hubs\2017\hubs_data '''Summer 2017''' -In order to obtain a more statistically significant sample of hubs, we developed 5 criteria which produce a more relaxed definition of hubs than last year. These include *Availability of co-number working space*Coding classes or tech events*Some focus on the tech sector (this is important as our dependent variable is VC funding)*Presence of an accelerator*Availability of mentorship for members. We will review the 125 candidate hubs and select those which satisfy a subset or all of portfolio companies to receive their first investment in that yearthese characteristics.   [[category:Internal]]

Navigation menu