Changes

Jump to navigation Jump to search
5,673 bytes added ,  13:41, 21 September 2020
no edit summary
The {{Project|Has project output=Data|Has sponsor=McNair Center|Has title=Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs"|Has owner=Hira Farooqi, a component of the entrepreneurship ecosystem|Has keywords=Data|Has project status=Active|Does subsume=Hubs Analysis 2017, in the advancement and growth of entrepreneurial success in a metropolitan area. }}
This research will primarily focused on large '''Important Notice: The last update to the hubs data was done manually by Ed and midis in E:\projects\MeasuringHGHTEcosystems\HubsData-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is locatedRevisedSimplified. xlsx'''
===Primary Data Set===
The Hubs data set, from SDC Platinum, is currently in the process of being constructed.
The data set includes all United States Venture Capital transactions (moneytree) from Hubs Research Project is a full-length academic paper analyzing the twenty-five year period effectiveness of "hubs", a component of 1990 through 2015.Data has been accumulated at the portfolio company, fundentrepreneurship ecosystem, in the advancement and round levelgrowth of entrepreneurial success in a metropolitan area. It will be analyzed at focuses on cities in the United States as the MSA levelprimary unit of analysis. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.
The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
*CombinedRounds: Coname, rounddate, discamount, fundname
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address
Used variables:
Companies: Coname, MSACode, Industry, state MSALookupTable: MSACode, MSASuper IndustryLookupTable: IndustryMajor, InduCode -> CompanyInfo: Coname, MSASuper, InduCode, state (complete)Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]].
Funds'''Note on joining: fundname''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, msacodenih, state MSALookupTable: MSACodensf, sbir, MSASuper compustat) is first joined with the VC data on city-> FundInfo: fundname, msacode, state (complete)-year ID and then the resulting tables are all joined together in the final table.
Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)
Then take===Data by zip code===*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017) RoundInfohttps: Coname//www2.census.gov/programs-surveys/popest/datasets/*Income data, roundyear, fundname, estamount1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017) CompanyInfohttps: Coname//www.irs.gov/uac/about-irs*DCI index, MSASuper, InduCode, stateto assess the economic well-being of communities FundInfohttp: fundname, msacode, state //eig.org/dci/interactive-maps/u-s-zip->codes SuperRoundInfo: Coname*R&D Expenses, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount 1980-2016 -> MSAPortCosWharton Research Data Services (E: Count(CoName\McNair\Hubs\summer 2017) As NoPortCosFunded, CoMSASuper, RoundYear *Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).
'''Notes on Creation of Primary == Data Set'''by MSA ==
Raw tables* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) * funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) * rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) * combined rounds (company name, round date, disclosed amount, investor) * msalist (changes MSAs to CMSAs— combined We have principle cities of MSAs)from the census:*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html
Process* cleaned tables We might be able to eliminate duplications, undisclosed variables* changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) * matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) *matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt go City ->FIPS place code -> cleanfundfinal.txt)*join by round and company conames*bridge years (1990-2016), stage, and cmsa* populate data with count of companies (Deal flow) and estimated amount ($)** data set in 181 hubs folder under summarycmsa.txt (38394)MSA?
===Hub Candidates Data Set===Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the countryCensus claims to provide city to MSA here: https://www. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hubcensus. This is a difficult gov/geo/maps-data/data set to pull/ua_rel_download.htmlHowever, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.only CBSA!
CharacteristicsThis might do it: https:/Variables*Year Founded*Square footage*LinkedIN self/www2.census.gov/geo/pdfs/maps-identifiers (what the organization classifies itself on its LinkedIN profile) *Activeness on Twitter (binomial)*Member Directory available online (binomial)*Number of conference rooms*Price ($data/data/rel/month) for Flex desk *Offers Reserved desk (binomial)*Offers office space for rent (binomial) *Offers community membership-- not for coworking but for community events, etcexplanation_ua_cbsa_rel_10. (binomial)*Number of events offered per month (estimate)*Offers code academypdf*Mission Statement/Vision (for qualitative or key-word analysis) We can maybe track city to principal city to MSA
These characteristics==COMPUSTAT Data==The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https:/variables will be used to determine whether a candidate is or is not likely to be a Hub/wrds-web.wharton.upenn.edu/wrds/index.cfm?).
As of March 10th 2016, the list contains 125 Hub candidates.
===Supplementary Raw Data Sets===is in:'''Patent data''' E: to be pulled from USPTO or SDC Platinum. \McNair\Projects\Hubs\Summer 2017*unable to find on the internet, must be pulled from the larger dataset Z:\Hubs\2017
'''Number of STEM Graduate Students''' The source file is RandDExpenditures.txt. It contains:*Date from 1980-2017 (NSFJuly) and . *427799 records*Fields include:**R&D Expenditure**Address (inc. city, zip, state)**Revenue of firms Database is '''University R&D Spendingcities''' (NSF): Grad Students found for the year 2015, no data going back historically; R&D found for the past 10 years
*categorized university by MSA, can be used for all university-based projectsSQL script is: COMPUSTAT.sql
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau)Output file is COMPUSTATSummary.txt. It contains: complete for most recent census*Variables: City, year, No.public firms, sum R&D, sum Sales, unable to find data going back historicallysum total assets*1979-2016*4440 cities
'''Firm Births''' (BDS)It is located in Z: data set found for 1990 to present, currently being cleaned up for use\Hubs\2017\Output_Files
===Resources=NSF Data==* Yael Hochberg and Fehder (2015), located Data is in dropbox:** Use this paper as a guideline on how to conduct the analysis E:\McNair\Projects\Hubs\Summer 2017*US Census Bureau data on employment by MSA Z: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table\Hubs\2017*USPTO tility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm *MSA level trends: http://www.metrotrends.org/data.cfDatabase is '''cities'''
===To Do===We need to find and clean up data sets at the MSA levelSQL script is: nsf_2017.sql
*Patent data (USPTO)*Number of STEM Graduate Students (NSF)**in progress*University R&D Spending (NSF)*Per Capita Income (US Census) **complete (Employment and Income_MSAThe source files are: nsf2017.xls)*Employment (US Census)**complete (Employment txt, copied from table '''nsf''', and Income_MSAnsf_institution copied from table '''nsf_grants_institution''' from the biotech db.xls)*Firm births (BDS)*SELECT MSAs!!! Possible method: choosing CMSAs with Distinct companies funded ** >100 = 38** >75 = 45** >50 = 52** >25 = 80** Total 238**greater than 100 will give us 52 CMSAs to work with
===Data Cleaning===They contain:*Award ID*Award Institution*Award Effective date*Institution city*Award Value*Organization state codeFrom 1900 - 2017
Cleaning tasksOutput file is nsfSummary.txt. It contains:*Remove PortCos named UndisclosedVariables: City, State code year, etc.*Remove Funds named Unknownnsf_nogrants, etc.nsf_valuegrant *Basic Data cleaning:**Enormous outliers on funds invested**Check dates1900-2017
Lookup tables:===Joined NSF table===The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.*SuperMSAsAll the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.*IndustryThe sql script is in*Stages Z:\HUbs\2017\sql scripts
==NIH Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017
===Database is '''cities'''SQL script is: nih2017.sqlThe Target Dataset===source files are: *nih_1986_2001.csv*nih_2002_2012.txt*nih_2013_2015located in E:\McNair\Projects\Federal Grant Data\NIH
We will need to process the following variables:
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
Example datasetThe script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here: MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors .... ---------------------------------------------------------------------------------------------------------------------------- 1234 2001 1000000 20000000 30000000 4 7 7
Z:\Hubs\2017\sql scripts
Note that the unit This table includes*year*city*state*country*nogrants (number of observation is MSA-Year.grants)*valuegrant*city_state
Variables to be computed at the MSA level:*HubActive (binary)*NoHubsActive (Count)*HubSqFt*Other Hub Vars (build list!!!)*'''SeedVCInv'''*'''SeedEarlyVCInv'''*NoDeals (done by local VCs?)**NoDealsNear**NoDealsFar*'''NoPortCosFunded '''*FundsInv (in an MSA)**FundsInvFromNear (within MSA?)**FundsInvFromFar (outside MSA?)*DistinctInvestors**DistinctInvestorsNear (within MSA?)**DistinctInvestorsFar (outside MSA?)*PatentCount*NoSTEMGrads*FirmBirths (BDS data)*UniRandDSpend*PerCapitaIncome*EmploymentDate from 1986-2015
We need to:===Joined NIH table===*Check funds invested means dollars investedThe joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.*Categorize near All the values of nih_valuegrant and far! Is it within MSA vsnih_nogrants with missing values for years 1986-2015 are set equal to 0. not, within adjacent MSAs, etc.?The sql script is in Z:\HUbs\2017\sql scripts
==Clinical Trials Data==
Data is in:
Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017
There may be Database is '''cities'''SQL script is: ctrials.sqlThe source file is:  *medclinical.txt located in Z:\Hubs\2017 *Date from 1999-2017 ===Joined clinical trials table=== The file which contains the number of trials in each city and year is located in: Z:\Hubs\2017 The file is in: Z:\Hubs\2017\clean dataThe name of the file is: ctrialsSummary.txt It contains:*city*year*city_state_year*noctrials - number of trials The ctrials is joined with VC table. The joined SQL script is: '''new_ctrials.sql''' and it is located in Z:\Hubs\2017\sql scripts The name of the joined table is '''new_merged_ctrials'''. It contains:*city*state*city_state_id*city_state_year*year*noctrials*seedamtm*earlyamtm*lateramtm*selamtm*numseeds*numearly*numlater*numsel All the values of noctrials with missing values for years 1999-2017 are set equal to 0. ==Population Data==Data is in: Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017 Database is '''cities''' SQL script is: '''population.sql'''The source files are: *pop2000_2009.xlsx*pop2010_2016.xlsx They contain:*State*City name *Year *Population Estimates Date from 2000-2016 ===Joined population table=== Data is in: Z:\Hubs\2017\clean dataThe file names are 1_population.txt - contains data on population estimates from 2000-2009 2_population.txt - contains data on population estimates from 2010-2016  Database is '''cities'''SQL script is: '''new_population.sql''', located in Z:\Hubs\2017\sql scripts The population table is joined on VC table. The table is called '''new_merged_population'''. They contain:*City*State*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Population estimates*Year*Code from the state code and Fips code*State full name ==Income Data== Raw data was obtained from Census data, American Communities Survey. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip   Date from 2005-2015 The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at: Z:\Hubs\2017 This master list includes:*MSA code*MSA name*Principal City*State*Place code (city code)*State Code This master list was edited to associate each principal city with a second dataset unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.   Cleaned Income data files are in Z:\Hubs\2017\merging_on_ID  They contain:*MSA code*MSA*Year *Total Household Income  The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in Z:\Hubs\2017\merging_on_ID  The SQL file that has Hubmerges income data from ACS (by MSA -IndustryYear) with the MSA-City file is titled '''income.sql'''. It is located here: Z:\Hubs\2017\sql scripts The final income table is in db '''cities''' titled '''merged_income'''. It includes:*MSA*City*State*Year*Total Household Income  The table includes 8780 observations ===Joined income table=== Data is in: Z:\Hubs\clean dataThe file names are: INC_05.txt -INC_15.txt  Database is '''cities'''SQL script is: merged_income.sql  They contain:*City*State*city_state_id to uniquely identify each city*Income*Year *Code from the state code and Fips code ==Employment Data== Data on employment was obtained from American Communities Survey, US Census Bureau. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSACleaned files are in Z:\Hubs\2017\clean data They contain:*MSA code*MSA*Year *Employment rate of individuals 16 years or older*Unemployment rate of individuals 16 years or older Date from 2005-2015 The SQL file that merges employment data from ACS (where industry by MSA - Year) with the MSA-City file is semiconductor/nontitled '''Employment.sql'''. The file is located in: Z:\Hubs\2017 The final table is in db '''cities''' titled '''merged_employment'''. It includes:*MSA*City*Year*Employment rate*Unemployment rate ===Joined employment table=== Data is in: Z:\Hubs\clean data The file names are: EMP_05.txt -semiconductor?EMP_15.txt  Database is '''cities'''SQL script is: '''new_employment.sql''' and it is located in Z:\Hubs\2017\sql scripts The final table which is joined on VC is in db cities titled '''new_merged_employment'''. They contain:*City*State*Code from the state code and Fips code*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Employment rates of individuals of 16 years or older*Unemployment rates of individuals of 16 years or older*Year ==Schooling Data== Data on schooling was obtained from American Communities Survey, US Census Bureau. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSACleaned files are in Z:\Hubs\2017\clean data They contain:*MSA code*MSA*Year *Total number of population 3 years and over enrolled in school*Percent of population 3 years and over enrolled in public school*Percent of population 3 years and over enrolled in private school  Date from 2005-2015 The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. The file is located in: Z:\Hubs\2017 The final table is in db '''cities''' titled '''merged_schooling'''. It includes:*MSA*City*Year*Total*Percent_public_schooling*Percent_private_schooling ===Joined schooling table=== Data is in: Z:\Hubs\clean dataThe file names are: SCH_05.txt - SCH_15.txt  Database is '''cities'''SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''The final table is in db '''cities''' titled '''new_merged_schooling'''. It contains:*City*State*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Total number of school enrollment*Percentage enrolled in public schools*Percentage enrolled in private schools*Year*Code from the state code and Fips code ==VC Data==   Raw Data is in: Z:\VentureCapitalData\SDCVCData\vcdb2 The file name is roundleveloutput2.txt It contains:*city*state*year*seedamtm - seed, amount in millions*earlyamtm - early, amount in millions*lateramtm - late, amount in millions*selamtm - seed early late, amount in millions*numseeds - number of seeds*numearly *numlater*numsel*numdeals*numalive  Date from 1948-2017  The table is in db '''cities''' titled '''new_vc'''. It includes:*city*state*city_state_id*city_state_year*seedamtm*earlyamtm*lateramtm*selamtm*numseeds*numearly*numlater*numsel*numdeals*numalive*year ==Final Joined Data set ==  The final data set is in file '''final.txt''' and is located here: Z:\Hubs\2017 It includes:*city*state*city_state_year - (ID that data is merged on)*year*seedamtm - Seed Amount*earlyamtm - Early Investment Amount*lateramtm - Late Investment Amount*selamtm - Seed early or late amount*numseeds - Number of seed investments *numearly - Number of early investments*numlater - Number of late investments*numsel *numdeals - Number of deals (first contracts)*numalive - Number of start ups alive*income - Income per capita in each city-year*sbir_nogrants - Number of SBIR grants*sbir_valuegrant - Value of SBIR grants*emp - Employment stats of each city-year*unemp - Rate of unemployment*popestimate - Population estimate of each city-year*private - Enrollment in private schools*public - Enrollment in public schools*total - *numfirms - Number of publicly traded firms*randd - R&D expenditure of publicly traded firms*revenue - Revenue of PTF*totalassets *nsf_nogrants - Number of NSF grants*valuegrant - Value of NSF grants*nih_nogrants - Number of NIH grants*nih_valuegrant - Value of NIH grants*noctrials - NUmber of clinical trials == Defining Hubs == '''Summer 2016''' - Last year a master list of 125 "potential" hubs was used. A scorecard was developed which filtered these 125 candidate hubs to determine which of these should be included in the study sample. This method resulted in a sample size of ~ 30. The master list and the final hubs list is titled '''Hubs Data v2_'16'''. It is located here: Z:\Hubs\2017\hubs_data '''Summer 2017''' - In order to obtain a more statistically significant sample of hubs, we developed 5 criteria which produce a more relaxed definition of hubs than last year. These include *Availability of co-working space*Coding classes or tech events*Some focus on the tech sector (this is important as our dependent variable is VC funding)*Presence of an accelerator*Availability of mentorship for members. We will review the 125 candidate hubs and select those which satisfy a subset or all of these characteristics.   [[category:Internal]]

Navigation menu