Changes

5,673 bytes added , 13:41, 21 September 2020

no edit summary

~~The~~ {{Project|Has project output=Data|Has sponsor=McNair Center|Has title=Hubs ~~Research Project is a full-length academic paper analyzing the effectiveness of "hubs"~~|Has owner=Hira Farooqi, ~~a component of the entrepreneurship ecosystem~~|Has keywords=Data|Has project status=Active|Does subsume=Hubs Analysis 2017, ~~in the advancement and growth of entrepreneurial success in a metropolitan area.~~ }}

~~This research will primarily focused on large~~ '''Important Notice: The last update to the hubs data was done manually by Ed and ~~mid~~is in E:\projects\MeasuringHGHTEcosystems\HubsData-~~sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located~~RevisedSimplified. xlsx'''

~~===Primary Data Set===~~

~~The Hubs data set, from SDC Platinum, is currently in the process of being constructed.~~

The ~~data set includes all United States Venture Capital transactions (moneytree) from~~ Hubs Research Project is a full-length academic paper analyzing the ~~twenty-five year period~~ effectiveness of "hubs", a component of ~~1990 through 2015.Data has been accumulated at~~ the ~~portfolio company, fund~~entrepreneurship ecosystem, in the advancement and ~~round level~~growth of entrepreneurial success in a metropolitan area. It ~~will be analyzed at~~ focuses on cities in the United States as the ~~MSA level~~primary unit of analysis. ~~We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA~~

This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.

~~The data set has now been uploaded to the database server, named Hubs.~~

~~There are 4 tables:~~

*Rounds: Rounddate, coname, state, roundno, stage1, etc.

*CombinedRounds: Coname, rounddate, discamount, fundname

*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)

*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

~~Used variables:~~

~~Companies: Coname, MSACode, Industry, state~~ ~~MSALookupTable: MSACode, MSASuper~~ ~~IndustryLookupTable: IndustryMajor, InduCode~~ -> ~~CompanyInfo: Coname, MSASuper, InduCode, state (complete)~~Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]].

~~Funds~~'''Note on joining: ~~fundname~~''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, ~~msacode~~nih, ~~state~~ ~~MSALookupTable: MSACode~~nsf, sbir, ~~MSASuper~~ compustat) is first joined with the VC data on city-> ~~FundInfo: fundname, msacode,~~ state ~~(complete)~~-year ID and then the resulting tables are all joined together in the final table.

~~Rounds: coname, rounddate, stagecode, roundno~~

~~CombinedRounds: coname, rounddate, discamount, fundname~~

->

~~RoundInfoSuper: coname, rounddate, '''nofunds''', discamount~~

->

~~RoundInfo: Coname, roundyear, fundname, estamount (complete)~~

~~Then take~~===Data by zip code===*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017) ~~RoundInfo~~https: ~~Coname~~//www2.census.gov/programs-surveys/popest/datasets/*Income data, ~~roundyear, fundname, estamount~~1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017) ~~CompanyInfo~~https: ~~Coname~~//www.irs.gov/uac/about-irs*DCI index, ~~MSASuper, InduCode, state~~to assess the economic well-being of communities ~~FundInfo~~http: ~~fundname, msacode, state~~ //eig.org/dci/interactive-maps/u-s-zip->codes ~~SuperRoundInfo: Coname~~*R&D Expenses, ~~CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount~~ 1980-2016 -> ~~MSAPortCos~~Wharton Research Data Services (E: ~~Count(CoName~~\McNair\Hubs\summer 2017) ~~As NoPortCosFunded, CoMSASuper, RoundYear~~ *Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

~~'''Notes on Creation of Primary~~ == Data ~~Set'''~~by MSA ==

~~Raw tables~~* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) * funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) * rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) * combined rounds (company name, round date, disclosed amount, investor) * msalist (changes MSAs to CMSAs— combined We have principle cities of MSAs)from the census:*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

~~Process~~* cleaned tables We might be able to ~~eliminate duplications, undisclosed variables~~* changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) * matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) *matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt go City ->FIPS place code -> ~~cleanfundfinal.txt)~~*join by round and company conames*bridge years (1990-2016), stage, and cmsa* populate data with count of companies (Deal flow) and estimated amount ($)** data set in 181 hubs folder under summarycmsa.txt (38394)MSA?

~~===Hub Candidates Data Set===~~Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The ~~Hubs candidate data set is a list of potential hubs found in MSAs throughout the country~~Census claims to provide city to MSA here: https://www. ~~Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub~~census. ~~This is a difficult~~ gov/geo/maps-data/data ~~set to pull~~/ua_rel_download.htmlHowever, as there is ~~little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.~~only CBSA!

~~Characteristics~~This might do it: https:/~~Variables~~*Year Founded*Square footage*LinkedIN self/www2.census.gov/geo/pdfs/maps-~~identifiers (what the organization classifies itself on its LinkedIN profile)~~ *Activeness on Twitter (binomial)*Member Directory available online (binomial)*Number of conference rooms*Price ($data/data/rel/~~month) for Flex desk~~ *Offers Reserved desk (binomial)*Offers office space for rent (binomial) *Offers community membership-- not for coworking but for community events, etcexplanation_ua_cbsa_rel_10. ~~(binomial)~~*Number of events offered per month (estimate)*Offers code academypdf*Mission Statement/Vision (for qualitative or key-word analysis) We can maybe track city to principal city to MSA

~~These characteristics~~==COMPUSTAT Data==The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https:/~~variables will be used to determine whether a candidate is or is not likely to be a Hub~~/wrds-web.wharton.upenn.edu/wrds/index.cfm?).

~~As of March 10th 2016, the list contains 125 Hub candidates.~~

~~===Supplementary~~ Raw Data ~~Sets===~~is in:~~'''Patent data'''~~ E: ~~to be pulled from USPTO or SDC Platinum.~~ \McNair\Projects\Hubs\Summer 2017*unable to find on the internet, must be pulled from the larger dataset Z:\Hubs\2017

~~'''Number of STEM Graduate Students'''~~ The source file is RandDExpenditures.txt. It contains:*Date from 1980-2017 (~~NSF~~July) ~~and~~ . *427799 records*Fields include:**R&D Expenditure**Address (inc. city, zip, state)**Revenue of firms Database is '''~~University R&D Spending~~cities''' ~~(NSF): Grad Students found for the year 2015, no data going back historically; R&D found for the past 10 years~~

*categorized university by MSA, can be used for all university-based projectsSQL script is: COMPUSTAT.sql

~~'''Per Capita Income''' and '''Employment Data''' (US Census Bureau)~~Output file is COMPUSTATSummary.txt. It contains: ~~complete for most recent census~~*Variables: City, year, No.public firms, sum R&D, sum Sales, ~~unable to find data going back historically~~sum total assets*1979-2016*4440 cities

~~'''Firm Births''' (BDS)~~It is located in Z: ~~data set found for 1990 to present, currently being cleaned up for use~~\Hubs\2017\Output_Files

==~~=Resources=~~NSF Data==* Yael Hochberg and Fehder (2015), located Data is in ~~dropbox~~:** Use this paper as a guideline on how to conduct the analysis E:\McNair\Projects\Hubs\Summer 2017*US Census Bureau data on employment by MSA Z: ~~http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table~~\Hubs\2017*USPTO tility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm *MSA level trends: http://www.metrotrends.org/data.cfDatabase is '''cities'''

~~===To Do===We need to find and clean up data sets at the MSA level~~SQL script is: nsf_2017.sql

*Patent data (USPTO)*Number of STEM Graduate Students (NSF)**in progress*University R&D Spending (NSF)*Per Capita Income (US Census) **complete (Employment and Income_MSAThe source files are: nsf2017.~~xls)~~*Employment (US Census)**complete (Employment txt, copied from table '''nsf''', and ~~Income_MSA~~nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.~~xls)~~*Firm births (BDS)*SELECT MSAs!!! Possible method: choosing CMSAs with Distinct companies funded ** >100 = 38** >75 = 45** >50 = 52** >25 = 80** Total 238**greater than 100 will give us 52 CMSAs to work with

~~===Data Cleaning===~~They contain:*Award ID*Award Institution*Award Effective date*Institution city*Award Value*Organization state codeFrom 1900 - 2017

~~Cleaning tasks~~Output file is nsfSummary.txt. It contains:*~~Remove PortCos named Undisclosed~~Variables: City, State code year, ~~etc.~~*Remove Funds named Unknownnsf_nogrants, ~~etc.~~nsf_valuegrant *~~Basic Data cleaning:~~**Enormous outliers on funds invested**Check dates1900-2017

~~Lookup tables:~~===Joined NSF table===The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.*SuperMSAsAll the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.*IndustryThe sql script is in*Stages Z:\HUbs\2017\sql scripts

==NIH Data==

Data is in:

Z:\Hubs

E:\McNair\Projects\Hubs\Summer 2017

~~===~~Database is '''cities'''SQL script is: nih2017.sqlThe ~~Target Dataset===~~source files are: *nih_1986_2001.csv*nih_2002_2012.txt*nih_2013_2015located in E:\McNair\Projects\Federal Grant Data\NIH

~~We will need to process the following variables:~~

*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?

~~Example dataset~~The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here: ~~MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....~~ ~~----------------------------------------------------------------------------------------------------------------------------~~ ~~1234 2001 1000000 20000000 30000000 4 7 7~~

Z:\Hubs\2017\sql scripts

~~Note that the unit~~ This table includes*year*city*state*country*nogrants (number of ~~observation is MSA-Year.~~grants)*valuegrant*city_state

~~Variables to be computed at the MSA level:~~*~~HubActive (binary)~~*NoHubsActive (Count)*HubSqFt*Other Hub Vars (build list!!!)*'''SeedVCInv'''*'''SeedEarlyVCInv'''*NoDeals (done by local VCs?)**NoDealsNear**NoDealsFar*'''NoPortCosFunded '''*FundsInv (in an MSA)**FundsInvFromNear (within MSA?)**FundsInvFromFar (outside MSA?)*DistinctInvestors**DistinctInvestorsNear (within MSA?)**DistinctInvestorsFar (outside MSA?)*PatentCount*NoSTEMGrads*FirmBirths (BDS data)*UniRandDSpend*PerCapitaIncome*EmploymentDate from 1986-2015

~~We need to:~~===Joined NIH table===*Check funds invested means dollars investedThe joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.*Categorize near All the values of nih_valuegrant and ~~far! Is it within MSA vs~~nih_nogrants with missing values for years 1986-2015 are set equal to 0. ~~not, within adjacent MSAs, etc.?~~The sql script is in Z:\HUbs\2017\sql scripts

==Clinical Trials Data==

Data is in:

Z:\Hubs

E:\McNair\Projects\Hubs\Summer 2017

~~There may be~~ Database is '''cities'''SQL script is: ctrials.sqlThe source file is: *medclinical.txt located in Z:\Hubs\2017 *Date from 1999-2017 ===Joined clinical trials table=== The file which contains the number of trials in each city and year is located in: Z:\Hubs\2017 The file is in: Z:\Hubs\2017\clean dataThe name of the file is: ctrialsSummary.txt It contains:*city*year*city_state_year*noctrials - number of trials The ctrials is joined with VC table. The joined SQL script is: '''new_ctrials.sql''' and it is located in Z:\Hubs\2017\sql scripts The name of the joined table is '''new_merged_ctrials'''. It contains:*city*state*city_state_id*city_state_year*year*noctrials*seedamtm*earlyamtm*lateramtm*selamtm*numseeds*numearly*numlater*numsel All the values of noctrials with missing values for years 1999-2017 are set equal to 0. ==Population Data==Data is in: Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017 Database is '''cities''' SQL script is: '''population.sql'''The source files are: *pop2000_2009.xlsx*pop2010_2016.xlsx They contain:*State*City name *Year *Population Estimates Date from 2000-2016 ===Joined population table=== Data is in: Z:\Hubs\2017\clean dataThe file names are 1_population.txt - contains data on population estimates from 2000-2009 2_population.txt - contains data on population estimates from 2010-2016 Database is '''cities'''SQL script is: '''new_population.sql''', located in Z:\Hubs\2017\sql scripts The population table is joined on VC table. The table is called '''new_merged_population'''. They contain:*City*State*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Population estimates*Year*Code from the state code and Fips code*State full name ==Income Data== Raw data was obtained from Census data, American Communities Survey. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip Date from 2005-2015 The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at: Z:\Hubs\2017 This master list includes:*MSA code*MSA name*Principal City*State*Place code (city code)*State Code This master list was edited to associate each principal city with a ~~second dataset~~ unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. Cleaned Income data files are in Z:\Hubs\2017\merging_on_ID They contain:*MSA code*MSA*Year *Total Household Income The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in Z:\Hubs\2017\merging_on_ID The SQL file that ~~has Hub~~merges income data from ACS (by MSA -~~Industry~~Year) with the MSA-City file is titled '''income.sql'''. It is located here: Z:\Hubs\2017\sql scripts The final income table is in db '''cities''' titled '''merged_income'''. It includes:*MSA*City*State*Year*Total Household Income The table includes 8780 observations ===Joined income table=== Data is in: Z:\Hubs\clean dataThe file names are: INC_05.txt -INC_15.txt Database is '''cities'''SQL script is: merged_income.sql They contain:*City*State*city_state_id to uniquely identify each city*Income*Year *Code from the state code and Fips code ==Employment Data== Data on employment was obtained from American Communities Survey, US Census Bureau. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSACleaned files are in Z:\Hubs\2017\clean data They contain:*MSA code*MSA*Year *Employment rate of individuals 16 years or older*Unemployment rate of individuals 16 years or older Date from 2005-2015 The SQL file that merges employment data from ACS (~~where industry~~ by MSA - Year) with the MSA-City file is ~~semiconductor/non~~titled '''Employment.sql'''. The file is located in: Z:\Hubs\2017 The final table is in db '''cities''' titled '''merged_employment'''. It includes:*MSA*City*Year*Employment rate*Unemployment rate ===Joined employment table=== Data is in: Z:\Hubs\clean data The file names are: EMP_05.txt -~~semiconductor?~~EMP_15.txt Database is '''cities'''SQL script is: '''new_employment.sql''' and it is located in Z:\Hubs\2017\sql scripts The final table which is joined on VC is in db cities titled '''new_merged_employment'''. They contain:*City*State*Code from the state code and Fips code*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Employment rates of individuals of 16 years or older*Unemployment rates of individuals of 16 years or older*Year ==Schooling Data== Data on schooling was obtained from American Communities Survey, US Census Bureau. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSACleaned files are in Z:\Hubs\2017\clean data They contain:*MSA code*MSA*Year *Total number of population 3 years and over enrolled in school*Percent of population 3 years and over enrolled in public school*Percent of population 3 years and over enrolled in private school Date from 2005-2015 The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. The file is located in: Z:\Hubs\2017 The final table is in db '''cities''' titled '''merged_schooling'''. It includes:*MSA*City*Year*Total*Percent_public_schooling*Percent_private_schooling ===Joined schooling table=== Data is in: Z:\Hubs\clean dataThe file names are: SCH_05.txt - SCH_15.txt Database is '''cities'''SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''The final table is in db '''cities''' titled '''new_merged_schooling'''. It contains:*City*State*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Total number of school enrollment*Percentage enrolled in public schools*Percentage enrolled in private schools*Year*Code from the state code and Fips code ==VC Data== Raw Data is in: Z:\VentureCapitalData\SDCVCData\vcdb2 The file name is roundleveloutput2.txt It contains:*city*state*year*seedamtm - seed, amount in millions*earlyamtm - early, amount in millions*lateramtm - late, amount in millions*selamtm - seed early late, amount in millions*numseeds - number of seeds*numearly *numlater*numsel*numdeals*numalive Date from 1948-2017 The table is in db '''cities''' titled '''new_vc'''. It includes:*city*state*city_state_id*city_state_year*seedamtm*earlyamtm*lateramtm*selamtm*numseeds*numearly*numlater*numsel*numdeals*numalive*year ==Final Joined Data set == The final data set is in file '''final.txt''' and is located here: Z:\Hubs\2017 It includes:*city*state*city_state_year - (ID that data is merged on)*year*seedamtm - Seed Amount*earlyamtm - Early Investment Amount*lateramtm - Late Investment Amount*selamtm - Seed early or late amount*numseeds - Number of seed investments *numearly - Number of early investments*numlater - Number of late investments*numsel *numdeals - Number of deals (first contracts)*numalive - Number of start ups alive*income - Income per capita in each city-year*sbir_nogrants - Number of SBIR grants*sbir_valuegrant - Value of SBIR grants*emp - Employment stats of each city-year*unemp - Rate of unemployment*popestimate - Population estimate of each city-year*private - Enrollment in private schools*public - Enrollment in public schools*total - *numfirms - Number of publicly traded firms*randd - R&D expenditure of publicly traded firms*revenue - Revenue of PTF*totalassets *nsf_nogrants - Number of NSF grants*valuegrant - Value of NSF grants*nih_nogrants - Number of NIH grants*nih_valuegrant - Value of NIH grants*noctrials - NUmber of clinical trials == Defining Hubs == '''Summer 2016''' - Last year a master list of 125 "potential" hubs was used. A scorecard was developed which filtered these 125 candidate hubs to determine which of these should be included in the study sample. This method resulted in a sample size of ~ 30. The master list and the final hubs list is titled '''Hubs Data v2_'16'''. It is located here: Z:\Hubs\2017\hubs_data '''Summer 2017''' - In order to obtain a more statistically significant sample of hubs, we developed 5 criteria which produce a more relaxed definition of hubs than last year. These include *Availability of co-working space*Coding classes or tech events*Some focus on the tech sector (this is important as our dependent variable is VC funding)*Presence of an accelerator*Availability of mentorship for members. We will review the 125 candidate hubs and select those which satisfy a subset or all of these characteristics. [[category:Internal]]

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Hubs (view source)

Revision as of 13:41, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools