Changes

3,530 bytes added , 13:41, 21 September 2020

no edit summary

~~The~~ {{Project|Has project output=Data|Has sponsor=McNair Center|Has title=Hubs ~~Research Project is a full-length academic paper analyzing the effectiveness of "hubs"~~|Has owner=Hira Farooqi, ~~a component of the entrepreneurship ecosystem~~|Has keywords=Data|Has project status=Active|Does subsume=Hubs Analysis 2017, ~~in the advancement and growth of entrepreneurial success in a metropolitan area.~~ }}

~~This research will primarily focused on large~~ '''Important Notice: The last update to the hubs data was done manually by Ed and ~~mid~~is in E:\projects\MeasuringHGHTEcosystems\HubsData-~~sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located~~RevisedSimplified. xlsx'''

~~===Primary Data Set===~~

~~The Hubs data set, from SDC Platinum, is currently in the process of being constructed.~~

The ~~data set includes all United States Venture Capital transactions (moneytree) from~~ Hubs Research Project is a full-length academic paper analyzing the ~~twenty-five year period~~ effectiveness of "hubs", a component of ~~1990 through 2015.Data has been accumulated at~~ the ~~portfolio company, fund~~entrepreneurship ecosystem, in the advancement and ~~round level~~growth of entrepreneurial success in a metropolitan area. It ~~will be analyzed at~~ focuses on cities in the United States as the ~~MSA level~~primary unit of analysis. ~~We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA~~

This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.

~~The data set has now been uploaded to the database server, named Hubs.~~

~~There are 4 tables:~~

*Rounds: Rounddate, coname, state, roundno, stage1, etc.

*CombinedRounds: Coname, rounddate, discamount, fundname

*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)

*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

~~Used variables:~~

~~Companies: Coname, MSACode, Industry, state~~ ~~MSALookupTable: MSACode, MSASuper~~ ~~IndustryLookupTable: IndustryMajor, InduCode~~ -> ~~CompanyInfo: Coname, MSASuper, InduCode, state (complete)~~Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]].

~~Funds~~'''Note on joining: ~~fundname~~''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, ~~msacode~~nih, ~~state~~ ~~MSALookupTable: MSACode~~nsf, sbir, ~~MSASuper~~ compustat) is first joined with the VC data on city-> ~~FundInfo: fundname, msacode,~~ state ~~(complete)~~-year ID and then the resulting tables are all joined together in the final table.

~~Rounds: coname, rounddate, stagecode, roundno~~

~~CombinedRounds: coname, rounddate, discamount, fundname~~

->

~~RoundInfoSuper: coname, rounddate, '''nofunds''', discamount~~

->

~~RoundInfo: Coname, roundyear, fundname, estamount (complete)~~

~~Then take~~===Data by zip code===*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017) ~~RoundInfo~~https: ~~Coname~~//www2.census.gov/programs-surveys/popest/datasets/*Income data, ~~roundyear, fundname, estamount~~1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017) ~~CompanyInfo~~https: ~~Coname~~//www.irs.gov/uac/about-irs*DCI index, ~~MSASuper, InduCode, state~~to assess the economic well-being of communities ~~FundInfo~~http: ~~fundname, msacode, state~~ //eig.org/dci/interactive-maps/u-s-zip->codes ~~SuperRoundInfo: Coname~~*R&D Expenses, ~~CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount~~ 1980-2016 -> ~~MSAPortCos~~Wharton Research Data Services (E: ~~Count(CoName~~\McNair\Hubs\summer 2017) ~~As NoPortCosFunded, CoMSASuper, RoundYear~~ *Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

~~'''Notes on Creation of Primary~~ == Data ~~Set'''~~by MSA ==

~~Raw tables~~* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry) * funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment) * rounds (round date, company name, state, round number, stage 1, stage 2, stage 3) * combined rounds (company name, round date, disclosed amount, investor) * msalist (changes MSAs to CMSAs— combined We have principle cities of MSAs)from the census:*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other) https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

~~Process~~* cleaned tables We might be able to ~~eliminate duplications, undisclosed variables~~* changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean) * matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC) *matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt go City ->FIPS place code -> ~~cleanfundfinal.txt)~~*join by round and company conames*bridge years (1990-2016), stage, and cmsa* populate data with count of companies (Deal flow) and estimated amount ($)** data set in 181 hubs folder under summarycmsa.txt (38394)MSA?

~~'''Glossary of Tables''''~~ ~~cleanco — used to remove duplicates from companies~~ ~~cleanedcompanies — clean set of companies with no duplicates~~ ~~cmsas— list of all CMSAs in final data set (for merging)~~ ~~cmsastats- statistics not including empty years (pre-merge)~~ ~~cmsastats2 - statistics separated by year-MSA~~ ~~cmsastats3— statistics separated by year-MSA-stage~~ ~~cmsayears— empty merged table between year and cmsa~~ ~~cmsayearstage — empty merged table between cmsa/years and stage~~ ~~combinedrounds— raw sdc data for combined rounds~~ ~~combinedroundswamt— used to join rounds~~ Cities and ~~combined rounds for roundinfo2~~ ~~companies- raw SDC company data~~ ~~companyinfo — cleaned companies joined with state and CMSA information~~ ~~companyinfo2— companyinfo1 with original industry categories~~ ~~companyinfo3— companyinfo2 with updated industry categories and codes~~ ~~companyinfo4~~ ~~companyround~~ ~~companyround2~~ ~~companyround3~~ ~~fundinfo— funds joined with CMSA info~~ ~~fundinfo2 - clean version of fundinfo1~~ ~~fundinfoclean - used in process to clean fundinfo2~~ ~~fundinfoclean2- used in process to clean fundinfo2~~ ~~fundinfocleanfinal- used in process to clean fundinfo2~~ ~~fundinfocleannodups- final clean set of fundinfo~~ ~~funds - raw SDC fund data~~ ~~industry — new industry~~ their FIPS codes (4which don't perfectly correspond)~~— used for all future data sets~~ ~~industrylist— lookup table for new industry codes (went~~ are available from ~~6 to 4)~~ ~~joined1~~ ~~joined2~~ ~~matchfund2~~ ~~matchfunds~~ ~~matchroundfund~~ ~~matchroundfund2~~ ~~msalist — lookup table for MSA to CMSA (used for all future data sets)~~ ~~roundfund— not used— joined round to fund; drop~~https://www.census.gov/geo/~~ignore~~ ~~roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate~~ ~~roundinfo2— roundinfo1 including name of investors~~reference/~~funds~~ ~~roundinfo3— clean version of roundinfo2~~ ~~roundinfoclean — final clean version of roundinfo3 (final roundinfo table)~~ ~~rounds — raw SDC round data~~ ~~stages — table for merging stage-year-CMSA~~ ~~superinfo — ignore~~codes/~~drop~~ ~~temp~~ ~~years — table for merging stage-year-CMSA~~place.html

~~===Hub Candidates Data Set===~~The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.htmlHowever, there is only CBSA!

~~The Hubs candidate~~ This might do it: https://www2.census.gov/geo/pdfs/maps-data ~~set is a list of potential hubs found in MSAs throughout the country~~/data/rel/explanation_ua_cbsa_rel_10. ~~Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what~~ pdfWe can ~~be identified as a hub. This is a difficult data set~~ maybe track city to ~~pull, as there is little~~ principal city to ~~no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.~~MSA

~~Characteristics/Variables~~==COMPUSTAT Data==*Year Founded*Square footage*LinkedIN self-identifiers (what The data set includes information on publicly traded firms in the US. It was obtained from the ~~organization classifies itself on its LinkedIN profile)~~ *Activeness on Twitter Wharton Research Data Services (~~binomial)~~*Member Directory available online (binomial)*Number of conference rooms*Price ($https://~~month) for Flex desk~~ *Offers Reserved desk (binomial)*Offers office space for rent (binomial) *Offers community membershipwrds-~~- not for coworking but for community events, etc~~web.wharton.upenn. ~~(binomial)~~*Number of events offered per month (estimate)*Offers code academy*Mission Statementedu/wrds/~~Vision (for qualitative or key-word analysis~~index.cfm?) .

~~These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.~~

~~As of March 10th 2016, the list contains 125 Hub candidates.~~Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017 Z:\Hubs\2017

~~===Supplementary Data Sets===~~The source file is RandDExpenditures.txt. It contains:*Date from 1980-2017 (July). *427799 records*Fields include:**R&D Expenditure**Address (inc. city, zip, state)**Revenue of firms Database is '''~~Patent data~~cities'''~~: to be pulled from USPTO or SDC Platinum.~~ *unable to find on the internet, must be pulled from the larger dataset

~~'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF)~~SQL script is: ~~Grad Students found for the year 2015, no data going back historically; R&D found for the past 10 years~~COMPUSTAT.sql

Output file is COMPUSTATSummary.txt. It contains:*~~categorized university by MSA~~Variables: City, ~~can be used for all university~~year, No.public firms, sum R&D, sum Sales, sum total assets*1979-~~based projects~~2016*4440 cities

~~'''Per Capita Income''' and '''Employment Data''' (US Census Bureau)~~It is located in Z: ~~complete for most recent census, unable to find data going back historically~~\Hubs\2017\Output_Files

==NSF Data==Data is in: E:\McNair\Projects\Hubs\Summer 2017 Z:\Hubs\2017 Database is '''~~Firm Births~~cities''' ~~(BDS): data set found for 1990 to present, currently being cleaned up for use~~

~~===Resources===~~* Yael Hochberg and Fehder (2015), located in dropbox** Use this paper as a guideline on how to conduct the analysis*US Census Bureau data on employment by MSASQL script is: ~~http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table~~*USPTO tility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm*MSA level trends: http://www.metrotrends.org/datansf_2017.cfsql

~~===To Do===We need to find~~ The source files are: nsf2017.txt, copied from table '''nsf''', and ~~clean up data sets at~~ nsf_institution copied from table '''nsf_grants_institution''' from the ~~MSA level~~biotech db.

*Patent data (USPTO)*Number of STEM Graduate Students (NSF)They contain:**in progressAward ID*~~University R&D Spending (NSF)~~Award Institution*~~Per Capita Income (US Census)~~ Award Effective date**complete (Employment and Income_MSA.xls)Institution city*~~Employment (US Census)~~Award Value**complete (Employment and Income_MSA.xls)Organization state code*Firm births (BDS)*SELECT MSAs!!! Possible method: choosing CMSAs with Distinct companies funded ** >100 = 38** >75 = 45** >50 = 52** >25 = 80** Total 238**greater than 100 will give us 52 CMSAs to work withFrom 1900 - 2017

~~===Data Cleaning===~~Output file is nsfSummary.txt. It contains:*Variables: City, State code year, nsf_nogrants, nsf_valuegrant *1900-2017

~~Cleaning tasks:~~===Joined NSF table===*Remove PortCos The joined nsf table with the VC table is found in db '''cities'''. The table is named ~~Undisclosed, etc~~'''merged_nsf'''.*Remove Funds named Unknown, etcAll the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.*Basic Data cleaningThe sql script is in Z:**Enormous outliers on funds invested**Check dates\HUbs\2017\sql scripts

~~Lookup tables~~==NIH Data==Data is in:*SuperMSAs Z:\Hubs*Industry*Stages E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: nih2017.sql

The source files are:

*nih_1986_2001.csv

*nih_2002_2012.txt

*nih_2013_2015

located in E:\McNair\Projects\Federal Grant Data\NIH

~~===The Target Dataset===~~

~~We will need to process the following variables:~~

*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?

The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:

~~Example dataset~~ Z: ~~MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors ....~~ ~~----------------------------------------------------------------------------------------------------------------------------~~ ~~1234 2001 1000000 20000000 30000000 4 7 7~~\Hubs\2017\sql scripts

This table includes

*year

*city

*state

*country

*nogrants (number of grants)

*valuegrant

*city_state

~~Note that the unit of observation is MSA~~*Date from 1986-~~Year.~~2015

~~Variables to be computed at~~ ===Joined NIH table===The joined NIH table with the ~~MSA level:~~*HubActive (binary)*NoHubsActive (Count)*HubSqFt*Other Hub Vars (build list!!!)*VC table is found in db '''~~SeedVCInv~~cities''' ~~(Seed/Start-up)~~*. The table is named '''~~EarlyVCInv~~merged_nih''' ~~(Early Stage)~~.*LaterStageVC (Later)*OtherStageVC (Buyout/Acq, Other)*'''NoDeals''' (done by local VCs?)**NoDealsNear**NoDealsFarAll the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.*NoPortCosFunded*FundsInv (The sql script is in ~~an MSA)~~**FundsInvFromNear (within MSA?)**FundsInvFromFar (outside MSA?)*DistinctInvestors**DistinctInvestorsNear (within MSA?)**DistinctInvestorsFar (outside MSA?)*PatentCount*NoSTEMGrads*FirmBirths (BDS data)*UniRandDSpend*PerCapitaIncome*Employment Z:\HUbs\2017\sql scripts

~~We need to~~==Clinical Trials Data==Data is in:*Check funds invested means dollars invested Z:\Hubs*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.? E:\McNair\Projects\Hubs\Summer 2017

Database is '''cities'''

SQL script is: ctrials.sql

The source file is:

~~There may be~~ *medclinical.txt located in Z:\Hubs\2017 *Date from 1999-2017 ===Joined clinical trials table=== The file which contains the number of trials in each city and year is located in: Z:\Hubs\2017 The file is in: Z:\Hubs\2017\clean dataThe name of the file is: ctrialsSummary.txt It contains:*city*year*city_state_year*noctrials - number of trials The ctrials is joined with VC table. The joined SQL script is: '''new_ctrials.sql''' and it is located in Z:\Hubs\2017\sql scripts The name of the joined table is '''new_merged_ctrials'''. It contains:*city*state*city_state_id*city_state_year*year*noctrials*seedamtm*earlyamtm*lateramtm*selamtm*numseeds*numearly*numlater*numsel All the values of noctrials with missing values for years 1999-2017 are set equal to 0. ==Population Data==Data is in: Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017 Database is '''cities''' SQL script is: '''population.sql'''The source files are: *pop2000_2009.xlsx*pop2010_2016.xlsx They contain:*State*City name *Year *Population Estimates Date from 2000-2016 ===Joined population table=== Data is in: Z:\Hubs\2017\clean dataThe file names are 1_population.txt - contains data on population estimates from 2000-2009 2_population.txt - contains data on population estimates from 2010-2016 Database is '''cities'''SQL script is: '''new_population.sql''', located in Z:\Hubs\2017\sql scripts The population table is joined on VC table. The table is called '''new_merged_population'''. They contain:*City*State*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Population estimates*Year*Code from the state code and Fips code*State full name ==Income Data== Raw data was obtained from Census data, American Communities Survey. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip Date from 2005-2015 The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at: Z:\Hubs\2017 This master list includes:*MSA code*MSA name*Principal City*State*Place code (city code)*State Code This master list was edited to associate each principal city with a ~~second dataset~~ unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY. Cleaned Income data files are in Z:\Hubs\2017\merging_on_ID They contain:*MSA code*MSA*Year *Total Household Income The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in Z:\Hubs\2017\merging_on_ID The SQL file that ~~has Hub~~merges income data from ACS (by MSA -~~Industry~~Year) with the MSA-City file is titled '''income.sql'''. It is located here: Z:\Hubs\2017\sql scripts The final income table is in db '''cities''' titled '''merged_income'''. It includes:*MSA*City*State*Year *Total Household Income The table includes 8780 observations ===Joined income table=== Data is in: Z:\Hubs\clean dataThe file names are: INC_05.txt - INC_15.txt Database is '''cities'''SQL script is: merged_income.sql They contain:*City*State*city_state_id to uniquely identify each city*Income*Year*Code from the state code and Fips code ==Employment Data== Data on employment was obtained from American Communities Survey, US Census Bureau. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSACleaned files are in Z:\Hubs\2017\clean data They contain:*MSA code*MSA*Year *Employment rate of individuals 16 years or older*Unemployment rate of individuals 16 years or older Date from 2005-2015 The SQL file that merges employment data from ACS (~~where industry~~ by MSA - Year) with the MSA-City file is ~~semiconductor/non~~titled '''Employment.sql'''. The file is located in: Z:\Hubs\2017 The final table is in db '''cities''' titled '''merged_employment'''. It includes:*MSA*City*Year*Employment rate*Unemployment rate ===Joined employment table=== Data is in: Z:\Hubs\clean data The file names are: EMP_05.txt -~~semiconductor?~~EMP_15.txt Database is '''cities'''SQL script is: '''new_employment.sql''' and it is located in Z:\Hubs\2017\sql scripts The final table which is joined on VC is in db cities titled '''new_merged_employment'''. They contain:*City*State*Code from the state code and Fips code*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Employment rates of individuals of 16 years or older*Unemployment rates of individuals of 16 years or older*Year ==Schooling Data== Data on schooling was obtained from American Communities Survey, US Census Bureau. Raw Data is in: E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSACleaned files are in Z:\Hubs\2017\clean data They contain:*MSA code*MSA*Year *Total number of population 3 years and over enrolled in school*Percent of population 3 years and over enrolled in public school*Percent of population 3 years and over enrolled in private school Date from 2005-2015 The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''. The file is located in: Z:\Hubs\2017 The final table is in db '''cities''' titled '''merged_schooling'''. It includes:*MSA*City*Year*Total*Percent_public_schooling*Percent_private_schooling ===Joined schooling table=== Data is in: Z:\Hubs\clean dataThe file names are: SCH_05.txt - SCH_15.txt Database is '''cities'''SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''The final table is in db '''cities''' titled '''new_merged_schooling'''. It contains:*City*State*city_state_id to uniquely identify each city*city_state_year to uniquely identify each city in each year*Total number of school enrollment*Percentage enrolled in public schools*Percentage enrolled in private schools*Year*Code from the state code and Fips code ==VC Data== Raw Data is in: Z:\VentureCapitalData\SDCVCData\vcdb2 The file name is roundleveloutput2.txt It contains:*city*state*year*seedamtm - seed, amount in millions*earlyamtm - early, amount in millions*lateramtm - late, amount in millions*selamtm - seed early late, amount in millions*numseeds - number of seeds*numearly *numlater*numsel*numdeals*numalive Date from 1948-2017 The table is in db '''cities''' titled '''new_vc'''. It includes:*city*state*city_state_id*city_state_year*seedamtm*earlyamtm*lateramtm*selamtm*numseeds*numearly*numlater*numsel*numdeals*numalive*year ==Final Joined Data set == The final data set is in file '''final.txt''' and is located here: Z:\Hubs\2017 It includes:*city*state*city_state_year - (ID that data is merged on)*year*seedamtm - Seed Amount*earlyamtm - Early Investment Amount*lateramtm - Late Investment Amount*selamtm - Seed early or late amount*numseeds - Number of seed investments *numearly - Number of early investments*numlater - Number of late investments*numsel *numdeals - Number of deals (first contracts)*numalive - Number of start ups alive*income - Income per capita in each city-year*sbir_nogrants - Number of SBIR grants*sbir_valuegrant - Value of SBIR grants*emp - Employment stats of each city-year*unemp - Rate of unemployment*popestimate - Population estimate of each city-year*private - Enrollment in private schools*public - Enrollment in public schools*total - *numfirms - Number of publicly traded firms*randd - R&D expenditure of publicly traded firms*revenue - Revenue of PTF*totalassets *nsf_nogrants - Number of NSF grants*valuegrant - Value of NSF grants*nih_nogrants - Number of NIH grants*nih_valuegrant - Value of NIH grants*noctrials - NUmber of clinical trials == Defining Hubs == '''Summer 2016''' - Last year a master list of 125 "potential" hubs was used. A scorecard was developed which filtered these 125 candidate hubs to determine which of these should be included in the study sample. This method resulted in a sample size of ~ 30. The master list and the final hubs list is titled '''Hubs Data v2_'16'''. It is located here: Z:\Hubs\2017\hubs_data '''Summer 2017''' - In order to obtain a more statistically significant sample of hubs, we developed 5 criteria which produce a more relaxed definition of hubs than last year. These include *Availability of co-working space*Coding classes or tech events*Some focus on the tech sector (this is important as our dependent variable is VC funding)*Presence of an accelerator*Availability of mentorship for members. We will review the 125 candidate hubs and select those which satisfy a subset or all of these characteristics. [[category:Internal]]

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,612

edits

Changes

Hubs (view source)

Revision as of 13:41, 21 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools