Difference between revisions of "Venture Capital (Data)"

From edegan.com
Jump to navigation Jump to search
Line 54: Line 54:
 
*Area code had a 1- in it
 
*Area code had a 1- in it
 
*Some line counts were off by one or two
 
*Some line counts were off by one or two
 +
*"Firm Capital under Mgmt" column header for VCFirms has a {0mil} which screws up the normalizer. Delete this part of the column title prior to running normalizer.
  
  

Revision as of 17:11, 16 June 2017


McNair Project
Venture Capital (Data)
Project logo 02.png
Project Information
Project Title Venture Capital (Data)
Owner Adrian Smart, Jake Silberman
Start Date
Deadline
Keywords Data
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


Project Objective

To create a master venture capital data set on which all other projects requiring venture capital data will be built.

Specifically, we will be building data to support the following:

  • Lists and counts by year as well as state, MSA, and city, zip code, etc.
  • Identify seed, early and later stage deals (first investment), dollars invested and rounds
  • Identify transactional VC dollars invested and rounds
  • Acquisitions and IPOs of VC backed firms
  • Identify when VC backed firms are "alive"
  • Geocode all portfolio companies
  • Geocode VC headquarters and branch offices
  • Compute distances between investors and their portfolio companies
  • Identify lead investors for portfolio companies
  • Identify top 100 VCs

Retrieving data

Old notes are Old Venture Capital Data Work

All pulls and processing scripts are in:

E:\McNair\Projects\VC Database

Each of the following has a .rpt, .ssh, and .txt file, and has the following constraints in its search:

  • Venture related deals
  • Round date in 1/1/1980 to 06/15/2017


The portfolio company (inc. round) based pulls also have:

  • Portfolio company nation = United States


The datasets retrieved from SDC platinum (on June 15-16th 2017) are:

  • Portfolio companies (USVC1980-present.ssh) - attributes to be extracted from roundlevel information
  • Portfolio company descriptions - just the portco name and the long description. Custom processed.
  • Round-on-one-line (USVCRound1980-present) - processed using RoundOnOneLine.pl
  • Funds (USVCFund1980-present)
  • Firms (VCFirms1980-present) - includes branch office so attributes must be extracted
  • IPOs
  • M&A

All files were processed with NormalizeFixedWidth.pl (after footers were removed) unless otherwise indicated. Some files required some minor post-processing to load into PostgreSQL. Issues included:

  • Firm level data didn't normalize correctly - had to adjust headers
  • Stray quotation in address line
  • Area code had a 1- in it
  • Some line counts were off by one or two
  • "Firm Capital under Mgmt" column header for VCFirms has a {0mil} which screws up the normalizer. Delete this part of the column title prior to running normalizer.


Additional datafiles (in E:\McNair\Projects\VC Database):

  • GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords.

Loading the data into SQL

The SQL script and load data are in:

Z:\VentureCapitalData\SDCVCData

The load script is:

LoadData.sql

Processing the base tables

The SQL scripts need to do the following:

  • Extract company infomation from roundbase then process it
    • Add variable for growth (1), transactional (2), or non-VC (3) (see below)
    • Create clean 5-digit ZIP field
  • Extract firm information from firmbase, then:
    • Add indicator variable for US
    • Create clean 5-digit ZIP field
  • Build out BranchOffice (BO) table
  • Build lookups for:
    • State Code
    • raisestatus

Lookup Tables

Stage of Investment

Growth VC (1)
--------------
Seed
Early Stage
Later Stage
Tranactional VC (2)
--------------
Acq. for Expansion
Acquisition
Bridge Loan
Expansion
Pending Acq
Recap or Turnaround
Exclude (non-VC) (3)
--------------
LBO
MBO
Open Market Purchase
PIPE
Secondary Buyout