Difference between revisions of "VCDB20H1"

From edegan.com
Jump to navigation Jump to search
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
{{Project
 
{{Project
 +
|Has project output=Data,Tool,How-to
 
|Has title=VCDB20H1
 
|Has title=VCDB20H1
 
|Has owner=Ed Egan,
 
|Has owner=Ed Egan,
 
}}
 
}}
 
+
<onlyinclude>The [[VCDB20H1]] project documents the build of vcbd20h1 -- a '''V'''enture '''C'''apital '''D'''ata'''B'''ase covering until the end of the first half ('''H1''') of 20'''20'''. Vcbd20h1 includes investments, funds, startups, executives, and exits derived from data from [[VentureXpert]]. This project updates [[vcdb4]], which covered (almost) to the of Q3 2019. See also: [[SDC Normalizer]].</onlyinclude>
==Summary==
 
 
 
This page documents the build of vcbd20h1. It updates [[vcdb4]], which covered (almost) to the of Q3 2019, to until the end of the first half (H1) of 2020.
 
  
 
==SDC files==
 
==SDC files==
Line 25: Line 23:
  
 
However, it appears that Thomson-Reuters have drastically redefined their 'venture related' flag, as about 25% of the data appears missing now: Using the venture-related flag, we now have 42,005 portco records, whereas we did have 49,250 in the last pull around 10 months ago...
 
However, it appears that Thomson-Reuters have drastically redefined their 'venture related' flag, as about 25% of the data appears missing now: Using the venture-related flag, we now have 42,005 portco records, whereas we did have 49,250 in the last pull around 10 months ago...
I tried re-pulling PortCo without the venture-related constraint but with a flag for venture-related and buyout (STD-VC and STD-Buyout), which appear the only two such options now (there used to be many more to choose from). This gave 68,596 records, of which 42,068 were venture, 26,527 were not venture, 20,208 were buyout, 48,387 were not buyouts, and 2,2256 were both and 8,575 were neither. I noticed that zozi.com was missing from the new pull, and that it had some early stage deals. It is marked as neither venture nor buyout. I therefore repulled everything without the constraint but with a venture-related and buyout flag at the portco level. (I tested it at the round level and it showed no within-firm variation). I omitted "VC" from the new file names. Note that you should select Columinar Grid and not text as the output method, and don't both appending the session details.
+
I tried re-pulling PortCo without the venture-related constraint but with a flag for venture-related and buyout (STD-VC and STD-Buyout), which appear the only two such options now (there used to be many more to choose from). This gave 68,596 records, of which 42,068 were venture, 26,527 were not venture, 20,208 were buyout, 48,387 were not buyouts, and 2,2256 were both and 8,575 were neither. I noticed that zozi.com was missing from the new pull, and that it had some early stage deals. It is marked as neither venture nor buyout. I therefore repulled everything without the constraint but with a venture-related and buyout flag at the portco level. (I tested it at the round level and it showed no within-firm variation). I omitted "VC" from the new file names. Note that you should select Columnar Grid and not text as the output method, and don't both appending the session details.
  
 
I also re-did the M&A queries, pulling more info from 1980 to 2020H1 in four goes. I removed the 100pc owned constraint from the search, as I'm concerned that when it isn't reported, observations are omitted. It can be added back later. I also included completed vs. withdrawn, pc cash and stock, and some other useful measures. Then I pulled all the non-private acquisitions in two goes as well.
 
I also re-did the M&A queries, pulling more info from 1980 to 2020H1 in four goes. I removed the 100pc owned constraint from the search, as I'm concerned that when it isn't reported, observations are omitted. It can be added back later. I also included completed vs. withdrawn, pc cash and stock, and some other useful measures. Then I pulled all the non-private acquisitions in two goes as well.
Line 47: Line 45:
 
  Sub. 111796
 
  Sub. 111796
  
Normalizing the files is generally straight-forward. Use NormalizeFixedWidth.pl (see [[Normalizer Documentation]]) and only copy down the missing keys (e.g., coname, statecode, datefirst, etc.). For most files there is nothing to fix, just select the last header row and go. For fund fix the header first, so that the capital under management header entry has some space after it. For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header. For fund, fund execs, and portco, remove double quotes after normalization.
+
Normalizing the files is generally straight-forward. Use NormalizeFixedWidth.pl (see [[Normalizer Documentation]]) and only copy down the missing keys (e.g., coname, statecode, datefirst, etc.). For most files there is nothing to fix, just select the last header row and go. For firm fix the header first so that the capital under management header entry has some space after it (otherwise it blends with the firm status). For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header. For fund, fund execs, and portco, remove double quotes after normalization. For M&As, replace \tnp\t and \tnm\t with \t\t.
  
 
For PortCoLongDescription (see [[Vcdb4#Long_Description]]):  
 
For PortCoLongDescription (see [[Vcdb4#Long_Description]]):  
Line 63: Line 61:
  
 
Other notes:
 
Other notes:
*The Stage Level 3 (in Round) options now include secondary purchase and open market purchase.
+
*The Stage Level 3 (in Round) options include secondary purchase, secondary buyout, and open market purchase.
  
 
==Dbase==
 
==Dbase==

Latest revision as of 12:33, 6 October 2020


Project
VCDB20H1
Project logo 02.png
Project Information
Has title VCDB20H1
Has owner Ed Egan
Has start date
Has deadline date
Has project status
Has project output Data, Tool, How-to
Copyright © 2019 edegan.com. All Rights Reserved.

The VCDB20H1 project documents the build of vcbd20h1 -- a Venture Capital DataBase covering until the end of the first half (H1) of 2020. Vcbd20h1 includes investments, funds, startups, executives, and exits derived from data from VentureXpert. This project updates vcdb4, which covered (almost) to the of Q3 2019. See also: SDC Normalizer.

SDC files

Copy the SDC files from vcdb4/SDC, as well as the two perl scripts:

  • NormalizeFixedWidth.pl
  • RoundOnOneLine.pl

Modify all of the ssh files:

  • Change the date to 07/20/2020 (usually 3 places per file)
  • Change the path to ../vcdb20h1/sdc (usually 3 places per file).

Originally these SDC requests have the following constraints:

  • Venture related deals only
  • Deals from 1/1/1980 to 07/20/2020
  • US companies (also targets and issuers)
  • 100pc owned after the acquisition

However, it appears that Thomson-Reuters have drastically redefined their 'venture related' flag, as about 25% of the data appears missing now: Using the venture-related flag, we now have 42,005 portco records, whereas we did have 49,250 in the last pull around 10 months ago... I tried re-pulling PortCo without the venture-related constraint but with a flag for venture-related and buyout (STD-VC and STD-Buyout), which appear the only two such options now (there used to be many more to choose from). This gave 68,596 records, of which 42,068 were venture, 26,527 were not venture, 20,208 were buyout, 48,387 were not buyouts, and 2,2256 were both and 8,575 were neither. I noticed that zozi.com was missing from the new pull, and that it had some early stage deals. It is marked as neither venture nor buyout. I therefore repulled everything without the constraint but with a venture-related and buyout flag at the portco level. (I tested it at the round level and it showed no within-firm variation). I omitted "VC" from the new file names. Note that you should select Columnar Grid and not text as the output method, and don't both appending the session details.

I also re-did the M&A queries, pulling more info from 1980 to 2020H1 in four goes. I removed the 100pc owned constraint from the search, as I'm concerned that when it isn't reported, observations are omitted. It can be added back later. I also included completed vs. withdrawn, pc cash and stock, and some other useful measures. Then I pulled all the non-private acquisitions in two goes as well.

After these changes, there are only two constraints in the SDC requests (see SDC Platinum for how to make requests using this really buggy tool):

  • Deals from 1/1/1980 to 07/20/2020
  • US companies (also targets and issuers)

The four private M&A pulls were normalized and then combined to create:

USMAPrivate80-20H1-normal.txt

The two public (actually this includes everything that isn't private) M&A pulls were normalized and then combined to create:

USMAPublic80-20H1-normal.txt

The various target statuses in this file are:

Govt.	333
Inv.	1
J.V.	2626
Mutual	218
Public	66176
Sub.	111796

Normalizing the files is generally straight-forward. Use NormalizeFixedWidth.pl (see Normalizer Documentation) and only copy down the missing keys (e.g., coname, statecode, datefirst, etc.). For most files there is nothing to fix, just select the last header row and go. For firm fix the header first so that the capital under management header entry has some space after it (otherwise it blends with the firm status). For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header. For fund, fund execs, and portco, remove double quotes after normalization. For M&As, replace \tnp\t and \tnm\t with \t\t.

For PortCoLongDescription (see Vcdb4#Long_Description):

  • Remove the header and footer and save as Process.txt on the dbase server, making sure it is encoded as UNIX and UTF-8.
  • Run the following:
cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt
cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt
cat Out2.txt | perl -pe 's/\s*\n//g' > Out3.txt
cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt
  • Add a header to Out4.txt (make the last header very long!)
  • Run NormalizeFixedWidth.pl on it
  • Remove the following from just the description field: ". Save as Out4wHeaderClean.txt, making sure it is UNIX and UTF-8. Then
 cat Out4wHeaderClean.txt  | perl -pe 's/\s{2,}/ /g' > Out5.txt
  • Copy to USPortCoLongDesc1980Cleaned.txt and make any manual fixes necessary to load (open it in excel and filter based on ISTEXT on row F to identify issues).

Other notes:

  • The Stage Level 3 (in Round) options include secondary purchase, secondary buyout, and open market purchase.

Dbase

Check the filespace on the dbserver (see Posgres Server Configuration): The /data mount is at 23% (845 1K blocks free) and vcdb4 used 37Gb, so we are all good.

Create the new dbase vcdb20h1 as researcher. Then copy over Load.sql and update it.

In addition to the processed SDC requests, copy over the following files to mother (/bulk/vcdb20h1):

  • StateLookup.txt