Difference between revisions of "Vcdb4"

From edegan.com
Jump to navigation Jump to search
Line 71: Line 71:
 
#Run the normalizer
 
#Run the normalizer
 
#Remove duplicate spaces from the description column by pushing the data through excel and running the last regex (save as In5.txt with UNIX/UTF-8)
 
#Remove duplicate spaces from the description column by pushing the data through excel and running the last regex (save as In5.txt with UNIX/UTF-8)
 +
#Remove quote marks from Out5.txt, resave and then put back into excel to create USVCPortCoLongDesc1980Cleaned.txt
  
 
  cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt
 
  cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt
Line 76: Line 77:
 
  cat Out2.txt | perl -pe 's/\n//g' > Out3.txt
 
  cat Out2.txt | perl -pe 's/\n//g' > Out3.txt
 
  cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt
 
  cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt
 
+
...
  cat In5.txt | perl -pe 's/\s{2,}/ /g' > Out4.txt
+
  cat In5.txt | perl -pe 's/\s{2,}/ /g' > Out5.txt
  
 
===Round On One Line===
 
===Round On One Line===

Revision as of 15:18, 23 September 2019


Project
Vcdb4
Project logo 02.png
Project Information
Has title vcdb4
Has owner Ed Egan
Has start date
Has deadline date
Has project status Active
Copyright © 2019 edegan.com. All Rights Reserved.


Source Files

Files are in:

E:\projects\vcdb4

The old files from VentureXpert Database are in the subfolder Student Work, and their latest work is in Updated.

We need a set of pulls (according to E:\projects\vcdb3\OriginalSQL\LoadingScriptsV1.sql), which are documented below, as well as some lookup tables (CPI may need updating) and some joined tables (which would have to be updated separately) in MatchingEntrepsV3.sql:

  • PortCoSBIR: PortCoSBIR.txt
  • PortCoPatent: PortCoPatent.txt

And to update RevisedDBaseCode.sql, we'll need to:

  • Join in the Crunchbase (which needs updating)
  • Update the Geocoordinates

Note that this data could support new or updated versions of:

and others.

The build should be done as quickly but cleanly as possible, as it is needed right away but also will likely need to be updated in January of 2020 to reflect 2019's year end.

Loading the data

First, create a dbase:

createdb vcdb4

Then load the data, running Load.sql

The following were nuances of this process:

  • Replace all double quotes with nothing in PortCo, Fund
  • Renormalize firm by fixing the header -- the Capital Under Mgmt field was too close to the next field -- and remove two bad area codes (lines 931 and 11298).

SDC Platinum Requests

Everything was updated to 09/22/2019 as the final date. Some files were renamed for clarity. Each result is a triplet of .ssh, .rpt, and .txt files. The following scripts, reports and their outputs are in E:\projects\vcdb4\SDC:

  • USVCRound1980
  • USVCPortCo1980
  • USVCRoundOnOneLine1980
  • USVCFund1980
  • USVCFirms1980
  • USPortCoLongDesc1980
  • USVCFirmBranchOffices1980
  • USIPO1980
  • USVCPortCoExecs1980
  • USVCFundExecs1980
  • USMAPrivate100pc1985
  • USMAPrivate100pc2013

The two USMAPrivate100pc queries are different. The first pulls just date announced, date effective, target name, target state and tv. The second adds basic acquirer information from 2013 forward (to allow retroactive revision by Thomson for 5+ years) and can be combined with MAUSTargetComp100pc1985-July2018.txt (after adjusting the spacing) to make USMAPrivate100pc2013Full. For some reason, the query always fails with an out of memory message when trying to pull the whole thing.

USSDCRound1980 was updated to remove fields that should have been in USVCPortCos1980 only. When normalizing be sure to only copy down key fields. USMAPrivate100pc1985 was updated to reflect the MAs load in LoadingScriptsV1. There wasn't a good original. We are using 1985 forward as there are data issues that prevent download/extraction for the 1980-1984 data. Year completed was added as a check variable but might have been the source of issues and so was removed. Date Effective can be used instead. And USIPOComp1980 was updated to allow all exchanges (not just NNA). I couldn't require completion in the search, so that will have to be done in the dbase. USVCFund1980 was updated because some variables -- those concerned with the fund's name and fund address -- had changed name. USTRoundOnOneLine1980 was fixed so that it is just the key (coname,statecode,datefirst) and the round info field, so that it works with the RoundOnOneLine.pl script. Finally, note that USPortCoLongDesc1980 needs processing separately (see below).

Long Description

The instructions on Retrieving_US_VC_Data_From_SDC#Scripts_and_other_info were modified as follows:

  1. Remove the header and footer, and then save as Process.txt using UNIX line endings and UTF-8 encoding.
  2. Run the Regex process (note that I modified it slightly)
  3. Manual Clean
  4. Remove quotes ",',`
  5. Put in a new header with a very long description column
  6. Run the normalizer
  7. Remove duplicate spaces from the description column by pushing the data through excel and running the last regex (save as In5.txt with UNIX/UTF-8)
  8. Remove quote marks from Out5.txt, resave and then put back into excel to create USVCPortCoLongDesc1980Cleaned.txt
cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt
cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt
cat Out2.txt | perl -pe 's/\n//g' > Out3.txt
cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt
...
cat In5.txt | perl -pe 's/\s{2,}/ /g' > Out5.txt

Round On One Line

The process is run USVCRoundOnOneLine1980.ssh with USVCRoundOnOneLine1980.rpt to generate USVCRoundOnOneLine1980.txt, then remove the footer and:

perl Normalizer.pl -file="USVCRoundOnOneLine1980-NoFoot.txt"
 copy down the key (0,1,2)
perl RoundOnOneLine.pl -file="USVCRoundOnOneLine1980-NoFoot-normal.txt"
 then put the header back in!

Everything else

Just run the Normalizer. Only copy down key fields -- never copy down anything else as it will introduce data errors. The primary and foreign key fields, which may still need cleaning in SQL to be valid, are as follows (they are marked with * if they should be copied down):

  • USVCRound1980 -- coname*, statecode*, datefirst*, rounddate
  • USVCPortCo1980 -- coname, statecode, datefirst
  • USVCRoundOnOneLine1980 -- Coname*, statecode*, datefirst*, rounddate, fundname
  • USVCFund1980 -- fundname, firmname
  • USVCFirms1980 -- firmname
  • USPortCoLongDesc1980 -- coname*, statecode*, datefirst*
  • USVCFirmBranchOffices1980 --firmname
  • USIPO1980 -- issuer
  • USVCPortCoExecs1980 -- coname*, statecode*, datefirst*
  • USVCFundExecs1980 -- fundname*, and maybe fundyear*
  • USMAPrivate100pc2013 -- dateeffective, targetname, acquirorname

Not that USMAPrivate100pc2013 and USIPO1980 have some non-numerics in their value fields, and we are generally going to have take care of some type issues.