Difference between revisions of "VCDB20H1"

From edegan.com
Jump to navigation Jump to search
Line 6: Line 6:
 
==Summary==
 
==Summary==
  
This page documents the build of vcbd20h1. It updates [[vcdb4]], which oovered (almost) to the of Q3 2019, to until the end of the first half (H1) of 2020.
+
This page documents the build of vcbd20h1. It updates [[vcdb4]], which covered (almost) to the of Q3 2019, to until the end of the first half (H1) of 2020.
  
 
==SDC files==
 
==SDC files==
  
Copy the SDC files from vcdb4/SDC. Modify all of the ssh file:
+
Copy the SDC files from vcdb4/SDC, as well as the two perl scripts:
 +
*NormalizeFixedWidth.pl
 +
*RoundOnOneLine.pl
 +
 
 +
Modify all of the ssh files:
 
*Change the date to 07/20/2020 (usually 3 places per file)
 
*Change the date to 07/20/2020 (usually 3 places per file)
 
*Change the path to ../vcdb20h1/sdc (usually 3 places per file).
 
*Change the path to ../vcdb20h1/sdc (usually 3 places per file).
  
Note that the MA pull from 1985 of the basics only fails. Delete the script and move on.
+
These SDC requests have the following constraints:
 +
*Venture related deals only
 +
*Deals from 1/1/1980 to 07/20/2020
 +
*US companies (also targets and issuers)
 +
*100pc owned after the acquisition (eliminated, see below)
 +
 
 +
Note that the MA pull from 1985 of the basics only fails. Delete the script and move on! The other pull adds data to an older pull. I redid the M&A queries, pulling more info from 1980 to 2020H1 in four goes. This removed the 100pc owned constraint from the search. It can be added back later. I also included completed vs. withdrawn, pc cash and stock, and some other useful measures.
 +
 
 +
We might want to take a look at what NOT venture related deals might yield...
 +
 
 +
Then normalize the files. Generally, this is straight-forward. Only copy down the missing keys (e.g., coname, statecode, datefirst, etc.), for most files there is nothing to fix. For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header.
 +
 
 +
For PortCoLongDescription (see [[Vcdb4#Long_Description]]):
 +
*Remove the header and footer and save as Process.txt, making sure it is UNIX and UTF-8.
 +
*Run the following:
 +
cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt
 +
cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt
 +
cat Out2.txt | perl -pe 's/\s*\n//g' > Out3.txt
 +
cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt
 +
*Add a header to Out4.txt (make the last header very long!)
 +
*Run NormalizeFixedWidth.pl on it
 +
*Remove the following from just the description field: ". Save as Out4wHeaderClean.txt, making sure it is UNIX and UTF-8. Then
 +
  cat Out4wHeaderClean.txt  | perl -pe 's/\s{2,}/ /g' > Out5.txt
 +
*Copy to USVCPortCoLongDesc1980Cleaned.txt and make any manual fixes necessary to load
 +
 
 +
 
 +
==Dbase==
 +
 
 +
Check the filespace on the dbserver (see [[Posgres Server Configuration]]): The /data mount is at 23% (845 1K blocks free) and vcdb4 used 37Gb, so we are all good.
 +
 
 +
Create the new dbase '''vcdb20h1''' as researcher.
 +
 
 +
Copy over Load.sql and update it.
 +
 
 +
In addition to the processed SDC requests, copy over the following files:
 +
*StateLookup.txt

Revision as of 15:23, 22 July 2020


Project
VCDB20H1
Project logo 02.png
Project Information
Has title VCDB20H1
Has owner Ed Egan
Has start date
Has deadline date
Has project status
Copyright © 2019 edegan.com. All Rights Reserved.


Summary

This page documents the build of vcbd20h1. It updates vcdb4, which covered (almost) to the of Q3 2019, to until the end of the first half (H1) of 2020.

SDC files

Copy the SDC files from vcdb4/SDC, as well as the two perl scripts:

  • NormalizeFixedWidth.pl
  • RoundOnOneLine.pl

Modify all of the ssh files:

  • Change the date to 07/20/2020 (usually 3 places per file)
  • Change the path to ../vcdb20h1/sdc (usually 3 places per file).

These SDC requests have the following constraints:

  • Venture related deals only
  • Deals from 1/1/1980 to 07/20/2020
  • US companies (also targets and issuers)
  • 100pc owned after the acquisition (eliminated, see below)

Note that the MA pull from 1985 of the basics only fails. Delete the script and move on! The other pull adds data to an older pull. I redid the M&A queries, pulling more info from 1980 to 2020H1 in four goes. This removed the 100pc owned constraint from the search. It can be added back later. I also included completed vs. withdrawn, pc cash and stock, and some other useful measures.

We might want to take a look at what NOT venture related deals might yield...

Then normalize the files. Generally, this is straight-forward. Only copy down the missing keys (e.g., coname, statecode, datefirst, etc.), for most files there is nothing to fix. For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header.

For PortCoLongDescription (see Vcdb4#Long_Description):

  • Remove the header and footer and save as Process.txt, making sure it is UNIX and UTF-8.
  • Run the following:
cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt
cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt
cat Out2.txt | perl -pe 's/\s*\n//g' > Out3.txt
cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt
  • Add a header to Out4.txt (make the last header very long!)
  • Run NormalizeFixedWidth.pl on it
  • Remove the following from just the description field: ". Save as Out4wHeaderClean.txt, making sure it is UNIX and UTF-8. Then
 cat Out4wHeaderClean.txt  | perl -pe 's/\s{2,}/ /g' > Out5.txt
  • Copy to USVCPortCoLongDesc1980Cleaned.txt and make any manual fixes necessary to load


Dbase

Check the filespace on the dbserver (see Posgres Server Configuration): The /data mount is at 23% (845 1K blocks free) and vcdb4 used 37Gb, so we are all good.

Create the new dbase vcdb20h1 as researcher.

Copy over Load.sql and update it.

In addition to the processed SDC requests, copy over the following files:

  • StateLookup.txt