Changes

Jump to navigation Jump to search
no edit summary
## The private and public M&A file sets have to be separately combined into 2 files after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t in each.
## For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first, then RoundOnOneLine.pl, and then fix the header.
## PortCoLongDescription must be pre-processed from the command line and then post-processed in excel (see below as well as [[VCDB20H1]] and [[Vcdb4#Long_Description]]).
Create the postgres database:
# Create a new database on mother (createdb vcdb24) and set up a directory for the input files: bulk\vcdb24
# Copy over (to sql folder) and edit Load.sql. Run it section-by-section.
 
===PortCoLongDescription===
 
Process the Long Description data as follows:
#Remove the header and footer, and then save as Process.txt using UNIX line endings and UTF-8 encoding.
#Run the first section (producing Out5.txt) of the regex process below
#Import into Excel to make tab-delimited
#Remove double quotes " from just the description field
#Put in a new header
#Save as In5.txt with UNIX/UTF-8
#Run the last regex. It deals with the spaces in the description and the cases when there is no description.
#Try importing USVCPortCoLongDesc1980Cleaned.txt. It should be fine.
 
cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt
cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt
cat Out2.txt | perl -pe 's/\n//g' > Out3.txt
cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt
cat Out4.txt | perl -pe 's/(\d{4} $/\1\t/g' > Out5.txt
...
cat In5.txt | perl -pe 's/(\d{4})\t$/\1###/g' > Out6.txt
cat Out6.txt | perl -pe 's/\s{2,}/ /g' > Out7.txt
cat Out7.txt | perl -pe 's/###/\t/g' > USPortCoLongDesc1980Cleaned.txt

Navigation menu