Difference between revisions of "SDC Normalizer"
Line 1: | Line 1: | ||
{{Project | {{Project | ||
+ | |Has sponsor=McNair Center | ||
|Has title=SDC Normalizer | |Has title=SDC Normalizer | ||
|Has project status=Complete | |Has project status=Complete |
Revision as of 13:20, 20 September 2020
SDC Normalizer | |
---|---|
Project Information | |
Has title | SDC Normalizer |
Has start date | |
Has deadline date | |
Has keywords | Tool |
Has project status | Complete |
Has sponsor | McNair Center |
Copyright © 2019 edegan.com. All Rights Reserved. |
This project built the SDC Normalizer. See Normalizer Documentation
Contents
Documentation
The perl script Normalizer.pl is provided with a pod-generated HTML documentation and a (CPAN clone) CSS style sheet page. The documentations below was created using the Pod::Simple::Wiki module available from CPAN and is not perfect.
NAME
Normalize Fixed Width - Converts unnormalized data, such as in files downloaded through Thomson SDC Platinum, to 3rd normal form. Processed files may be opened as tabbed delimited text files or columned spreadsheets. Ideal for database inputs, as commas in numbers, percent symbols, ampersands, and other database-averse symbols have been removed.
SYNOPSIS
perl NormalizeFixedWidth.pl -file=<file> [-headerdiscards=<int>] [-footerdiscards=<int>] [-columnnames=<1|0>] [-safenames=<1|0>] [-outfile=<file>] [h]
Note for non-perl programmers, to process file.txt run the command perl NormalizeFixedWidth.pl -file="file.txt"
OPTIONS
-file=<file>: Name of file to Normalize -headerdiscards=<int>: The number of header lines to discard (Default: 0) -footerdiscards=<int>: The number of footer lines to discard (Default: 0) -columnnames=<1|0>: Whether or not the file contains column names on the first line, after discards (Default: 1) -safenames=<1|0>: Determines whether column names will be stripped of non-alphanumeric characters (Default: 1) -outfile=<file>: The name of the outfile sequence (Default: file.txt - expands to file1.txt, file2.txt,etc) -h: Display help
USAGE & FEATURES
Normalizer.pl takes a tab-delimited, string-quoted textfile, such as might be made by copying and pasting an excel sheet into a text file, and creates a series of files containing the data in third normal form and suitable for import into a database.
Several major data providers, notably Thomson Financial's SDC service, use 'in cell' carriage returns to seperate multiple items of data. This makes processing of the data very difficult (without some software to help!). An example spreadsheet might look like this:
-------------------------------------------------------- | col1 | col2 | col3 | col4 | col5 | -------------------------------------------------------- | A | A | A | A | A | -------------------------------------------------------- | B | B | B | B | B | | | | B | B | B | | | | | B | B | -------------------------------------------------------- | | C | C | C | C | | | | | C | C | --------------------------------------------------------
Normalize.pl will prepare tables for import into the database, in 3rd normal form, by doing the following:
- 1 Adding a unique key field (col0) to the 1st output file. ; ; 2 Inserting all data (including null values) that has a 1
- 1 relationship with the key into the 1st output file. ; ; 3 If, as in the case of col4 and col5 above, data has a many to one relationship with the key, it will be grouped into sets. ; ; 4 Foreach of these sets (the example has two) a new output file, with the first key and it's own key, will be created. ;
The set identification is fully automatic and dictated by the data. In the example given, col1 and col2 would be assigned to the 1:1 file, while "col4 & col5" and "col3" would be two seperate sets.
Note that if the software were given only the first line (A) it would only identify a 1:1 relationship. Likewise if the software were given only the first (A) and the third (C) lines, it would identify two sets (col1, col2 & col3; and col4 & col5). Thus, the larger the file, the more accurate the set identification. However, a conceptual underidentification could be corrected manually after import into a database.
BUGS & FEEDBACK
Worked as expected to process three large Hubs pulls:
- Hubs2_4RoundsPart2.txt
- Hubs3_4_PortfolioCompanies.txt
- Hubs4_4_funds.txt
A fourth test file, which was pulled in a particularly difficult format, required use of a second Perl Program to sculpt it into the proper form.
- Hubs1_4CombinedRoundInfo.txt
After processing by NormalizeFixedWidth:
- Hubs1_4CombinedRoundInfo-normal.txt
After separating Regex groups by tabs with HubsCapture.pl:
- Hubs1_4CombinedRoundInfo-tabbedv3
All of the normalized files are stored under Software/Hubs\ Pulls on the server and are marked with "-normal" suffix on their filename. The original pulls are stored under Original Pulls. The concoction of files used to fix the CombinedRoundInfo are under the HubsCombinedRound directory.
DYLAN'S GUIDE: NORMALIZER FOR DUMMIES
PS C:\Users\DylanDickens> e: PS E:\> cd .\McNair PS E:\McNair> cd .\Projects PS E:\McNair\Projects> cd .\AccentureWork PS E:\McNair\Projects\AccentureWork> perl .\NormalizeFixedWidth.pl -file="file.txt"