Normalizer - Converts unnormalized data, such as in files downloaded through Thomson SDC Platinum, to 3rd normal form.
(c) Ed Egan, 2007. All rights reserved. For now.
perl Normalizer.pl -file=<file> [-headerdiscards=<int>] [-footerdiscards=<int>] [-columnnames=<1|0>] [-safenames=<1|0>] [-outfile=<file>] [h]
-file=<file>: Name of file to Normalize -headerdiscards=<int>: The number of header lines to discard (Default: 0) -footerdiscards=<int>: The number of footer lines to discard (Default: 0) -columnnames=<1|0>: Whether or not the file contains column names on the first line, after discards (Default: 1) -safenames=<1|0>: Determines whether column names will be stripped of non-alphanumeric characters (Default: 1) -outfile=<file>: The name of the outfile sequence (Default: file.txt - expands to file1.txt, file2.txt,etc) -h: Display help
Normalizer.pl takes a tab-delimited, string-quoted textfile, such as might be made by copying and pasting an excel sheet into a text file, and creates a series of files containing the data in third normal form and suitable for import into a database.
Several major data providers, notably Thomson Financial's SDC service, use 'in cell' carriage returns to seperate multiple items of data. This makes processing of the data very difficult (without some software to help!). An example spreadsheet might look like this:
-------------------------------------------------------- | col1 | col2 | col3 | col4 | col5 | -------------------------------------------------------- | A | A | A | A | A | -------------------------------------------------------- | B | B | B | B | B | | | | B | B | B | | | | | B | B | -------------------------------------------------------- | | C | C | C | C | | | | | C | C | --------------------------------------------------------
Normalize.pl will prepare tables for import into the database, in 3rd normal form, by doing the following:
The set identification is fully automatic and dictated by the data. In the example given, col1 and col2 would be assigned to the 1:1 file, while ``col4 & col5'' and ``col3'' would be two seperate sets.
Note that if the software were given only the first line (A) it would only identify a 1:1 relationship. Likewise if the software were given only the first (A) and the third (C) lines, it would identify two sets (col1, col2 & col3; and col4 & col5). Thus, the larger the file, the more accurate the set identification. However, a conceptual underidentification could be corrected manually after import into a database.
Well it seems to work fine for me...
Now, I'm not offering any support whatsoever, you understand, but if you find something that goes wrong, or want an extra feature or two, and particularly if you aren't a perl programmer yourself, then maybe I will help. Mail me and find out.
On the other hand, if you find a bug and fix it, or make some improvement to the code, or just want to give me money, then please contact me right away.
Ed.