NAME

Normalizer - Converts unnormalized data, such as in files downloaded through Thomson SDC Platinum, to 3rd normal form.

SYNOPSIS

perl Normalizer.pl -file=<file> [-headerdiscards=<int>] [-footerdiscards=<int>] [-columnnames=<1|0>] [-safenames=<1|0>] [-outfile=<file>] [h]

OPTIONS

    -file=<file>:           Name of file to Normalize
    -headerdiscards=<int>:  The number of header lines to discard (Default: 0)
    -footerdiscards=<int>:  The number of footer lines to discard (Default: 0)
    -columnnames=<1|0>:     Whether or not the file contains column names on the first line, after discards (Default: 1)
    -safenames=<1|0>:       Determines whether column names will be stripped of non-alphanumeric characters (Default: 1)
    -outfile=<file>:        The name of the outfile sequence (Default: file.txt - expands to file1.txt, file2.txt,etc)
    -h:                     Display help

Usage and Features

Normalizer.pl takes a tab-delimited, string-quoted textfile, such as might be made by copying and pasting an excel sheet into a text file, and creates a series of files containing the data in third normal form and suitable for import into a database.

Several major data providers, notably Thomson Financial's SDC service, use 'in cell' carriage returns to seperate multiple items of data. This makes processing of the data very difficult (without some software to help!). An example spreadsheet might look like this:

    --------------------------------------------------------
    |   col1   |   col2   |   col3   |   col4   |   col5   |
    --------------------------------------------------------
    |    A     |    A     |    A     |    A     |     A    |
    --------------------------------------------------------
    |    B     |    B     |    B     |    B     |     B    |
    |          |          |    B     |    B     |     B    |
    |          |          |          |    B     |     B    |
    --------------------------------------------------------
    |          |    C     |    C     |    C     |     C    |
    |          |          |          |    C     |     C    |
    --------------------------------------------------------

Normalize.pl will prepare tables for import into the database, in 3rd normal form, by doing the following:

Adding a unique key field (col0) to the 1st output file.
Inserting all data (including null values) that has a 1:1 relationship with the key into the 1st output file.
If, as in the case of col4 and col5 above, data has a many to one relationship with the key, it will be grouped into sets.
Foreach of these sets (the example has two) a new output file, with the first key and it's own key, will be created.

The set identification is fully automatic and dictated by the data. In the example given, col1 and col2 would be assigned to the 1:1 file, while ``col4 & col5'' and ``col3'' would be two seperate sets.

Note that if the software were given only the first line (A) it would only identify a 1:1 relationship. Likewise if the software were given only the first (A) and the third (C) lines, it would identify two sets (col1, col2 & col3; and col4 & col5). Thus, the larger the file, the more accurate the set identification. However, a conceptual underidentification could be corrected manually after import into a database.

BUGS & FEEDBACK

Well it seems to work fine for me...

Now, I'm not offering any support whatsoever, you understand, but if you find something that goes wrong, or want an extra feature or two, and particularly if you aren't a perl programmer yourself, then maybe I will help. Mail me and find out.

On the other hand, if you find a bug and fix it, or make some improvement to the code, or just want to give me money, then please contact me right away.

Ed.

<ed@edegan.com>