Difference between revisions of "Normalizer.pl"
(New page: The perl script Normalizer.pl is provided with a pod-generated HTML documentation and a (CPAN clone) CSS style sheet page. ==Files== *[www.edegan.com/repository/Normalizer.pl Normalizer....) |
|||
Line 1: | Line 1: | ||
− | The perl script Normalizer.pl is provided with a pod-generated HTML documentation and a (CPAN clone) CSS style sheet page. | + | The perl script Normalizer.pl is provided with a pod-generated HTML documentation and a (CPAN clone) CSS style sheet page. The documentations below was created using the [http://search.cpan.org/~jmcnamara/Pod-Simple-Wiki/lib/Pod/Simple/Wiki.pm Pod::Simple::Wiki module available from CPAN]. |
− | = | + | = FILES = |
*[www.edegan.com/repository/Normalizer.pl Normalizer.pl] | *[www.edegan.com/repository/Normalizer.pl Normalizer.pl] | ||
*[www.edegan.com/repository/Normalizer-Documentation.html Normalizer-Documentation.html] | *[www.edegan.com/repository/Normalizer-Documentation.html Normalizer-Documentation.html] | ||
*[www.edegan.com/repository/CPAN-Pod.css CPAN-Pod.css] | *[www.edegan.com/repository/CPAN-Pod.css CPAN-Pod.css] | ||
− | |||
− | |||
− | |||
− | |||
− | |||
= NAME = | = NAME = | ||
Line 17: | Line 12: | ||
(c) Ed Egan, 2007. All rights reserved. For now. | (c) Ed Egan, 2007. All rights reserved. For now. | ||
− | |||
− | |||
= SYNOPSIS = | = SYNOPSIS = | ||
perl Normalizer.pl -file=<file> [-headerdiscards=<int>] [-footerdiscards=<int>] [-columnnames=<1|0>] [-safenames=<1|0>] [-outfile=<file>] [h] | perl Normalizer.pl -file=<file> [-headerdiscards=<int>] [-footerdiscards=<int>] [-columnnames=<1|0>] [-safenames=<1|0>] [-outfile=<file>] [h] | ||
− | |||
− | |||
= OPTIONS = | = OPTIONS = | ||
Line 36: | Line 27: | ||
− | |||
= Usage and Features = | = Usage and Features = | ||
Line 63: | Line 53: | ||
Note that if the software were given only the first line (A) it would only identify a 1:1 relationship. Likewise if the software were given only the first (A) and the third (C) lines, it would identify two sets (col1, col2 & col3; and col4 & col5). Thus, the larger the file, the more accurate the set identification. However, a conceptual underidentification could be corrected manually after import into a database. | Note that if the software were given only the first line (A) it would only identify a 1:1 relationship. Likewise if the software were given only the first (A) and the third (C) lines, it would identify two sets (col1, col2 & col3; and col4 & col5). Thus, the larger the file, the more accurate the set identification. However, a conceptual underidentification could be corrected manually after import into a database. | ||
− | |||
− | |||
= BUGS & FEEDBACK = | = BUGS & FEEDBACK = | ||
Line 73: | Line 61: | ||
On the other hand, if you find a bug and fix it, or make some improvement to the code, or just want to give me money, then please contact me right away. | On the other hand, if you find a bug and fix it, or make some improvement to the code, or just want to give me money, then please contact me right away. | ||
− | Ed. | + | Ed.<ed@edegan.com> |
− | |||
− | <ed@edegan.com> |
Revision as of 18:12, 1 August 2009
The perl script Normalizer.pl is provided with a pod-generated HTML documentation and a (CPAN clone) CSS style sheet page. The documentations below was created using the Pod::Simple::Wiki module available from CPAN.
FILES
- [www.edegan.com/repository/Normalizer.pl Normalizer.pl]
- [www.edegan.com/repository/Normalizer-Documentation.html Normalizer-Documentation.html]
- [www.edegan.com/repository/CPAN-Pod.css CPAN-Pod.css]
NAME
Normalizer - Converts unnormalized data, such as in files downloaded through Thomson SDC Platinum, to 3rd normal form.
(c) Ed Egan, 2007. All rights reserved. For now.
SYNOPSIS
perl Normalizer.pl -file=<file> [-headerdiscards=<int>] [-footerdiscards=<int>] [-columnnames=<1|0>] [-safenames=<1|0>] [-outfile=<file>] [h]
OPTIONS
-file=<file>: Name of file to Normalize -headerdiscards=<int>: The number of header lines to discard (Default: 0) -footerdiscards=<int>: The number of footer lines to discard (Default: 0) -columnnames=<1|0>: Whether or not the file contains column names on the first line, after discards (Default: 1) -safenames=<1|0>: Determines whether column names will be stripped of non-alphanumeric characters (Default: 1) -outfile=<file>: The name of the outfile sequence (Default: file.txt - expands to file1.txt, file2.txt,etc) -h: Display help
Usage and Features
Normalizer.pl takes a tab-delimited, string-quoted textfile, such as might be made by copying and pasting an excel sheet into a text file, and creates a series of files containing the data in third normal form and suitable for import into a database.
Several major data providers, notably Thomson Financial's SDC service, use 'in cell' carriage returns to seperate multiple items of data. This makes processing of the data very difficult (without some software to help!). An example spreadsheet might look like this:
-------------------------------------------------------- | col1 | col2 | col3 | col4 | col5 | -------------------------------------------------------- | A | A | A | A | A | -------------------------------------------------------- | B | B | B | B | B | | | | B | B | B | | | | | B | B | -------------------------------------------------------- | | C | C | C | C | | | | | C | C | --------------------------------------------------------
Normalize.pl will prepare tables for import into the database, in 3rd normal form, by doing the following:
- 1 Adding a unique key field (col0) to the 1st output file. ; ; 2 Inserting all data (including null values) that has a 1
- 1 relationship with the key into the 1st output file. ; ; 3 If, as in the case of col4 and col5 above, data has a many to one relationship with the key, it will be grouped into sets. ; ; 4 Foreach of these sets (the example has two) a new output file, with the first key and it's own key, will be created. ;
The set identification is fully automatic and dictated by the data. In the example given, col1 and col2 would be assigned to the 1:1 file, while "col4 & col5" and "col3" would be two seperate sets.
Note that if the software were given only the first line (A) it would only identify a 1:1 relationship. Likewise if the software were given only the first (A) and the third (C) lines, it would identify two sets (col1, col2 & col3; and col4 & col5). Thus, the larger the file, the more accurate the set identification. However, a conceptual underidentification could be corrected manually after import into a database.
BUGS & FEEDBACK
Well it seems to work fine for me...
Now, I'm not offering any support whatsoever, you understand, but if you find something that goes wrong, or want an extra feature or two, and particularly if you aren't a perl programmer yourself, then maybe I will help. Mail me and find out.
On the other hand, if you find a bug and fix it, or make some improvement to the code, or just want to give me money, then please contact me right away.
Ed.<ed@edegan.com>