Changes

Normalizing Surnames (view source)

Revision as of 19:30, 6 July 2009

979 bytes added , 19:30, 6 July 2009

no edit summary

|Phone Book (Hardcopy) || Last Name || First Name || Middle Initial || ||

|}

=The Normalization Script=

A [http://www.edegan.com/repository/NormalizeSurnames.pl script for conducting the normalization] takes all of the above points into consideration. The sequence of normalization is:

# Force the encoding to Latin

# Remove Stop Words (default uses: [http://www.edegan.com/repository/Names-Stopwords.txt Names-Stopwords.txt])

# Mark discards

# Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt])

# Remove first barrel (default) or concatenate double-barrelled names

# Extract "Surname"

# Extract "Firstname Surname" pair

An example command line is: <tt>perl NormalizeSurnames.pl -i=sourcefile.txt -ncol=1 -rcol=3 </tt> where the ncol specifies the name column and rcol specifies whether the name is in reversed format (use -r=1 to force reversals for the entire dataset). Basic script help on options is available through <tt>perl NormalizeSurnames.pl -h</tt>

Anonymous user

imported>Ed

Changes

Normalizing Surnames (view source)

Revision as of 19:30, 6 July 2009

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools