Changes

1,402 bytes added , 02:49, 30 July 2009

m

*This page is a part of series in [[Classifying Names by Culture]]

==Encodings==

Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.

==Tussenvoegsel==

[http://en.wikipedia.org/wiki/Tussenvoegsel Tussenvoegsel] are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled [http://www.edegan.com/repository/~~Surnames~~Names-Tussenvoegsel.txt list of Tussenvoegsel] is used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname. Note that in some sources Tussenvoegsel can be identified by their lack of capitalization.

==Double-barrelled Surnames==

There are two defacto-standard formats (there does not appear to be an [http://www.iso.org ISO] standard):

{|

|!Source || !!Element 1 || !!Element 2 || !!Element 3 || !!Element 4 || !!Element 5 ~~|| !Element 6~~

|-

|US census [http://www.census.gov/geo/www/standards/scdd/ADCStandard.pdf Address Data Content Standard] || Name Prefix || First Name || Middle Initial || Surname || Name Suffix

|}

{|Note: The US census Address Data Content Standard was managed by the [http://www.fgdc.gov/standards/projects/FGDC-standards-projects/addressstandard/ Federal Geographic Data Committee] but is now discontinued. ~~| Orange || Apple || more~~The phone book format is most commonly encountered as: Surname, Firstname I. In this instance we refer to it as a comma format name.|==The Normalization Script==A [http://www.edegan.com/repository/NormalizeSurnames.pl script for conducting the normalization] takes all of the above points into consideration. The sequence of normalization is:# Force the encoding to Latin# Remove Stop Words (default uses: [http://www.edegan.com/repository/Names-Stopwords.txt Names-Stopwords.txt])~~| Bread || Pie || more~~# Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt]) - Note that with comma formatted names this does not apply.|# Remove first barrel (default) or concatenate double-barrelled names~~| Butter || Ice cream ||~~ # Mark discards# Extract "Surname"# Extract "Firstname Surname" pair An example command line is: <tt>perl NormalizeSurnames.pl -i=sourcefile.txt -ncol=1 -rcol=3 </tt> where the ncol specifies the name column and ~~more~~rcol specifies whether the name is in reversed format (use -r=1 to force reversals for the entire dataset). Basic script help on options is available through |}<tt>perl NormalizeSurnames.pl -h</tt>

Anonymous user

imported>Ed

Changes

Normalizing Surnames (view source)

Revision as of 02:49, 30 July 2009

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools