Changes

Jump to navigation Jump to search
m
*This page is a part of series in [[Classifying Names by Culture]]
 
==Encodings==
Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.
==Tussenvoegsel==
[http://en.wikipedia.org/wiki/Tussenvoegsel Tussenvoegsel] are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled [http://www.edegan.com/repository/SurnamesNames-Tussenvoegsel.txt list of Tussenvoegsel]is used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname. Note that in some sources Tussenvoegsel can be identified by their lack of capitalization.
==Double-barrelled Surnames==
|}
Note: The US census Address Data Content Standard was managed by the [http://www.fgdc.gov/standards/projects/FGDC-standards-projects/addressstandard/ Federal Geographic Data Committee] but is now discontinued. The phone book format is most commonly encountered as: Surname, Firstname I. In this instance we refer to it as a comma format name. ==The Normalization Script==
A [http://www.edegan.com/repository/NormalizeSurnames.pl script for conducting the normalization] takes all of the above points into consideration. The sequence of normalization is:
# Force the encoding to Latin
# Remove Stop Words (default uses: [http://www.edegan.com/repository/Names-Stopwords.txt Names-Stopwords.txt])
# Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt]) - Note that with comma formatted names this does not apply.
# Remove first barrel (default) or concatenate double-barrelled names
# Mark discards
# Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt])
# Remove first barrel (default) or concatenate double-barrelled names
# Extract "Surname"
# Extract "Firstname Surname" pair
Anonymous user

Navigation menu