Changes

Jump to navigation Jump to search
1,524 bytes added ,  02:49, 30 July 2009
m
*This page is a part of series in [[Classifying Names by Culture]]
 
==Encodings==
Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.
==Tussenvoegsel==
[http://en.wikipedia.org/wiki/Tussenvoegsel Tussenvoegsel] are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled [http://www.edegan.com/repository/SurnamesNames-Tussenvoegsel.txt list of Tussenvoegsel] is used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname. Note that in some sources Tussenvoegsel can be identified by their lack of capitalization.
==Double-barrelled Surnames==
Some cultures and some datasets routinely reverse (or re-order) the order of names; the most common reversal being Surname, FirstName Initial. Such reversals may or may not be indicated by punctuation and may be systematic across an entire dataset or idiosyncratic to groups or individuals within the dataset. To facilitate this the normalization script must support idiosyncratic reversal options using indicator variables.
I declare the following There are two defacto-standard formats (there does not appear to be an [http://www.iso.org ISO] standard): 1. US census [http://www.census.gov/geo/www/standards/scdd/ADCStandard.pdf Address Data Content Standard] 2. Phonebook {| !Source !!Element 1 !!Element 2 !!Element 3 !!Element 4 !!Element 5 !Element 6
|-
|US Census ADCS census [http://www.census.gov/geo/www/standards/scdd/ADCStandard.pdf Address Data Content Standard] || Name Prefix || First Name || Middle Initial || Surname || Name Suffix
|-
|Phone Book (Hardcopy) || Last Name || First Name || Middle Initial || | |
|}
 
Note: The US census Address Data Content Standard was managed by the [http://www.fgdc.gov/standards/projects/FGDC-standards-projects/addressstandard/ Federal Geographic Data Committee] but is now discontinued.
 
The phone book format is most commonly encountered as: Surname, Firstname I. In this instance we refer to it as a comma format name.
 
==The Normalization Script==
A [http://www.edegan.com/repository/NormalizeSurnames.pl script for conducting the normalization] takes all of the above points into consideration. The sequence of normalization is:
# Force the encoding to Latin
# Remove Stop Words (default uses: [http://www.edegan.com/repository/Names-Stopwords.txt Names-Stopwords.txt])
# Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt]) - Note that with comma formatted names this does not apply.
# Remove first barrel (default) or concatenate double-barrelled names
# Mark discards
# Extract "Surname"
# Extract "Firstname Surname" pair
 
An example command line is:
 
<tt>perl NormalizeSurnames.pl -i=sourcefile.txt -ncol=1 -rcol=3 </tt>
 
where the ncol specifies the name column and rcol specifies whether the name is in reversed format (use -r=1 to force reversals for the entire dataset). Basic script help on options is available through
 
<tt>perl NormalizeSurnames.pl -h</tt>
Anonymous user

Navigation menu