Changes

2,861 bytes added , 02:49, 30 July 2009

m

*This page is a part of series in [[Classifying Names by Culture]]

==Encodings==

Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.

==Tussenvoegsel==

[http://en.wikipedia.org/wiki/Tussenvoegsel Tussenvoegsel] are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled [http://www.edegan.com/repository/~~Surnames~~Names-Tussenvoegsel.txt list of Tussenvoegsel] is used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname. Note that in some sources Tussenvoegsel can be identified by their lack of capitalization.

==Double-barrelled Surnames==

Surname data often contains honorifics such as Mr, Mrs, Ms, and Dr, as well as suffices such as Esq., Jr., roman numerals (II, III, IV, V, etc) and occasionally academic qualifications (PhD, MSc, etc). These need to be removed or seperated, and can be classified for gender, education, and other characteristics.

Military, political and class honorifics and suffices also need treatment. These include Sir, M.P., The Hon., Lord, Lt, Cap., Major, Gen., and so forth. Pratically all of these honorifics and suffices are sufficiently distinct from real names to be considered stop words, at least assuming context permits (i.e. from context "Major John Major" could have the first "Major" removed, but removing the "Major" from "John Major" would compromise the name-string). Coding these stop words for gender, education and other other variables of interest is possible.

==Initials and Middle Names==

Many name sources provide either middle initials or middle names, or sometimes both. In the case of initials very little information can be deduced (possibly more initials are indicative or higher social class or some such, but this is a blind guess). Middle names could be used in much the same fashion as first names, that is to deduce gender and possibly a SES (Socio-Economic Status) type variable. However, for the most part this is superflous information that can be ignored.

==Short Names==

==Name Orders and Formats==

Some cultures and some datasets ~~routine~~ routinely reverse (or re-order) the order of names; the most common reversal being Surname, FirstName Initial. Such reversals may or may not be indicated by punctuation and may be systematic across an entire dataset or idiosyncratic to groups or individuals within the dataset. To facilitate this the normalization script must support idiosyncratic reversal options using indicator variables. There are two defacto-standard formats (there does not appear to be an [http://www.iso.org ISO] standard):{|!Source !! Element 1 !! Element 2 !! Element 3 !! Element 4 !! Element 5|-|US census [http://www.census.gov/geo/www/standards/scdd/ADCStandard.pdf Address Data Content Standard] || Name Prefix || First Name || Middle Initial || Surname || Name Suffix|- |Phone Book (Hardcopy) || Last Name || First Name || Middle Initial || |||} Note: The US census Address Data Content Standard was managed by the [http://www.fgdc.gov/standards/projects/FGDC-standards-projects/addressstandard/ Federal Geographic Data Committee] but is now discontinued. The phone book format is most commonly encountered as: Surname, Firstname I. In this instance we refer to it as a comma format name. ==The Normalization Script==A [http://www.edegan.com/repository/NormalizeSurnames.pl script for conducting the normalization] takes all of the above points into consideration. The sequence of normalization is:# Force the encoding to Latin# Remove Stop Words (default uses: [http://www.edegan.com/repository/Names-Stopwords.txt Names-Stopwords.txt])# Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt]) - Note that with comma formatted names this does not apply.# Remove first barrel (default) or concatenate double-barrelled names# Mark discards# Extract "Surname"# Extract "Firstname Surname" pair An example command line is:

<tt>perl NormalizeSurnames.pl -i=sourcefile.txt -ncol=1 -rcol=3 </tt>

where the ncol specifies the name column and rcol specifies whether the name is in reversed format (use -r=1 to force reversals for the entire dataset). Basic script help on options is available through

~~==Stop Words==~~<tt>perl NormalizeSurnames.pl -h</tt>

Anonymous user

imported>Ed

Changes

Normalizing Surnames (view source)

Revision as of 02:49, 30 July 2009

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools