Normalizing Surnames

From edegan.com
Revision as of 19:33, 6 July 2009 by imported>Ed (→‎Tussenvoegsel)
Jump to navigation Jump to search

Encodings

Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.

The Latin alphabet offers the advantage of simplicity. There are only 26 letter characters, A to Z, provided one ignores case (upper or lower). There are no ligatures or diacritics. As n-grams have (symbols)n permutations, an encoding with a large number of symbols will result in a much higher number of dimensions for the data for even a small value of n. Furthermore, most datasets used for practical applications are encoded in the Latin alphabet and having a classification system that allows for non-Latin characters would therefor introduce redundancy.

The first stage of normalization is therefore to check that the encoding is in the latin alphabet, with a minimal number of other symbols (such as the period, comma, and hyphen) that may provide meta information for further normalization, and to force it into the latin alpabet if it isn't. Maintaining information about the simplification or removal of ligature and diacritics (in particular) may be useful and is accomplished through the creation of additional binary variable.

Tussenvoegsel

Tussenvoegsel are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled list of Tussenvoegselis used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname. Note that in some sources Tussenvoegsel can be identified by their lack of capitalization.

Double-barrelled Surnames

Double-barrelled surnames may be hyphenated and easy to detect, such as Smith-Jones, but also come in many difficult forms. Spanish Naming Customs, for example, suggest the use of two surnames: a paternal surname (that is dominant) and a maternal surname. They are ordered, paternal-maternal, and often without the hyphen making discrimination problematic. However, for cultural indentification purposes it seems as suitable to use the maternal (last) surname, as to use the (strictly correct) paternal surname. While problems will persist (as in Zarragoza-Watkins), this is to some extent unavoidable.

Honorifics and Suffices

Surname data often contains honorifics such as Mr, Mrs, Ms, and Dr, as well as suffices such as Esq., Jr., roman numerals (II, III, IV, V, etc) and occasionally academic qualifications (PhD, MSc, etc). These need to be removed or seperated, and can be classified for gender, education, and other characteristics.

Military, political and class honorifics and suffices also need treatment. These include Sir, M.P., The Hon., Lord, Lt, Cap., Major, Gen., and so forth. Pratically all of these honorifics and suffices are sufficiently distinct from real names to be considered stop words, at least assuming context permits (i.e. from context "Major John Major" could have the first "Major" removed, but removing the "Major" from "John Major" would compromise the name-string). Coding these stop words for gender, education and other other variables of interest is possible.

Initials and Middle Names

Many name sources provide either middle initials or middle names, or sometimes both. In the case of initials very little information can be deduced (possibly more initials are indicative or higher social class or some such, but this is a blind guess). Middle names could be used in much the same fashion as first names, that is to deduce gender and possibly a SES (Socio-Economic Status) type variable. However, for the most part this is superflous information that can be ignored.

Short Names

It is difficult to classify names consisting of single words as either first names or surnames, or as data errors. For this reason single word names should probably be discarded. While there are an abundance of surnames composed of two or three letters, single letter names are exceedingly rare. As a single letter surname could be interpreted as an initial (as in Smith J) in a different format, it is possible to process single letter names in some instances, but not as surnames. The analysis of names depends on frequencies of letter combinations; thus a single letter surname is not meaningful for the analysis.

Name Orders and Formats

Some cultures and some datasets routinely reverse (or re-order) the order of names; the most common reversal being Surname, FirstName Initial. Such reversals may or may not be indicated by punctuation and may be systematic across an entire dataset or idiosyncratic to groups or individuals within the dataset. To facilitate this the normalization script must support idiosyncratic reversal options using indicator variables.

There are two defacto-standard formats (there does not appear to be an ISO standard):

Source Element 1 Element 2 Element 3 Element 4 Element 5
US census Address Data Content Standard Name Prefix First Name Middle Initial Surname Name Suffix
Phone Book (Hardcopy) Last Name First Name Middle Initial

The Normalization Script

A script for conducting the normalization takes all of the above points into consideration. The sequence of normalization is:

  1. Force the encoding to Latin
  2. Remove Stop Words (default uses: Names-Stopwords.txt)
  3. Mark discards
  4. Remove or concatenate (default) Tussenvoegsel (default uses: Names-Tussenvoegsel.txt)
  5. Remove first barrel (default) or concatenate double-barrelled names
  6. Extract "Surname"
  7. Extract "Firstname Surname" pair

An example command line is:

perl NormalizeSurnames.pl -i=sourcefile.txt -ncol=1 -rcol=3

where the ncol specifies the name column and rcol specifies whether the name is in reversed format (use -r=1 to force reversals for the entire dataset). Basic script help on options is available through

perl NormalizeSurnames.pl -h