Normalizing Surnames
Encodings
Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.
The Latin alphabet offers the advantage of simplicity. There are only 26 letter characters, A to Z, provided one ignores case (upper or lower). There are no ligatures or diacritics. As n-grams have (symbols)n permutations, an encoding with a large number of symbols will result in a much higher number of dimensions for the data for even a small value of n. Furthermore, most datasets used for practical applications are encoded in the Latin alphabet and having a classification system that allows for non-Latin characters would therefor introduce redundancy.
The first stage of normalization is therefore to check that the encoding is in the latin alphabet, with a minimal number of other symbols (such as the period, comma, and hyphen) that may provide meta information for further normalization, and to force it into the latin alpabet if it isn't. Maintaining information about the simplification or removal of ligature and diacritics (in particular) may be useful and is accomplished through the creation of additional binary variable.
Tussenvoegsel
Tussenvoegsel are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled list of Tussenvoegsel is used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname.
Double-barrelled Surnames
Double-barrelled surnames may be hyphenated and easy to detect, such as Smith-Jones, but also come in many difficult forms. Spanish Naming Customs, for example, suggest the use of two surnames, a paternal surname (that is dominant) and a maternal surname. They are ordered, paternal-maternal, and often without the hyphen.