Difference between revisions of "Normalizing Surnames"
imported>Ed |
imported>Ed |
||
(17 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
+ | *This page is a part of series in [[Classifying Names by Culture]] | ||
+ | |||
==Encodings== | ==Encodings== | ||
Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet. | Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet. | ||
− | The Latin alphabet offers the advantage of simplicity. There are only 26 letter characters, A to Z, provided one ignores case (upper or lower). There are no ligatures or diacritics. As n-grams have (symbols)<sup>n</sup> permutations, an encoding with a large number of symbols will result in a much higher number of dimensions for the data for even a small value of n. | + | The Latin alphabet offers the advantage of simplicity. There are only 26 letter characters, A to Z, provided one ignores case (upper or lower). There are no ligatures or diacritics. As n-grams have (symbols)<sup>n</sup> permutations, an encoding with a large number of symbols will result in a much higher number of dimensions for the data for even a small value of n. Furthermore, most datasets used for practical applications are encoded in the Latin alphabet and having a classification system that allows for non-Latin characters would therefor introduce redundancy. |
+ | |||
+ | The first stage of normalization is therefore to check that the encoding is in the latin alphabet, with a minimal number of other symbols (such as the period, comma, and hyphen) that may provide meta information for further normalization, and to force it into the latin alpabet if it isn't. Maintaining information about the simplification or removal of ligature and diacritics (in particular) may be useful and is accomplished through the creation of additional binary variable. | ||
==Tussenvoegsel== | ==Tussenvoegsel== | ||
− | [http://en.wikipedia.org/wiki/Tussenvoegsel Tussenvoegsel] | + | [http://en.wikipedia.org/wiki/Tussenvoegsel Tussenvoegsel] are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled [http://www.edegan.com/repository/Names-Tussenvoegsel.txt list of Tussenvoegsel]is used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname. Note that in some sources Tussenvoegsel can be identified by their lack of capitalization. |
==Double-barrelled Surnames== | ==Double-barrelled Surnames== | ||
− | ==Stop Words== | + | Double-barrelled surnames may be hyphenated and easy to detect, such as Smith-Jones, but also come in many difficult forms. [http://en.wikipedia.org/wiki/Spanish_surname Spanish Naming Customs], for example, suggest the use of two surnames: a paternal surname (that is dominant) and a maternal surname. They are ordered, paternal-maternal, and often without the hyphen making discrimination problematic. However, for cultural indentification purposes it seems as suitable to use the maternal (last) surname, as to use the (strictly correct) paternal surname. While problems will persist (as in Zarragoza-Watkins), this is to some extent unavoidable. |
+ | |||
+ | ==Honorifics and Suffices== | ||
+ | |||
+ | Surname data often contains honorifics such as Mr, Mrs, Ms, and Dr, as well as suffices such as Esq., Jr., roman numerals (II, III, IV, V, etc) and occasionally academic qualifications (PhD, MSc, etc). These need to be removed or seperated, and can be classified for gender, education, and other characteristics. | ||
+ | |||
+ | Military, political and class honorifics and suffices also need treatment. These include Sir, M.P., The Hon., Lord, Lt, Cap., Major, Gen., and so forth. Pratically all of these honorifics and suffices are sufficiently distinct from real names to be considered stop words, at least assuming context permits (i.e. from context "Major John Major" could have the first "Major" removed, but removing the "Major" from "John Major" would compromise the name-string). Coding these stop words for gender, education and other other variables of interest is possible. | ||
+ | |||
+ | ==Initials and Middle Names== | ||
+ | |||
+ | Many name sources provide either middle initials or middle names, or sometimes both. In the case of initials very little information can be deduced (possibly more initials are indicative or higher social class or some such, but this is a blind guess). Middle names could be used in much the same fashion as first names, that is to deduce gender and possibly a SES (Socio-Economic Status) type variable. However, for the most part this is superflous information that can be ignored. | ||
+ | |||
+ | ==Short Names== | ||
+ | |||
+ | It is difficult to classify names consisting of single words as either first names or surnames, or as data errors. For this reason single word names should probably be discarded. While there are an abundance of surnames composed of two or three letters, single letter names are exceedingly rare. As a single letter surname could be interpreted as an initial (as in Smith J) in a different format, it is possible to process single letter names in some instances, but not as surnames. The analysis of names depends on frequencies of letter combinations; thus a single letter surname is not meaningful for the analysis. | ||
+ | |||
+ | ==Name Orders and Formats== | ||
+ | |||
+ | Some cultures and some datasets routinely reverse (or re-order) the order of names; the most common reversal being Surname, FirstName Initial. Such reversals may or may not be indicated by punctuation and may be systematic across an entire dataset or idiosyncratic to groups or individuals within the dataset. To facilitate this the normalization script must support idiosyncratic reversal options using indicator variables. | ||
+ | |||
+ | There are two defacto-standard formats (there does not appear to be an [http://www.iso.org ISO] standard): | ||
+ | {| | ||
+ | !Source !! Element 1 !! Element 2 !! Element 3 !! Element 4 !! Element 5 | ||
+ | |- | ||
+ | |US census [http://www.census.gov/geo/www/standards/scdd/ADCStandard.pdf Address Data Content Standard] || Name Prefix || First Name || Middle Initial || Surname || Name Suffix | ||
+ | |- | ||
+ | |Phone Book (Hardcopy) || Last Name || First Name || Middle Initial || || | ||
+ | |} | ||
+ | |||
+ | Note: The US census Address Data Content Standard was managed by the [http://www.fgdc.gov/standards/projects/FGDC-standards-projects/addressstandard/ Federal Geographic Data Committee] but is now discontinued. | ||
+ | |||
+ | The phone book format is most commonly encountered as: Surname, Firstname I. In this instance we refer to it as a comma format name. | ||
+ | |||
+ | ==The Normalization Script== | ||
+ | A [http://www.edegan.com/repository/NormalizeSurnames.pl script for conducting the normalization] takes all of the above points into consideration. The sequence of normalization is: | ||
+ | # Force the encoding to Latin | ||
+ | # Remove Stop Words (default uses: [http://www.edegan.com/repository/Names-Stopwords.txt Names-Stopwords.txt]) | ||
+ | # Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt]) - Note that with comma formatted names this does not apply. | ||
+ | # Remove first barrel (default) or concatenate double-barrelled names | ||
+ | # Mark discards | ||
+ | # Extract "Surname" | ||
+ | # Extract "Firstname Surname" pair | ||
+ | |||
+ | An example command line is: | ||
+ | |||
+ | <tt>perl NormalizeSurnames.pl -i=sourcefile.txt -ncol=1 -rcol=3 </tt> | ||
+ | |||
+ | where the ncol specifies the name column and rcol specifies whether the name is in reversed format (use -r=1 to force reversals for the entire dataset). Basic script help on options is available through | ||
+ | |||
+ | <tt>perl NormalizeSurnames.pl -h</tt> |
Latest revision as of 02:49, 30 July 2009
- This page is a part of series in Classifying Names by Culture
Contents
Encodings
Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.
The Latin alphabet offers the advantage of simplicity. There are only 26 letter characters, A to Z, provided one ignores case (upper or lower). There are no ligatures or diacritics. As n-grams have (symbols)n permutations, an encoding with a large number of symbols will result in a much higher number of dimensions for the data for even a small value of n. Furthermore, most datasets used for practical applications are encoded in the Latin alphabet and having a classification system that allows for non-Latin characters would therefor introduce redundancy.
The first stage of normalization is therefore to check that the encoding is in the latin alphabet, with a minimal number of other symbols (such as the period, comma, and hyphen) that may provide meta information for further normalization, and to force it into the latin alpabet if it isn't. Maintaining information about the simplification or removal of ligature and diacritics (in particular) may be useful and is accomplished through the creation of additional binary variable.
Tussenvoegsel
Tussenvoegsel are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled list of Tussenvoegselis used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname. Note that in some sources Tussenvoegsel can be identified by their lack of capitalization.
Double-barrelled Surnames
Double-barrelled surnames may be hyphenated and easy to detect, such as Smith-Jones, but also come in many difficult forms. Spanish Naming Customs, for example, suggest the use of two surnames: a paternal surname (that is dominant) and a maternal surname. They are ordered, paternal-maternal, and often without the hyphen making discrimination problematic. However, for cultural indentification purposes it seems as suitable to use the maternal (last) surname, as to use the (strictly correct) paternal surname. While problems will persist (as in Zarragoza-Watkins), this is to some extent unavoidable.
Honorifics and Suffices
Surname data often contains honorifics such as Mr, Mrs, Ms, and Dr, as well as suffices such as Esq., Jr., roman numerals (II, III, IV, V, etc) and occasionally academic qualifications (PhD, MSc, etc). These need to be removed or seperated, and can be classified for gender, education, and other characteristics.
Military, political and class honorifics and suffices also need treatment. These include Sir, M.P., The Hon., Lord, Lt, Cap., Major, Gen., and so forth. Pratically all of these honorifics and suffices are sufficiently distinct from real names to be considered stop words, at least assuming context permits (i.e. from context "Major John Major" could have the first "Major" removed, but removing the "Major" from "John Major" would compromise the name-string). Coding these stop words for gender, education and other other variables of interest is possible.
Initials and Middle Names
Many name sources provide either middle initials or middle names, or sometimes both. In the case of initials very little information can be deduced (possibly more initials are indicative or higher social class or some such, but this is a blind guess). Middle names could be used in much the same fashion as first names, that is to deduce gender and possibly a SES (Socio-Economic Status) type variable. However, for the most part this is superflous information that can be ignored.
Short Names
It is difficult to classify names consisting of single words as either first names or surnames, or as data errors. For this reason single word names should probably be discarded. While there are an abundance of surnames composed of two or three letters, single letter names are exceedingly rare. As a single letter surname could be interpreted as an initial (as in Smith J) in a different format, it is possible to process single letter names in some instances, but not as surnames. The analysis of names depends on frequencies of letter combinations; thus a single letter surname is not meaningful for the analysis.
Name Orders and Formats
Some cultures and some datasets routinely reverse (or re-order) the order of names; the most common reversal being Surname, FirstName Initial. Such reversals may or may not be indicated by punctuation and may be systematic across an entire dataset or idiosyncratic to groups or individuals within the dataset. To facilitate this the normalization script must support idiosyncratic reversal options using indicator variables.
There are two defacto-standard formats (there does not appear to be an ISO standard):
Source | Element 1 | Element 2 | Element 3 | Element 4 | Element 5 |
---|---|---|---|---|---|
US census Address Data Content Standard | Name Prefix | First Name | Middle Initial | Surname | Name Suffix |
Phone Book (Hardcopy) | Last Name | First Name | Middle Initial |
Note: The US census Address Data Content Standard was managed by the Federal Geographic Data Committee but is now discontinued.
The phone book format is most commonly encountered as: Surname, Firstname I. In this instance we refer to it as a comma format name.
The Normalization Script
A script for conducting the normalization takes all of the above points into consideration. The sequence of normalization is:
- Force the encoding to Latin
- Remove Stop Words (default uses: Names-Stopwords.txt)
- Remove or concatenate (default) Tussenvoegsel (default uses: Names-Tussenvoegsel.txt) - Note that with comma formatted names this does not apply.
- Remove first barrel (default) or concatenate double-barrelled names
- Mark discards
- Extract "Surname"
- Extract "Firstname Surname" pair
An example command line is:
perl NormalizeSurnames.pl -i=sourcefile.txt -ncol=1 -rcol=3
where the ncol specifies the name column and rcol specifies whether the name is in reversed format (use -r=1 to force reversals for the entire dataset). Basic script help on options is available through
perl NormalizeSurnames.pl -h