Difference between revisions of "Normalizing Surnames"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
Line 1: Line 1:
 
==Encodings==
 
==Encodings==
 
Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.  
 
Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.  
 +
 +
The Latin alphabet offers the advantage of simplicity. There are only 26 letter characters, A to Z, provided one ignores case (upper or lower). There are no ligatures or diacritics. As n-grams have symbols<sup>n</sup> permutations, an encoding with a large number of symbols will result in a much higher number of dimensions for the data for even a small value of n.
  
 
==Tussenvoegsel==
 
==Tussenvoegsel==

Revision as of 00:58, 17 June 2009

Encodings

Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.

The Latin alphabet offers the advantage of simplicity. There are only 26 letter characters, A to Z, provided one ignores case (upper or lower). There are no ligatures or diacritics. As n-grams have symbolsn permutations, an encoding with a large number of symbols will result in a much higher number of dimensions for the data for even a small value of n.

Tussenvoegsel

Tussenvoegsel

Double-barrelled Surnames

Stop Words