Changes

473 bytes added , 02:49, 30 July 2009

m

*This page is a part of series in [[Classifying Names by Culture]]

==Encodings==

Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.

|Phone Book (Hardcopy) || Last Name || First Name || Middle Initial || ||

|}

Note: The US census Address Data Content Standard was managed by the [http://www.fgdc.gov/standards/projects/FGDC-standards-projects/addressstandard/ Federal Geographic Data Committee] but is now discontinued.

The phone book format is most commonly encountered as: Surname, Firstname I. In this instance we refer to it as a comma format name.

==The Normalization Script==

# Force the encoding to Latin

# Remove Stop Words (default uses: [http://www.edegan.com/repository/Names-Stopwords.txt Names-Stopwords.txt])

# Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt]) - Note that with comma formatted names this does not apply.

# Remove first barrel (default) or concatenate double-barrelled names

# Mark discards

~~# Remove or concatenate (default) Tussenvoegsel (default uses: [http://www.edegan.com/repository/Names-Tussenvoegsel.txt Names-Tussenvoegsel.txt])~~

~~# Remove first barrel (default) or concatenate double-barrelled names~~

# Extract "Surname"

# Extract "Firstname Surname" pair

Anonymous user

imported>Ed

Changes

Normalizing Surnames (view source)

Revision as of 02:49, 30 July 2009

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools