Changes

143 bytes added , 19:23, 20 July 2009

m

no edit summary

First many classifiers 'require' a feature matrix of full column rank, so including a variable like the length of the name along with the n-gram frequencies introduces a linear dependence between the columns; coding EGAN as having length 4 along with the 1-grams E, G, A, and N, clearly introduces no new information. The same is true for bigrams EG, GA, and AN, or trigrams EGA and GAN, and so forth. Likewise coding both bigrams and trigrams introduces no new information.

Second the assumption of independence among features means that with an n-gram encoding the sequence information is lost. That is EGA and GAN are assumed to be uncorrelated, though clearly they are not (as they overlap by GA). Thus there is a potential for improvement by including positional features. One way of denoting the start and end of the string is to add a space to the gram set and delimit surname with spaces. Thus EGAN would be coded in trigrams as " EG", "EGA", "GAN", and "AN ". As space characters can be difficult to spotand problematic to parse, a hash (#) or underscore (_) is often used in its place.

==Extracting the Features==

An example command line to build the two files and do the reference look-ups is:

<tt> perl SurnameFeatures.pl -i=SourceFile.txt -r=Culture-EganClassification.txt -rcol=6 -rkey=0 -rno=2 -ncol=0 -dcol=5 -rsup=1 -sp=1 -gram=2 -minfq=1 -diag=0 -two=1 </tt>

Where <tt>-rsup</tt> suppresses records that do not have reference lookups, and the <tt>-rkey</tt> and <tt>-rno</tt> specify the key and class number columns in the reference file (here Culture-EganClassification.txt). For simplicity we recommend that country names are standardized to the UN standard and then used as reference keys.

Anonymous user

imported>Ed

Changes

Extracting Features from Surnames (view source)

Revision as of 19:23, 20 July 2009

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools