Changes

Jump to navigation Jump to search
m
no edit summary
*This page is a part of series in [[Classifying Names by Culture]] Extracting features from Surnames surnames entails encoding the frequency of [http://en.wikipedia.org/wiki/Ngram n-grams ] and other features such as the string length. Recall that 1-grams are letters or characters, also called unigrams, 2-grams are called bigrams or digraphs, and 3-grams are called trigrams. In some applications entire words, sentences or other tokens are used as grams.
==Assumption of Independence of Features==
In many (actually most) classification techniques there is an assumption of independence of features. This has two important bearings on classification using n-grams.
First many classifier classifiers 'require' a feature matrix of full column rank, so including a variable like the length of the name along with the n-gram frequencies introduces a linear dependence between the columns. Thus ; coding EGAN as having length 4 along with the 1-grams E, G, A, and N, clearly introduces no new information. The same is true for bigrams EG, GA, and AN, or trigrams EGA and GAN, and so forth. Likewise coding both bigrams and trigrams introduces no new information. Second the assumption of independence among features means that with an n-gram encoding the sequence information is lost. That is EGA and GAN are assumed to be uncorrelated, though clearly they are not (as they overlap by GA). Thus there is a potential for improvement by including positional features. One way of denoting the start and end of the string is to add a space to the gram set and delimit surname with spaces. Thus EGAN would be coded in trigrams as " EG", "EGA", "GAN", and "AN ". As space characters can be difficult to spot and problematic to parse, a hash (#) or underscore (_) is often used in its place. ==Extracting the Features==Feature extraction is performed by a dedicated script ([http://www.edegan.com/repository/SurnameFeatures.pl SurnameFeatures.pl]). An example command line is:<tt> perl SurnameFeatures.pl -i=sourcefile.txt -ncol=0 -dcol=5 -sp=1 -gram=2 -minfq=1 -diag=0</tt> Where <tt>sp=1</tt> forces the inclusion of spaces in the character set (which is otherwise a-z), as well as before and after the string, <tt>minfq</tt> sets to minimum global frequency of occurance of an n-gram for it to be included in the output, and <tt>diag=1</tt> produces an additional frequency of occurance diagnostic file. The script has several other useful options, including <tt>-two</tt> which generates two files, one of the index, the class (if specified through <tt>-refcol</tt> and a reference file is specified with <tt>-r</tt>) and the gram variables, and another containing the index and all other variables. An example command line to build the two files and do the reference look-ups is: <tt> perl SurnameFeatures.pl -i=SourceFile.txt -r=Culture-EganClassification.txt -rcol=6 -rkey=0 -rno=2 -ncol=0 -dcol=5 -rsup=1 -sp=1 -gram=2 -minfq=1 -diag=0 -two=1 </tt>
Second Where <tt>-rsup</tt> suppresses records that do not have reference lookups, and the assumption of independence among features means that with an n<tt>-rkey</tt> and <tt>-gram encoding rno</tt> specify the key and class number columns in the sequence information is lostreference file (here Culture-EganClassification. That is EGA and GAN txt). For simplicity we recommend that country names are assumed standardized to be uncorrelated, though clearly they are not (the UN standard and then used as they overlap by GA)reference keys.
Anonymous user

Navigation menu