Difference between revisions of "Extracting Features from Surnames"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
Line 6: Line 6:
 
First many classifier 'require' a feature matrix of full column rank, so including a variable like the length of the name along with the n-gram frequencies introduces a linear dependence between the columns. Thus coding EGAN as having length 4 along with the 1-grams E, G, A, and N, clearly introduces no new information. The same is true for bigrams EG, GA, and AN, or trigrams EGA and GAN, and so forth. Likewise coding both bigrams and trigrams introduces no new information.
 
First many classifier 'require' a feature matrix of full column rank, so including a variable like the length of the name along with the n-gram frequencies introduces a linear dependence between the columns. Thus coding EGAN as having length 4 along with the 1-grams E, G, A, and N, clearly introduces no new information. The same is true for bigrams EG, GA, and AN, or trigrams EGA and GAN, and so forth. Likewise coding both bigrams and trigrams introduces no new information.
  
Second the assumption of independence among features means that with an n-gram encoding the sequence information is lost. That is EGA and GAN are assumed to be uncorrelated, though clearly they are not (as they overlap by GA).
+
Second the assumption of independence among features means that with an n-gram encoding the sequence information is lost. That is EGA and GAN are assumed to be uncorrelated, though clearly they are not (as they overlap by GA). Thus there is a potential for improvement by including positional features. One way of denoting the start and end of the string is to add a space to the gram set and delimit surname with spaces. Thus EGAN would be coded in trigrams as " EG", "EGA", "GAN", and "AN ".
 +
 
 +
==Extracting the Features==
 +
Feature extraction is performed by a dedicated script ([http://www.edegan.com/repository/SurnameFeatures.pl SurnameFeatures.pl]).

Revision as of 21:33, 9 July 2009

Extracting features from Surnames entails encoding the frequency of n-grams and other features such as the string length. Recall that 1-grams are letters or characters, 2-grams are called bigrams or digraphs, and 3-grams are called trigrams.

Assumption of Independence of Features

In many (actually most) classification techniques there is an assumption of independence of features. This has two important bearings on classification using n-grams.

First many classifier 'require' a feature matrix of full column rank, so including a variable like the length of the name along with the n-gram frequencies introduces a linear dependence between the columns. Thus coding EGAN as having length 4 along with the 1-grams E, G, A, and N, clearly introduces no new information. The same is true for bigrams EG, GA, and AN, or trigrams EGA and GAN, and so forth. Likewise coding both bigrams and trigrams introduces no new information.

Second the assumption of independence among features means that with an n-gram encoding the sequence information is lost. That is EGA and GAN are assumed to be uncorrelated, though clearly they are not (as they overlap by GA). Thus there is a potential for improvement by including positional features. One way of denoting the start and end of the string is to add a space to the gram set and delimit surname with spaces. Thus EGAN would be coded in trigrams as " EG", "EGA", "GAN", and "AN ".

Extracting the Features

Feature extraction is performed by a dedicated script (SurnameFeatures.pl).