Classifying Names by Culture
Individual's names contain information about their ethnic ancestory and culture (broadly defined). The purpose of this project is to create a classifier that given an individual's name can deduce, with good accuracy, their culture.
Classification tecnhiques use variations in the features of their subjects to predict classes. In the classic example (see R.A. Fisher 1936), a classifier for types of plant used the features "petal width", "petal length", and so forth. In our context features refer to properties of names, specifically the length of the name string, the frequency of occurance of n-grams, and so forth.
An n-gram is a combination of characters (a gram) of length "n". For example, using a 2-gram, also called a bigram or a digraph, the surname "EGAN" has frequency of one for the grams EG, GA, and AN, and a frequency of zero for all other grams from AA to ZZ.
The process follows the following broad steps:
- Sources of Surname Data: Various sources of surname data, with their classifications already known are needed for training and testing the classifier.
- Normalizing Surnames: Before we can extract features from names, they must be in a standardized format, such as just a surname encoded in the latin character set with no spaces.
- Extracting Features from Surnames: Given a standardized input we can extract a number of features from our names, such as the n-grams.
- Culture Based Classifications: Determine which culture based classification to use.
- Training Classifiers: Train a classifier to use the features to predict the classes.
Venture Capital Data
Our first major academic application of classifying names by culture is to examine the importance of cultural homogeneity in venture capital contract formation.
The following pages provide details on venture capital data:
- STAT520 (UBC): Cultural Identification using Linear Discriminant Analysis on Surnames (Download DOC)