Changes

Jump to navigation Jump to search
no edit summary
We are primarily interested in sources of surname data that contain both surnames and countries of birth for surnames.
 
==Internet Movie Database (IMDB)==
 
A list of all actors and thier birth countries was extracted from the [http://www.imdb.com/interfaces#plain IMDB biographies file]
==Olympic Athletes==
<tt>perl -i=Olympics-RawOutputWithUNReversal.txt -ncol=1 -rcol=3</tt>
The resultant output ([http://www.edegan.com/repository/Olympics-RawOutputWithUNReversalShortRawOutputWithUNReversal-Normalized.txt Olympics-RawOutputWithUNReversalShortRawOutputWithUNReversal-Normalized.txt]) can be was used to create the n-gram variables. ==Internet Movie Database (IMDB)== A list of all actors and their birth countries was extracted from the [http://www.imdb.com/interfaces#plain IMDB biographies file] using a simple one-off script ([http://www.edegan.com/repository/IMDB-ExtractNamesAndCountries.pl IMDB-ExtractNamesAndCountries.pl]). This produced the input file http://www.edegan.com/repository/IMDB-BioData.txt IMDB-BioData.txt]. Much as with the Olympics data the country names were then corrected to the UN standard name, with individuals who were born in non-recognized jurisdictions, such as on a cruise ship at sea, excluded (see [http://www.edegan.com/repository/IMDB-BiosUNCountryCodes.txt IMDB-BiosUNCountryCodes.txt]).  The NormalizeSurnames.pl script was with following options (and defaults): <tt>perl -i=IMDB-BiosUNCountryCodes.txt -ncol=1 -comma=1</tt> The resultant output ([http://www.edegan.com/repository/IMDB-BiosUNCountryCodes-Normalized.txt IMDB-BiosUNCountryCodes-Normalized.txt]) was used to create the n-gram variables.
==World Leaders==
Anonymous user

Navigation menu