Difference between revisions of "Sources of Surname Data"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
We are primarily interested in sources of surname data that contain both surnames and countries of birth for surnames.
+
*This page is a part of series in [[Classifying Names by Culture]]
 +
 
 +
We are primarily interested in sources of surname data that contain both surnames and countries of birth for surnames, for training and testing purposes.
  
 
==Olympic Athletes==
 
==Olympic Athletes==
Line 12: Line 14:
  
 
The resultant output ([http://www.edegan.com/repository/Olympics-RawOutputWithUNReversal-Normalized.txt Olympics-RawOutputWithUNReversal-Normalized.txt]) was used to create the n-gram variables.
 
The resultant output ([http://www.edegan.com/repository/Olympics-RawOutputWithUNReversal-Normalized.txt Olympics-RawOutputWithUNReversal-Normalized.txt]) was used to create the n-gram variables.
 +
 +
Other classifications(aside from UN GeoRegions) can be added using their UN Country name lookup tables. For example, the output with the (Egan 2009) CultureClass classification is [http://www.edegan.com/repository/Olympics-RawOutputWithUNReversalCultureclass-Normalized.txt Olympics-RawOutputWithUNReversalCultureclass-Normalized.txt].
  
 
==Internet Movie Database (IMDB)==
 
==Internet Movie Database (IMDB)==
  
 
A list of all actors and their birth countries was extracted from the [http://www.imdb.com/interfaces#plain IMDB biographies file] using a simple one-off script ([http://www.edegan.com/repository/IMDB-ExtractNamesAndCountries.pl IMDB-ExtractNamesAndCountries.pl]). This produced the input file http://www.edegan.com/repository/IMDB-BioData.txt IMDB-BioData.txt]. Much as with the Olympics data the country names were then corrected to the [[UN GeoRegion standard | UN GeoRegions]], with individuals who were born in non-recognized jurisdictions, such as on a cruise ship at sea, excluded (see [http://www.edegan.com/repository/IMDB-BiosUNCountryCodes.txt IMDB-BiosUNCountryCodes.txt]).  
 
A list of all actors and their birth countries was extracted from the [http://www.imdb.com/interfaces#plain IMDB biographies file] using a simple one-off script ([http://www.edegan.com/repository/IMDB-ExtractNamesAndCountries.pl IMDB-ExtractNamesAndCountries.pl]). This produced the input file http://www.edegan.com/repository/IMDB-BioData.txt IMDB-BioData.txt]. Much as with the Olympics data the country names were then corrected to the [[UN GeoRegion standard | UN GeoRegions]], with individuals who were born in non-recognized jurisdictions, such as on a cruise ship at sea, excluded (see [http://www.edegan.com/repository/IMDB-BiosUNCountryCodes.txt IMDB-BiosUNCountryCodes.txt]).  
 +
 +
A small percentage of actors have changed their names or use stage names. Care was taken to record actors' original birth names where available.
  
 
The NormalizeSurnames.pl script was with following options (and defaults):
 
The NormalizeSurnames.pl script was with following options (and defaults):
Line 24: Line 30:
  
 
==World Leaders==
 
==World Leaders==
 +
 +
A request to the CIA to use a web-bot to scrap data from the HTML version of the [https://www.cia.gov/library/publications/the-world-factbook CIA World Factbook] recieved no response. World leader information was downloaded in pdf format from the [https://www.cia.gov/library/publications/world-leaders-1/pdf-version/pdf-version.html CIA World Leaders PDF site] for April 2008, and converted into a plain-text file ([http://www.edegan.com/repository/WorldLeaders-Raw.txt WorldLeaders-Raw.txt]).
 +
 +
The raw file was then reprocessed by a one-off script to produce ([http://www.edegan.com/repository/WorldLeaders-Extracted.txt WorldLeaders-Extracted.txt]). The country codes were corrected using a look-up table ([http://www.edegan.com/repository/WorldLeaders-UNCountryLookup.txt WorldLeaders-UNCountryLookup.txt]) to produce the resulting basic dataset ([http://www.edegan.com/repository/WorldLeaders-ExtractedUNCountry.txt WorldLeaders-ExtractedUNCountry.txt]). Users should note that a very small number of individuals had invalidly coded countries (mostly countries that were not recognized by the UN but were recognized by the CIA) and were excluded in this process. Furthermore, some leaders held multiple positions in their governments and had multiple listings - these were collapsed to a single record.
 +
 +
The NormalizeSurnames.pl script was with following options (and defaults):
 +
 +
<tt>perl -i=WorldLeaders-ExtractedUNCountry.txt -ncol=0 -rcol=1</tt>
 +
 +
The resultant output ([http://www.edegan.com/repository/IMDB-BiosUNCountryCodes-Normalized.txt IMDB-BiosUNCountryCodes-Normalized.txt]) was used to create the n-gram variables.

Latest revision as of 18:09, 15 July 2009

We are primarily interested in sources of surname data that contain both surnames and countries of birth for surnames, for training and testing purposes.

Olympic Athletes

The Olympic Athletes data was taken from the relevant pages on wikipedia. Local copies of all of the individual country pages from the 2004 Summer Games were retrieved by a script (Olympics-RetrievePageURLs.pl) that uses an offline version (Olympics-OfflineSource.html) of the wikipedia Nations at the 2004 Summer Olympics page. This script also constructs a list of participating countries (Olympics-ParticipatingCountries.txt).

The offline pages were then parsed by another script (Olympics-ExtractOlypiads.pl) and the resulting output (Olympics-RawOutput.txt) was checked by hand. This output is the basic names set with countries for the 2004 Olympic Athletes used here. Because some individuals competed in multiple events, identical full name strings were collapsed to produce a single record with a count. It seems unlikely that many John Joe Smiths entered, making such a reduction erroneous. Users of these scripts should the wikipedia source files have likely changed and should check results carefully.

The country names were then corrected to the UN GeoRegions and coded using SQL scripts, and country with idiosyncratic name reversals were marked to produce a normalization input file (Olympics-RawOutputWithUNReversal.txt). The NormalizeSurnames.pl script was with following options (and defaults):

perl -i=Olympics-RawOutputWithUNReversal.txt -ncol=1 -rcol=3

The resultant output (Olympics-RawOutputWithUNReversal-Normalized.txt) was used to create the n-gram variables.

Other classifications(aside from UN GeoRegions) can be added using their UN Country name lookup tables. For example, the output with the (Egan 2009) CultureClass classification is Olympics-RawOutputWithUNReversalCultureclass-Normalized.txt.

Internet Movie Database (IMDB)

A list of all actors and their birth countries was extracted from the IMDB biographies file using a simple one-off script (IMDB-ExtractNamesAndCountries.pl). This produced the input file http://www.edegan.com/repository/IMDB-BioData.txt IMDB-BioData.txt]. Much as with the Olympics data the country names were then corrected to the UN GeoRegions, with individuals who were born in non-recognized jurisdictions, such as on a cruise ship at sea, excluded (see IMDB-BiosUNCountryCodes.txt).

A small percentage of actors have changed their names or use stage names. Care was taken to record actors' original birth names where available.

The NormalizeSurnames.pl script was with following options (and defaults):

perl -i=IMDB-BiosUNCountryCodes.txt -ncol=1 -comma=1

The resultant output (IMDB-BiosUNCountryCodes-Normalized.txt) was used to create the n-gram variables.

World Leaders

A request to the CIA to use a web-bot to scrap data from the HTML version of the CIA World Factbook recieved no response. World leader information was downloaded in pdf format from the CIA World Leaders PDF site for April 2008, and converted into a plain-text file (WorldLeaders-Raw.txt).

The raw file was then reprocessed by a one-off script to produce (WorldLeaders-Extracted.txt). The country codes were corrected using a look-up table (WorldLeaders-UNCountryLookup.txt) to produce the resulting basic dataset (WorldLeaders-ExtractedUNCountry.txt). Users should note that a very small number of individuals had invalidly coded countries (mostly countries that were not recognized by the UN but were recognized by the CIA) and were excluded in this process. Furthermore, some leaders held multiple positions in their governments and had multiple listings - these were collapsed to a single record.

The NormalizeSurnames.pl script was with following options (and defaults):

perl -i=WorldLeaders-ExtractedUNCountry.txt -ncol=0 -rcol=1

The resultant output (IMDB-BiosUNCountryCodes-Normalized.txt) was used to create the n-gram variables.