Difference between revisions of "Sources of Surname Data"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
Line 6: Line 6:
 
==Olympic Athletes==
 
==Olympic Athletes==
  
The Olympic Athletes data was taken from the relevant pages on [http://www.wikipedia.org/ wikipedia]. Local copies of all of the individual country pages from the 2004 Summer Games were retrieved by [[script]] ([http://www.edegan.com/Olympics-RetrievePageURLs.pl Olympics-RetrievePageURLs.pl]) that used an offline version ([http://www.edegan.com/repository/Olympics-OfflineSource.html Olympics-OfflineSource.html]) of [http://en.wikipedia.org/wiki/Category:Nations_at_the_2004_Summer_Olympics the wikipedia Nations at the 2004 Summer Olympics page]. This script also constructs a list of participating countries ([http://www.edegan.com/repository/Olympics-ParticipatingCountries.txt Olympics-ParticipatingCountries.txt]).
+
The Olympic Athletes data was taken from the relevant pages on [http://www.wikipedia.org/ wikipedia]. Local copies of all of the individual country pages from the 2004 Summer Games were retrieved by a [[script]] ([http://www.edegan.com/Olympics-RetrievePageURLs.pl Olympics-RetrievePageURLs.pl]) that uses an offline version ([http://www.edegan.com/repository/Olympics-OfflineSource.html Olympics-OfflineSource.html]) of [http://en.wikipedia.org/wiki/Category:Nations_at_the_2004_Summer_Olympics the wikipedia Nations at the 2004 Summer Olympics page]. This script also constructs a list of participating countries ([http://www.edegan.com/repository/Olympics-ParticipatingCountries.txt Olympics-ParticipatingCountries.txt]).
 +
 
 +
The offline pages were then parsed by another script ([http://www.edegan.com/repository/Olympics-ExtractOlypiads.pl Olympics-ExtractOlypiads.pl]) and the resulting output ([http://www.edegan.com/repository/Olympics-RawOutput.txt Olympics-RawOutput.txt]) was checked by hand. This output is the basic names set with countries for the 2004 Olympic Athletes used here. Because some individuals competed in multiple events, identical full name strings were collapsed to produce a single record with a count. It seems unlikely that many John Joe Smiths entered, making such a reduction erroneous. Users of these scripts should the wikipedia source files have likely changed and should check results carefully.
  
 
==World Leaders==
 
==World Leaders==
 +
 +
#Single letter names...

Revision as of 22:11, 22 June 2009

We are primarily interested in sources of surname data that contain both surnames and countries of birth for surnames.

Internet Movie Database (IMDB)

Olympic Athletes

The Olympic Athletes data was taken from the relevant pages on wikipedia. Local copies of all of the individual country pages from the 2004 Summer Games were retrieved by a script (Olympics-RetrievePageURLs.pl) that uses an offline version (Olympics-OfflineSource.html) of the wikipedia Nations at the 2004 Summer Olympics page. This script also constructs a list of participating countries (Olympics-ParticipatingCountries.txt).

The offline pages were then parsed by another script (Olympics-ExtractOlypiads.pl) and the resulting output (Olympics-RawOutput.txt) was checked by hand. This output is the basic names set with countries for the 2004 Olympic Athletes used here. Because some individuals competed in multiple events, identical full name strings were collapsed to produce a single record with a count. It seems unlikely that many John Joe Smiths entered, making such a reduction erroneous. Users of these scripts should the wikipedia source files have likely changed and should check results carefully.

World Leaders

  1. Single letter names...