Difference between revisions of "Geocoding Inventor Locations"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
Line 3: Line 3:
 
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.  
 
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.  
  
==Reference Data==
+
==Country Codes==
 +
 
 +
This project uses [http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
 +
ISO3166-1 alpha-2] two-character country codes, as recognised by the UN and (with exceptions) used for top level domains on the internet. A list of codes for the 246 recognised countries or self-governing territories is as follows:
 +
 
 +
AD,AE,AF,AG,AI,AL,AM,AN,AO,AQ,AR,AS,AT,AU,AW,AX,AZ,BA,BB,BD,BE,BF,BG,BH,BI,BJ,BL,BM,BN,BO,BR,BS,BT,BV,BW,BY,BZ,
 +
CA,CC,CD,CF,CG,CH,CI,CK,CL,CM,CN,CO,CR,CU,CV,CX,CY,CZ,DE,DJ,DK,DM,DO,DZ,EC,EE,EG,EH,ER,ES,ET,FI,FJ,FK,FM,FO,FR,
 +
GA,GB,GD,GE,GF,GG,GH,GI,GL,GM,GN,GP,GQ,GR,GS,GT,GU,GW,GY,HK,HM,HN,HR,HT,HU,ID,IE,IL,IM,IN,IO,IQ,IR,IS,IT,JE,JM,
 +
JO,JP,KE,KG,KH,KI,KM,KN,KP,KR,KW,KY,KZ,LA,LB,LC,LI,LK,LR,LS,LT,LU,LV,LY,
 +
MA,MC,MD,ME,MF,MG,MH,MK,ML,MM,MN,MO,MP,MQ,MR,MS,MT,MU,MV,MW,MX,MY,MZ,NA,NC,NE,NF,NG,NI,NL,NO,NP,NR,NU,NZ,OM,
 +
PA,PE,PF,PG,PH,PK,PL,PM,PN,PR,PS,PT,PW,PY,QA,RE,RO,RS,RU,RW,SA,SB,SC,SD,SE,SG,SH,SI,SJ,SK,SL,SM,SN,SO,SR,ST,SV,SY,SZ,
 +
TC,TD,TF,TG,TH,TJ,TK,TL,TM,TN,TO,TR,TT,TV,TW,TZ,UA,UG,UM,US,UY,UZ,VA,VC,VE,VG,VI,VN,VU,WF,WS,YE,YT,ZA,ZM,ZW
  
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server]] which covers the world excluding the U.S. and Antartica.  
+
Note that the UK is on "exceptional reserve" for the use by the United Kingdom of Great Britain (GB) and Northern Island and is often used in its place.
  
The perl module (GNS.pm) loads, indexes and provides key variables from this data.
+
==Reference Data==
  
 +
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica.
  
 +
The perl module [http://www.edegan.com/repository/GNS.pm] loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. Exported Methods include:
 +
*new() - Constructor. Takes an ISO9660 code, calls Load
 +
*Load() - Expects to find GNS-XX.txt (where XX is an ISO9660 code) and it to have GNS standard column names; Loads it.
 +
*Index - Build all of master index and all sub-indices
 +
*GetIndex() -Takes a specific GNS NT code (e.g. P,L,A) or ALL and returns an index
 +
*GetUNIs() - Takes a place name and a type (e.g. P,L,A,ALL); returns a list of corresponding UNIs
 +
*GetLongLat() - Takes a UNI, returns a longitude, latitude pair
  
==The Source And Reference Files==
+
==The Source Files==
  
 
Source files are currently extracted from the NBER patent data on a per country basis. The original format of the source files was:  
 
Source files are currently extracted from the NBER patent data on a per country basis. The original format of the source files was:  

Revision as of 01:07, 30 July 2009

This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.

Country Codes

This project uses [http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2

ISO3166-1 alpha-2] two-character country codes, as recognised by the UN and (with exceptions) used for top level domains on the internet. A list of codes for the 246 recognised countries or self-governing territories is as follows:

AD,AE,AF,AG,AI,AL,AM,AN,AO,AQ,AR,AS,AT,AU,AW,AX,AZ,BA,BB,BD,BE,BF,BG,BH,BI,BJ,BL,BM,BN,BO,BR,BS,BT,BV,BW,BY,BZ, CA,CC,CD,CF,CG,CH,CI,CK,CL,CM,CN,CO,CR,CU,CV,CX,CY,CZ,DE,DJ,DK,DM,DO,DZ,EC,EE,EG,EH,ER,ES,ET,FI,FJ,FK,FM,FO,FR, GA,GB,GD,GE,GF,GG,GH,GI,GL,GM,GN,GP,GQ,GR,GS,GT,GU,GW,GY,HK,HM,HN,HR,HT,HU,ID,IE,IL,IM,IN,IO,IQ,IR,IS,IT,JE,JM, JO,JP,KE,KG,KH,KI,KM,KN,KP,KR,KW,KY,KZ,LA,LB,LC,LI,LK,LR,LS,LT,LU,LV,LY, MA,MC,MD,ME,MF,MG,MH,MK,ML,MM,MN,MO,MP,MQ,MR,MS,MT,MU,MV,MW,MX,MY,MZ,NA,NC,NE,NF,NG,NI,NL,NO,NP,NR,NU,NZ,OM, PA,PE,PF,PG,PH,PK,PL,PM,PN,PR,PS,PT,PW,PY,QA,RE,RO,RS,RU,RW,SA,SB,SC,SD,SE,SG,SH,SI,SJ,SK,SL,SM,SN,SO,SR,ST,SV,SY,SZ, TC,TD,TF,TG,TH,TJ,TK,TL,TM,TN,TO,TR,TT,TV,TW,TZ,UA,UG,UM,US,UY,UZ,VA,VC,VE,VG,VI,VN,VU,WF,WS,YE,YT,ZA,ZM,ZW

Note that the UK is on "exceptional reserve" for the use by the United Kingdom of Great Britain (GB) and Northern Island and is often used in its place.

Reference Data

The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) which covers the world excluding the U.S. and Antartica.

The perl module [1] loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. Exported Methods include:

  • new() - Constructor. Takes an ISO9660 code, calls Load
  • Load() - Expects to find GNS-XX.txt (where XX is an ISO9660 code) and it to have GNS standard column names; Loads it.
  • Index - Build all of master index and all sub-indices
  • GetIndex() -Takes a specific GNS NT code (e.g. P,L,A) or ALL and returns an index
  • GetUNIs() - Takes a place name and a type (e.g. P,L,A,ALL); returns a list of corresponding UNIs
  • GetLongLat() - Takes a UNI, returns a longitude, latitude pair

The Source Files

Source files are currently extracted from the NBER patent data on a per country basis. The original format of the source files was:

postcode WKU CTY city county

Where WKU is the patent number and CTY is an address field. The postcode, city and county are derived fields, either extracted from CTY by an algorithm or hand coded. These fields are error prone but contain some important information, particular regarding "historical curiosities", and approximately 1k of hand-corrected typos. As a first pass, however, the address fields are regenerated in the matching script.

Countries that are being processed include:

  1. The UK (Source file: UK-PatentInventorLocations.txt, Reference file: GNS-UK.txt)

Postal Codes

Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference:

  • United Kingdom (Sourced from Wikipedia): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: ([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})