Geocoding Inventor Locations

From edegan.com
Revision as of 02:57, 30 July 2009 by 67.180.26.152 (talk)
Jump to navigation Jump to search

This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.

Country Codes

This project uses ISO3166, specifically ISO3166-1 alpha-2, two-character country codes, as recognised by the UN and (with exceptions) used for top level domains on the internet. A list of codes for the 246 recognised countries or self-governing territories is as follows: AD,AE,AF,AG,AI,AL,AM,AN,AO,AQ,AR,AS,AT,AU,AW,AX,AZ,BA,BB,BD,BE,BF,BG,BH,BI,BJ,BL,BM,BN,BO,BR,BS,BT,BV,BW,BY,BZ, CA,CC,CD,CF,CG,CH,CI,CK,CL,CM,CN,CO,CR,CU,CV,CX,CY,CZ,DE,DJ,DK,DM,DO,DZ,EC,EE,EG,EH,ER,ES,ET,FI,FJ,FK,FM,FO,FR, GA,GB,GD,GE,GF,GG,GH,GI,GL,GM,GN,GP,GQ,GR,GS,GT,GU,GW,GY,HK,HM,HN,HR,HT,HU,ID,IE,IL,IM,IN,IO,IQ,IR,IS,IT,JE,JM, JO,JP,KE,KG,KH,KI,KM,KN,KP,KR,KW,KY,KZ,LA,LB,LC,LI,LK,LR,LS,LT,LU,LV,LY, MA,MC,MD,ME,MF,MG,MH,MK,ML,MM,MN,MO,MP,MQ,MR,MS,MT,MU,MV,MW,MX,MY,MZ,NA,NC,NE,NF,NG,NI,NL,NO,NP,NR,NU,NZ,OM, PA,PE,PF,PG,PH,PK,PL,PM,PN,PR,PS,PT,PW,PY,QA,RE,RO,RS,RU,RW,SA,SB,SC,SD,SE,SG,SH,SI,SJ,SK,SL,SM,SN,SO,SR,ST,SV,SY,SZ, TC,TD,TF,TG,TH,TJ,TK,TL,TM,TN,TO,TR,TT,TV,TW,TZ,UA,UG,UM,US,UY,UZ,VA,VC,VE,VG,VI,VN,VU,WF,WS,YE,YT,ZA,ZM,ZW

Note that the UK is on "exceptional reserve" for the use by the United Kingdom of Great Britain (GB) and Northern Island and is often used in the place of GB, though the Patent Data Project uses GB.

Reference Data

The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) which covers the world excluding the U.S. and Antartica.

Available reference data files for countries that have or are being processed include:

The perl module GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. Exported Methods include:

  • new() - Constructor. Takes an ISO3166 code, calls Load
  • Load() - Expects to find GNS-XX.txt (where XX is an ISO3166 code) and it to have GNS standard column names; Loads it.
  • Index - Build all of master index and all sub-indices
  • GetIndexKeys() -Takes a specific GNS NT code (e.g. P,L,A) or ALL and returns a set of index keys
  • GetUNIs() - Takes a place name and a type (e.g. P,L,A,ALL); returns a list of corresponding UNIs
  • GetLongLat() - Takes a UNI, returns a longitude, latitude pair

The Source Files

Per country source files are extracted from the NBER patent data. The problem of identifying countries for some address records will be addressed later. The format of the source file(s) is as follows (XX is an ISO3166 code):

  • XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): cty
  • XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): cty city adm postcode

The cty is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly necessary but will be processed if present.

The perl module PatentLocations.pm loads and provides an interface to this source data. The source code is the primary module documentation. Exported Methods include:

  • new() - Takes an ISO3166 country code, calls load. Expects to find a set stop words file (PatentLocations-Stopwords.txt) and a postcode RegEx file (PatentLocations-PostCode.rex).
  • Load() - load the data file(s)
  • CleanAndParse() - Do a first round of cleaning and parsing (calls internal methods). Extract out the postcode and replace stop words.
  • UnMatched() - Takes an NT code (e.g. P,L,A,ALL) and returns the set of currently unmatched country name keys for that type
  • ReturnMatches() - Marks country name keys with their new match sets

Postal Codes

Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference:

  • United Kingdom (Sourced from Wikipedia): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: ([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})

The Matching Process

The matching process is carried out by MatchPatentLocations.pl, which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched. The script uses these modules: PatentLocations.pm, GNS.pm

Glossary of terms:

  • Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately.
  • Tokens - Single words or sequences of words separated by a space (note that this is a specific usage)
  • n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth
  • Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
  • LCS - Longest Common Subsequence based matching (See below)
  • Place and administrative area - somewhere identified as a NT=P or NT=A respectively in the GNS data. Unless otherwise specified matches are performed for both place and administrative area separately and in series.

The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):

  1. Load the source files, clean and parse (parsing identifies units)
  2. Load the reference file, build indices
  3. Exact match the exception units of records with exceptions
  4. Exact match the units of well-formatted records
  5. Exact match tokens (1-5 words)
  6. LCS match the exception units of records with exceptions
  7. LCS match (all other)
  8. n-gram match
  9. Reconsile multiple matches

Longest Common Subsequence (LCS)

Longest Common Subsequence is perhaps the simplest (for certain inefficient implementations) and most abundantly used of fuzzy matching technique. The Longest Common Subsequence page on wikipedia provides a very detailed background.

test