Geocoding Inventor Locations

From edegan.com
Revision as of 21:49, 19 August 2009 by 128.32.74.87 (talk)
Jump to navigation Jump to search

This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.

The scripts and modules that operationalize these matching techniques can be downloaded individually by function or as a bundle with all supporting data files (MatchPatentLocations.tar.gz).

Reference Data

The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) which covers the world excluding the U.S. and Antartica.

This project uses ISO3166 two-character country codes to name source and reference files. Available reference data files for countries that have or are being processed include:

The perl module GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area).

The Source Files

Per country source files are extracted from the NBER patent data. The problem of identifying countries for some address records will be addressed later. The format of the source file(s) is as follows (XX is an ISO3166 code):

  • XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): cty
  • XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): cty city adm postcode

The cty is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly required by the scrips but will be processed if present.

The perl module PatentLocations.pm loads and provides an interface to this source data. The source code is the primary module documentation.

Postal Codes

Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference:

  • United Kingdom (Sourced from Wikipedia): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: ([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})

The PostalCodes.pm perl module provides a method to extract a postcode from a text string for a given ISO3166 code.

The Matching Process

The matching process is carried out by MatchPatentLocations.pl, which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched. The script uses these modules: PatentLocations.pm, GNS.pm, CleanStrings.pm, GramMatch.pm, LCS.pm and PostalCodes.pm. In addition to GNS reference files and patent data source files as detailed above, the script also use PatentLocations-Stopwords.txt.

Glossary of terms:

  • Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately.
  • Tokens - Single words or sequences of words separated by a space (note that this is a specific usage)
  • n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth
  • Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
  • LCS - Longest Common Subsequence based matching (See below)
  • Place and administrative area - somewhere identified as a FC=P or FC=A respectively in the GNS data. Unless otherwise specified matches are performed for both place and administrative area separately and in series.

The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):

  1. Load the source files, clean and parse (parsing identifies units)
  2. Load the reference file, build indices
  3. Exact match the exception units of records with exceptions
  4. Exact match the units of well-formatted records
  5. Exact match tokens (1-5 words)
  6. N-gram and LCS match
  7. Reconsile multiple matches

Longest Common Subsequence (LCS)

Longest Common Subsequence is perhaps the simplest (for certain inefficient implementations) and most abundantly used of fuzzy matching technique. The Longest Common Subsequence page on wikipedia provides a very detailed background.