Difference between revisions of "Geocoding Inventor Locations"

From edegan.com
Jump to navigation Jump to search
Line 5: Line 5:
 
==Script Files==
 
==Script Files==
  
The scripts and modules that operationalize these matching techniques can be downloaded as individual scripts or as a bundle with all supporting data files ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gz] ~20Mb). Note that the reference data should be placed in a subdirectory by default named "GNS", and source data should be placed in a subdirectory by default named "Source". Both defaults can be changed in the MatchLocations.pl script.
+
The scripts and modules that operationalize these matching techniques can be downloaded as a bundle with ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gz v1.0.1] ~20Mb) or withour ([http://www.edegan.com/repository/MatchLocations_Full.tar.gz MatchLocations_Full.tar.gz v1.0.1] ~20Mb) all supporting data files. Note that the current version is 1.0.3, which will be posted shortly. The bundles contain the default directory structure. Defaults can be changed in the MatchLocations.pl script.  
 +
 
 +
The directories are as follows:
 +
*Source - Source data should be placed here. See below for formatting.
 +
*Results - Results generated by the scripts, including logs will appear here.
 +
*GNS - contains GNS reference data named GNS-XX.txt
 +
*Match - contains the modules
  
 
The bundle contains:
 
The bundle contains:
*[http://www.edegan.com/repository/MatchLocations.pl MatchLocations.pl] - The main script and initializes and processes the matching requests
+
*MatchLocations.pl - The main script that initializes and processes the matching requests
*[http://www.edegan.com/repository/Match/GNS.pm Match::GNS.pm] - Interface to the GNS reference data (see below)
+
*BatchMatch.pl - A script for running batches
*[http://www.edegan.com/repository/Match/Patent.pm Match::Patent.pm] - Interface to the Patent Location data (see below)
+
*Match::GNS.pm - Interface to the GNS reference data (see below)
*[http://www.edegan.com/repository/Match/Common.pm Match::Common.pm] - Provides common (string cleaning) routines for both the reference and source interface modules
+
*Match::Patent.pm - Interface to the Patent Location data (see below)
*[http://www.edegan.com/repository/Match/PostalCodes.pm Match::PostalCodes.pm] - A module that extracts postcodes of various formats from (address) strings
+
*Match::Common.pm - Provides common (string cleaning) routines for both the reference and source interface modules
*[http://www.edegan.com/repository/PatentLocations-Stopwords.txt PatentLocations-Stopwords.txt] - A Stop Word file (tab delimited)
+
*Match::PostalCodes.pm - A module that extracts postcodes of various formats from (address) strings
*[http://www.edegan.com/repository/Match/Gram.pm Match::Gram.pm] - Custom NGram Module
+
*Match::Gram.pm - Custom NGram Module
*[http://www.edegan.com/repository/Match/LCS.pm Match::LCS.pm] - A standard LCS Module
+
*Match::LCS.pm - A standard LCS Module
 +
*PatentLocations-Stopwords.txt - A Stop Word file (tab delimited)
 +
*GNS Reference Files - The full bundle contains a full set of correctly named GNS reference files
  
 
The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:
 
The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:
Line 21: Line 29:
 
  <tt>perl MatchLocations.pl -co GB -u -human -r -wf </tt>
 
  <tt>perl MatchLocations.pl -co GB -u -human -r -wf </tt>
  
which will process ISO3166 <tt>country</tt> code GB (Great Britain), include <tt>unmatched</tt> inputs in the results file, produce a <tt>human</tt> choices file, write the <tt>report</tt> to a text file, and <tt>write fuzzy</tt> matches to additional seperate files as well as the main results file.
+
which will process ISO3166 <tt>country</tt> code GB (Great Britain), include <tt>unmatched</tt> inputs in the results file, produce a <tt>human</tt> choices file, write the <tt>report</tt> to a text file, and <tt>write fuzzy</tt> matches to additional seperate files as well as the main results file. Other options include <tt>over</tt> to override country designations and <tt>o</tt> to specify the results filename.
  
 
  <tt>perl MatchLocations.pl -h</tt>
 
  <tt>perl MatchLocations.pl -h</tt>
Line 31: Line 39:
 
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica.  
 
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica.  
  
This project uses [[ISO3166]] two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the [[GEOnet Names Server | GNS page]] for details). Example reference data files for countries that have been processed include:
+
This project uses [[ISO3166]] two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the [[GEOnet Names Server | GNS page]] for details). A full bundle of correctly names GNS files is also available.
*Australia: [http://www.edegan.com/repository/GNS-AU.txt GNS-AU.txt]
 
*Belgium: [http://www.edegan.com/repository/GNS-BE.txt GNS-BE.txt]
 
*Canada: [http://www.edegan.com/repository/GNS-CA.txt GNS-CA.txt]
 
*France: [http://www.edegan.com/repository/GNS-FR.txt GNS-FR.txt]
 
*Germany: [http://www.edegan.com/repository/GNS-DE.txt GNS-DE.txt]
 
*Hungary: [http://www.edegan.com/repository/GNS-HU.txt GNS-HU.txt]
 
*Ireland: [http://www.edegan.com/repository/GNS-IE.txt GNS-IE.txt]
 
*Great Britain (The United Kingdom of Great Britain and Northern Ireland): [http://www.edegan.com/repository/GNS-GB.txt GNS-GB.txt]
 
*Spain: [http://www.edegan.com/repository/GNS-ES.txt GNS-ES.txt]
 
*Switzerland: [http://www.edegan.com/repository/GNS-CH.txt GNS-CH.txt]
 
 
 
  
 
The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area).
 
The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area).
Line 48: Line 45:
 
==The Source Files==
 
==The Source Files==
  
Per country source files are extracted from the NBER patent data. The problem of identifying countries for some address records will be addressed later. The format of the source file(s) is as follows (XX is an ISO3166 code):  
+
Per country source files are extracted from the NBER patent data. The format of the source file(s) is as follows (XX is an ISO3166 code):  
*XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty</tt>
 
*XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty city adm postcode</tt>
 
 
 
The <tt>cty</tt> is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly required by the scrips but will be processed if present.
 
 
 
The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation.
 
 
 
==Postal Codes==
 
 
 
Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference, as are some simple regular expressions that should safely match most variants:
 
 
 
*Australia: ([http://en.wikipedia.org/wiki/Postcodes_in_australia Sourced from Wikipedia]): NNNN where N is a numeric. Australian postcodes should appear at the end of addresses, and are frequently preceded by the acronym for the territory/state (specifically: NSW, ACT, VIC, QLD, SA, WA, TAS, and NT). In the patent data variations include: NNNN, AU-NNNN, XXX NNNN, Xxx. NNNN X.X.X. NNNN, XXXNNNN, where XXX indicate the two or three characters of the acronym.
 
**Simple Regex: <tt>(NSW|Nsw|ACT|Act|VIC|Vic|QLD|Qld|SA|Sa|WA|Wa|TAS|Tas|NT|Nt|Au|AU)?(\w\.\w\.\w\.)?\.?\s?-?\s?\d{4,4}</tt>
 
*Belgium ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Belgium Sourced from Wikipedia]): NNNN where N is a numeric. Belgian postcodes are usually placed before the city, and the number of trailing zeros indicates the size of the city. However, the following formats also appear frequently in the patent data: NN NNNN, NNN, NNNN, NNNNN, NNNN-, NNN B-NNNN, NNN B-NNN, NN-NNNN, B-NNNN, NN - B NNNN, NN, N - B - NNNN, "NN, B. NNNN", B - NNNN, B -NNNN, B NNNN, B- NNNN, B--NNNN, B-NNNN, B-NNNN-, BNNNN, BNNNNN, BE - NNNN, BE-NNNN, BF-NNNN.
 
**Simple Regex: <tt>\d{0,3},?\s?-{0,2}\s?B?[EF]?\.?\s?-{0,2}\s?\d{1,5}-? </tt>
 
*Canada: ([http://en.wikipedia.org/wiki/Canadian_postal_code Sourced from Wikipedia]): XNX NXN, where X indicates a letter and the N a numeric. The first letter denotes the province or territory. This standard was adopted in 1970 (fully implemented by 1974) and is closely related to the UK and Dutch systems. In the patent data, Canadian postal codes appear (like the Canadians) very well behaved with the following variants appearing: XNX NXN, XNX-NXN, and XNX. Although it is possible that the letter O and the number 0 may be erroneously transcribed.
 
**Simple Refex: <tt>[A-Z0-0][0-9O-O][A-Z0-0]\s?\-?\s?([0-9O-O][A-Z0-0][0-9O-O])?</tt>
 
*Finland ([http://en.wikipedia.org/wiki/Postal_codes_in_Finland Sourced from Wikipedia]): NNMMD where N, M and D are numerics, and NN indicates the municipality, MM the district and D is typically either a 0 (large area), 5 (small area) or 1 (for P.O. Boxes). In the patent data the following Finnish postal codes are evident: NNNNN, NNNNNN, FI--NNNNN, FI-NNNNN, FI-NNNN, FIB-NNNNN, FIN -NNNNN, FIN NNNNN, FIN- NNNNN, FIN-NNNNN, Finn-NNNNN, SF-NNNNN, and SF-NNNNNNN. Also the city name is sometimes followed by two digits.
 
**Simple Refex: <tt>(FI|FIN|Finn|FINN|FIB|SF)?\s?-?-?\s?\d{4,7}</tt>
 
*France ([http://en.wikipedia.org/wiki/Postal_codes_in_France Sourced from Wikipedia]): NNNMM or NNMMM where NN and NNN are numerics indicating the préfectures and sous-préfectures, respectively, and MMM are other numerics. However the following formats also appear frequently in the patent data: F NNNNN, F-NNNNN, F - NNNNN, F- NN NNN, FNNNNN, F-NN NNN, F-NN, FR - NNNNN, FR NNNNN, FR-NNNNN, (NNNNN), (NN), - NNNNN, -NNNNN-, -NN, NN, NN - N-NNNNN, NNNN, NN NNN, NN.NNN, NNN/N, NN., F. NNNNN, F.NNNNN, "FRNN,NNN". French postal codes most often, but not exclusively, occur at the start of the address string. If there are fewer than 5 digits, trailing zeros should be added.
 
**Simple Regex: <tt>\(?F?R?\.?\s?\d?-?\s?\d{2,3}\.?\s?\/?\d{0,3}-?\)?</tt>
 
*Germany ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Germany Sourced from Wikipedia]): Currently (post 1993) German postcal codes consist of five digits: NNMMM where NN indicates the broad area and MMM indicates the sub-area. Prior to 1993 postal codes had four digits NNNN and between 1989 and 1993, O-NNNN (for East, Ost, Germany) and W-NNNN (for West Germany) was used. However, the following formats also appear frequently in the patent data: (NNNN), (D-NNNN), -NNNN, 0-NNNN, 0 - NNNN, 0-NNN, 0NNNN, N CityName NN, NN CityName NN, NNN CityName NN, NNNN CityName NN, NNNN CityName N, NNNN CityName N/BRD, NNNN CityName N/HB, 1-DNNNN, "10,NNNN", BRD-NNNN, N, NN, NNN, NNNN, NNNNN, NN.NNNNN, NN-NNNNN, d-NNNN, NN - NNNNN, NN NNN, D-N, D-NN, D-NNN, D-NNNN, D-NNNNN, D N CityName NN, D NN, D NNNN, D NNNNN, D- NNNN, D- NNNNN, D--NNNNN, D-0-NNNN, D-0NNNN, D-N-NNNNN, D.NNNN, D.NNNNN, D.-NNNNN, D0NNNN, DNNN NN. DNN, DNNN, DNNNN, DNNNNN, DE - NNNN, DE 0NNNN, DE NNNNN, DE-0NNNNN, DE-0-NNNN, DE-NNNN, DE-NNNNN, O-NNNN, W-NNNN, W-NNNN CityName NN, W-NNNNN CityName NN, WNNNN. Where CityName is Berlin, Hamburg, Dusseldorf, Seevetal, etc., and DM, DS and DW appear instead of DE sometimes. Unfortunately this list is not exhaustive. Readers should note that there is a frequent transcription error of O (Ooh) as 0 (Zero).
 
**Simple Regex (Doesn't catch everything): <tt>\(?(DE|D|1|W|)\.?-?[O0]?(BRD|1|)\s?-?\s?\d{2,5}\)?</tt>
 
*Hungary ([http://en.wikipedia.org/wiki/List_of_postal_codes Sourced from Wikipedia]): H- or HU-NNNN. (Note: Apparently introduced in 1973.). From the patent data the following postcodes can be noted: NN, NNN, NNNN, NNNN-, H-NNNN, H--NNNN, and H-NN-N. However, u. NN and u. N frequently appear at the end of the CTY string, and some cities are followed by roman numerals.
 
**Simple Regex <tt>(H|HU)?\s?-?\s?\d{2,4}-?\d{0,2}</tt>
 
*Ireland ([http://en.wikipedia.org/wiki/Republic_of_Ireland_postal_addresses Sourced from Wikipedia]): The Republic of Ireland does not use postal codes per se. However some cities, particularly Dublin, use one or two digit district numbers following the city name. In the patent data the format bmNNNN also appeared and the district numbers appear strictly at the end of the string, except in the case where it is followed by "Eire.".
 
**Simple Regex <tt>\d{1,2}\s?,?(Eire)?\.?$</tt>
 
*Spain ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain Sourced from Wikipedia]): Post 1976 Spanish postcodes are five digits of the format NNMMM, where NN indicates the province (01-52) or a reserved code (e.g. 80 for P.O. boxes). In the patent data Spansish postcodes are comparatively well behaved, with the following standard variants appearing: NNNNN, NNN NN, NNNN, NN NNNNN, NN- NNNNN, -NNNNN, NNNNN-, "NN, NNNNN", NNN, NN, N NNNNN-, NN-NN, NN-NN NNNNN, NNNNN-IBI, E-NNNNN, E-NNNN, E - NNNNN, E--NNNNN, ES-NNNNN.
 
**Simple Regex: <tt>(E|ES|)\d{0,2},?\s?-{0,2}\s?\d{2,5}-?(IBI|)</tt>
 
*Switzerland ([http://en.wikipedia.org/wiki/Postal_codes_in_Switzerland_and_Liechtenstein Sourced from Wikipedia]): Swiss (and Lictenstein) postcodes are hierarchical four-digit numbers of the form District+Area+Route+PONumber, where districts are numbered West to East (would you expect less from the Swiss?). In the patent data Swiss postcodes are comparatively immaculately behaved with the following formats appearing: NNNN, NNNN-, CH-NNNN, CH - NNNN, CH NNNN, CHNNN, CH- NNNN, CHNNN. Though the "H" may sometimes be lowercase.
 
**Simple Regex: <tt>(CH|Ch|)\s?-?\s?\d{3,4}-?</tt>
 
*United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA.
 
**Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt>
 
  
The Match::PostalCodes.pm perl module provides a method to extract a postcode from a text string for a given ISO3166 code. The simple regular expressions listed above are not used verbatim, as more sophisticed techniques can be employed on per country basis.
+
XX.txt - Tab delimited plain text with no (intentional) string quotation.  
 +
Column(s): <tt>country</tt> <tt>str</tt> <tt>cty</tt> <tt>adm</tt> <tt>city</tt> <tt>postcode</tt> <tt>str</tt>
 +
The column order is not important. <tt>country</tt>, <tt>str</tt>, and <tt>cty</tt> can not all be null. <tt>adm</tt> <tt>city</tt> <tt>postcode</tt> are optional 'exception' fields that are processed with priority. They provide hand corrections and other specifically generated information.
  
 +
The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation.
 +
[[Postal Codes]]
 
==The Matching Process==
 
==The Matching Process==
  

Revision as of 00:33, 21 January 2010

This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.

Script Files

The scripts and modules that operationalize these matching techniques can be downloaded as a bundle with (MatchLocations.tar.gz v1.0.1 ~20Mb) or withour (MatchLocations_Full.tar.gz v1.0.1 ~20Mb) all supporting data files. Note that the current version is 1.0.3, which will be posted shortly. The bundles contain the default directory structure. Defaults can be changed in the MatchLocations.pl script.

The directories are as follows:

  • Source - Source data should be placed here. See below for formatting.
  • Results - Results generated by the scripts, including logs will appear here.
  • GNS - contains GNS reference data named GNS-XX.txt
  • Match - contains the modules

The bundle contains:

  • MatchLocations.pl - The main script that initializes and processes the matching requests
  • BatchMatch.pl - A script for running batches
  • Match::GNS.pm - Interface to the GNS reference data (see below)
  • Match::Patent.pm - Interface to the Patent Location data (see below)
  • Match::Common.pm - Provides common (string cleaning) routines for both the reference and source interface modules
  • Match::PostalCodes.pm - A module that extracts postcodes of various formats from (address) strings
  • Match::Gram.pm - Custom NGram Module
  • Match::LCS.pm - A standard LCS Module
  • PatentLocations-Stopwords.txt - A Stop Word file (tab delimited)
  • GNS Reference Files - The full bundle contains a full set of correctly named GNS reference files

The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:

perl MatchLocations.pl -co GB -u -human -r -wf 

which will process ISO3166 country code GB (Great Britain), include unmatched inputs in the results file, produce a human choices file, write the report to a text file, and write fuzzy matches to additional seperate files as well as the main results file. Other options include over to override country designations and o to specify the results filename.

perl MatchLocations.pl -h

produces a simple help output.

Reference Data

The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) which covers the world excluding the U.S. and Antartica.

This project uses ISO3166 two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the GNS page for details). A full bundle of correctly names GNS files is also available.

The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area).

The Source Files

Per country source files are extracted from the NBER patent data. The format of the source file(s) is as follows (XX is an ISO3166 code):

XX.txt - Tab delimited plain text with no (intentional) string quotation. 
Column(s): country str cty adm city postcode str
The column order is not important. country, str, and cty can not all be null. adm city postcode are optional 'exception' fields that are processed with priority. They provide hand corrections and other specifically generated information.

The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation.

Postal Codes

The Matching Process

The matching process is carried out by MatchLocations.pl script, and its dependent modules (detailed above), which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched.

Glossary of terms:

  • Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately.
  • Tokens - Single words or sequences of words separated by a space (note that this is a specific usage)
  • n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth
  • Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
  • LCS - Longest Common Subsequence based matching (See below)
  • Place and administrative area - somewhere identified as a FC=P or FC=A respectively in the GNS data. Unless otherwise specified matches are performed for both place and administrative area separately and in series.

The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):

  1. Load the source files, clean and parse (parsing identifies units)
  2. Load the reference file, build indices
  3. Exact match the exception units of records with exceptions
  4. Exact match the units of well-formatted records
  5. Exact match tokens (1-5 words)
  6. N-gram and LCS match
  7. Reconsile multiple matches

Exact Matching Units

The exact matching of units is performed for both the exception units and units of "well-formatted" records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. Units are matched from the bottom to the top, in order of precedence. That is if the string is Unit1, Unit2, Unit3, Postcode; then Unit3 is matched with precedence over Units 2 and 1, and so forth. However, if multiple matches are made for a "Place" and one match is made for the "Area", then preference is given to a Place name that is different from the Area name. This is done as many Areas are also places, and more information from the source string is used in this fashion. For example if the string were "Chelsea, London" and both Chelsea and London were recorded in the GNS data as Places, but only London was recorded as a Area, then it would be most sensible to record Place=Chelsea, Area=London, and not Place=London, Area=London. The same 'difference preference' is also applied in the rare cases where there are multiple matches on Area but only one on Place.

Token Matching

Each source string is cleaned of upper-asci code (i.e. reduced to alphanumerics plus spaces), removed of its postcode (to generate PRS_POSTCODE) and then seperated into token arrays on space characters. Thus the string "String1 String2 String3 String4 Postcode" would become: [0]String1 [1]String2 [2]String3 [3]String4

An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:

  1. String1 String2 String3 String4 (token set lenght=4, first and only set)
  2. String2 String3 String4 (token set lenght=3, first set)
  3. String1 String2 String3 (token set lenght=3, second set)
  4. String3 String4 (token set lenght=2, first set)
  5. String2 String3 (token set lenght=2, second set)
  6. String1 String2 (token set lenght=2, third set)
  7. String4 (token set lenght=1, first set)
  8. String3 (token set lenght=1, second set)
  9. String2 (token set lenght=1, third set)
  10. String1 (token set lenght=1, fourth set)

As with the Exact Matching, the 'difference preference' for Areas and Places is invoked.

NGram and LCS Matching

Longest Common Subsequence (LCS) is an abundantly used fuzzy matching technique. The Longest Common Subsequence page on wikipedia provides a very detailed background. However, LCS matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run-times, LCS matching is done on only a small sub-set of strings that have met the NGram criteria detailed below.

NGrams are character-based token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets:

  1. ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet)
  2. 0123456789 (i.e. Standard numbers)
  3. " " (i.e. the space character)
  4. Alphanumeric (i.e. 1 and 2 above)
  5. Alphanumeric plus space (1,2,3 above)
  6. Alphabet only plus space(1 and 3 above)

Gramsets of various lengths, currently 2,3 and 4 characters, are used. Reference strings are decomposed into grams and both forward and reverse indexed for speed on large datasets. Source strings are decomposed into grams. Candidate reference strings are found by compiling a list of all reference strings that contain at least one of the source grams (this uses the reverse index). Then each of the candidate reference strings is gram-matched against the source gram to compute a number of scores. Specically gram-matching involves computing:

  • The number of grams in the source string
  • The number of grams in the reference string
  • The number of grams in the source string that appear in the reference string
  • The number of grams in the reference string that appear in the source string
  • The above two numbers as a source percentage and a reference percentage

Then an optional constraint that the first letter of the source and reference strings must match can be employed. If a source and reference pair meets a variety of criteria on the above variables (for example the source string is greater than 10 characters long, the source percentage of grams is greater then 80% and less than 110%, the reference percentage of grams is greater than 90% and less than 105%, and the two strings start with the same letter), then the LCS is computed. An LCS percentage, using the maximum of the lengths of the source and reference strings as a denominator, is also calculated.

The candidate reference string with the highest gram and LCS scores, assuming that these scores meet the decision threshold is then selected as the closest match. If no candidate reference string meets the decision threshold the source string is left unmatched. The decision thresholds are configured in the MatchLocations.pl script, and sets of Ngram/LCS matchings, using different character sets, gram lengths and decision thresholds, are performed sequentially, with the currently unmatched source strings used as input for each round.

Reconciling Multiple Matches

In a small number of cases it is possible that the source string will achieve more than one A (Area) or P (Place) match. For example suppose the string "Glouchester Street Cambridge Cambridgeshire" were considered. This could concievably produce two P matches and one A match with the token matching algorithm detailed above.

To reconsile multiple matches the following process is undertaken:

  • If there are both P and A matches and more than one of either P and/or A matches, then determine the P-A pair with the shortest distance between then using a Haversine formula distance calculation based on the GNS reported longitudes and latitudes. (Note that the Haversine formula is implemented in the Match::GNS.pm module and is the most accurate method over short distances, where other methods, like the great-circle method, suffer from compounded rounding error problems.)
  • If there are multiple P matches but no A matches, take the one that was arrived at first.
  • If there are multiple A matches but no P matches, take the one that was arrived at first.

Human Choices

It is generally preferrable to have a very high degree of confidence in the fuzzy matches, so that they can be treated as correct without individual inspection. However the script and modules are capable of matching to any degree of accuracy. To get further matches that can be inspected/validated/chosen by a human agent, a very weak criteria is set for two runs of fuzzy matching, and then in each run the best (in terms of parameter scores) options are recorded and written into a 'human choice' file.

As a result a human choice file may contain:

  1. No matches for a source string as none of the reference strings managed to reach even the very weak threshold criteria.
  2. One match, as both runs of fuzzy matching produced the same recommendation.
  3. Two matches, as both runs of fuzzy matching produced one best candidate and the candidates were unique.
  4. More than two matches, as one or both of the fuzzy matching runs had multiple unique candidates with the same scores.

It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters.