Changes

Jump to navigation Jump to search
==Script Files==
The scripts and modules that operationalize these matching techniques can be downloaded as individual scripts or as a bundle with all supporting data files ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gzv1.0.1]~20Mb)or without ([http://www.edegan.com/repository/MatchLocations_Full.tar.gz MatchLocations_Full.tar.gz v1.0.1] ~20Mb) all supporting data files. Note that the reference current version is 1.0.3, which will be posted shortly. The bundles contain the default directory structure. Defaults can be changed in the MatchLocations.pl script.   The directories are as follows:*Source - Source data should be placed in a subdirectory here. See below for formatting.*Results - Results generated by default the scripts, including logs will appear here.*GNS - contains GNS reference data named "GNS"-XX.txt*Match - contains the modules 
The bundle contains:
*[http://www.edegan.com/repository/MatchLocations.pl MatchLocations.pl] - The main script and that initializes and processes the matching requests*[http://wwwBatchMatch.edegan.com/repository/GNS.pm pl - A script for running batches *Match::GNS.pm] - Interface to the GNS reference data (see below)*[http://www.edegan.com/repository/Patent.pm Match::Patent.pm] - Interface to the Patent Location data (see below)*[http://www.edegan.com/repository/Gram.pm Match::GramCommon.pm] - Custom NGram Module*[http://www.edegan.com/repository/LCS.pm Match::LCS.pm] - A standard LCS ModuleProvides common (string cleaning) routines for both the reference and source interface modules*[http://www.edegan.com/repository/PostalCodes.pm Match::PostalCodes.pm] - A module that extracts postcodes of various formats from (address) strings*[httpMatch:://www.edegan.com/repository/CommonGram.pm - Custom NGram Module*Match::CommonLCS.pm] - Provides common (string cleaning) routines for both the reference and source interface modulesA standard LCS Module*[http://www.edegan.com/repository/PatentLocations-Stopwords.txt PatentLocations-Stopwords.txt] - A Stop Word file (tab delimited)*GNS Reference Files - The full bundle contains a full set of correctly named GNS reference files
==Reference Data==The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:  <tt>perl MatchLocations.pl -co GB -u -human -r -wf </tt>
The reference data for the locations which will process ISO3166 <tt>country</tt> code GB (which provides the longitude and latitudesGreat Britain) is taken from , include <tt>unmatched</tt> inputs in the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers results file, produce a <tt>human</tt> choices file, write the world excluding <tt>report</tt> to a text file, and <tt>write fuzzy</tt> matches to additional seperate files as well as the U.Smain results file. Other options include <tt>over</tt> to override country designations and Antartica<tt>o</tt> to specify the results filename.
This project uses [[ISO3166]] two-character country codes to name source and reference files. Available reference data files for countries that have or are being processed include:*Belgium: [http://www <tt>perl MatchLocations.edegan.com/repository/GNSpl -BE.txt GNS-BE.txt]*France: [http:/h</www.edegan.com/repository/GNS-FR.txt GNS-FR.txt]*Great Britain (The United Kingdom of Great Britain and Northern Ireland): [http://www.edegan.com/repository/GNS-GB.txt GNS-GB.txt]*Spain: [http://www.edegan.com/repository/GNS-ES.txt GNS-ES.txt]*Switzerland: [http://www.edegan.com/repository/GNS-CH.txt GNS-CH.txt]tt>
The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area)produces a simple help output.
==The Source Files==
Per country source files are extracted from the NBER patent data. The problem of identifying countries for some address records will be addressed later. The format of the source file(s) is as follows (XX is an ISO3166 code): * XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>country</tt> <tt>str</tt> <tt>cty</tt><tt>adm</tt> <tt>city</tt> <tt>postcode</tt> <tt>str</tt> *XX_exceptionsThe column order is not important.txt - Tab delimited plain text with no (intentional) string quotation<tt>country</tt>, <tt>str</tt>, and <tt>cty</tt> can not all be null. Column(s): <tt>cty adm</tt> <tt>city adm </tt> <tt>postcode</tt>are optional 'exception' fields that are processed with priority. They provide hand corrections and other specifically generated information. The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation. The Match::PostalCodes.pm perl module provides a method to extract [[Postal Codes]] from a the addresses for a large number of ISO3166 codes, and implements 'standard' postal code identification for all other jurisdictions. ==Reference Data==
The <tt>cty</tt> reference data for the locations (which provides the longitude and latitudes) is used as a primary key in both filestaken from the (U.S. The XX_exceptions) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.txt provides details on hand identified records, or other records where special care has been takenS. This file is not strictly required by the scrips but will be processed if presentand Antartica.
The perl module Match::PatentThis project uses [[ISO3166]] two-character country codes to name source and reference files.pm loads GNS does not use ISO3166 country codes, and provides an interface so users will need to this source datatranslate accordingly (see the [[GEOnet Names Server | GNS page]] for details). The source code A full bundle of correctly names GNS files is the primary module documentationalso available.
==Postal Codes==The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes an ISO3166 code, and the index methods and most other methods take specific GNS FC codes (e.g. "P" for populated place, "L" for locality, and "A" for administrative area). Which GNS FC codes are used is specified in the @Letters global varible of MatchLocations.pl and inherited by all other modules.
Postal codes, known as ZIP MatchLocations.pl also retrieves a list of all ISO3166 codes included in the U.Sdata (from the MatchPatent.pm module) and in any specified override file, vary by national jurisdiction and for historical reasonscalls Match::GNS. The following postal codes formats pm to load them. An override file can be specified with the <tt>-over</tt> option. Override files are posted for referencetab-delimited and have the format:
*United Kingdom ([http://en ListedISO3166 1stPreference 2ndPreference 3rdPreference .wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt>
The Match::PostalCodesISO3166 listed in the source data is then overridden and the alternatives are searched for matches in order of preference.pm perl module provides a method to extract The search is terminated when a postcode from a text string for a given ISO3166 codematch is found or the override set is exhausted.
==The Matching Process==
The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchPatentLocationsMatchLocations.pl]script, and its dependent modules (detailed above), which has a standard pod based command line interface. The <tt>-co </tt> option specifies the ISO3166 country code to be matched. The script uses these modules: PatentLocations.pmIf the override option is used, GNSthen the <tt>-co</tt> option can be used to specify the source file.pmWhen an override option is set to 1, CleanStrings.pmrather than to the filename containing the overrides, GramMatch.pm, LCS.pm and PostalCodes.pm. In addition then the source files countries are used to determine which GNS reference files and patent data source files as detailed abovelookups to perform, otherwise the script also use PatentLocations<tt>-Stopwords.txtco</tt> option specifies the GNS reference set.
Glossary of terms:
*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
*LCS - Longest Common Subsequence based matching (See below)
*Place Administrative area, populated place, and administrative area locality - somewhere locations identified as a FC=A, FC=P or FC=A L respectively in the GNS data. Unless otherwise specified , matches are performed for both place and administrative area all GNS FC codes requested (default is A,P,L) separately and in series.
The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):
===Exact Matching Units===
The exact matching of units is performed for both the exception units and units of "well-formatted" records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. Units are matched from the bottom to the top, in order of precedence. That is if the string is Unit1, Unit2, Unit3, Postcode; then Unit3 is matched with precedence over Units 2 and 1, and so forth. However, if multiple matches are made for a some FC code and one match is made for another, then preference is given to the different combination. For example if the string were "Chelsea, London" and both Chelsea and London were recorded in the GNS data as FC=P, but only London was recorded as a FC=A, then it would be most sensible to record P=Chelsea, A=London, and not P=London, A=London. This is differencing is done in the matching method and independent from the resolution of multiple matches at the end.
===Token Matching===
An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:
#String1 String2 String3 String4 (token set lenghtlength=4, first and only set)#String2 String3 String4 (token set lenghtlength=3, first set)#String1 String2 String3 (token set lenghtlength=3, second set)#String3 String4 (token set lenghtlength=2, first set)#String2 String3 (token set lenghtlength=2, second set)#String1 String2 (token set lenghtlength=2, third set)#String4 (token set lenghtlength=1, first set)#String3 (token set lenghtlength=1, second set)#String2 (token set lenghtlength=1, third set)#String1 (token set lenghtlength=1, fourth set)
===NGram and LCS Matching===
Longest Common Subsequence (LCS) is an abundantly used fuzzy matching technique. The [http://en.wikipedia.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia] provides a very detailed background. However, LCS is an matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run -times, LCS matching is done on only a small sub-set of string strings that have met the NGram criteria detailed below.
NGram NGrams are letter character-based token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets:
#ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet)
#0123456789 (i.e. Standard numbers)
The candidate reference string with the highest gram and LCS scores, assuming that these scores meet the decision threshold is then selected as the closest match. If no candidate reference string meets the decision threshold the source string is left unmatched. The decision thresholds are configured in the MatchLocations.pl script, and sets of Ngram/LCS matchings, using different character sets, gram lengths and decision thresholds, are performed sequentially, with the currently unmatched source strings used as input for each round.
===Reconsiling Reconciling Multiple Matches===
In a small number of cases it is possible that the source string will achieve more than one A (Area) or P (Place) matchfor more than one FC code. For example suppose the string "Glouchester Street Cambridge Cambridgeshire" where were considered. This could concievably produce two P matches and one A match with the token matching algorithm detailed above.
To reconsile multiple matches the following process is undertaken:
*If there an FC code has only one match keep that one match*Aim for distinction in the set, giving priority in the order that the FC codes are both specified in MatchLocations.pl. The default is to include A,P ,L in order, so that precedence follows importance and size. This is important if multiple FC codes contain multiple overlapping matches. For example suppose A matches =1,2 P=2,3 and more than one of either P L=3. The algorithm will look forward and/or backwards to assign: A =1 P=2 L=3.*Determine the set of FC code matches, then determine the P-A pair with the shortest distance between then them using a [http://en.wikipedia.org/wiki/Haversine_formula Haversine formula] distance calculation based on the GNS reported longitudes and latitudes. (Note that the Haversine formula is implemented in the Match::GNS.pm module and is the most accurate method over short-distances, where other methods, like the great-circle method, suffer from compounded rounding error problems.)This is important when multiple FC codes have muliple matches but they do not overlap.*If there one or more match is found for an FC code then one final 'best' match must be reported, even if it overlaps with another FC code or is distant. ==Human Choices== It is generally preferrable to have a very high degree of confidence in the fuzzy matches, so that they can be treated as correct without individual inspection. However the script and modules are capable of matching to any degree of accuracy. To get further matches that can be inspected/validated/chosen by a human agent, a very weak criteria is set for two runs of fuzzy matching, and then in each run the best (in terms of parameter scores) options are recorded and written into a 'human choice' file. As a result a human choice file may contain:#No matches for a source string as none of the reference strings managed to reach even the very weak threshold criteria.#One match, as both runs of fuzzy matching produced the same recommendation.#Two matches, as both runs of fuzzy matching produced one best candidate and the candidates were unique.#More than two matches, as one or both of the fuzzy matching runs had multiple P unique candidates with the same scores. It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters. ==Output Files== By default all files are outputted to the Results directory. Which files are outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches but no (unless the <tt>-e</tt> option is used to force just exact matching). The main results file outputs:*COUNTRY - From the source entry*STR - From the source entry*CTY - From the source entry*EXP_CITY - From the source entry*EXP_ADM - From the source entry*EXP_POSTCODE - From the source entry*CTY_STR - A compound entry, delimited by #, used as an internal key. It is the software's best estimate of an address structure.*EXP_STR - A matchescompound entry, delimited by #, take made from the one exception data in a similar way to CTY_STR*PRS_POSTCODE - The software's best estimate of the postcode if any*MATCH_TYPE - The match type that was arrived at firstused to make the match*PLACE - The name of the most precise location*UNI - The GNS unique identifier of the most precise location*LAT - The latitude of the most precise location*LONG - The longitude of the most precise location*FC - The FC code of the most precise location The most precise location is taken to be the finest grained result. That is the match corresponding to the lowest level FC code. In the case of the default of FC=A,P,L preference is given to L then P then A.The following variables are then repeated for each FC code searched, and prefixed by the FC code (if no match was found for this FC code the entries will be blank):*NAME*UNI*LAT*If there LONG The fuzzy match file(s), if requested with <tt>-wf</tt>, have the same format (they are multiple A matches but no P matcheswritten by the same method). The report file is a copy of the output to the terminal, and can be enabled with the <tt>-r</tt> option. The human choice file (enabled with <tt>-human</tt>) has its own format as follows:*SOURCENAME - The word, take token or string from the source entry that is being considered as relevant for a match*REFNAME - The name of a place in the GNS file*COUNTRY - From the source entry*STR - From the source entry*CTY - From the source entry*EXP_CITY - From the source entry*EXP_ADM - From the source entry*EXP_POSTCODE - From the source entry*REFTOTAL - The total number of grams in REFNAME*SOURCETOTAL - The total number of grams in SOURCENAME*REFPC - the one percentage of the REFNAME grams that appear in the SOURCENAME gram set*SOURCEPC - the percentage of the SOURCENAME grams that appear in the REFNAME gram set*LEFTGRAMS - the number of the REFNAME grams that appear in the SOURCENAME gram set*RIGHTGRAMS - the number of the SOURCENAME grams that was arrived at appear in the REFNAME gram set*LCSSCORE - The size of the longest common subsequence in characters*SOURCELENGTH - The length of SOURCENAME*REFLENGTH - The length of REFNAME*MAXLENGTH - The maximum of the lengths of SOURCENAME and REFNAME*LCSPC - The LCSSCORE divided by the MAXLENGTH*FIRSTLETTERBINDS - Whether the fuzzy matching algorithm required the same first.letter in SOURCENAME and REFNAME*GRAMALPHABET - The gram alphabet used by the matching algorithm*GRAMLENGTH - The length of the n-grams used
Anonymous user

Navigation menu