Changes

1,399 bytes added , 05:47, 22 January 2010

==Script Files==

The scripts and modules that operationalize these matching techniques can be downloaded ~~as individual scripts or~~ as a bundle with ~~all supporting data files~~ ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gzv1.0.1]~20Mb)or without ([http://www.edegan.com/repository/MatchLocations_Full.tar.gz MatchLocations_Full.tar.gz v1.0.1] ~20Mb) all supporting data files. Note that the ~~reference data should be placed in a subdirectory by default named "GNS"~~current version is 1.0.3, ~~and source data should~~ which will be ~~placed in a subdirectory by~~ posted shortly. The bundles contain the default ~~named "Source"~~directory structure. ~~Both defaults~~ Defaults can be changed in the MatchLocations.pl script. The directories are as follows:*Source - Source data should be placed here. See below for formatting.*Results - Results generated by the scripts, including logs will appear here.*GNS - contains GNS reference data named GNS-XX.txt*Match - contains the modules

The bundle contains:

*~~[http://www.edegan.com/repository/~~MatchLocations.pl ~~MatchLocations.pl]~~ - The main script ~~and~~ that initializes and processes the matching requests*~~[http://www~~BatchMatch.~~edegan.com/repository/Match/GNS.pm~~ pl - A script for running batches *Match::GNS.pm] - Interface to the GNS reference data (see below)*~~[http://www.edegan.com/repository/Match/Patent.pm~~ Match::Patent.pm] - Interface to the Patent Location data (see below)*~~[http://www.edegan.com/repository/Match/Common.pm~~ Match::Common.pm] - Provides common (string cleaning) routines for both the reference and source interface modules*~~[http://www.edegan.com/repository/Match/PostalCodes.pm~~ Match::PostalCodes.pm] - A module that extracts postcodes of various formats from (address) strings*~~[http~~Match::~~//www~~Gram.~~edegan~~pm - Custom NGram Module*Match::LCS.~~com/repository/PatentLocations~~pm -~~Stopwords.txt~~ A standard LCS Module*PatentLocations-Stopwords.txt] - A Stop Word file (tab delimited)*~~[http://www.edegan.com/repository/Match/Gram.pm Match::Gram.pm]~~ GNS Reference Files - ~~Custom NGram Module~~The full bundle contains a full set of correctly named GNS reference files *[http://wwwThe MatchLocations.~~edegan~~pl script can be run from any shell or command line with perl installed.~~com/repository/Match/LCS.pm Match~~Example commands are:~~:LCS.pm] - A standard LCS Module~~

~~==Reference Data==~~ <tt>perl MatchLocations.pl -co GB -u -human -r -wf </tt>

~~The reference data for the locations~~ which will process ISO3166 <tt>country</tt> code GB (~~which provides the longitude and latitudes~~Great Britain) ~~is taken from~~ , include <tt>unmatched</tt> inputs in the ~~(U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers~~ results file, produce a <tt>human</tt> choices file, write the ~~world excluding~~ <tt>report</tt> to a text file, and <tt>write fuzzy</tt> matches to additional seperate files as well as the ~~U.S~~main results file. Other options include <tt>over</tt> to override country designations and ~~Antartica~~<tt>o</tt> to specify the results filename.

~~This project uses [[ISO3166]] two-character country codes to name source and reference files. Available reference data files for countries that have or are being processed include:~~*Belgium: [http://www.edegan.com/repository/GNS-BE.txt GNS-BE.txt]*France: [http://www.edegan.com/repository/GNS-FR <tt>perl MatchLocations.~~txt GNS~~pl -~~FR.txt]~~*Germany: [http:/h</~~www.edegan.com/repository/GNS-DE.txt GNS-DE.txt]~~*Great Britain (The United Kingdom of Great Britain and Northern Ireland): [http://www.edegan.com/repository/GNS-GB.txt GNS-GB.txt]*Spain: [http://www.edegan.com/repository/GNS-ES.txt GNS-ES.txt]*Switzerland: [http://www.edegan.com/repository/GNS-CH.txt GNS-CH.txt]tt>

The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area)produces a simple help output.

==The Source Files==

Per country source files are extracted from the NBER patent data~~. The problem of identifying countries for some address records will be addressed later~~. The format of the source file(s) is as follows (XX is an ISO3166 code): *XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty</tt>*XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty city adm postcode</tt>

~~The~~ XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>country</tt> <tt>str</tt> <tt>cty</tt> is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly required by the scrips but will be processed if present.<tt>adm</tt> <tt>city</tt> <tt>postcode</tt> <tt>str</tt>

The ~~perl module Match::Patent~~column order is not important.~~pm loads~~ <tt>country</tt>, <tt>str</tt>, and ~~provides an interface to this source data~~<tt>cty</tt> can not all be null. <tt>adm</tt> <tt>city</tt> <tt>postcode</tt> are optional 'exception' fields that are processed with priority. ~~The source code is the primary module documentation~~They provide hand corrections and other specifically generated information.

==The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation. The Match::PostalCodes.pm perl module provides a method to extract [[Postal Codes==]] from a the addresses for a large number of ISO3166 codes, and implements 'standard' postal code identification for all other jurisdictions.

Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference, as are some simple regular expressions that should safely match most variants:==Reference Data==

*United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt>*France The reference data for the locations (~~[http://en.wikipedia.org/wiki/Postal_codes_in_France Sourced from Wikipedia]): NNNMM or NNMMM where NN and NNN are numerics indicating~~ which provides the ~~préfectures and sous-préfectures, respectively,~~ longitude and ~~MMM are other numerics. However the following formats also appear frequently in the patent data: F NNNNN, F-NNNNN, F - NNNNN, F- NN NNN, FNNNNN, F-NN NNN, F-NN, FR - NNNNN, FR NNNNN, FR-NNNNN, (NNNNN~~latitudes)~~, (NN), - NNNNN, -NNNNN-, -NN, NN, NN - N-NNNNN, NNNN, NN NNN, NN.NNN, NNN/N, NN., F. NNNNN, F.NNNNN, "FRNN,NNN". French postal codes most often, but not exclusively, occur at~~ is taken from the ~~start of the address string. If there are fewer than 5 digits, trailing zeros should be added.~~ **Simple Regex: <tt>\(~~?F?R?\~~U.~~?\s?\d?-?\s?\d{2,3}\~~S.~~?\s?\/?\d{0,3}-?\~~)~~?</tt>~~*Belgium ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Belgium Sourced from Wikipedia]): NNNN where N is a numeric. Belgian postcodes are usually placed before the city, and the number of trailing zeros indicates the size of the city. However, the following formats also appear frequently in the patent data: NN NNNN, NNN, NNNN, NNNNN, NNNNNational Geospatial-~~, NNN B-NNNN, NNN B-NNN, NN-NNNN, B-NNNN, NN - B NNNN, NN, N - B - NNNN, "NN, B. NNNN", B - NNNN, B -NNNN, B NNNN, B- NNNN, B--NNNN, B-NNNN, B-NNNN-, BNNNN, BNNNNN, BE - NNNN, BE-NNNN, BF-NNNN.~~**Simple Regex: <tt>\d{0,3},?\s?-{0,2}\Intelligence Agency's~~?B?~~[~~EF]?\.?\s?-{0,2}\s?\d{1,5}-? </tt>~~*Germany ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Germany Sourced from Wikipedia]): Currently (post 1993) German postcal codes consist of five digits: NNMMM where NN indicates the broad area and MMM indicates the sub-area. Prior to 1993 postal codes had four digits NNNN and between 1989 and 1993, O-NNNN (for East, Ost, Germany) and W-NNNN (for West Germany) was used. However, the following formats also appear frequently in the patent data: (NNNN), (D-NNNN), -NNNN, 0-NNNN, 0 - NNNN, 0-NNN, 0NNNN, N CityName NN, NN CityName NN, NNN CityName NN, NNNN CityName NN, NNNN CityName N, NNNN CityName N/BRD, NNNN CityName N/HB, 1-DNNNN, "10,NNNN", BRD-NNNN, N, NN, NNN, NNNN, NNNNN, NN.NNNNN, NN-NNNNN, d-NNNN, NN - NNNNN, NN NNN, D-N, D-NN, D-NNN, D-NNNN, D-NNNNN, D N CityName NN, D NN, D NNNN, D NNNNN, D- NNNN, D- NNNNN, D--NNNNN, D-0-NNNN, D-0NNNN, D-N-NNNNN, D.NNNN, D.NNNNN, D.-NNNNN, D0NNNN, DNNN NN. DNN, DNNN, DNNNN, DNNNNN, DE - NNNN, DE 0NNNN, DE NNNNN, DE-0NNNNN, DE-0-NNNN, DE-NNNN, DE-NNNNN, O-NNNN, W-NNNN, W-NNNN CityName NN, W-NNNNN CityName NN, WNNNN. Where CityName is Berlin, Hamburg, Dusseldorf, Seevetal, etc., and DM, DS and DW appear instead of DE sometimes. Unfortunately this list is not exhaustive. Readers should note that there is a frequent transcription error of O (Ooh) as 0 (Zero).**Simple Regex (Doesn't catch everything): <tt>\GEOnet Names Server | GEOnet Names Server (~~?(DE|D|1|W|~~GNS)~~\.?-?[O0~~]~~?(BRD|1|)\s?-?\s?\d{2,5}\)?</tt>~~*Spain ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain Sourced from Wikipedia]~~): Post 1976 Spanish postcodes are five digits of~~ which covers the ~~format NNMMM, where NN indicates~~ world excluding the ~~province (01-52) or a reserved code (e~~U.g. 80 for P.O. boxes). In the patent data Spansish postcodes are comparatively well behaved, with the following standard variants appearing: NNNNN, NNN NN, NNNN, NN NNNNN, NN- NNNNN, -NNNNN, NNNNN-, "NN, NNNNN", NNN, NN, N NNNNN-, NN-NN, NN-NN NNNNN, NNNNN-IBI, E-NNNNN, E-NNNN, E - NNNNN, E--NNNNN, ES-NNNNNS.**Simple Regex: <tt>(E|ES|)\d{0,2},?\s?-{0,2}\s?\d{2,5}-?(IBI|)</tt>*Switzerland ([http://en.wikipedia.org/wiki/Postal_codes_in_Switzerland_and_Liechtenstein Sourced from Wikipedia]): Swiss (and ~~Lictenstein) postcodes are hierarchical four-digit numbers of the form District+Area+Route+PONumber, where districts are numbered West to East (would you expect less from the Swiss?)~~Antartica. In the patent data Swiss postcodes are comparatively immaculately behaved with the following formats appearing: NNNN, NNNN-, CH-NNNN, CH - NNNN, CH NNNN, CHNNN, CH- NNNN, CHNNN. Though the "H" may sometimes be lowercase.**Simple Regex: <tt>(CH|Ch|)\s?-?\s?\d{3,4}-?</tt>

This project uses [[ISO3166]] two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the [[GEOnet Names Server | GNS page]] for details). A full bundle of correctly names GNS files is also available. The perl module Match::~~PostalCodes~~GNS.pm ~~perl module~~ loads, indexes and provides ~~a method~~ an interface to ~~extract a postcode~~ key variables from ~~a text string for a given~~ this data. The source code is the primary module documentation. The load() method takes an ISO3166 code, and the index methods and most other methods take specific GNS FC codes (e.g. "P" for populated place, "L" for locality, and "A" for administrative area). ~~The simple regular expressions listed above~~ Which GNS FC codes are ~~not~~ used ~~verbatim~~is specified in the @Letters global varible of MatchLocations.pl and inherited by all other modules. MatchLocations.pl also retrieves a list of all ISO3166 codes included in the data (from the MatchPatent.pm module) and in any specified override file, ~~as more sophisticed techniques~~ and calls Match::GNS.pm to load them. An override file can be ~~employed on per country basis~~specified with the <tt>-over</tt> option. Override files are tab-delimited and have the format: ListedISO3166 1stPreference 2ndPreference 3rdPreference ... The ISO3166 listed in the source data is then overridden and the alternatives are searched for matches in order of preference. The search is terminated when a match is found or the override set is exhausted.

==The Matching Process==

The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchLocations.pl] script, and its dependent modules (detailed above), which has a standard pod based command line interface. The <tt>-co </tt> option specifies the ISO3166 country code to be matched. If the override option is used, then the <tt>-co</tt> option can be used to specify the source file. When an override option is set to 1, rather than to the filename containing the overrides, then the source files countries are used to determine which GNS lookups to perform, otherwise the <tt>-co</tt> option specifies the GNS reference set.

Glossary of terms:

*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings

*LCS - Longest Common Subsequence based matching (See below)

*~~Place~~ Administrative area, populated place, and ~~administrative area~~ locality - ~~somewhere~~ locations identified as a FC=A, FC=P or FC=A L respectively in the GNS data. Unless otherwise specified , matches are performed for ~~both place and administrative area~~ all GNS FC codes requested (default is A,P,L) separately and in series.

The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):

===Exact Matching Units===

The exact matching of units is performed for both the exception units and units of "well-formatted" records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. Units are matched from the bottom to the top, in order of precedence. That is if the string is Unit1, Unit2, Unit3, Postcode; then Unit3 is matched with precedence over Units 2 and 1, and so forth. However, if multiple matches are made for a ~~"Place"~~ some FC code and one match is made for ~~the "Area"~~another, then preference is given to ~~a Place name that is~~ the different ~~from the Area name. This is done as many Areas are also places, and more information from the source string is used in this fashion~~combination. For example if the string were "Chelsea, London" and both Chelsea and London were recorded in the GNS data as ~~Places~~FC=P, but only London was recorded as a ~~Area~~FC=A, then it would be most sensible to record ~~Place~~P=Chelsea, ~~Area~~A=London, and not ~~Place~~P=London, ~~Area~~A=London. ~~The same 'difference preference'~~ This is differencing is ~~also applied~~ done in the ~~rare cases where there are~~ matching method and independent from the resolution of multiple matches ~~on Area but only one on Place~~at the end.

===Token Matching===

An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:

#String1 String2 String3 String4 (token set ~~lenght~~length=4, first and only set)#String2 String3 String4 (token set ~~lenght~~length=3, first set)#String1 String2 String3 (token set ~~lenght~~length=3, second set)#String3 String4 (token set ~~lenght~~length=2, first set)#String2 String3 (token set ~~lenght~~length=2, second set)#String1 String2 (token set ~~lenght~~length=2, third set)#String4 (token set ~~lenght~~length=1, first set)#String3 (token set ~~lenght~~length=1, second set)#String2 (token set ~~lenght~~length=1, third set)#String1 (token set ~~lenght~~length=1, fourth set) ~~As with the Exact Matching, the 'difference preference' for Areas and Places is invoked.~~

===NGram and LCS Matching===

===Reconciling Multiple Matches===

In a small number of cases it is possible that the source string will achieve more than one ~~A (Area) or P (Place)~~ matchfor more than one FC code. For example suppose the string "Glouchester Street Cambridge Cambridgeshire" were considered. This could concievably produce two P matches and one A match with the token matching algorithm detailed above.

To reconsile multiple matches the following process is undertaken:

*If ~~there~~ an FC code has only one match keep that one match*Aim for distinction in the set, giving priority in the order that the FC codes are ~~both~~ specified in MatchLocations.pl. The default is to include A,P ,L in order, so that precedence follows importance and size. This is important if multiple FC codes contain multiple overlapping matches. For example suppose A ~~matches~~ =1,2 P=2,3 and ~~more than one of either P~~ L=3. The algorithm will look forward and~~/or~~ backwards to assign: A =1 P=2 L=3.*Determine the set of FC code matches~~, then determine the P-A pair~~ with the shortest distance between ~~then~~ them using a [http://en.wikipedia.org/wiki/Haversine_formula Haversine formula] distance calculation based on the GNS reported longitudes and latitudes. (Note that the Haversine formula is implemented in the Match::GNS.pm module and is the most accurate method over short distances, where other methods, like the great-circle method, suffer from compounded rounding error problems.)This is important when multiple FC codes have muliple matches but they do not overlap.*If ~~there~~ one or more match is found for an FC code then one final 'best' match must be reported, even if it overlaps with another FC code or is distant. ==Human Choices== It is generally preferrable to have a very high degree of confidence in the fuzzy matches, so that they can be treated as correct without individual inspection. However the script and modules are capable of matching to any degree of accuracy. To get further matches that can be inspected/validated/chosen by a human agent, a very weak criteria is set for two runs of fuzzy matching, and then in each run the best (in terms of parameter scores) options are recorded and written into a 'human choice' file. As a result a human choice file may contain:#No matches for a source string as none of the reference strings managed to reach even the very weak threshold criteria.#One match, as both runs of fuzzy matching produced the same recommendation.#Two matches, as both runs of fuzzy matching produced one best candidate and the candidates were unique.#More than two matches, as one or both of the fuzzy matching runs had multiple P unique candidates with the same scores. It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters. ==Output Files== By default all files are outputted to the Results directory. Which files are outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches ~~but no~~ (unless the <tt>-e</tt> option is used to force just exact matching). The main results file outputs:*COUNTRY - From the source entry*STR - From the source entry*CTY - From the source entry*EXP_CITY - From the source entry*EXP_ADM - From the source entry*EXP_POSTCODE - From the source entry*CTY_STR - A compound entry, delimited by #, used as an internal key. It is the software's best estimate of an address structure.*EXP_STR - A ~~matches~~compound entry, delimited by #, ~~take~~ made from the ~~one~~ exception data in a similar way to CTY_STR*PRS_POSTCODE - The software's best estimate of the postcode if any*MATCH_TYPE - The match type that was ~~arrived at first~~used to make the match*PLACE - The name of the most precise location*UNI - The GNS unique identifier of the most precise location*LAT - The latitude of the most precise location*LONG - The longitude of the most precise location*FC - The FC code of the most precise location The most precise location is taken to be the finest grained result. That is the match corresponding to the lowest level FC code. In the case of the default of FC=A,P,L preference is given to L then P then A.The following variables are then repeated for each FC code searched, and prefixed by the FC code (if no match was found for this FC code the entries will be blank):*NAME*UNI*LAT*~~If there~~ LONG The fuzzy match file(s), if requested with <tt>-wf</tt>, have the same format (they are ~~multiple A matches but no P matches~~written by the same method). The report file is a copy of the output to the terminal, and can be enabled with the <tt>-r</tt> option. The human choice file (enabled with <tt>-human</tt>) has its own format as follows:*SOURCENAME - The word, ~~take~~ token or string from the ~~one~~ source entry that is being considered as relevant for a match*REFNAME - The name of a place in the GNS file*COUNTRY - From the source entry*STR - From the source entry*CTY - From the source entry*EXP_CITY - From the source entry*EXP_ADM - From the source entry*EXP_POSTCODE - From the source entry*REFTOTAL - The total number of grams in REFNAME*SOURCETOTAL - The total number of grams in SOURCENAME*REFPC - the percentage of the REFNAME grams that appear in the SOURCENAME gram set*SOURCEPC - the percentage of the SOURCENAME grams that appear in the REFNAME gram set*LEFTGRAMS - the number of the REFNAME grams that appear in the SOURCENAME gram set*RIGHTGRAMS - the number of the SOURCENAME grams that ~~was arrived at~~ appear in the REFNAME gram set*LCSSCORE - The size of the longest common subsequence in characters*SOURCELENGTH - The length of SOURCENAME*REFLENGTH - The length of REFNAME*MAXLENGTH - The maximum of the lengths of SOURCENAME and REFNAME*LCSPC - The LCSSCORE divided by the MAXLENGTH*FIRSTLETTERBINDS - Whether the fuzzy matching algorithm required the same first.letter in SOURCENAME and REFNAME*GRAMALPHABET - The gram alphabet used by the matching algorithm*GRAMLENGTH - The length of the n-grams used

Anonymous user

67.188.196.241

Changes

Geocoding Inventor Locations (view source)

Revision as of 05:47, 22 January 2010

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools