Changes

Jump to navigation Jump to search
==Script Files==
The scripts and modules that operationalize these matching techniques can be downloaded as individual scripts or as a bundle with all supporting data files ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gz]~20Mb). Note that the reference data should be placed in a subdirectory by default named "GNS", and source data should be placed in a subdirectory by default named "Source". Both defaults can be changed in the MatchLocations.pl script.
The bundle contains:
*[http://www.edegan.com/repository/Match/Gram.pm Match::Gram.pm] - Custom NGram Module
*[http://www.edegan.com/repository/Match/LCS.pm Match::LCS.pm] - A standard LCS Module
 
The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:
 
<tt>perl MatchLocations.pl -co GB -u -human -r -wf </tt>
 
which will process ISO3166 <tt>country</tt> code GB (Great Britain), include <tt>unmatched</tt> inputs in the results file, produce a <tt>human</tt> choices file, write the <tt>report</tt> to a text file, and <tt>write fuzzy</tt> matches to additional seperate files as well as the main results file.
 
<tt>perl MatchLocations.pl -h</tt>
 
produces a simple help output.
==Reference Data==
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica.
This project uses [[ISO3166]] two-character country codes to name source and reference files. Available GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the [[GEOnet Names Server | GNS page]] for details). Example reference data files for countries that have or are being processed include:
*Australia: [http://www.edegan.com/repository/GNS-AU.txt GNS-AU.txt]
*Belgium: [http://www.edegan.com/repository/GNS-BE.txt GNS-BE.txt]
Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference, as are some simple regular expressions that should safely match most variants:
*United Kingdom Australia: ([http://en.wikipedia.org/wiki/UK_postcodes Postcodes_in_australia Sourced from Wikipedia]): A9 9AANNNN where N is a numeric. Australian postcodes should appear at the end of addresses, and are frequently preceded by the acronym for the territory/state (specifically: NSW, ACT, VIC, QLD, SA, A99 9AAWA, A9A 9AATAS, AA9 9AAand NT). In the patent data variations include: NNNN, AU-NNNN, AA99 9AAXXX NNNN, AA9A 9AAXxx. NNNN X. X.X. NNNN, XXXNNNN, where XXX indicate the two or three characters of the acronym.**Simple Regex: <tt>([ANSW|Nsw|ACT|Act|VIC|Vic|QLD|Qld|SA|Sa|WA|Wa|TAS|Tas|NT|Nt|Au|AU)?(\w\.\w\.\w\.)?\.?\s?-Z]?\s?\d{14,24}</tt>*Belgium ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Belgium Sourced from Wikipedia]): NNNN where N is a numeric. Belgian postcodes are usually placed before the city, and the number of trailing zeros indicates the size of the city. However, the following formats also appear frequently in the patent data: NN NNNN, NNN, NNNN, NNNNN, NNNN-, NNN B-NNNN, NNN B-NNN, NN-NNNN, B-NNNN, NN - B NNNN, NN, N - B - NNNN, "NN, B. NNNN", B - NNNN, B -NNNN, B NNNN, B- NNNN, B--NNNN, B-NNNN, B-NNNN-, BNNNN, BNNNNN, BE - NNNN, BE-NNNN, BF-NNNN.**Simple Regex: <tt>\d{0,3},?\s?-9]{10,2}\s?B?[AEF]?\.?\s?-Z]{0,2}\s?\d{1,5}-? </tt>*Canada: ([http://en.wikipedia.org/wiki/Canadian_postal_code Sourced from Wikipedia]): XNX NXN, where X indicates a letter and the N a numeric. The first letter denotes the province or territory. This standard was adopted in 1970 (fully implemented by 1974) and is closely related to the UK and Dutch systems. In the patent data, Canadian postal codes appear (like the Canadians) very well behaved with the following variants appearing: XNX NXN, XNX-NXN, and XNX. Although it is possible that the letter O and the number 0 may be erroneously transcribed.**Simple Refex: <tt>[A-Z0-0][0-9O-O][A-Z0-0]\s?\-?\s?([0-99O-O][A-ZZ0-0][0-9O-O])?</tt>*Finland ([http://en.wikipedia.org/wiki/Postal_codes_in_Finland Sourced from Wikipedia]): NNMMD where N, M and D are numerics, and NN indicates the municipality, MM the district and D is typically either a 0 (large area), 5 (small area) or 1 (for P.O. Boxes). In the patent data the following Finnish postal codes are evident: NNNNN, NNNNNN, FI--NNNNN, FI-NNNNN, FI-NNNN, FIB-NNNNN, FIN -NNNNN, FIN NNNNN, FIN- NNNNN, FIN-NNNNN, Finn-NNNNN, SF-NNNNN, and SF-NNNNNNN. Also the city name is sometimes followed by two digits. **Simple Refex: <tt>(FI|FIN|Finn|FINN|FIB|SF)?\s?-?-?\s?\d{24,27})</tt>
*France ([http://en.wikipedia.org/wiki/Postal_codes_in_France Sourced from Wikipedia]): NNNMM or NNMMM where NN and NNN are numerics indicating the préfectures and sous-préfectures, respectively, and MMM are other numerics. However the following formats also appear frequently in the patent data: F NNNNN, F-NNNNN, F - NNNNN, F- NN NNN, FNNNNN, F-NN NNN, F-NN, FR - NNNNN, FR NNNNN, FR-NNNNN, (NNNNN), (NN), - NNNNN, -NNNNN-, -NN, NN, NN - N-NNNNN, NNNN, NN NNN, NN.NNN, NNN/N, NN., F. NNNNN, F.NNNNN, "FRNN,NNN". French postal codes most often, but not exclusively, occur at the start of the address string. If there are fewer than 5 digits, trailing zeros should be added.
**Simple Regex: <tt>\(?F?R?\.?\s?\d?-?\s?\d{2,3}\.?\s?\/?\d{0,3}-?\)?</tt>
*Belgium ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Belgium Sourced from Wikipedia]): NNNN where N is a numeric. Belgian postcodes are usually placed before the city, and the number of trailing zeros indicates the size of the city. However, the following formats also appear frequently in the patent data: NN NNNN, NNN, NNNN, NNNNN, NNNN-, NNN B-NNNN, NNN B-NNN, NN-NNNN, B-NNNN, NN - B NNNN, NN, N - B - NNNN, "NN, B. NNNN", B - NNNN, B -NNNN, B NNNN, B- NNNN, B--NNNN, B-NNNN, B-NNNN-, BNNNN, BNNNNN, BE - NNNN, BE-NNNN, BF-NNNN.
**Simple Regex: <tt>\d{0,3},?\s?-{0,2}\s?B?[EF]?\.?\s?-{0,2}\s?\d{1,5}-? </tt>
*Germany ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Germany Sourced from Wikipedia]): Currently (post 1993) German postcal codes consist of five digits: NNMMM where NN indicates the broad area and MMM indicates the sub-area. Prior to 1993 postal codes had four digits NNNN and between 1989 and 1993, O-NNNN (for East, Ost, Germany) and W-NNNN (for West Germany) was used. However, the following formats also appear frequently in the patent data: (NNNN), (D-NNNN), -NNNN, 0-NNNN, 0 - NNNN, 0-NNN, 0NNNN, N CityName NN, NN CityName NN, NNN CityName NN, NNNN CityName NN, NNNN CityName N, NNNN CityName N/BRD, NNNN CityName N/HB, 1-DNNNN, "10,NNNN", BRD-NNNN, N, NN, NNN, NNNN, NNNNN, NN.NNNNN, NN-NNNNN, d-NNNN, NN - NNNNN, NN NNN, D-N, D-NN, D-NNN, D-NNNN, D-NNNNN, D N CityName NN, D NN, D NNNN, D NNNNN, D- NNNN, D- NNNNN, D--NNNNN, D-0-NNNN, D-0NNNN, D-N-NNNNN, D.NNNN, D.NNNNN, D.-NNNNN, D0NNNN, DNNN NN. DNN, DNNN, DNNNN, DNNNNN, DE - NNNN, DE 0NNNN, DE NNNNN, DE-0NNNNN, DE-0-NNNN, DE-NNNN, DE-NNNNN, O-NNNN, W-NNNN, W-NNNN CityName NN, W-NNNNN CityName NN, WNNNN. Where CityName is Berlin, Hamburg, Dusseldorf, Seevetal, etc., and DM, DS and DW appear instead of DE sometimes. Unfortunately this list is not exhaustive. Readers should note that there is a frequent transcription error of O (Ooh) as 0 (Zero).
**Simple Regex (Doesn't catch everything): <tt>\(?(DE|D|1|W|)\.?-?[O0]?(BRD|1|)\s?-?\s?\d{2,5}\)?</tt>
*Hungary ([http://en.wikipedia.org/wiki/List_of_postal_codes Sourced from Wikipedia]): H- or HU-NNNN. (Note: Apparently introduced in 1973.). From the patent data the following postcodes can be noted: NN, NNN, NNNN, NNNN-, H-NNNN, H--NNNN, and H-NN-N. However, u. NN and u. N frequently appear at the end of the CTY string, and some cities are followed by roman numerals.
**Simple Regex <tt>(H|HU)?\s?-?\s?\d{2,4}-?\d{0,2}</tt>
*Ireland ([http://en.wikipedia.org/wiki/Republic_of_Ireland_postal_addresses Sourced from Wikipedia]): The Republic of Ireland does not use postal codes per se. However some cities, particularly Dublin, use one or two digit district numbers following the city name. In the patent data the format bmNNNN also appeared and the district numbers appear strictly at the end of the string, except in the case where it is followed by "Eire.".
**Simple Regex <tt>\d{1,2}\s?,?(Eire)?\.?$</tt>
*Spain ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain Sourced from Wikipedia]): Post 1976 Spanish postcodes are five digits of the format NNMMM, where NN indicates the province (01-52) or a reserved code (e.g. 80 for P.O. boxes). In the patent data Spansish postcodes are comparatively well behaved, with the following standard variants appearing: NNNNN, NNN NN, NNNN, NN NNNNN, NN- NNNNN, -NNNNN, NNNNN-, "NN, NNNNN", NNN, NN, N NNNNN-, NN-NN, NN-NN NNNNN, NNNNN-IBI, E-NNNNN, E-NNNN, E - NNNNN, E--NNNNN, ES-NNNNN.
**Simple Regex: <tt>(E|ES|)\d{0,2},?\s?-{0,2}\s?\d{2,5}-?(IBI|)</tt>
*Switzerland ([http://en.wikipedia.org/wiki/Postal_codes_in_Switzerland_and_Liechtenstein Sourced from Wikipedia]): Swiss (and Lictenstein) postcodes are hierarchical four-digit numbers of the form District+Area+Route+PONumber, where districts are numbered West to East (would you expect less from the Swiss?). In the patent data Swiss postcodes are comparatively immaculately behaved with the following formats appearing: NNNN, NNNN-, CH-NNNN, CH - NNNN, CH NNNN, CHNNN, CH- NNNN, CHNNN. Though the "H" may sometimes be lowercase.
**Simple Regex: <tt>(CH|Ch|)\s?-?\s?\d{3,4}-?</tt>
*United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA.
**Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt>
The Match::PostalCodes.pm perl module provides a method to extract a postcode from a text string for a given ISO3166 code. The simple regular expressions listed above are not used verbatim, as more sophisticed techniques can be employed on per country basis.
*If there are multiple P matches but no A matches, take the one that was arrived at first.
*If there are multiple A matches but no P matches, take the one that was arrived at first.
 
==Human Choices==
 
It is generally preferrable to have a very high degree of confidence in the fuzzy matches, so that they can be treated as correct without individual inspection. However the script and modules are capable of matching to any degree of accuracy. To get further matches that can be inspected/validated/chosen by a human agent, a very weak criteria is set for two runs of fuzzy matching, and then in each run the best (in terms of parameter scores) options are recorded and written into a 'human choice' file.
 
As a result a human choice file may contain:
#No matches for a source string as none of the reference strings managed to reach even the very weak threshold criteria.
#One match, as both runs of fuzzy matching produced the same recommendation.
#Two matches, as both runs of fuzzy matching produced one best candidate and the candidates were unique.
#More than two matches, as one or both of the fuzzy matching runs had multiple unique candidates with the same scores.
 
It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters.
Anonymous user

Navigation menu