Changes

Jump to navigation Jump to search
no edit summary
==Script Files==
The scripts and modules that operationalize these matching techniques can be downloaded as individual scripts or as a bundle with all supporting data files ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gzv1.0.1] ~20Mb) or withour ([http://www.edegan.com/repository/MatchLocations_Full.tar.gz MatchLocations_Full.tar.gz v1.0.1] ~20Mb)all supporting data files. Note that the reference data should be placed in a subdirectory by default named "GNS"current version is 1.0.3, and source data should which will be placed in a subdirectory by posted shortly. The bundles contain the default named "Source"directory structure. Both defaults Defaults can be changed in the MatchLocations.pl script. The directories are as follows:*Source - Source data should be placed here. See below for formatting.*Results - Results generated by the scripts, including logs will appear here.*GNS - contains GNS reference data named GNS-XX.txt*Match - contains the modules
The bundle contains:
*[http://www.edegan.com/repository/MatchLocations.pl MatchLocations.pl] - The main script and that initializes and processes the matching requests*[http://wwwBatchMatch.edegan.com/repository/Match/GNS.pm pl - A script for running batches *Match::GNS.pm] - Interface to the GNS reference data (see below)*[http://www.edegan.com/repository/Match/Patent.pm Match::Patent.pm] - Interface to the Patent Location data (see below)*[http://www.edegan.com/repository/Match/Common.pm Match::Common.pm] - Provides common (string cleaning) routines for both the reference and source interface modules*[http://www.edegan.com/repository/Match/PostalCodes.pm Match::PostalCodes.pm] - A module that extracts postcodes of various formats from (address) strings*[http://www.edegan.com/repository/PatentLocations-Stopwords.txt PatentLocations-Stopwords.txt] - A Stop Word file (tab delimited)*[http://www.edegan.com/repository/Match/Gram.pm Match::Gram.pm] - Custom NGram Module*[http://www.edegan.com/repository/Match/LCS.pm Match::LCS.pm] - A standard LCS Module*PatentLocations-Stopwords.txt - A Stop Word file (tab delimited)*GNS Reference Files - The full bundle contains a full set of correctly named GNS reference files
The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:
<tt>perl MatchLocations.pl -co GB -u -human -r -wf </tt>
which will process ISO3166 <tt>country</tt> code GB (Great Britain), include <tt>unmatched</tt> inputs in the results file, produce a <tt>human</tt> choices file, write the <tt>report</tt> to a text file, and <tt>write fuzzy</tt> matches to additional seperate files as well as the main results file. Other options include <tt>over</tt> to override country designations and <tt>o</tt> to specify the results filename.
<tt>perl MatchLocations.pl -h</tt>
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica.
This project uses [[ISO3166]] two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the [[GEOnet Names Server | GNS page]] for details). Example reference data files for countries that have been processed include:*Australia: [http://www.edegan.com/repository/GNS-AU.txt GNS-AU.txt]*Belgium: [http://www.edegan.com/repository/GNS-BE.txt GNS-BE.txt]*Canada: [http://www.edegan.com/repository/GNS-CA.txt GNS-CA.txt]*France: [http://www.edegan.com/repository/GNS-FR.txt GNS-FR.txt]*Germany: [http://www.edegan.com/repository/GNS-DE.txt GNS-DE.txt]*Hungary: [http://www.edegan.com/repository/GNS-HU.txt GNS-HU.txt]*Ireland: [http://www.edegan.com/repository/GNS-IE.txt GNS-IE.txt]*Great Britain (The United Kingdom A full bundle of Great Britain and Northern Ireland): [http://www.edegan.com/repository/correctly names GNS-GB.txt GNS-GB.txt]*Spain: [http://www.edegan.com/repository/GNS-ES.txt GNS-ES.txt]*Switzerland: [http://www.edegan.com/repository/GNS-CH.txt GNS-CHfiles is also available.txt] 
The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area).
==The Source Files==
Per country source files are extracted from the NBER patent data. The problem of identifying countries for some address records will be addressed later. The format of the source file(s) is as follows (XX is an ISO3166 code): *XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty</tt>*XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty city adm postcode</tt> The <tt>cty</tt> is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly required by the scrips but will be processed if present. The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation.  ==Postal Codes== Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference, as are some simple regular expressions that should safely match most variants: *Australia: ([http://en.wikipedia.org/wiki/Postcodes_in_australia Sourced from Wikipedia]): NNNN where N is a numeric. Australian postcodes should appear at the end of addresses, and are frequently preceded by the acronym for the territory/state (specifically: NSW, ACT, VIC, QLD, SA, WA, TAS, and NT). In the patent data variations include: NNNN, AU-NNNN, XXX NNNN, Xxx. NNNN X.X.X. NNNN, XXXNNNN, where XXX indicate the two or three characters of the acronym.**Simple Regex: <tt>(NSW|Nsw|ACT|Act|VIC|Vic|QLD|Qld|SA|Sa|WA|Wa|TAS|Tas|NT|Nt|Au|AU)?(\w\.\w\.\w\.)?\.?\s?-?\s?\d{4,4}</tt>*Belgium ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Belgium Sourced from Wikipedia]): NNNN where N is a numeric. Belgian postcodes are usually placed before the city, and the number of trailing zeros indicates the size of the city. However, the following formats also appear frequently in the patent data: NN NNNN, NNN, NNNN, NNNNN, NNNN-, NNN B-NNNN, NNN B-NNN, NN-NNNN, B-NNNN, NN - B NNNN, NN, N - B - NNNN, "NN, B. NNNN", B - NNNN, B -NNNN, B NNNN, B- NNNN, B--NNNN, B-NNNN, B-NNNN-, BNNNN, BNNNNN, BE - NNNN, BE-NNNN, BF-NNNN.**Simple Regex: <tt>\d{0,3},?\s?-{0,2}\s?B?[EF]?\.?\s?-{0,2}\s?\d{1,5}-? </tt>*Canada: ([http://en.wikipedia.org/wiki/Canadian_postal_code Sourced from Wikipedia]): XNX NXN, where X indicates a letter and the N a numeric. The first letter denotes the province or territory. This standard was adopted in 1970 (fully implemented by 1974) and is closely related to the UK and Dutch systems. In the patent data, Canadian postal codes appear (like the Canadians) very well behaved with the following variants appearing: XNX NXN, XNX-NXN, and XNX. Although it is possible that the letter O and the number 0 may be erroneously transcribed.**Simple Refex: <tt>[A-Z0-0][0-9O-O][A-Z0-0]\s?\-?\s?([0-9O-O][A-Z0-0][0-9O-O])?</tt>*Finland ([http://en.wikipedia.org/wiki/Postal_codes_in_Finland Sourced from Wikipedia]): NNMMD where N, M and D are numerics, and NN indicates the municipality, MM the district and D is typically either a 0 (large area), 5 (small area) or 1 (for P.O. Boxes). In the patent data the following Finnish postal codes are evident: NNNNN, NNNNNN, FI--NNNNN, FI-NNNNN, FI-NNNN, FIB-NNNNN, FIN -NNNNN, FIN NNNNN, FIN- NNNNN, FIN-NNNNN, Finn-NNNNN, SF-NNNNN, and SF-NNNNNNN. Also the city name is sometimes followed by two digits. **Simple Refex: <tt>(FI|FIN|Finn|FINN|FIB|SF)?\s?-?-?\s?\d{4,7}</tt>*France ([http://en.wikipedia.org/wiki/Postal_codes_in_France Sourced from Wikipedia]): NNNMM or NNMMM where NN and NNN are numerics indicating the préfectures and sous-préfectures, respectively, and MMM are other numerics. However the following formats also appear frequently in the patent data: F NNNNN, F-NNNNN, F - NNNNN, F- NN NNN, FNNNNN, F-NN NNN, F-NN, FR - NNNNN, FR NNNNN, FR-NNNNN, (NNNNN), (NN), - NNNNN, -NNNNN-, -NN, NN, NN - N-NNNNN, NNNN, NN NNN, NN.NNN, NNN/N, NN., F. NNNNN, F.NNNNN, "FRNN,NNN". French postal codes most often, but not exclusively, occur at the start of the address string. If there are fewer than 5 digits, trailing zeros should be added. **Simple Regex: <tt>\(?F?R?\.?\s?\d?-?\s?\d{2,3}\.?\s?\/?\d{0,3}-?\)?</tt>*Germany ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Germany Sourced from Wikipedia]): Currently (post 1993) German postcal codes consist of five digits: NNMMM where NN indicates the broad area and MMM indicates the sub-area. Prior to 1993 postal codes had four digits NNNN and between 1989 and 1993, O-NNNN (for East, Ost, Germany) and W-NNNN (for West Germany) was used. However, the following formats also appear frequently in the patent data: (NNNN), (D-NNNN), -NNNN, 0-NNNN, 0 - NNNN, 0-NNN, 0NNNN, N CityName NN, NN CityName NN, NNN CityName NN, NNNN CityName NN, NNNN CityName N, NNNN CityName N/BRD, NNNN CityName N/HB, 1-DNNNN, "10,NNNN", BRD-NNNN, N, NN, NNN, NNNN, NNNNN, NN.NNNNN, NN-NNNNN, d-NNNN, NN - NNNNN, NN NNN, D-N, D-NN, D-NNN, D-NNNN, D-NNNNN, D N CityName NN, D NN, D NNNN, D NNNNN, D- NNNN, D- NNNNN, D--NNNNN, D-0-NNNN, D-0NNNN, D-N-NNNNN, D.NNNN, D.NNNNN, D.-NNNNN, D0NNNN, DNNN NN. DNN, DNNN, DNNNN, DNNNNN, DE - NNNN, DE 0NNNN, DE NNNNN, DE-0NNNNN, DE-0-NNNN, DE-NNNN, DE-NNNNN, O-NNNN, W-NNNN, W-NNNN CityName NN, W-NNNNN CityName NN, WNNNN. Where CityName is Berlin, Hamburg, Dusseldorf, Seevetal, etc., and DM, DS and DW appear instead of DE sometimes. Unfortunately this list is not exhaustive. Readers should note that there is a frequent transcription error of O (Ooh) as 0 (Zero).**Simple Regex (Doesn't catch everything): <tt>\(?(DE|D|1|W|)\.?-?[O0]?(BRD|1|)\s?-?\s?\d{2,5}\)?</tt>*Hungary ([http://en.wikipedia.org/wiki/List_of_postal_codes Sourced from Wikipedia]): H- or HU-NNNN. (Note: Apparently introduced in 1973.). From the patent data the following postcodes can be noted: NN, NNN, NNNN, NNNN-, H-NNNN, H--NNNN, and H-NN-N. However, u. NN and u. N frequently appear at the end of the CTY string, and some cities are followed by roman numerals.**Simple Regex <tt>(H|HU)?\s?-?\s?\d{2,4}-?\d{0,2}</tt>*Ireland ([http://en.wikipedia.org/wiki/Republic_of_Ireland_postal_addresses Sourced from Wikipedia]): The Republic of Ireland does not use postal codes per se. However some cities, particularly Dublin, use one or two digit district numbers following the city name. In the patent data the format bmNNNN also appeared and the district numbers appear strictly at the end of the string, except in the case where it is followed by "Eire.".**Simple Regex <tt>\d{1,2}\s?,?(Eire)?\.?$</tt>*Spain ([http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain Sourced from Wikipedia]): Post 1976 Spanish postcodes are five digits of the format NNMMM, where NN indicates the province (01-52) or a reserved code (e.g. 80 for P.O. boxes). In the patent data Spansish postcodes are comparatively well behaved, with the following standard variants appearing: NNNNN, NNN NN, NNNN, NN NNNNN, NN- NNNNN, -NNNNN, NNNNN-, "NN, NNNNN", NNN, NN, N NNNNN-, NN-NN, NN-NN NNNNN, NNNNN-IBI, E-NNNNN, E-NNNN, E - NNNNN, E--NNNNN, ES-NNNNN.**Simple Regex: <tt>(E|ES|)\d{0,2},?\s?-{0,2}\s?\d{2,5}-?(IBI|)</tt>*Switzerland ([http://en.wikipedia.org/wiki/Postal_codes_in_Switzerland_and_Liechtenstein Sourced from Wikipedia]): Swiss (and Lictenstein) postcodes are hierarchical four-digit numbers of the form District+Area+Route+PONumber, where districts are numbered West to East (would you expect less from the Swiss?). In the patent data Swiss postcodes are comparatively immaculately behaved with the following formats appearing: NNNN, NNNN-, CH-NNNN, CH - NNNN, CH NNNN, CHNNN, CH- NNNN, CHNNN. Though the "H" may sometimes be lowercase.**Simple Regex: <tt>(CH|Ch|)\s?-?\s?\d{3,4}-?</tt>*United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. **Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt>
The Match::PostalCodes XX.pm perl module provides a method to extract a postcode from a txt - Tab delimited plain text with no (intentional) string for a given ISO3166 codequotation. Column(s): <tt>country</tt> <tt>str</tt> <tt>cty</tt> <tt>adm</tt> <tt>city</tt> <tt>postcode</tt> <tt>str</tt> The simple regular expressions listed above are column order is not used verbatimimportant. <tt>country</tt>, as more sophisticed techniques <tt>str</tt>, and <tt>cty</tt> can not all be employed on per country basisnull. <tt>adm</tt> <tt>city</tt> <tt>postcode</tt> are optional 'exception' fields that are processed with priority. They provide hand corrections and other specifically generated information.
The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation.
[[Postal Codes]]
==The Matching Process==
Anonymous user

Navigation menu