Changes

2,922 bytes added , 05:47, 22 January 2010

==Script Files==

The scripts and modules that operationalize these matching techniques can be downloaded as a bundle with ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gz v1.0.1] ~20Mb) or ~~withour~~ without ([http://www.edegan.com/repository/MatchLocations_Full.tar.gz MatchLocations_Full.tar.gz v1.0.1] ~20Mb) all supporting data files. Note that the current version is 1.0.3, which will be posted shortly. The bundles contain the default directory structure. Defaults can be changed in the MatchLocations.pl script.

The directories are as follows:

*GNS - contains GNS reference data named GNS-XX.txt

*Match - contains the modules

The bundle contains:

*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings

*LCS - Longest Common Subsequence based matching (See below)

*~~Place~~ Administrative area, populated place, and ~~administrative area~~ locality - ~~somewhere~~ locations identified as a FC=A, FC=P or FC=A L respectively in the GNS data. Unless otherwise specified , matches are performed for ~~both place and administrative area~~ all GNS FC codes requested (default is A,P,L) separately and in series.

The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):

===Exact Matching Units===

The exact matching of units is performed for both the exception units and units of "well-formatted" records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. Units are matched from the bottom to the top, in order of precedence. That is if the string is Unit1, Unit2, Unit3, Postcode; then Unit3 is matched with precedence over Units 2 and 1, and so forth. However, if multiple matches are made for a ~~"Place"~~ some FC code and one match is made for ~~the "Area"~~another, then preference is given to ~~a Place name that is~~ the different ~~from the Area name. This is done as many Areas are also places, and more information from the source string is used in this fashion~~combination. For example if the string were "Chelsea, London" and both Chelsea and London were recorded in the GNS data as ~~Places~~FC=P, but only London was recorded as a ~~Area~~FC=A, then it would be most sensible to record ~~Place~~P=Chelsea, ~~Area~~A=London, and not ~~Place~~P=London, ~~Area~~A=London. ~~The same 'difference preference'~~ This is differencing is ~~also applied~~ done in the ~~rare cases where there are~~ matching method and independent from the resolution of multiple matches ~~on Area but only one on Place~~at the end.

===Token Matching===

An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:

#String1 String2 String3 String4 (token set ~~lenght~~length=4, first and only set)#String2 String3 String4 (token set ~~lenght~~length=3, first set)#String1 String2 String3 (token set ~~lenght~~length=3, second set)#String3 String4 (token set ~~lenght~~length=2, first set)#String2 String3 (token set ~~lenght~~length=2, second set)#String1 String2 (token set ~~lenght~~length=2, third set)#String4 (token set ~~lenght~~length=1, first set)#String3 (token set ~~lenght~~length=1, second set)#String2 (token set ~~lenght~~length=1, third set)#String1 (token set ~~lenght~~length=1, fourth set) ~~As with the Exact Matching, the 'difference preference' for Areas and Places is invoked.~~

===NGram and LCS Matching===

It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters.

==Output Files== By default all files are outputted to the Results directory. Which files are outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches (unless the <tt>-e</tt> option is used to force just exact matching). The main results file outputs:*COUNTRY - From the source entry*STR - From the source entry*CTY - From the source entry*EXP_CITY - From the source entry*EXP_ADM - From the source entry*EXP_POSTCODE - From the source entry*CTY_STR - A compound entry, delimited by #, used as an internal key. It is the software's best estimate of an address structure.*EXP_STR - A compound entry, delimited by #, made from the exception data in a similar way to CTY_STR*PRS_POSTCODE - The software's best estimate of the postcode if any*MATCH_TYPE - The match type that was used to make the match*PLACE - The name of the most precise location*UNI - The GNS unique identifier of the most precise location*LAT - The latitude of the most precise location*LONG - The longitude of the most precise location*FC - The FC code of the most precise location The most precise location is taken to be the finest grained result. That is the match corresponding to the lowest level FC code. In the case of the default of FC=A,P,L preference is given to L then P then A. The following variables are then repeated for each FC code searched, and prefixed by the FC code (if no match was found for this FC code the entries will be blank):*NAME*UNI*LAT*LONG The fuzzy match file(s), if requested with <tt>-wf</tt>, have the same format (they are written by the same method). The report file is a copy of the output to the terminal, and can be enabled with the <tt>-r</tt> option. The human choice file (enabled with <tt>-human</tt>) has its own format as follows:*SOURCENAME - The word, token or string from the source entry that is being considered as relevant for a match*REFNAME - The name of a place in the GNS file*COUNTRY - From the source entry*STR - From the source entry*CTY - From the source entry*EXP_CITY - From the source entry*EXP_ADM - From the source entry*EXP_POSTCODE - From the source entry*REFTOTAL - The total number of grams in REFNAME*SOURCETOTAL - The total number of grams in SOURCENAME*REFPC - the percentage of the REFNAME grams that appear in the SOURCENAME gram set*SOURCEPC - the percentage of the SOURCENAME grams that appear in the REFNAME gram set*LEFTGRAMS - the number of the REFNAME grams that appear in the SOURCENAME gram set*RIGHTGRAMS - the number of the SOURCENAME grams that appear in the REFNAME gram set*LCSSCORE - The size of the longest common subsequence in characters*SOURCELENGTH - The length of SOURCENAME*REFLENGTH - The length of REFNAME*MAXLENGTH - The maximum of the lengths of SOURCENAME and REFNAME*LCSPC - The LCSSCORE divided by the MAXLENGTH*FIRSTLETTERBINDS - Whether the fuzzy matching algorithm required the same first letter in SOURCENAME and REFNAME*GRAMALPHABET - The gram alphabet used by the matching algorithm*GRAMLENGTH - The length of the n-grams used

Anonymous user

67.188.196.241

Changes

Geocoding Inventor Locations (view source)

Revision as of 05:47, 22 January 2010

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools