Changes

43 bytes added , 05:47, 22 January 2010

*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings

*LCS - Longest Common Subsequence based matching (See below)

*~~Place~~ Administrative area, populated place, and ~~administrative area~~ locality - ~~somewhere~~ locations identified as a FC=A, FC=P or FC=A L respectively in the GNS data. Unless otherwise specified , matches are performed for ~~both place and administrative area~~ all GNS FC codes requested (default is A,P,L) separately and in series.

The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):

An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:

#String1 String2 String3 String4 (token set ~~lenght~~length=4, first and only set)#String2 String3 String4 (token set ~~lenght~~length=3, first set)#String1 String2 String3 (token set ~~lenght~~length=3, second set)#String3 String4 (token set ~~lenght~~length=2, first set)#String2 String3 (token set ~~lenght~~length=2, second set)#String1 String2 (token set ~~lenght~~length=2, third set)#String4 (token set ~~lenght~~length=1, first set)#String3 (token set ~~lenght~~length=1, second set)#String2 (token set ~~lenght~~length=1, third set)#String1 (token set ~~lenght~~length=1, fourth set)

===NGram and LCS Matching===

==Output Files==

By default all files are outputted to the Results directory. Which files are ~~output~~ outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches (unless the <tt>-e</tt> option is used to force just exact matching). The main results file outputs:

*COUNTRY - From the source entry

*STR - From the source entry

*LONG

The fuzzy match file(s), if requested with <tt>-wf</tt>, have the same format (they are written by the same method). The report file is a copy of the output to the terminal, and can be enabled with the <tt>-r</tt> option. The human choice file (enabled with <tt>-human</tt> ) has its own format as follows:

*SOURCENAME - The word, token or string from the source entry that is being considered as relevant for a match

*REFNAME - The name of a place in the GNS file

Anonymous user

67.188.196.241

Changes

Geocoding Inventor Locations (view source)

Revision as of 05:47, 22 January 2010

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools