Changes

Jump to navigation Jump to search
*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
*LCS - Longest Common Subsequence based matching (See below)
*Place Administrative area, populated place, and administrative area locality - somewhere locations identified as a FC=A, FC=P or FC=A L respectively in the GNS data. Unless otherwise specified , matches are performed for both place and administrative area all GNS FC codes requested (default is A,P,L) separately and in series.
The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):
An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:
#String1 String2 String3 String4 (token set lenghtlength=4, first and only set)#String2 String3 String4 (token set lenghtlength=3, first set)#String1 String2 String3 (token set lenghtlength=3, second set)#String3 String4 (token set lenghtlength=2, first set)#String2 String3 (token set lenghtlength=2, second set)#String1 String2 (token set lenghtlength=2, third set)#String4 (token set lenghtlength=1, first set)#String3 (token set lenghtlength=1, second set)#String2 (token set lenghtlength=1, third set)#String1 (token set lenghtlength=1, fourth set)
===NGram and LCS Matching===
==Output Files==
By default all files are outputted to the Results directory. Which files are output outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches (unless the <tt>-e</tt> option is used to force just exact matching). The main results file outputs:
*COUNTRY - From the source entry
*STR - From the source entry
*LONG
The fuzzy match file(s), if requested with <tt>-wf</tt>, have the same format (they are written by the same method). The report file is a copy of the output to the terminal, and can be enabled with the <tt>-r</tt> option. The human choice file (enabled with <tt>-human</tt> ) has its own format as follows:
*SOURCENAME - The word, token or string from the source entry that is being considered as relevant for a match
*REFNAME - The name of a place in the GNS file
Anonymous user

Navigation menu