Changes

Jump to navigation Jump to search
no edit summary
*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
*LCS - Longest Common Subsequence based matching (See below)
 
*Place and administrative area - somewhere identified as a NT=P or NT=A respectively in the GNS data. Unless otherwise specified matches are performed for both place and administrative area separately and in series.
The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):
#Load the source files, clean and parse (parsing identifies units)
#Load the reference file, build indices
#Exact match the units of well-formatted records
#Exact match tokens (1-5 words)
#LCS match the exception units of records with exceptions#LCS match (all other)#n-gram match#Reconsile multiple matches ==Longest Common Subsequence (LCS)== Longest Common Subsequence is perhaps the simplest (for certain inefficient implementations) and most abundantly used of fuzzy matching technique. The [http://en.wikipedia.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia] provides a very detailed background.
Anonymous user

Navigation menu