Changes

Jump to navigation Jump to search
m
Longest Common Subsequence (LCS) is an abundantly used fuzzy matching technique. The [http://en.wikipedia.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia] provides a very detailed background. However, LCS matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run-times, LCS matching is done on only a small sub-set of strings that have met the NGram criteria detailed below.
NGram NGrams are letter character-based token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets:
#ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet)
#0123456789 (i.e. Standard numbers)
Anonymous user

Navigation menu