Difference between revisions of "The Matcher (Tool)"

Project
The Matcher (Tool)
Project Information
Has title	The Matcher (Tool)
Has start date	Summer 2016
Has deadline date
Has keywords	Database, Tool
Has project status	Proposed
Has sponsor	McNair Center
Has project output	Tool, How-to
	Copyright © 2019 edegan.com. All Rights Reserved.

Latest revision as of 12:33, 6 October 2020

The Matcher (Tool) is a tool to match and merge datasets using company names as identifiers. It is written in perl and implements both normalization and fuzzy matching techniques. The normalization methods include 'Hall' and others used by the NBER Patent Data project, and the fuzzy matching supports a range of techniques (Ngram, LCS, etc.) that can be used to generate candidate lists for human processing or machine learning, as well as threshold-based cut-offs.

Code

The code is in the repository. A ready-to-use version is in:

E:\McNair\Software\Scripts\Matcher

Usage:

The matcher takes two tab-delimited text files containing firm names and matches them. The two input files should be put in the Input directory. The results of a successful match run can be found in the Output directory. The firm names can be in any column(s) of the tab-delimited text files. By default, the software assumes that the names are in the first column of each of the two files. Also by default, the normalization technique used is "hall" (see below) and the algorithm is "none" (see below).

Matching has two stages: 1) Normalization based matching; and 2) Algorithm matching based on normalized strings. Using an algorithm of "none" prevents the second stage from taking place. Likewise, though this is generally not recommended (except for deliberate testing), using "lowercase" normalization will just translate all strings into lowercase without other changes, and normalization matching is the same as case-insensitive exact string matching.

Although matches should be symmetric, the Matcher works by using one file as the list file and one file as the reference (ref) file. With some algorithms choosing the shorter file to be the ref file may improve performance. By default, the first file (file1) is considered the list file, though this can be overridden with the -list=2 option.

A simple match between the files small1.txt and small2.txt (both are included) would be run as follows:

perl Matcher.pl -file1="small1.txt" -file2="small2.txt" -name1=1 -name2=1 -list=1 -normalize="hall" -algorithm="none"

which is equivalent to...

perl Matcher.pl -file1="small1.txt" -file2="small2.txt"

References

@@ Line 1: / Line 1: @@
-{{McNair Projects
+{{Project
-|Has title= The Matcher (Tool)
+|Has project output=Tool,How-to
+|Has sponsor=McNair Center
+|Has title=The Matcher (Tool)
 |Has start date=Summer 2016
-|Status=Tabled
+|Has keywords=Database,Tool
-|Deliverable= Other
+|Has project status=Proposed
-|Audience=McNair Staff
+}}
+<onlyinclude>[[The Matcher (Tool)]] is a tool to match and merge datasets using company names as identifiers. It is written in perl and implements both normalization and fuzzy matching techniques. The normalization methods include 'Hall' and others used by the [[NBER Patent Data]] project, and the fuzzy matching supports a range of techniques (Ngram, LCS, etc.) that can be used to generate candidate lists for human processing or machine learning, as well as threshold-based cut-offs.</onlyinclude>
-|Keywords=Database
+==Code==
-|Primary Billing=AccMcNair01
-}}
+The code is in the repository. A ready-to-use version is in:
-==Abstract==
+ E:\McNair\Software\Scripts\Matcher
-A tool to match and merge datasets.
+Usage:
+The matcher takes two tab-delimited text files containing firm names and matches them. The two input files should be put in the Input directory. The results of a successful match run can be found in the Output directory. The firm names can be in any column(s) of the tab-delimited text files. By default, the software assumes that the names are in the first column of each of the two files. Also by default, the normalization technique used is "hall" (see below) and the algorithm is "none" (see below).
+Matching has two stages: 1) Normalization based matching; and 2) Algorithm matching based on normalized strings. Using an algorithm of "none" prevents the second stage from taking place. Likewise, though this is generally not recommended (except for deliberate testing), using "lowercase" normalization will just translate all strings into lowercase without other changes, and normalization matching is the same as case-insensitive exact string matching.
+Although matches should be symmetric, the Matcher works by using one file as the list file and one file as the reference (ref) file. With some algorithms choosing the shorter file to be the ref file may improve performance. By default, the first file (file1) is considered the list file, though this can be overridden with the -list=2 option.
+A simple match between the files small1.txt and small2.txt (both are included) would be run as follows:
+ perl Matcher.pl -file1="small1.txt" -file2="small2.txt" -name1=1 -name2=1 -list=1 -normalize="hall" -algorithm="none"
+which is equivalent to...
+ perl Matcher.pl -file1="small1.txt" -file2="small2.txt"
 ==References==
-<includeonly>
+<!-- flush -->
-[[Category: McNair Projects]]
-</includeonly><!-- flush -->
-[[Category: Internal]]
-[[Internal Classification: Software| ]]

Difference between revisions of "The Matcher (Tool)"

Latest revision as of 12:33, 6 October 2020

Code

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools