The Matcher (Tool)

Revision as of 16:42, 12 September 2017 by BenBaldazo (talk | contribs)
Jump to navigation Jump to search

McNair Project
The Matcher (Tool)
Project logo 02.png
Project Information
Project Title The Matcher (Tool)
Start Date Summer 2016
Keywords Database, Tool
Primary Billing
Has project status Proposed
Copyright © 2016 All Rights Reserved.


A tool to match and merge datasets.


The code is in the repository. A ready-to-use version is in:



The matcher takes two tab delimited text files containing firm names and matches them. The two input files should be put in the Input directory. The results of a successful match run can be found in the Output directory. The firm names can be in any column(s) of the tab delimited text files. By default the software assumes that the names are in the first column of each of the two files. Also by default the normalization technique used is "hall" (see below) and the algorithm is "none" (see below).

Matching has two stages: 1) Normalization based matching; and 2) Algorithm matching based on normalized strings. Using an algorithm of "none" prevents the second stage from taking place. Likewise, though this is generally not recommended (except for deliberate testing), using "lowercase" normalization will just translate all strings into lowercase without other changes, and normalization matching is the same as case-insensitive exact string matching.

Although matches should be symmetric, the Matcher works by using one file as the list file and one file as the reference (ref) file. With some algorithms choosing the shorter file to be the ref file may improve performance. By default the first file (file1) is considered the list file, though this can be overridden with the -list=2 option.

A simple match between the files small1.txt and small2.txt (both are included) would be run as follows:

perl -file1="small1.txt" -file2="small2.txt" -name1=1 -name2=1 -list=1 -normalize="hall" -algorithm="none"

which is equivalent to...

perl -file1="small1.txt" -file2="small2.txt"