Changes

14,306 bytes added , 05:47, 22 January 2010

This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.

The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) [http://www.nga.mil National Geospatial-Intelligence Agency's] [http://geonames.nga.mil/ggmagaz/geonames4.asp GEOnet Names Server (GNS)] which covers the world excluding the U.S. and Antartica. The NGA site states: "There are no licensing requirements or restrictions in place for the use of the GNS data. Toponymic information is based on the Geographic Names Data Base, containing official standard names approved by the United States Board on Geographic Names and maintained by the National Geospatial-Intelligence Agency."==Script Files==

The scripts and modules that operationalize these matching techniques can be downloaded as a bundle with ([http://www.~~geonames~~edegan.~~org~~com/ ~~Geonames~~repository/MatchLocations.~~org~~tar.gz MatchLocations.tar.gz v1.0.1], a third-party website based in Switzerland, provides location name data under the creative commons attribution license. However, this data is drawn from the GNS for all locations except the US and Canada, where data is drawn from the ~20Mb) or without ([http://www.~~geonames~~edegan.~~org~~com/repository/ UMatchLocations_Full.Star. ~~Geological Survey Geographic Names Information System] and [http://www~~gz MatchLocations_Full.~~geobase~~tar.~~ca www~~gz v1.~~geobase~~0.ca1] ~~respectively~~~20Mb) all supporting data files. ~~Thus we recommend~~ Note that ~~users take advantage of~~ the ~~original sources and respect~~ current version is 1.0.3, which will be posted shortly. The bundles contain the default directory structure. Defaults can be changed in the ~~original licenses if applicable~~MatchLocations.pl script.

~~==Details of the GEOnet Names Server (GNS)==~~

*Place names The directories are ~~recorded in [http~~as follows:~~//earth-info.nga.mil/gns/html/romanization.html Romanized form]~~*~~Country (and province/territory) names, as well as their assigned codes, are recorded using [http://earth-info.nga.mil/gns/html/FIPS10~~Source -~~4_match.pdf FIPS Standard #10]. Users~~ Source data should ~~note that there are differences between the Federal Information Processing Standard (FIPS) country names and the [[UN GeoRegion Codes |UN Country Names]].~~*The [http://earth-info.nga.mil/gns/html/helpbe placed here.~~htm GNS output format] contains various custom codes~~See below for formatting. ~~Of particular interest are:~~**UNI Results - ~~Unique Name Idenfier (a numeric value - often negative)~~**LAT and LONG - Latitude and Longitude (Decimal - also available as DMS)** NT - Name Type (Only AResults generated by the scripts,~~P and L are of use in matching to addresses)~~including logs will appear here.***A = Administrative region type feature***P = Populated place type feature***V = Vegetation type feature***L = Locality or area type feature***U = Undersea type feature***R = Streets, highways, roads, or railroad type feature***T = Hypsographic type feature***H = Hydrographic type feature***S = Spot type feature**DC GNS - ~~Designation Code (DC provides a refinement of NT, details are in [http://www.edegan.com/repository/~~contains GNS~~-DesignationCodes.txt~~ reference data named GNS-~~DesignationCodes~~XX.txt]**SHORT_FORM Match - ~~a Short Form of the name that is commonly used~~**FULL_NAME - the Long Form of the name**FULL_NAME_ND - the Long Form of contains the ~~name without diacritics~~modules

~~==The Source And Reference Files==~~

~~Source files are currently extracted from~~ The bundle contains:*MatchLocations.pl - The main script that initializes and processes the matching requests*BatchMatch.pl - A script for running batches *Match::GNS.pm - Interface to the GNS reference data (see below)*Match::Patent.pm - Interface to the ~~NBER patent~~ Patent Location data ~~on a per country basis~~(see below)*Match::Common.pm - Provides common (string cleaning) routines for both the reference and source interface modules*Match::PostalCodes.pm - A module that extracts postcodes of various formats from (address) strings*Match::Gram.pm - Custom NGram Module*Match::LCS.pm - A standard LCS Module*PatentLocations-Stopwords. txt - A Stop Word file (tab delimited)*GNS Reference Files - The ~~original format~~ full bundle contains a full set of ~~the source~~ correctly named GNS reference files ~~was:~~

~~<tt>postcode WKU CTY city county</tt>~~The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:

~~Where~~ <tt>~~WKU</tt> is the patent number and <tt>CTY~~perl MatchLocations.pl -co GB -u -human -r -wf </tt> is an address field. The postcode, city and county are derived fields, either extracted from CTY by an algorithm or hand coded. These fields are error prone but contain some important information, particular regarding "historical curiosities", and approximately 1k of hand-corrected typos. As a first pass, however, the address fields are regenerated in the matching script.

~~Countries that are being processed~~ which will process ISO3166 <tt>country</tt> code GB (Great Britain), include:~~#The UK (Source~~ <tt>unmatched</tt> inputs in the results file~~: [http:~~, produce a <tt>human</tt> choices file, write the <tt>report</~~www.edegan.com~~tt> to a text file, and <tt>write fuzzy</~~repository/UK-PatentInventorLocations.txt UK-PatentInventorLocations.txt], Reference~~ tt> matches to additional seperate files as well as the main results file~~: [http://www~~.~~edegan.com~~Other options include <tt>over</~~repository~~tt> to override country designations and <tt>o</~~GNS-UK~~tt> to specify the results filename.~~txt GNS-UK.txt])~~

~~==Postal Codes==~~ <tt>perl MatchLocations.pl -h</tt>

~~Postal codes, known as ZIP codes in the U~~produces a simple help output.~~S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference:~~

*United Kingdom ==The Source Files== Per country source files are extracted from the NBER patent data. The format of the source file(s) is as follows (XX is an ISO3166 code): XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(~~[http~~s):<tt>country</tt> <tt>str</tt> <tt>cty</entt> <tt>adm</tt> <tt>city</tt> <tt>postcode</tt> <tt>str</tt> The column order is not important.~~wikipedia~~<tt>country</tt>, <tt>str</tt>, and <tt>cty</tt> can not all be null.~~org~~<tt>adm</tt> <tt>city</~~wiki~~tt> <tt>postcode</~~UK_postcodes Sourced~~ tt> are optional 'exception' fields that are processed with priority. They provide hand corrections and other specifically generated information. The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation. The Match::PostalCodes.pm perl module provides a method to extract [[Postal Codes]] from a the addresses for a large number of ISO3166 codes, and implements 'standard' postal code identification for all other jurisdictions. ==Reference Data== The reference data for the locations (which provides the longitude and latitudes) is taken from ~~Wikipedia~~the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica. This project uses [[ISO3166]] two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the [[GEOnet Names Server | GNS page]] for details). A full bundle of correctly names GNS files is also available. The perl module Match: ~~A9 9AA~~:GNS.pm loads, ~~A99 9AA~~indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes an ISO3166 code, ~~A9A 9AA~~and the index methods and most other methods take specific GNS FC codes (e.g. "P" for populated place, ~~AA9 9AA~~"L" for locality, ~~AA99 9AA~~and "A" for administrative area). Which GNS FC codes are used is specified in the @Letters global varible of MatchLocations.pl and inherited by all other modules. MatchLocations.pl also retrieves a list of all ISO3166 codes included in the data (from the MatchPatent.pm module) and in any specified override file, ~~AA9A 9AA~~and calls Match::GNS. ~~Simple Regex~~pm to load them. An override file can be specified with the <tt>-over</tt> option. Override files are tab-delimited and have the format: ListedISO3166 1stPreference 2ndPreference 3rdPreference ... The ISO3166 listed in the source data is then overridden and the alternatives are searched for matches in order of preference. The search is terminated when a match is found or the override set is exhausted. ==The Matching Process== The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchLocations.pl] script, and its dependent modules (detailed above), which has a standard pod based command line interface. The <tt>-co</tt> option specifies the ISO3166 country code to be matched. If the override option is used, then the <tt>-co</tt> option can be used to specify the source file. When an override option is set to 1, rather than to the filename containing the overrides, then the source files countries are used to determine which GNS lookups to perform, otherwise the <tt>-co</tt> option specifies the GNS reference set. Glossary of terms:*Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately. *Tokens - Single words or sequences of words separated by a space (note that this is a specific usage)*n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings*LCS - Longest Common Subsequence based matching (See below)*Administrative area, populated place, and locality - locations identified as a FC=A, FC=P or FC=L respectively in the GNS data. Unless otherwise specified, matches are performed for all GNS FC codes requested ([default is A,P,L) separately and in series. The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):#Load the source files, clean and parse (parsing identifies units)#Load the reference file, build indices#Exact match the exception units of records with exceptions#Exact match the units of well-~~Z]{~~formatted records#Exact match tokens (1-5 words)#N-gram and LCS match#Reconsile multiple matches ===Exact Matching Units=== The exact matching of units is performed for both the exception units and units of "well-formatted" records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. Units are matched from the bottom to the top, in order of precedence. That is if the string is Unit1,Unit2, Unit3, Postcode; then Unit3 is matched with precedence over Units 2}and 1, and so forth. However, if multiple matches are made for a some FC code and one match is made for another, then preference is given to the different combination. For example if the string were "Chelsea, London" and both Chelsea and London were recorded in the GNS data as FC=P, but only London was recorded as a FC=A, then it would be most sensible to record P=Chelsea, A=London, and not P=London, A=London. This is differencing is done in the matching method and independent from the resolution of multiple matches at the end. ===Token Matching=== Each source string is cleaned of upper-asci code (i.e. reduced to alphanumerics plus spaces), removed of its postcode (to generate PRS_POSTCODE) and then seperated into token arrays on space characters. Thus the string "String1 String2 String3 String4 Postcode" would become:[0]String1 [1]String2 [2]String3 [3]String4 An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-9side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:#String1 String2 String3 String4 (token set length=4, first and only set)#String2 String3 String4 (token set length=3, first set)#String1 String2 String3 (token set length=3, second set)#String3 String4 (token set length=2, first set)#String2 String3 (token set length=2, second set)#String1 String2 (token set length=2, third set)#String4 (token set length=1, first set)#String3 (token set length=1, second set)#String2 (token set length=1, third set)#String1 (token set length=1, fourth set) ===NGram and LCS Matching=== Longest Common Subsequence (LCS) is an abundantly used fuzzy matching technique. The [http://en.wikipedia.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia]{provides a very detailed background. However, LCS matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run-times, LCS matching is done on only a small sub-set of strings that have met the NGram criteria detailed below. NGrams are character-based token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets:#ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet)#0123456789 (i.e. Standard numbers)#" " (i.e. the space character)#Alphanumeric (i.e. 1 and 2 above)#Alphanumeric plus space (1,2},3 above)#Alphabet only plus space(1 and 3 above) Gramsets of various lengths, currently 2,3 and 4 characters, are used. Reference strings are decomposed into grams and both forward and reverse indexed for speed on large datasets. Source strings are decomposed into grams. Candidate reference strings are found by compiling a list of all reference strings that contain at least one of the source grams (this uses the reverse index). Then each of the candidate reference strings is gram-matched against the source gram to compute a number of scores. Specically gram-matching involves computing:*The number of grams in the source string*The number of grams in the reference string*The number of grams in the source string that appear in the reference string*The number of grams in the reference string that appear in the source string*The above two numbers as a source percentage and a reference percentage Then an optional constraint that the first letter of the source and reference strings must match can be employed. If a source and reference pair meets a variety of criteria on the above variables (for example the source string is greater than 10 characters long, the source percentage of grams is greater then 80% and less than 110%, the reference percentage of grams is greater than 90% and less than 105%, and the two strings start with the same letter), then the LCS is computed. An LCS percentage, using the maximum of the lengths of the source and reference strings as a denominator, is also calculated. The candidate reference string with the highest gram and LCS scores, assuming that these scores meet the decision threshold is then selected as the closest match. If no candidate reference string meets the decision threshold the source string is left unmatched. The decision thresholds are configured in the MatchLocations.pl script, and sets of Ngram/LCS matchings, using different character sets, gram lengths and decision thresholds, are performed sequentially, with the currently unmatched source strings used as input for each round. ===Reconciling Multiple Matches=== In a small number of cases it is possible that the source string will achieve more than one match for more than one FC code. For example suppose the string "Glouchester Street Cambridge Cambridgeshire" were considered. This could concievably produce two P matches and one A match with the token matching algorithm detailed above. To reconsile multiple matches the following process is undertaken:*If an FC code has only one match keep that one match*Aim for distinction in the set, giving priority in the order that the FC codes are specified in MatchLocations.pl. The default is to include A,P,L in order, so that precedence follows importance and size. This is important if multiple FC codes contain multiple overlapping matches. For example suppose A=1,2 P=2,3 and L=3. The algorithm will look forward and backwards to assign: A=1 P=2 L=3.*Determine the set of FC code matches with the shortest distance between them using a [http://en.wikipedia.org/wiki/Haversine_formula Haversine formula] distance calculation based on the GNS reported longitudes and latitudes. (Note that the Haversine formula is implemented in the Match::GNS.pm module and is the most accurate method over short distances, where other methods, like the great-circle method, suffer from compounded rounding error problems.) This is important when multiple FC codes have muliple matches but they do not overlap.*If one or more match is found for an FC code then one final 'best' match must be reported, even if it overlaps with another FC code or is distant. ==Human Choices== It is generally preferrable to have a very high degree of confidence in the fuzzy matches, so that they can be treated as correct without individual inspection. However the script and modules are capable of matching to any degree of accuracy. To get further matches that can be inspected/validated/chosen by a human agent, a very weak criteria is set for two runs of fuzzy matching, and then in each run the best (in terms of parameter scores) options are recorded and written into a 'human choice' file. As a result a human choice file may contain:#No matches for a source string as none of the reference strings managed to reach even the very weak threshold criteria.#One match, as both runs of fuzzy matching produced the same recommendation.#Two matches, as both runs of fuzzy matching produced one best candidate and the candidates were unique.#More than two matches, as one or both of the fuzzy matching runs had multiple unique candidates with the same scores. It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters. ==Output Files== By default all files are outputted to the Results directory. Which files are outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches (unless the <tt>-e</tt> option is used to force just exact matching). The main results file outputs:*COUNTRY - From the source entry*STR - From the source entry*CTY - From the source entry*EXP_CITY - From the source entry*EXP_ADM - From the source entry*EXP_POSTCODE - From the source entry*CTY_STR - Acompound entry, delimited by #, used as an internal key. It is the software's best estimate of an address structure.*EXP_STR -~~Z]{0~~A compound entry, delimited by #,~~1}\~~made from the exception data in a similar way to CTY_STR*PRS_POSTCODE - The software's[0best estimate of the postcode if any*MATCH_TYPE - The match type that was used to make the match*PLACE - The name of the most precise location*UNI - The GNS unique identifier of the most precise location*LAT - The latitude of the most precise location*LONG - The longitude of the most precise location*FC -~~9][~~The FC code of the most precise location The most precise location is taken to be the finest grained result. That is the match corresponding to the lowest level FC code. In the case of the default of FC=A,P,L preference is given to L then P then A. The following variables are then repeated for each FC code searched, and prefixed by the FC code (if no match was found for this FC code the entries will be blank):*NAME*UNI*LAT*LONG The fuzzy match file(s), if requested with <tt>-~~Z]{2~~wf</tt>,2}have the same format (they are written by the same method). The report file is a copy of the output to the terminal, and can be enabled with the <tt>-r</tt> option. The human choice file (enabled with <tt>-human</tt>) has its own format as follows:*SOURCENAME - The word, token or string from the source entry that is being considered as relevant for a match*REFNAME - The name of a place in the GNS file*COUNTRY - From the source entry*STR - From the source entry*CTY - From the source entry*EXP_CITY - From the source entry*EXP_ADM - From the source entry*EXP_POSTCODE - From the source entry*REFTOTAL - The total number of grams in REFNAME*SOURCETOTAL - The total number of grams in SOURCENAME*REFPC - the percentage of the REFNAME grams that appear in the SOURCENAME gram set*SOURCEPC - the percentage of the SOURCENAME grams that appear in the REFNAME gram set*LEFTGRAMS - the number of the REFNAME grams that appear in the SOURCENAME gram set*RIGHTGRAMS - the number of the SOURCENAME grams that appear in the REFNAME gram set*LCSSCORE - The size of the longest common subsequence in characters*SOURCELENGTH - The length of SOURCENAME*REFLENGTH - The length of REFNAME*MAXLENGTH - The maximum of the lengths of SOURCENAME and REFNAME*LCSPC - The LCSSCORE divided by the MAXLENGTH*FIRSTLETTERBINDS - Whether the fuzzy matching algorithm required the same first letter in SOURCENAME and REFNAME*GRAMALPHABET - The gram alphabet used by the matching algorithm*GRAMLENGTH - The length of the n-grams used

Anonymous user

67.188.196.241

Changes

Geocoding Inventor Locations (view source)

Revision as of 05:47, 22 January 2010

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools