Changes

Jump to navigation Jump to search
no edit summary
==Country Codes==
This project uses ISO3166, specifically [http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 | ISO3166-1 alpha-2] , two-character country codes, as recognised by the UN and (with exceptions) used for top level domains on the internet. A list of codes for the 246 recognised countries or self-governing territories is as follows:[<tt>
AD,AE,AF,AG,AI,AL,AM,AN,AO,AQ,AR,AS,AT,AU,AW,AX,AZ,BA,BB,BD,BE,BF,BG,BH,BI,BJ,BL,BM,BN,BO,BR,BS,BT,BV,BW,BY,BZ,
CA,CC,CD,CF,CG,CH,CI,CK,CL,CM,CN,CO,CR,CU,CV,CX,CY,CZ,DE,DJ,DK,DM,DO,DZ,EC,EE,EG,EH,ER,ES,ET,FI,FJ,FK,FM,FO,FR,
PA,PE,PF,PG,PH,PK,PL,PM,PN,PR,PS,PT,PW,PY,QA,RE,RO,RS,RU,RW,SA,SB,SC,SD,SE,SG,SH,SI,SJ,SK,SL,SM,SN,SO,SR,ST,SV,SY,SZ,
TC,TD,TF,TG,TH,TJ,TK,TL,TM,TN,TO,TR,TT,TV,TW,TZ,UA,UG,UM,US,UY,UZ,VA,VC,VE,VG,VI,VN,VU,WF,WS,YE,YT,ZA,ZM,ZW
]</tt> Note that the UK is on "exceptional reserve" for the use by the United Kingdom of Great Britain (GB) and Northern Island and is often used in its the placeof GB.
==Reference Data==
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica.
Available reference data files for countries that have or are being processed include:#The UK: [http://www.edegan.com/repository/GNS-GB.txt GNS-GB.txt]) The perl module [http://www.edegan.com/repository/GNS.pm GNS.pm] loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. Exported Methods include:*new() - Constructor. Takes an ISO9660 ISO3166 code, calls Load*Load() - Expects to find GNS-XX.txt (where XX is an ISO9660 ISO3166 code) and it to have GNS standard column names; Loads it.
*Index - Build all of master index and all sub-indices
*GetIndexGetIndexKeys() -Takes a specific GNS NT code (e.g. P,L,A) or ALL and returns an a set of index keys
*GetUNIs() - Takes a place name and a type (e.g. P,L,A,ALL); returns a list of corresponding UNIs
*GetLongLat() - Takes a UNI, returns a longitude, latitude pair
==The Source Files==
Source Per country source files are currently extracted from the NBER patent data on a per country basis. The original problem of identifying countries for some address records will be addressed later. The format of the source files wasfile(s) is as follows (XX is an ISO3166 code): *XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty</tt>*XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty city adm postcode WKU CTY city county</tt>
Where The <tt>WKUcty</tt> is the patent number and <tt>CTY</tt> is an address fieldused as a primary key in both files. The postcodeXX_exceptions.txt provides details on hand identified records, city and county are derived fields, either extracted from CTY by an algorithm or hand codedother records where special care has been taken. These fields are error prone This file is not strictly necessary but contain some important information, particular regarding "historical curiosities", and approximately 1k of hand-corrected typos. As a first pass, however, the address fields are regenerated in the matching scriptwill be processed if present.
Countries that are being processed The perl module [http://www.edegan.com/repository/PatentLocations.pm PatentLocations.pm] loads and provides an interface to this source data. The source code is the primary module documentation. Exported Methods include:#The UK *new(Source ) - Takes an ISO3166 country code, calls load. Expects to find a set stop words file: [(http://www.edegan.com/repository/UKPatentLocations-PatentInventorLocationsStopwords.txt UKPatentLocations-PatentInventorLocationsStopwords.txt)], Reference and a postcode RegEx file: [(http://www.edegan.com/repository/GNSPatentLocations-UKPostCode.txt GNSrex PatentLocations-UKPostCode.txtrex)].*Load()- load the data file(s)*CleanAndParse() - Do a first round of cleaning and parsing (calls internal methods). Extract out the postcode and replace stop words.*UnMatched() - Takes an NT code (e.g. P,L,A,ALL) and returns the set of currently unmatched country name keys for that type*ReturnMatches() - Marks country name keys with their new match sets
==Postal Codes==
*United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt>
 
==The Matching Process==
 
The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchPatentLocations.pl], which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched. The script uses these modules: PatentLocations.pm, GNS.pm
 
Glossary of terms:
*Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately.
*Tokens - Single words or sequences of words separated by a space (note that this is a specific usage)
*n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth
*Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
*LCS - Longest Common Subsequence based matching (See below)
 
*Place and administrative area - somewhere identified as a NT=P or NT=A respectively in the GNS data. Unless otherwise specified matches are performed for both place and administrative area separately and in series.
 
The sequence of processing is as follows :
#Load the source files, clean and parse (parsing identifies units)
#Load the reference file, build indices
#Exact match the exception units of records with exceptions
#Exact match the units of well-formatted records
#Exact match tokens (1-5 words)
#
Anonymous user

Navigation menu