Difference between revisions of "Geocoding Inventor Locations"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
Line 3: Line 3:
 
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.  
 
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.  
  
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) [http://www.nga.mil National Geospatial-Intelligence Agency's] [http://geonames.nga.mil/ggmagaz/geonames4.asp GEOnet Names Server (GNS)] which covers the world excluding the U.S. and Antartica. The NGA site states: "There are no licensing requirements or restrictions in place for the use of  the GNS data. Toponymic information is based on the Geographic Names Data Base, containing official standard names approved by the United States Board on Geographic Names and maintained by the National Geospatial-Intelligence Agency."
+
==Reference Data==
  
[http://www.geonames.org/ Geonames.org], a third-party website based in Switzerland, provides location name data under the creative commons attribution license. However, this data is drawn from the GNS for all locations except the US and Canada, where data is drawn from the [http://www.geonames.org/ U.S. Geological Survey Geographic Names Information System] and [http://www.geobase.ca www.geobase.ca] respectively. Thus we recommend that users take advantage of the original sources and respect the original licenses if applicable.
+
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server]] which covers the world excluding the U.S. and Antartica.  
 +
 
 +
The perl module (GNS.pm) loads, indexes and provides key variables from this data.
  
==Details of the GEOnet Names Server (GNS)==
 
  
*Place names are recorded in [http://earth-info.nga.mil/gns/html/romanization.html Romanized form]
 
*Country (and province/territory) names, as well as their assigned codes, are recorded using [http://earth-info.nga.mil/gns/html/FIPS10-4_match.pdf FIPS Standard #10]. Users should note that there are differences between the Federal Information Processing Standard (FIPS) country names and the [[UN GeoRegion Codes |UN Country Names]].
 
*The [http://earth-info.nga.mil/gns/html/help.htm GNS output format] contains various custom codes. Of particular interest are:
 
**UNI - Unique Name Idenfier (a numeric value - often negative)
 
**LAT and LONG - Latitude and Longitude (Decimal - also available as DMS)
 
** NT - Name Type (Only A,P and L are of use in matching to addresses)
 
***A = Administrative region type feature
 
***P = Populated place type feature
 
***V = Vegetation type feature
 
***L = Locality or area type feature
 
***U = Undersea type feature
 
***R = Streets, highways, roads, or railroad type feature
 
***T = Hypsographic type feature
 
***H = Hydrographic type feature
 
***S = Spot type feature
 
**DC - Designation Code (DC provides a refinement of NT, details are in [http://www.edegan.com/repository/GNS-DesignationCodes.txt GNS-DesignationCodes.txt]
 
**SHORT_FORM - a Short Form of the name that is commonly used
 
**FULL_NAME - the Long Form of the name
 
**FULL_NAME_ND - the Long Form of the name without diacritics
 
  
 
==The Source And Reference Files==
 
==The Source And Reference Files==

Revision as of 00:26, 30 July 2009

This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.

Reference Data

The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's GEOnet Names Server which covers the world excluding the U.S. and Antartica.

The perl module (GNS.pm) loads, indexes and provides key variables from this data.


The Source And Reference Files

Source files are currently extracted from the NBER patent data on a per country basis. The original format of the source files was:

postcode WKU CTY city county

Where WKU is the patent number and CTY is an address field. The postcode, city and county are derived fields, either extracted from CTY by an algorithm or hand coded. These fields are error prone but contain some important information, particular regarding "historical curiosities", and approximately 1k of hand-corrected typos. As a first pass, however, the address fields are regenerated in the matching script.

Countries that are being processed include:

  1. The UK (Source file: UK-PatentInventorLocations.txt, Reference file: GNS-UK.txt)

Postal Codes

Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference:

  • United Kingdom (Sourced from Wikipedia): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: ([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})