Changes

Jump to navigation Jump to search
Some patterns can be used to extract state information.
'''a. '[,] State Postcode''''
The state and post code are always together, separated by a space. We can extract state information with regular expression
SQL Code:
 
WHEN city ~* '([,]|[.])\s\w{2,}\s{0,}\w{0,}\s{1,}\d{5}[-]\d{4}'
THEN LTRIM(RTRIM(replace(regexp_replace(SUBSTRING(city, '[,]\s\w{2,}\s{0,}\w{0,}\s{1,}\d{5}[-]\d{4}'), '\d{5}-\d{4}', ''),',','')))
For example,Examples:
city | state_city
NEW YORK, NY 10022-3201 | NY
BROOKINGS, SOUTH DAKOTA 57006-0128 | SOUTH DAKOTA
'''b. '\s State(abbreviation) Postcode''''
'(^|\s)\w{2}\s{1}\d{5}[-]\d{4}'
SQL code:
 
WHEN city ~* '(^|\s)\w{2}\s{1}\d{5}[-]\d{4}'
THEN regexp_replace(SUBSTRING(city, '\w{2}\s{1,}\d{5}[-]\d{4}'), '\d{5}-\d{4}', '')
For exampleExamples
NEW YORK NY 10022-3201 |NY
WAUKEGAN IL 60085-2195 |IL
'''c. 'D.C.' '''
'D[.]C[.]\s\d{5}-\d{4}'
SQL code:
 
WHEN city ~* 'D[.]C[.]\s\d{5}-\d{4}'
THEN 'D.C.'
Z:/PatentAddress/
'''a. '\s{2,} CityName [,] State Postcode''''
SQL code:
 
CASE WHEN addrline1 ~* '\s{2,}\w{1,}\s{0,}\w{0,}\s{0,}[,]\s{0,}\w{1,}\s{0,}\w{0,}\s{0,}\d{5}[-]\d{4}'
THEN LTRIM(RTRIM(REPLACE(SUBSTRING(SUBSTRING(addrline1, '\s{2,}\w{1,}\s{0,}\w{0,}\s{0,}[,]\s{0,}\w{1,}\s{0,}\w{0,}\s{0,}\d{5} [-]\d{4}'),'.*[,]'),',','')))
1313 N. MARKET STREET HERCULES PLAZAWILMINGTON, DE 19894-0001
'''b. '[,]\s{1,} CityName [,] State Postcode' '''
SQL code:
BALLSTON TOWER ONE 800 NORTH QUINCY STREET, ARLINGTON, VA 22217-5660 |ARLINGTON
'''c. 'CityName [,] State Postcode' (no leading spaces)'''
SQL code:
ROCHESTER, NY 14650-2201 |ROCHESTER
'''d. 'CityName State(abbreviation) Postcode' (no leading spaces)'''
SQL code:
TARRYTOWN NY 10591-6706 |TARRYTOWN
'''e. 'CityName State (full name) Postcode' (no leading spaces)'''
NEW YORK NEW YORK 10022-3201
''This pattern can't be identified because of much noise.''
'''f. 'CityName Postcode' (no leading spaces)'''
LITTLE ELM 75068-3787
MASSACHUSETTS 02780-7319 ('State Postcode')
'''*Other Noise: '''
BATON, ROUGE, LA 70809-4562 (the city name is separated by a comma)
ST. PAUL, MIN 55133-3427 (city name contains a dot)
'''5. Output (Tables)'''
'''*ptoassigneend_allus'''
:::This table contains all the U.S. patents extracted from ptoassigneend table. The rule to generate this table is 'country = 'UNITED STATES' ' or any of the features contains U.S. postcode.
:::Noise exists because of postcode errors.
:::The ptoassigneend_allus table may miss some U.S. patents which lose postcode records.
:::The SQL code is:
CREATE TABLE ptoassigneend_allus AS
postcode | character varying(80) |
::'''*ptoassigneend_missus_final'''
:::State and postcode information extracted from addrline1, addrline2 and city columns are stored in this table. See section 2 and 3. :::This table is a subset of ptoassigneend_allus table.
This table is a subset of ptoassigneend_allus table.
Table "public.ptoassigneend_missus_final"
Column | Type | Modifiers
state_addr2 | text |
:::postcode_city is the postcode extracted from 'city'; postcode_addr1 is the postcode extracted from 'addrline1'; postcode_addr2 is the postcode extracted from 'addrline2'.  :::state_city is the state name extracted from 'city'; state_addr1 is the state name extracted from 'addrline1'; state_addr2 is the state name extracted from 'addrline2'.
::*ptoassigneend_missus_city_finalstate_city is the state name extracted from 'city'; state_addr1 is the state name extracted from 'addrline1'; state_addr2 is the state name extracted from 'addrline2'.
:::City information are extracted from addrline1, addrline2 and city columns and are stored in ptoassigneend_city table. See section 4.'''*ptoassigneend_missus_city_final'''
:::This table is a subset of ptoassigneend_allus City information are extracted from addrline1, addrline2 and city columns and are stored in ptoassigneend_city table. See section 4.
This table is a subset of ptoassigneend_allus table.
Table "public.ptoassigneend_missus_city_final"
Column | Type | Modifiers
city_city | text |
:::city_city is the city name extracted from 'city'; city_addr1 is the city name extracted from 'addrline1'; city_addr2 is the city name extracted from 'addrline2'.
:'''6. Issues'''
::* The post code and zip regex for other countries besides U.S. can be found here:
http://stackoverflow.com/questions/578406/what-is-the-ultimate-postal-code-and-zip-regex
Example::: For example, 
"US", "\d{5}([ \-]\d{4})?"
"CA", "[ABCEGHJKLMNPRSTVXY]\d[ABCEGHJ-NPRSTV-Z][ ]?\d[ABCEGHJ-NPRSTV-Z]\d"
"FR", "\d{2}[ ]?\d{3}"
::* The city feature needs to be standardized. For example, 'GRAND CAYMAN, CAYMAN ISLAND' and 'GRAND CAYMAN' indicate the same city.
::* Some state and country features don't match.  :::For example:
Example:
addrline2 | city | country
2882 SAND HILL ROAD MENLO PARK, CALIFORNIA 94025-7022 | TOKYO | JAPAN
3001 ORCHARD PARKWAY SAN JOSE, CA 95134-2088 | MINATO-KU, TOKYO 107 | JAPAN
* Both state name and its abbreviation exist.
::* Both state name and its abbreviation exist. ::* Some records are 'A CORP. OF CA.' or 'A CORP. OF NY' or 'A CORPORATION OF CA', which may be another method to extract state information.

Navigation menu