
Jump to navigation Jump to search
As mentioned in Section 2.2.2, city, state and postcode info are extracted from 'addrline1', 'addrline2' and 'city'. Original table also contains 'postcode', 'city' and 'state'. In this way, we have four candidates for city, state and postcode.
Info extracted from addresses and that in the original table are not necessarily consistent. In this way, the  '''The object of this section is to pick out the best postcode, city and state for each record, and create a Master Table with original features and cleaned postcode, city, and state.'''
Reminder: 'postcode_city' is the postcodes extracted from 'city'; 'postcode_addr1' is the postcodes extracted from 'addrline1'; 'postcode_addr2' is the postcodes extracted from 'addrline2'.
The 'postcode_city', 'postcode_addr1' and 'postcode_addr2' are all consistent. !
* Inconsistency between 'postcode' and 'postcode_addr1'
Case 1: 'postcode_addr1' beats 'postcode' because 'addrline1' is detailed. For example:  addrline1 | postcode_addr1 | postcode_newpostcode
P.O. BOX 6 / 83707-0006 | 83707-0006 | 83716
BLDG. C01, M.S. A126 P.O. BOX 80028 LOS ANGELES, CA 90080-0028 | 90080-0028 | 90045
* Inconsistency between 'postcode' and 'postcode_addr2'
Case 2: 'postcode_addr2' beats 'postcode' because 'addrline2' is detailed.
addrline2 | postcode_addr2 | postcode_newpostcode
P.O. BOX 6 / 83707-0006 | 83707-0006 | 83716-9632
P.O. BOX 5800 - MS0161, ALBUQUERQUE, NEW MEXICO 87185-0161 | 87185-0161 | 87123-0161
* Inconsistency between 'postcode' and 'postcode_city'
Case 3: 'postcode_city' beats 'postcode'.
city | state | postcode_city | postcode
Reminder: 'state_city' is the states extracted from 'city'; 'state_addr1' is the states extracted from 'addrline1'; 'state_addr2' is the states extracted from 'addrline2'.
All the cleaned states for U.S. patents are stored in ptoassigneend_us_cleaned (see feature state_cleaned).
Note: We might want to convert state names to standard codes.

Navigation menu