Changes

Jump to navigation Jump to search
==== Output: ptoassigneend_us_identify0 ====
As mentionedin Section 3, the ptoassigneend_us_extracted is cleanedclean. Copy all the records in ptoassigneend_us_extracted to ptoassigneend_us_identify0.
Store remaining records in ptoassigneend_us_temp.
==== Output: ptoassigneend_us_identify1 ====
The following section works on the remaining records which are stored in ptoassigneend_us_temp.
First, filter Filter out records with city that is a city, zip that is a zip, state that is a state.
Note: The consistency between city and state or city and postcode is not checked in this section.
* ptoassigneend_us_citylist
Select Copy clean city records in ptoassigneend_us_extracted and store them in to ptoassigneend_us_citylist (775).
Since the city list is not long, I briefly cleaned the list by hand, and updated the ptoassigneend_us_citylist (730).
* zip that is a zip
Match the pattern 55d-4 4d or 5 digits.
*state that is a state
*city that is a city
One option method to identify clean city is to find city records that match ptoassigneend_us_citylist.
SQL Code:
-- SELECT 2603422
* The table ptoassigneend_us_identify1 stores records that meet all the requirements above: zip with 5-4 or 5 digits, state not null or and not spaces, and city in ptoasigneend_us_citylist.
SQL Code:
==== Output: ptoassigneend_us_identify2 ====
Part of 'city' contains commaat the end. Remove comma, and then match 'city' with ptoassigneend_us_citylist.
SELECT *, replace(city, ',', '') clean_city
# SELECT 14508
Store remaining data (excluding data in ptoassigneend_us_identify0, ptoassigneend_us_identify1 & ptoassigneend_us_identify2) in ptoassigneend_temp3ptoassigneend_us_temp3.
=== Clean Address: more patterns ===
ObjectTable: ptoassigneend_temp3 ptoassigneend_us_temp3
SQL code is in E:\McNair\Projects\PatentAddress\RestructureAddressInfo(Second Stage).sql
Identifying five-digit postcode is risky because of the existence of P.O. BOX #, SUITE #, etc.
One option is to identify state and postcode together with the following SQL codefunction: (take 'addrline1' as an example)
SQL function:
CREATE OR REPLACE FUNCTION ExtractPostcode2(adr text) RETURNS text AS $$
SELECT CASE WHEN (adr ~* '([,]|[.])\s\w{2,}\s{0,}\w{0,}\s{1,}\d{5}' OR
767 FIFTH AVE., NEW YORK, NY 10153 | 10153
Even excluding the P.O. BOX # and SUITE #, noise Noise still exists.
After extracting postcode, next, the following function is used to get clean postcode.
The priority is 'postcode' if it is '\d{5}', postcode_addr1, postcode_addr2, postcode_city
CREATE OR REPLACE FUNCTION PostcodeClean2 (text,text,text,text) RETURNS text AS $$
$pri1=$_[0];
$pri2=$_[1];
if ($pri3) {return $pri3;}
return undef;
$$ LANGUAGE plperl;
The details and SQL function are in E:\McNair\Projects\PatentAddress\Cleang_Step2RestructureAddressInfo(Second Stage).sql
The output is table ptoassigneend_us_postex which include a new feature 'postcode_extracted'.
====Clean 'city'====
'city' can be is cleaned using the following patterns.
*Pattern 1: 'city' is like ~ 'city name, state ID'
*Pattern 2: 'city' is like ~ 'city name, state postcode (5 digits)'
*Pattern 3: 'city' is like ~ 'city name,'
The SQL function is:
$$ LANGUAGE SQL;
The details and SQL function are in E:\McNair\Projects\PatentAddress\Cleang_Step2RestructureAddressInfo(Second Stage).sql
The output is table ptoassigneend_us_postex2 which include a new feature 'city_extracted' and 'postcode_extracted'.
===Identify Clean Data (Round Two)===
A new list of clean city is extracted in Section 4.3.2. This list, combined with 'ptoassigneend_us_citylist', creates a new city list 'ptoassigneend_us_citylist2' which can be used to identify clean data.
Since the city list is not long, I briefly cleaned the list it by hand, and stored it in ptoassigneend_us_citylist2.
* Actually, we can buy find a full list of U.S. cities online: https://www.uscitieslist.org/.
====Output: ptoassigneend_us_identify3====
Similar to Section 4.2, identify clean data that meets all the requirements: postcode_extracted with 5-4 or 5 digits, state not null or and not spaces, and city_extracted in ptoasigneend_us_citylist2.
SQL Code:
====Output: ptoassigneend_us_identify4====
Some of the city records contain dots. Remove dots, and then match 'city' with ptoassigneend_us_citylist2.
SQL Code:
CREATE TABLE ptoassigneend_us_identify4 AS

Navigation menu