Changes

Jump to navigation Jump to search
=== Clean Address: more patterns ===
After identifying clean data, the remaining records are cleaned Object: ptoassigneend_temp3  SQL code is in the following wayE:\McNair\Projects\PatentAddress\Cleaning_Step2.sql
====Clean Postcode====
One option is to identify state and postcode together with the following SQL code: (take 'addrline1' as an example)
SELECT addrline1SQL function: FROM ptoassigneend_us_temp2CREATE OR REPLACE FUNCTION ExtractPostcode2(adr text) RETURNS text AS $$ WHERE SELECT CASE WHEN (addrline1 adr ~* '([,]|[.])\s\w{2,}\s{0,}\w{0,}\s{1,}\d{5}' OR addrline1 adr ~* '(^|\s)\w{2}\s{1}\d{5}') AND NOT (addrline1 adr ~* 'BO' OR addrline1 adr ~* 'P[.]O') AND NOT (addrline1 adr ~* 'SUITE\s\d{5}') THEN SUBSTRING(adr, '\d{5}') ELSE NULL END AS result; # SELECT 3601$$ LANGUAGE SQL;
Examples:
Even excluding the P.O. BOX # and SUITE #, noise exists.
 
After extracting postcode, next, the following function is used to get clean postcode.
 
The priority is 'postcode' if it is '\d{5}', postcode_addr1, postcode_addr2, postcode_city
 
CREATE OR REPLACE FUNCTION PostcodeClean2 (text,text,text,text) RETURNS text AS $$
$pri1=$_[0];
$pri2=$_[1
];
$pri3=$_[2];
$postcode=$_[3];
if ($postcode) {return $postcode;}
if ($pri1) {return $pri1;}
if ($pri2) {return $pri2;}
if ($pri3) {return $pri3;}
return undef;
$$ LANGUAGE plperl;
The details and SQL function are in E:\McNair\Projects\PatentAddress\Cleang_Step2.sql
The details and SQL function are in E:\McNair\Projects\PatentAddress\Cleang_Step2.sql
The output is table ptoassigneend_us_postex2 which include a new feature 'city_extracted' and 'postcode_extracted'.
CREATE TABLE ptoassigneend_us_postex2 AS

Navigation menu