Changes

Jump to navigation Jump to search
=== Identify Clean Data ===
As mentioned, the ptoassigneend_us_extracted is cleaned. This section works on the remaining records which are stored in ptoassigneend_us_temp.==== Output: ptoassigneend_us_identify0 ====
As mentioned, the ptoassigneend_us_extracted is cleaned. Copy all the records in ptoassigneend_us_extracted to ptoassigneend_us_identify0. ==== Output: ptoassigneend_us_identify1 ==== The following section works on the remaining records which are stored in ptoassigneend_us_temp. First, filter out records with city that is a city, zip that is a zip, state that is a state.  Note: The consistency between city and state or city and postcode is not checked in this section.
* ptoassigneend_us_citylist
Select distinct clean city records in ptoassigneend_us_extracted and store them in ptoassigneend_us_citylist (775).
Since the city list is not long, I briefly cleaned the list by hand, and stored it in updated the ptoassigneend_us_citylist (730).
====* zip that is a zip====
Match the pattern 5-4 or 5 digits.
====*state that is a state====
Select distinct state records with
The output shows that all the records not null or not spaces are valid state names.
====*city that is a city====
One option to identify clean city is to find records that match ptoassigneend_us_citylist.
-- SELECT 2603422
==== Output ==== * The table ptoassigneend_us_identify1 This table stores records that meet all the requirements above: zip with 5-4 or 5 digits, state not null or not spaces, and city in ptoasigneend_us_citylist.
SQL Code:
SELECT 2511356
* Store remaining records in ptoassigneend_us_temp2. ==== Output: ptoassigneend_us_identify2 ==== Part of 'city' contains punctuation. Remove punctuation marks, and then match 'city' with ptoassigneend_us_citylist.
CREATE TABLE ptoassigneend_us_temp3 AS
SELECT *, replace(city, ',', '') clean_city
FROM ptoassigneend_us_temp2;
# SELECT 1055874
Output: The output is ptoassigneend_us_identify2.
SQL code:
CREATE TABLE ptoassigneend_us_identify2 AS
SELECT *

Navigation menu