Changes

Jump to navigation Jump to search
==== Identify Clean Data ====
As mentioned, the ptoassigneend_us_extracted is cleaned. This section works on the remaining records which are stored in ptoassigneend_us_temp.
Next, filter out records with city that is a city, zip that is a zip, state that is a state. Note: The consistency between city and state or city and postcode is not checked in this section.
*state that is a state
All Select distinct state records with  SELECT DISTINCT state FROM ptoassigneend_us_temp; The output shows that all the records not null or '' are valid state names.
*city that is a city
 
No comma, no dot
 
The feature city is messy. In this way, we first clean feature city before identifying clean data.
 
===== Clean Feature City =====
 
The following patterns can be used to clean feature city.
 
Select distinct city records and store them in table citylist (30971).
 
* city, state postcode(5)
 
'.*[,].*\d{5}'
 
-- 565
 
* city, state code
 
'.*[,]\s{0,}\w{2}$'
 
-- 284
 
* city, state code
 
'.*[,]\s{0,}\w{2}\s'
 
* IS NOT NULL AND city != '' AND city !~* '([,]|[.])'
 
-- 23501

Navigation menu