Changes

Jump to navigation Jump to search
1,262 bytes added ,  15:15, 17 July 2017
==Cleaning the Companybase table==
Every table will contain some duplicate keys and erroneous entries. We're going to clean the companybase table so that every key (coname, statecode, datefirstinv) is unique. This means that there will be a 1:1 relationship between 1 key and 1 entry. Given an entry you will be able to create a unique key and given a coname, statecode, datefirstinv key you will be able to find exactly 1 entry that the key corresponds to in the companybase table set.
 
So first check to see if the key is valid on the base data using the following 2 queries.
SELECT COUNT(*)
FROM (SELECT coname, statecode, datefirstinv FROM companybase)a;
--44774
 
SELECT COUNT(*)
FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase)a;
--44771
You can see that they key is not unique because the counts don't match up. There are 44,771 distinct keys but there are 44,774 keys in the companybase table. So 1 key can match to more than one entry in the table.
Some of the data in the companybase table contains undisclosed company names and companies that exist in other countries outside the US. So let's build flags for these two events and check the key count again.
DROP TABLE companybase1;
CREATE TABLE companybase1 AS
SELECT *,
CASE
WHEN nationcode = 'US' THEN 1::int
ELSE 0::int
END AS alwaysusflag,
CASE
WHEN coname = 'Undisclosed Company' THEN 1::int
ELSE 0::int
END AS undisclosedflag
FROM companybase;
 
SELECT COUNT(*)
FROM (SELECT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;
--44771
 
SELECT COUNT(*)
FROM (SELECT DISTINCT coname, statecode, datefirstinv FROM companybase1 WHERE alwaysusflag = 1 AND undisclosedflag = 0)a;
--44770

Navigation menu