US Address Verification
| US Address Verification | |
|---|---|
| Project Information | |
| Project Title | US Address Verification |
| Owner | Oliver Chang |
| Start Date | June 2017 |
| Deadline | June 2017 |
| Primary Billing | |
| Notes | |
| Has project status | Active |
| Copyright © 2016 edegan.com. All Rights Reserved. | |
This project is composed of SQL queries against patents.ptoassigneend_allus to gain confidence in the data. The scripts and rationale behind the design decisions in this table are described in its wiki page here and all original work was done by Sonia Zhang.
Total Row Counts
| Table Name | count(*) | Description |
|---|---|---|
| ptoassigneend | 7,234,001 | Base data with duplicates removed |
| ptoassigneend_allus | 3,572,605 | Sonia's subset of only US addresses; 49.4% of ptoassigneend |
| ptoassigneend_us_cleaned | 3,572,605 | "cleaned postcode, city, and state" of ptoassigneend_allus; this is the one we want to work with |
| ptoassigneend_us_extracted | 5,343 | cleaned up complete addresses; too small to be worthwhile |
Table Origin
Note that the suffix nd stands for "no duplicates". Assume this method was done correctly.
It appears that the country field is consistent across all of the entries, and that the country names are taken from a small lookup table.
Proof:
patent=# select country, count(country) as c from ptoassigneend group by country order by c desc limit 50;
country | c
----------------------------------------+---------
UNITED STATES | 3570216
JAPAN | 1334774
GERMANY | 429429
KOREA, REPUBLIC OF | 284830
TAIWAN | 231493
FRANCE | 172995
CANADA | 132387
SWITZERLAND | 99281
UNITED KINGDOM | 90634
CHINA | 88954
NETHERLANDS | 84573
SWEDEN | 70538
ITALY | 61416
FINLAND | 44116
AUSTRALIA | 43314
ISRAEL | 41857
ENGLAND | 31341
DENMARK | 24742
BELGIUM | 24147
AUSTRIA | 19487
SINGAPORE | 17518
HONG KONG | 17091
SPAIN | 13288
INDIA | 12589
IRELAND | 11754
NORWAY | 11290
STATELESS | 10917
GREAT BRITAIN | 10692
BERMUDA | 8780
CAYMAN ISLANDS | 8206
NOT PROVIDED | 6658
VIRGIN ISLANDS, BRITISH | 6510
NEW ZEALAND | 6364
ONTARIO | 6047
LUXEMBOURG | 5856
KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF | 5625
LIECHTENSTEIN | 4711
BRAZIL | 4368
SOUTH AFRICA | 3699
RUSSIAN FEDERATION | 3375
SAUDI ARABIA | 3335
BARBADOS | 3115
HUNGARY | 2642
GERMAN DEMOCRATIC REPUBLIC | 2401
MEXICO | 2083
QUEBEC | 1777
NETHERLANDS ANTILLES | 1690
POLAND | 1607
MALAYSIA | 1606
BRITISH COLUMBIA | 1279
(50 rows)
There are single-digit edge cases that in the strictest interpretation of country are correct.
patent=# select count(*) from ptoassigneend; count --------- 7234001 (1 row) patent=# select count(*) from ptoassigneend where country = 'UNITED STATES'; count --------- 3570216 (1 row) patent=# select count(*) from ptoassigneend where country like '%UNITED STATES%'; count --------- 3570217 (1 row) patent=# select * from ptoassigneend where (country like '%UNITED STATES%' and country != 'UNITED STATES'); reelno | frameno | name | addrline1 | addrline2 | --------+---------+------------------------------+---------------+----------------------+-------- 29344 | 970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)