Changes

Jump to navigation Jump to search
25,389 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project|Has project output=Tool|Has sponsor=McNair ProjectsCenter
|Has title=US Address Verification
|Has owner=Oliver Chang,
|Has start date=June 2017
|Has deadline=June 2017
|Has project status=ActiveComplete
}}
 
 
This project is composed of SQL queries against <code>patents.ptoassigneend_allus</code> to gain confidence in the data. The scripts and rationale behind the design decisions in this table are described in its wiki page [[Patent_Assignment_Data_Restructure#Restructure_Address_Information_.28First_Stage.29|here]] and all original work was done by Sonia Zhang.
 
== Total Row Counts ==
 
{| class="wikitable"
|+Number of Rows
|-
! scope="col" | Table Name !! scope="col" | count(*) !! scope="col" | Description of Tables in <code>patent</code>
|-
| ptoassigneend
| 7,234,001
| Base data with duplicates removed
|-
| ptoassigneend_allus
| 3,572,605
| Sonia's subset of only US addresses; 49.4% of ptoassigneend
|-
| ptoassigneend_us_cleaned
| 3,572,605
| "cleaned postcode, city, and state" of ptoassigneend_allus; '''this is the one we want to work with'''
|-
| ptoassigneend_us_extracted
| 5,343
| cleaned up complete addresses; too small to be worthwhile
|-
| june_2017_hotfix
| 3,370,613
| quick and dirty fix to get first 5 postcode digits from ptoassigneend_us_cleaned using heuristics (see [[#A_Stopgap_Measure]])
|-
| june_2017_zipcode_join
| 1,365,408
| hack to join zipcodes to patent ids; see <code>E:\McNair\Projects\SimplerPatentData\src\db\ZipcodeMapAssignmentToPatent.sql</code> for methodology
|}
 
Note that the suffix <code>nd</code> stands for "no duplicates". Assume this method was done correctly.
 
== The Good ==
 
=== <code>country</code> column is reliable ===
 
It appears that the country field is consistent across all of the entries, and that the country names are taken from a small lookup table.
 
'''Proof:'''
 
<nowiki>
patent=# select country, count(country) as c from ptoassigneend group by country order by c desc limit 50;
country | c
----------------------------------------+---------
UNITED STATES | 3570216
JAPAN | 1334774
GERMANY | 429429
KOREA, REPUBLIC OF | 284830
TAIWAN | 231493
FRANCE | 172995
CANADA | 132387
SWITZERLAND | 99281
UNITED KINGDOM | 90634
CHINA | 88954
NETHERLANDS | 84573
SWEDEN | 70538
ITALY | 61416
FINLAND | 44116
AUSTRALIA | 43314
ISRAEL | 41857
ENGLAND | 31341
DENMARK | 24742
BELGIUM | 24147
AUSTRIA | 19487
SINGAPORE | 17518
HONG KONG | 17091
SPAIN | 13288
INDIA | 12589
IRELAND | 11754
NORWAY | 11290
STATELESS | 10917
GREAT BRITAIN | 10692
BERMUDA | 8780
CAYMAN ISLANDS | 8206
NOT PROVIDED | 6658
VIRGIN ISLANDS, BRITISH | 6510
NEW ZEALAND | 6364
ONTARIO | 6047
LUXEMBOURG | 5856
KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF | 5625
LIECHTENSTEIN | 4711
BRAZIL | 4368
SOUTH AFRICA | 3699
RUSSIAN FEDERATION | 3375
SAUDI ARABIA | 3335
BARBADOS | 3115
HUNGARY | 2642
GERMAN DEMOCRATIC REPUBLIC | 2401
MEXICO | 2083
QUEBEC | 1777
NETHERLANDS ANTILLES | 1690
POLAND | 1607
MALAYSIA | 1606
BRITISH COLUMBIA | 1279
(50 rows)
</nowiki>
 
=== territories, etc are categorized differently ===
 
There are single-digit edge cases that in the strictest interpretation of country are correct.
 
<nowiki>
patent=# select count(*) from ptoassigneend;
count
---------
7234001
(1 row)
 
patent=# select count(*) from ptoassigneend where country = 'UNITED STATES';
count
---------
3570216
(1 row)
 
patent=# select count(*) from ptoassigneend where country like '%UNITED STATES%';
count
---------
3570217
(1 row)
 
patent=# select * from ptoassigneend where (country like '%UNITED STATES%' and country != 'UNITED STATES');
reelno | frameno | name | addrline1 | addrline2 |
--------+---------+------------------------------+---------------+----------------------+--------
29344 | 970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)
</nowiki>
 
=== all zipcodes are valid US zipcodes ===
 
Using the data from [http://download.geonames.org/export/zip/ geonames] for the United States, put into the table via <code>psql < Z:\zipcodes-oliver\dump.sql</code> where <code>dump.sql</code> is generated via <code>E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\uspto_assignments\GeonamesZips.java</code>, all of the values for postcodes are valid US postcodes.
 
<nowiki>
patent=# select count(distinct c.postcode_f5_cleaned) from ptoassigneend_us_cleaned as c left join geonames_us_zipcodes
as z on c.postcode_f5_cleaned = z.zipcode;
count
-------
1116
(1 row)
patent=# select count(distinct postcode_f5_cleaned) from ptoassigneend_us_cleaned;
count
-------
1116
(1 row)
</nowiki>
 
== The Bad ==
 
=== <code>state</code> column is ill-constrained ===
 
<nowiki>
# returns 57 rows (District of Columbia is counted as distinct state)
# select distinct state, count(state) from ptoassigneend_us_cleaned group by state limit 100;
...
Armed Forces in Europe, the Middle East, Africa, and Canada | 2
UNITED STATES ARMY | 2
Armed Forces in the Pacific | 3
| 1
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION | 25
| 0
...
</nowiki>
 
 
=== <code>ptoassigneend_us_cleaned.postcode_*_cleaned</code> is total garbage ===
 
This garbage stretches to include
 
* <code>ptoassigneend_us_cleaned.postcode_f5_cleaned</code>
* <code>ptoassigneend_us_cleaned.postcode_cleaned</code>
 
==== only .15% of records have their postcodes extracted correctly ====
 
<nowiki>
patent=# select count(*) from ptoassigneend_us_cleaned;
count
---------
3572605
(1 row)
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned is not null;
count
---------
3376480
(1 row)
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_f5_cleaned is not null;
count
-------
5344
(1 row)
patent=# select postcode_cleaned, postcode_f5_cleaned from ptoassigneend_us_cleaned where (postcode_cleaned is not null and postcode_f5_cleaned is null) limit 12;
postcode_cleaned | postcode_f5_cleaned
-------------------------------------------+---------------------
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |
(12 rows)
 
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082';
count
---------
3371136
(1 row)
</nowiki>
 
==== the underlying data is fine though... ====
<nowiki>
patent=# select postcode, postcode_addr1, postcode_addr2, postcode_city from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' limit 5;
postcode | postcode_addr1 | postcode_addr2 | postcode_city
----------+----------------+----------------+---------------
75024 | | |
55379 | | |
94538 | | |
23219 | | |
73114 | | |
(5 rows)</nowiki>
 
Moreover, selecting from the tables that <code>ptoassigneend_us_cleaned</code> is derived from did not yield this string. Therefore, there is likely an error in the SQL script, perhaps with some wonky copy-pasting from the internet.
 
==== ...this case is hit when postcode is present but none of the other fields are present ====
 
<nowiki>
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u
0082';
count
---------
3371136
(1 row)
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u
0082' and postcode is not null;
count
---------
3371136
(1 row)
 
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082' and postcode is not null and postcode_addr1 is null and postcode_addr2 is null and postcode_city is null;
count
---------
3371136
(1 row)</nowiki>
 
==== ...easy gains can be gotten as a quick fix ====
 
<nowiki>
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and postcode is not null;
count
---------
3371136
(1 row)
 
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5;
count
---------
2971542
(1 row)</nowiki>
 
==== ...5 digit postcode works great ====
 
<nowiki>
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5 limit 12;
postcode
----------
75024
55379
94538
23219
73114
95134
33487
60603
84604
20191
97213
10504
(12 rows)</nowiki>
 
==== ...9 digit postcode works ok (two forms) ====
 
<nowiki>
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 9 limit 12;
postcode
-----------
USA 20585
USA 33458
4826-2766
181951501
772522463
USA 27601
772522463
USA 20310
USA 47202
174012991
913929221
USA 12345
(12 rows)</nowiki>
 
==== ...10 digit postcode works great ====
 
<nowiki>
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 10 limit 12;
postcode
------------
92121-1714
95134-1706
80527-2400
95066-4544
01862-2000
07962-2245
90245-5012
55133-3427
30332-0415
95134-1706
20892-7660
94080-4990
(12 rows)</nowiki>
 
== The Ugly ==
 
=== fields directly contradict each other ===
 
Possible Heuristics:
 
* Presence of Ln, St, Ave, Blvd, Cir, etc...
 
=== approximate data is available for stateless assignees ===
 
==== only on the order of 500 stateless assignees have postcodes ====
<nowiki>
patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS');
count
-------
17575
(1 row)
 
patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode is
not null);
count
-------
329
(1 row)
 
patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode is
not null or addrline1 is not null);
count
-------
1572
(1 row)
</nowiki>
 
 
==== approximate location data is available (but unused) ====
 
<nowiki>
patent=# select name, addrline1, addrline2, postcode from ptoassigneend where (country = 'NOT PROVIDED' or country = 'S
TATELESS') and (addrline1 is not null or addrline2 is not null) limit 30;
name | addrline1 | addrline2 | postcode
--------------------------------------------------------------------------------------+----------------------------------+----------------------------------+----------
REGAL KING COMMERCIAL OFFSHORE DE MACAU LIMITED | ALAMEDA DR. CARLOS D'S ASSUMPCAO | DYNASTY PLAZA, 19 AND. (H) MACAO |
ECODYNE CORPORATION | | A CORP. OF DE |
SHELL OIL COMPANY | | A CORP. OF DE |
RCA CORPORATION | | A CORP. OF DE |
BACHEM FEINCHEMIKALIEN A.G. | | A SWISS CORP. |
AMERICAN STERILIZER COMPANY | | A PA. CORP. |
CREATIVE TECHNOLOGY LTD. | CREATIVE RESOURCE | 31 INTERNATIONAL BUSINESS PARK | 60922
FUJITSU LIMITED | NAKAHARA-KU, KAWASAKI-SHI | 1-1, KAMIKODANAKA 4-CHOME | 211-8
SIEMENS AKTIENGESELLSCHAFT | | A GERMAN CORP |
ISABERG AB, HESTRA | | A CORP. OF SWEDEN |
MOELLER MANUFACTURING CO., INC. | | A DE CORP. |
MOBIL OIL CORPORATION | | A CORP OF NY |
GENERAL ELECTRIC COMPANY | | A NY CORP. |
GENERAL ELECTRIC COMPANY | A CORP. OF NY | |
BTR INDUSTRIES LIMITED | CARLISLE PLACE | BTR HOUSE |
DIATRON CORPORATION | | A CORP. OF CA |
CLOSURES AND PACKAGING SERVICES LIMITED | P. O. BOX 119 | | GY1 3HB
DR. JOHANNES HEIDENHAIN GMBH A CORPORATION OF GERMANY | | TRAUNREUT CITY |
AMERICAN SAFETY RAZOR COMPANY | | A DE CORP. |
ETHICON, INC. | | A NJ CORP. |
GENERAL ELECTRIC COMPANY | | A NEW YORK CORP. |
MOBIL OIL CORPORATION | | A CORP. OF NY |
RICHARD LANGLECHNER GMBH | MAUERBERGERSTR. 15 | |
KONISHIROKU PHOTO INDUSTRY CO., LTD., | | A CORP OF JAPAN |
SHELL OIL COMPANY | | A CORP. OF DE |
SIEMENS AKTIENGESELLSCHAFT A GERMAN CORPORATION | | MUNICH |
C.A. BRIGGS COMPANY | | A CORP OF PA |
GTE PRODUCTS CORPORATION | | A DE CORP. |
TRAMEX ENGINEERING LIMITED | 8 CLARE STREET, | A IRISH COMPANY |
MOBIL OIL CORPORATION | | A CORPORATION OF NEW YORK |
</nowiki>
 
== A Stopgap Measure ==
 
As a quick and dirty fix, walk through the cases and do the best we can. The script for that is located at <code>E:\McNair\Projects\SimplerPatentData\src\db\hacks\June2017PostcodeHotfix.sql</code> and it creates the following table:
 
<nowiki>
DROP VIEW IF EXISTS june_2017_postcodecode_hotfix;
CREATE VIEW june_2017_postcode_hotfix AS
SELECT
reelno,
frameno,
name,
city_cleaned as city,
GodHelpUsAll(postcode_cleaned, postcode) as postcode
FROM ptoassigneend_us_cleaned;</nowiki>
 
<nowiki>
patent=# select count(*) from june_2017_postcode_hotfix ;
count
---------
3572605
(1 row)
 
patent=# select count(*) from june_2017_postcode_hotfix where postcode is not null;
count
---------
3370613
(1 row)</nowiki>

Navigation menu