Changes

Jump to navigation Jump to search
20,239 bytes added ,  13:47, 21 September 2020
no edit summary
{{Project|Has project output=Tool|Has sponsor=McNair ProjectsCenter
|Has title=US Address Verification
|Has owner=Oliver Chang,
|Has start date=June 2017
|Has deadline=June 2017
|Has project status=ActiveComplete
}}
|+Number of Rows
|-
! scope="col" | Table Name !! scope="col" | count(*) !! scope="col" | Descriptionof Tables in <code>patent</code>
|-
| ptoassigneend
| 5,343
| cleaned up complete addresses; too small to be worthwhile
|-
| june_2017_hotfix
| 3,370,613
| quick and dirty fix to get first 5 postcode digits from ptoassigneend_us_cleaned using heuristics (see [[#A_Stopgap_Measure]])
|-
| june_2017_zipcode_join
| 1,365,408
| hack to join zipcodes to patent ids; see <code>E:\McNair\Projects\SimplerPatentData\src\db\ZipcodeMapAssignmentToPatent.sql</code> for methodology
|}
--------+---------+------------------------------+---------------+----------------------+--------
29344 | 970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)
</nowiki>
 
=== all zipcodes are valid US zipcodes ===
 
Using the data from [http://download.geonames.org/export/zip/ geonames] for the United States, put into the table via <code>psql < Z:\zipcodes-oliver\dump.sql</code> where <code>dump.sql</code> is generated via <code>E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\uspto_assignments\GeonamesZips.java</code>, all of the values for postcodes are valid US postcodes.
 
<nowiki>
patent=# select count(distinct c.postcode_f5_cleaned) from ptoassigneend_us_cleaned as c left join geonames_us_zipcodes
as z on c.postcode_f5_cleaned = z.zipcode;
count
-------
1116
(1 row)
patent=# select count(distinct postcode_f5_cleaned) from ptoassigneend_us_cleaned;
count
-------
1116
(1 row)
</nowiki>
== The Bad ==
=== <code>state</code> column is ill-constrained ===  <nowiki># returns 57 rows (District of Columbia is counted as distinct state)# select distinct state, count(state) from ptoassigneend_us_cleaned group by state limit 100;...Armed Forces in Europe, the Middle East, Africa, and Canada | 2UNITED STATES ARMY | 2Armed Forces in the Pacific | 3 | 1NATIONAL AERONAUTICS AND SPACE ADMINISTRATION | 25 | 0...</nowiki>  === <code>ptoassigneend_us_cleaned.postcode_*_cleaned</code> is total garbage === This garbage stretches to include * <code>ptoassigneend_us_cleaned.postcode_f5_cleaned</code>* <code>ptoassigneend_us_cleaned.postcode_cleaned</code> ==== only .15% of records have their postcodes extracted correctly ====  <nowiki>patent=# select count(*) from ptoassigneend_us_cleaned; count --------- 3572605 (1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned is not null; count --------- 3376480 (1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_f5_cleaned is not null; count ------- 5344 (1 row) patent=# select postcode_cleaned, postcode_f5_cleaned from ptoassigneend_us_cleaned where (postcode_cleaned is not all zipcodes null and postcode_f5_cleaned is null) limit 12; postcode_cleaned | postcode_f5_cleaned -------------------------------------------+--------------------- £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | (12 rows)  patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082'; count--------- 3371136(1 row) </nowiki> ==== the underlying data is fine though... ==== <nowiki>patent=# select postcode, postcode_addr1, postcode_addr2, postcode_city from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' limit 5; postcode | postcode_addr1 | postcode_addr2 | postcode_city----------+----------------+----------------+--------------- 75024 | | | 55379 | | | 94538 | | | 23219 | | | 73114 | | |(5 rows)</nowiki> Moreover, selecting from the tables that <code>ptoassigneend_us_cleaned</code> is derived from did not yield this string. Therefore, there is likely an error in the SQL script, perhaps with some wonky copy-pasting from the internet. ==== ...this case is hit when postcode is present but none of the other fields are created equal present ====  <nowiki>patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082'; count --------- 3371136 (1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned =E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082' and postcode is not null; count --------- 3371136 (1 row)  patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned =E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082' and postcode is not null and postcode_addr1 is null and postcode_addr2 is null and postcode_city is null; count--------- 3371136(1 row)</nowiki>
TODO==== ...easy gains can be gotten as a quick fix ====  <nowiki>patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and postcode is not null; count--------- 3371136(1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5; count--------- 2971542(1 row)</nowiki> ==== ...5 digit postcode works great ====  <nowiki>patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5 limit 12; postcode---------- 75024 55379 94538 23219 73114 95134 33487 60603 84604 20191 97213 10504(12 rows)</nowiki> ==== ...9 digit postcode works ok (two forms) ====  <nowiki>patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 9 limit 12; postcode----------- USA 20585 USA 33458 4826-2766 181951501 772522463 USA 27601 772522463 USA 20310 USA 47202 174012991 913929221 USA 12345(12 rows)</nowiki> ==== ...10 digit postcode works great ====  <nowiki>patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 10 limit 12; postcode------------ 92121-1714 95134-1706 80527-2400 95066-4544 01862-2000 07962-2245 90245-5012 55133-3427 30332-0415 95134-1706 20892-7660 94080-4990(12 rows)</nowiki>
== The Ugly ==
* Presence of Ln, St, Ave, Blvd, Cir, etc...
== Conclusion = approximate data is available for stateless assignees ===
==== only on the order of 500 stateless assignees have postcodes ==== <nowiki>patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS'); count------- 17575(1 row) patent=# select count(* Consistent Rows) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode isnot null); count------- 329(1 row) patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode isnot null or addrline1 is not null); count------- 1572(1 row)</nowiki>  ==== approximate location data is available (but unused) ====  <nowiki>patent=# select name, addrline1, addrline2, postcode from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (addrline1 is not null or addrline2 is not null) limit 30; name | addrline1 | addrline2 | postcode--------------------------------------------------------------------------------------+----------------------------------+----------------------------------+---------- REGAL KING COMMERCIAL OFFSHORE DE MACAU LIMITED | ALAMEDA DR. CARLOS D'S ASSUMPCAO | DYNASTY PLAZA, 19 AND. (H) MACAO | ECODYNE CORPORATION | | A CORP. OF DE | SHELL OIL COMPANY | | A CORP. OF DE | RCA CORPORATION | | A CORP. OF DE | BACHEM FEINCHEMIKALIEN A.G. | | A SWISS CORP. | AMERICAN STERILIZER COMPANY | | A PA. CORP. | CREATIVE TECHNOLOGY LTD. | CREATIVE RESOURCE | 31 INTERNATIONAL BUSINESS PARK | 60922 FUJITSU LIMITED | NAKAHARA-KU, KAWASAKI-SHI | 1-1, KAMIKODANAKA 4-CHOME | 211-8 SIEMENS AKTIENGESELLSCHAFT | | A GERMAN CORP | ISABERG AB, HESTRA | | A CORP. OF SWEDEN | MOELLER MANUFACTURING CO., INC. | | A DE CORP. | MOBIL OIL CORPORATION | | A CORP OF NY | GENERAL ELECTRIC COMPANY | | A NY CORP. | GENERAL ELECTRIC COMPANY | A CORP. OF NY | | BTR INDUSTRIES LIMITED | CARLISLE PLACE | BTR HOUSE | DIATRON CORPORATION | | A CORP. OF CA | CLOSURES AND PACKAGING SERVICES LIMITED | P. O. BOX 119 | | GY1 3HB DR. JOHANNES HEIDENHAIN GMBH A CORPORATION OF GERMANY | | TRAUNREUT CITY | AMERICAN SAFETY RAZOR COMPANY | | A DE CORP. | ETHICON, INC. | | A NJ CORP. | GENERAL ELECTRIC COMPANY | | A NEW YORK CORP. | MOBIL OIL CORPORATION | | A CORP. OF NY | RICHARD LANGLECHNER GMBH | MAUERBERGERSTR. 15 | | KONISHIROKU PHOTO INDUSTRY CO., LTD., | | A CORP OF JAPAN | SHELL OIL COMPANY | | A CORP. OF DE | SIEMENS AKTIENGESELLSCHAFT A GERMAN CORPORATION | | MUNICH | C.A. BRIGGS COMPANY | | A CORP OF PA | GTE PRODUCTS CORPORATION | | A DE CORP. | TRAMEX ENGINEERING LIMITED | 8 CLARE STREET, | A IRISH COMPANY | MOBIL OIL CORPORATION | | A CORPORATION OF NEW YORK |</nowiki> == A Stopgap Measure == As a quick and dirty fix, walk through the cases and do the best we can. The script for that is located at <code>E:\McNair\Projects\SimplerPatentData\src\db\hacks\June2017PostcodeHotfix.sql</code> and it creates the following table:  <nowiki>DROP VIEW IF EXISTS june_2017_postcodecode_hotfix;CREATE VIEW june_2017_postcode_hotfix AS SELECT reelno, frameno, name, city_cleaned as city, GodHelpUsAll(postcode_cleaned, postcode) as postcode FROM ptoassigneend_us_cleaned;</nowiki>  <nowiki>patent=# select count(* Inconsistent Rows:) from june_2017_postcode_hotfix ; count--------- 3572605(1 row) patent=# select count(* Uncertain Rows:) from june_2017_postcode_hotfix where postcode is not null; count--------- 3370613(1 row)</nowiki>

Navigation menu