Difference between revisions of "American Community Survey (ACS) Data"
(33 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | {{McNair | + | {{Project |
+ | |Has project output=Data | ||
+ | |Has sponsor=McNair Center | ||
|Has title=American Community Survey (ACS) Data | |Has title=American Community Survey (ACS) Data | ||
|Has owner=Jeemin Sim, | |Has owner=Jeemin Sim, | ||
}} | }} | ||
− | + | See also: | |
− | + | *[[PostGIS Installation]] | |
+ | *[[Urban Start-up Agglomeration]] | ||
+ | *[[Tiger Geocoder]] | ||
− | = | + | =Notes= |
+ | |||
+ | The [https://www.census.gov/programs-surveys/acs/news/data-releases/2019/release-schedule.html ACS releases datasets] in September through December of the following year. It has the following datasets: | ||
+ | *ACS5: 5-year estimates for a broad list of variables, from 2009-2020, covering 29,573 places. | ||
+ | *ACS1: 1-year estimates from 2005-2020 for select variables for 630 places with populations of 65,000 people or more | ||
+ | *ACS1s: 1-year supplements covering 2,323 places with 20,000 people or more | ||
+ | *Some historical datasets (like ACS3), which are not currently supported | ||
+ | |||
+ | ACS provides information [https://www.census.gov/programs-surveys/acs/geography-acs/areas-published.html on a broad range of geographies, including states, places, and zctas]. | ||
+ | |||
+ | ==Other data sources== | ||
+ | |||
+ | *PUMS data for 2017: https://www2.census.gov/programs-surveys/acs/data/pums/2017/1-Year/ | ||
+ | *Population and other data are available from: http://census.ire.org/data/bulkdata.html | ||
+ | Note: PUMS is only available for places in the [https://en.wikipedia.org/wiki/Public_Use_Microdata_Area Public Use Microdata Areas], which have a population 100,000 or more. | ||
+ | |||
+ | =Using api.census.gov= | ||
+ | |||
+ | The new approach is to use api.census.gov to request information. This applies to ACS and a broad range of other Census datasets. | ||
+ | |||
+ | Each year has an API URL, which specifies the dataset and takes '''get''' and '''for''' name-value pairs that specify variables and geographies, respectively. The base URL for ACS is: '''https://api.census.gov/data/yyyy/acs/acsx/''', where yyyy is year and x is 1, 1s, 3 or 5. An example request is: | ||
+ | *Requesting variable B06011_001E for ZCTAs from ACS5 for 2019: https://api.census.gov/data/2019/acs/acs5?get=NAME,B06011_001E&for=zip%20code%20tabulation%20area: | ||
+ | |||
+ | Pages listing available geographies, variables, and some examples are on the same base URL. For, example: | ||
+ | *The 2019 ACS5 available geographies: https://api.census.gov/data/2019/acs/acs5/geography.html | ||
+ | *The 2019 ACS5 available variables: https://api.census.gov/data/2019/acs/acs5/variables.html | ||
+ | *Some example URLs for the API: https://api.census.gov/data/2019/acs/acs5/examples.html | ||
+ | |||
+ | ==get_ACS.py== | ||
+ | |||
+ | A script called '''get_ACS.py''' automates the retrieval and compiling of results from api.census.gov. It has the year, geographic unit, and dataset in a (user-editable) parameters dictionary and is called with variables as command line parameters. Note that groups of variables are called in the same way as individual variables. | ||
+ | |||
+ | The script is available in: | ||
+ | E:\projects\census\ | ||
+ | |||
+ | Example commands are: | ||
+ | python3 get_ACS.py B19013_001E B06009 B02001 | ||
+ | python3 get_ACS.py B06011_001E | ||
+ | |||
+ | The script produces tab-delimited text files (to change to csv, edit line 63) with timestamped filenames. | ||
+ | |||
+ | Note: In the script, I rely on the response.json() method of request to handle the "JSON" produced by the API. I never use json.loads or similar. The Census doesn't adhere to any reasonable standards for encoding URLs or data! | ||
+ | |||
+ | ==Examples and Documentation== | ||
+ | |||
+ | Example API calls for median income, educational attainment, and race from the 5yr ACS, using the 2019 dataset: | ||
+ | *Median Income: https://api.census.gov/data/2019/acs/acs5?get=NAME,B06011_001E&for=zip%20code%20tabulation%20area:* | ||
+ | *Educational Attainment: https://api.census.gov/data/2019/acs/acs5?get=NAME,group(B06009)&for=zip%20code%20tabulation%20area:* | ||
+ | *Race: https://api.census.gov/data/2019/acs/acs5?get=NAME,group(B02001)&for=zip%20code%20tabulation%20area:* | ||
+ | |||
+ | Other links: | ||
+ | *The documentation on the ACS5: https://www.census.gov/data/developers/data-sets/acs-5year.html | ||
+ | *A slide deck on how to use the API: https://www.census.gov/content/dam/Census/programs-surveys/acs/guidance/training-presentations/06212017_ACS_Census_API.pdf (Note that this guide uses an incorrect URL base that only works for specific years). | ||
+ | |||
+ | Other datasets available using api.census.gov are listed here: https://www.census.gov/data/developers/data-sets.html. | ||
+ | |||
+ | ==Working with the API== | ||
+ | |||
+ | ===Getting a Key=== | ||
+ | The API doesn't seem to require a key, but you can request one from https://api.census.gov/. | ||
+ | |||
+ | ===Quoting, Nulls, and Error Codes=== | ||
+ | Every field is string quoted (even if it's a number) including error codes, except for nulls. The data contains null values, as well as values that indicate error codes (-999999999,-888888888,-666666666,-555555555,-222222222, and others. See https://www.census.gov/data/developers/data-sets/acs-1year/notes-on-acs-estimate-and-annotation-values.html. Generally, it appears that all negative values can be striped. | ||
+ | |||
+ | ===Improper Encoding=== | ||
+ | The API uses improper URL encoding (or more accurately, doesn't use URL encoding), so most methods for auto-populating URL parameters in scripting languages will not function correctly. Build URLs manually. | ||
+ | |||
+ | The API returns improper JSON. Essentially, it just encodes a flat data structure into JSON-like syntax one line at a time, and completely ignores the structure of the data (and so much of the point of encoding it in JSON!). However, the first (zeroth) column is the requested geography name, so multiple pulls can be joined using that. And the first (zeroth) row is a valid set of column headers. Also, the nulls are valid JSON encodings. This, and the strong quoting, makes using a JSON handler likely worthwhile. An alternative is to strip the JSON tags and treat the data as a badly encoded CSV, which it is. | ||
+ | |||
+ | ===Field and Group Names=== | ||
+ | To request a group ask for '''group(B06009)''' where you would normally request a field like '''B06011_001E'''. It looks like programmers can rely on groups and fields for estimates being identifiable from the "_" and the string length (\w{1}\d{5}\w{0,1}_\d{3}\w{1}), where the first letter is B or C and the last letter is E. Attributes for estimates can have two one last letter (like M for Margin) or two last letters(like AE - Annotation of Estimate and AM - Annotation of Margin). Note that requests of individual estimates don't provide the attributes but requests of groups do. | ||
+ | |||
+ | =Old Approach using AFF= | ||
+ | |||
+ | The approach below, using American FactFinder (AFF), was the old standard. Unfortunately, AFF has now been phased out. See the [https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml Fact Finder Download Center]. | ||
+ | |||
+ | ==Old Files== | ||
+ | |||
+ | All of the corrected files and the Load.sql script are in: | ||
+ | E:\projects\agglomeration\ACS\CleanFiles | ||
+ | |||
+ | The Load.sql script was run against [[Vcdb4]] | ||
Python Script to pull certain columns from excel file: | Python Script to pull certain columns from excel file: | ||
Line 18: | Line 103: | ||
*E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\csvFiles | *E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\csvFiles | ||
− | =Steps to Obtain Data= | + | ==Steps to Obtain Data== |
1) Go to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml | 1) Go to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml | ||
Line 37: | Line 122: | ||
11) Press Next | 11) Press Next | ||
− | = | + | For the new pull, I used 2017 5yr Estimates. |
+ | |||
+ | ===Tables Pulled=== | ||
S1501 EDUCATIONAL ATTAINMENT | S1501 EDUCATIONAL ATTAINMENT | ||
Line 53: | Line 140: | ||
G001 GEOGRAPHIC IDENTIFIERS | G001 GEOGRAPHIC IDENTIFIERS | ||
− | =Notes on Data= | + | At the end, I chose to have full descriptions but not geographic components. |
+ | |||
+ | ==Notes on Data== | ||
*Educational Attainment are based on population of age 25 years or higher (except for years 2010, 2011, and 2012 - that's why there are fewer entries in those tables) | *Educational Attainment are based on population of age 25 years or higher (except for years 2010, 2011, and 2012 - that's why there are fewer entries in those tables) | ||
− | = | + | ==Loading to a Dbase== |
− | + | The zip file from AFF was downloaded and extracted to E:\projects\agglomeration\ACS\AFFFiles | |
− | + | ||
− | + | There are 13 files entitled ACS_17_5YR_table_with_ann.csv, where table is S1501, etc. The files have two header lines and are CSV, with quoted strings of the format "Abanda CDP, Alabama" for the display label. There are 5 columns in the first file: | |
− | + | *Geo.id e.g., 1600000US0100100 | |
− | + | *GEO.id2 e.g., 0100100 | |
− | + | *GEO.display-label e.g., "Abanda CDP, Alabama" | |
− | + | *HD01_VD01 (Estimate; Total) e.g., 174 | |
− | + | *HD02_VD01 (Margin of Error; Total) e.g., 158 | |
− | + | ||
+ | Geo.id2 appears to match our Geoids. Note the leading zero, so these fields should be converted varchars(7) with front padding. | ||
+ | |||
+ | Checking Geo.id2 of 0103076: | ||
+ | SELECT * from Tigerplaces WHERE geoid='0103076'; | ||
+ | statecode statename gid statefp placefp placens geoid name namelsad lsad classfp pcicbsa pcinecta mtfcc funcstat aland awater intptlat intptlon geom | ||
+ | AL Alabama 380 01 03076 02403132 0103076 Auburn Auburn city 25 C1 Y N G4110 A 152375113 2646161 +32.6077220 -085.4895446 | ||
+ | 1600000US0103076,0103076,"Auburn city, Alabama",61462,65 | ||
+ | |||
+ | All files are renamed table.txt and have their second header line removed. The first three columns are common to all files, but then files can have multiple estimate and margin columns. Redundant ones are deleted. The state is also split out. The following are kept: | ||
+ | *B01003 - Population - Total, margin | ||
+ | *B02001 - Race - Total, WhiteAlone, BlackAlone, IndianAlone, AsianAlone, IslanderAlone, OtherAlone, TwoPlus | ||
+ | *B07201 - Moving - Total, Same1yr, DiffinUS1yr, DiffinMSA1yr, Abroad1yr | ||
+ | *B08303 - Driving - Total, Lt5, Btw5And9, Btw10And14, Btw..., Btw60And89, Gt90 | ||
+ | *B19013 - HHIncome - MedHHInc2017 | ||
+ | *B19053 - SelfEmployment - Total, NumWSelfEmpIncome, NumWOSelfEmpInc | ||
+ | *B19083 - GiniIndex - Gini, margin | ||
+ | *B25003 - Accomodation - Total, Owner, Renter | ||
+ | *B25105 - Housing - monthlyhousingcost | ||
+ | *B28011 - Internet - Total, InternetSub, DialUp, Broadband, Satellite, Other, NoSub, NoInternet | ||
+ | *S1501 - Education - Total25plus, Male25Plus, Female25Plus, Assoc25Plus, Bach25Plus, Grad25Plus, HSOrHigher, BachOrHigher, PCPovertyDueToEdu, MedEarnings | ||
+ | *S2301 - Labor - Total, LFPR, EmpPopRatio, UnEmpRate |
Latest revision as of 13:54, 15 February 2021
American Community Survey (ACS) Data | |
---|---|
Project Information | |
Has title | American Community Survey (ACS) Data |
Has owner | Jeemin Sim |
Has start date | |
Has deadline date | |
Has project status | |
Has sponsor | McNair Center |
Has project output | Data |
Copyright © 2019 edegan.com. All Rights Reserved. |
See also:
Notes
The ACS releases datasets in September through December of the following year. It has the following datasets:
- ACS5: 5-year estimates for a broad list of variables, from 2009-2020, covering 29,573 places.
- ACS1: 1-year estimates from 2005-2020 for select variables for 630 places with populations of 65,000 people or more
- ACS1s: 1-year supplements covering 2,323 places with 20,000 people or more
- Some historical datasets (like ACS3), which are not currently supported
ACS provides information on a broad range of geographies, including states, places, and zctas.
Other data sources
- PUMS data for 2017: https://www2.census.gov/programs-surveys/acs/data/pums/2017/1-Year/
- Population and other data are available from: http://census.ire.org/data/bulkdata.html
Note: PUMS is only available for places in the Public Use Microdata Areas, which have a population 100,000 or more.
Using api.census.gov
The new approach is to use api.census.gov to request information. This applies to ACS and a broad range of other Census datasets.
Each year has an API URL, which specifies the dataset and takes get and for name-value pairs that specify variables and geographies, respectively. The base URL for ACS is: https://api.census.gov/data/yyyy/acs/acsx/, where yyyy is year and x is 1, 1s, 3 or 5. An example request is:
- Requesting variable B06011_001E for ZCTAs from ACS5 for 2019: https://api.census.gov/data/2019/acs/acs5?get=NAME,B06011_001E&for=zip%20code%20tabulation%20area:
Pages listing available geographies, variables, and some examples are on the same base URL. For, example:
- The 2019 ACS5 available geographies: https://api.census.gov/data/2019/acs/acs5/geography.html
- The 2019 ACS5 available variables: https://api.census.gov/data/2019/acs/acs5/variables.html
- Some example URLs for the API: https://api.census.gov/data/2019/acs/acs5/examples.html
get_ACS.py
A script called get_ACS.py automates the retrieval and compiling of results from api.census.gov. It has the year, geographic unit, and dataset in a (user-editable) parameters dictionary and is called with variables as command line parameters. Note that groups of variables are called in the same way as individual variables.
The script is available in:
E:\projects\census\
Example commands are:
python3 get_ACS.py B19013_001E B06009 B02001 python3 get_ACS.py B06011_001E
The script produces tab-delimited text files (to change to csv, edit line 63) with timestamped filenames.
Note: In the script, I rely on the response.json() method of request to handle the "JSON" produced by the API. I never use json.loads or similar. The Census doesn't adhere to any reasonable standards for encoding URLs or data!
Examples and Documentation
Example API calls for median income, educational attainment, and race from the 5yr ACS, using the 2019 dataset:
- Median Income: https://api.census.gov/data/2019/acs/acs5?get=NAME,B06011_001E&for=zip%20code%20tabulation%20area:*
- Educational Attainment: https://api.census.gov/data/2019/acs/acs5?get=NAME,group(B06009)&for=zip%20code%20tabulation%20area:*
- Race: https://api.census.gov/data/2019/acs/acs5?get=NAME,group(B02001)&for=zip%20code%20tabulation%20area:*
Other links:
- The documentation on the ACS5: https://www.census.gov/data/developers/data-sets/acs-5year.html
- A slide deck on how to use the API: https://www.census.gov/content/dam/Census/programs-surveys/acs/guidance/training-presentations/06212017_ACS_Census_API.pdf (Note that this guide uses an incorrect URL base that only works for specific years).
Other datasets available using api.census.gov are listed here: https://www.census.gov/data/developers/data-sets.html.
Working with the API
Getting a Key
The API doesn't seem to require a key, but you can request one from https://api.census.gov/.
Quoting, Nulls, and Error Codes
Every field is string quoted (even if it's a number) including error codes, except for nulls. The data contains null values, as well as values that indicate error codes (-999999999,-888888888,-666666666,-555555555,-222222222, and others. See https://www.census.gov/data/developers/data-sets/acs-1year/notes-on-acs-estimate-and-annotation-values.html. Generally, it appears that all negative values can be striped.
Improper Encoding
The API uses improper URL encoding (or more accurately, doesn't use URL encoding), so most methods for auto-populating URL parameters in scripting languages will not function correctly. Build URLs manually.
The API returns improper JSON. Essentially, it just encodes a flat data structure into JSON-like syntax one line at a time, and completely ignores the structure of the data (and so much of the point of encoding it in JSON!). However, the first (zeroth) column is the requested geography name, so multiple pulls can be joined using that. And the first (zeroth) row is a valid set of column headers. Also, the nulls are valid JSON encodings. This, and the strong quoting, makes using a JSON handler likely worthwhile. An alternative is to strip the JSON tags and treat the data as a badly encoded CSV, which it is.
Field and Group Names
To request a group ask for group(B06009) where you would normally request a field like B06011_001E. It looks like programmers can rely on groups and fields for estimates being identifiable from the "_" and the string length (\w{1}\d{5}\w{0,1}_\d{3}\w{1}), where the first letter is B or C and the last letter is E. Attributes for estimates can have two one last letter (like M for Margin) or two last letters(like AE - Annotation of Estimate and AM - Annotation of Margin). Note that requests of individual estimates don't provide the attributes but requests of groups do.
Old Approach using AFF
The approach below, using American FactFinder (AFF), was the old standard. Unfortunately, AFF has now been phased out. See the Fact Finder Download Center.
Old Files
All of the corrected files and the Load.sql script are in:
E:\projects\agglomeration\ACS\CleanFiles
The Load.sql script was run against Vcdb4
Python Script to pull certain columns from excel file:
- E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\pullCertainColumns.py
SQL Commands to create tables and load data:
- E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\DataLoading_SQL_Commands
Excel files downloads from ACS website:
- E:\McNair\Projects\Agglomeration\ACS_Downloaded_Data\csvFiles
Steps to Obtain Data
1) Go to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml 2) Select 'I know the dataset or table(s) that I want to download.' 3) Press Next 4) For 'Select a program:' choose 'American Community Survey' 5) For 'Select a dataset and click Add to Your Selections:' choose '<YEAR OF INTEREST> ACS 1-year estimates' 6) Press 'Add To Your Selections' 7) Press Next 8) For 'Select a geographic type:' choose 'Place - 160' 9) For Select a state: Don't choose a state, as we wish to download all. 10) For 'Select one or more geographic areas...' choose 'All Places within United States and Puerto Rico' 11) Press Next
For the new pull, I used 2017 5yr Estimates.
Tables Pulled
S1501 EDUCATIONAL ATTAINMENT S2301 EMPLOYMENT STATUS B01003 TOTAL POPULATION B02001 RACE B07201 GEOGRAPHICAL MOBILITY B08303 TRAVEL TIME TO WORK B19013 MEDIAN HOUSEHOLD INCOME B19053 SELF-EMPLOYMENT INCOME IN THE PAST 12 MONTHS FOR HOUSEHOLDS B19083 GINI INDEX OF INCOME INEQUALITY B25003 TENURE B25105 MEDIAN MONTHLY HOUSING COSTS B28011 INTERNET SUBSCRIPTIONS IN HOUSEHOLD G001 GEOGRAPHIC IDENTIFIERS
At the end, I chose to have full descriptions but not geographic components.
Notes on Data
- Educational Attainment are based on population of age 25 years or higher (except for years 2010, 2011, and 2012 - that's why there are fewer entries in those tables)
Loading to a Dbase
The zip file from AFF was downloaded and extracted to E:\projects\agglomeration\ACS\AFFFiles
There are 13 files entitled ACS_17_5YR_table_with_ann.csv, where table is S1501, etc. The files have two header lines and are CSV, with quoted strings of the format "Abanda CDP, Alabama" for the display label. There are 5 columns in the first file:
- Geo.id e.g., 1600000US0100100
- GEO.id2 e.g., 0100100
- GEO.display-label e.g., "Abanda CDP, Alabama"
- HD01_VD01 (Estimate; Total) e.g., 174
- HD02_VD01 (Margin of Error; Total) e.g., 158
Geo.id2 appears to match our Geoids. Note the leading zero, so these fields should be converted varchars(7) with front padding.
Checking Geo.id2 of 0103076:
SELECT * from Tigerplaces WHERE geoid='0103076'; statecode statename gid statefp placefp placens geoid name namelsad lsad classfp pcicbsa pcinecta mtfcc funcstat aland awater intptlat intptlon geom AL Alabama 380 01 03076 02403132 0103076 Auburn Auburn city 25 C1 Y N G4110 A 152375113 2646161 +32.6077220 -085.4895446 1600000US0103076,0103076,"Auburn city, Alabama",61462,65
All files are renamed table.txt and have their second header line removed. The first three columns are common to all files, but then files can have multiple estimate and margin columns. Redundant ones are deleted. The state is also split out. The following are kept:
- B01003 - Population - Total, margin
- B02001 - Race - Total, WhiteAlone, BlackAlone, IndianAlone, AsianAlone, IslanderAlone, OtherAlone, TwoPlus
- B07201 - Moving - Total, Same1yr, DiffinUS1yr, DiffinMSA1yr, Abroad1yr
- B08303 - Driving - Total, Lt5, Btw5And9, Btw10And14, Btw..., Btw60And89, Gt90
- B19013 - HHIncome - MedHHInc2017
- B19053 - SelfEmployment - Total, NumWSelfEmpIncome, NumWOSelfEmpInc
- B19083 - GiniIndex - Gini, margin
- B25003 - Accomodation - Total, Owner, Renter
- B25105 - Housing - monthlyhousingcost
- B28011 - Internet - Total, InternetSub, DialUp, Broadband, Satellite, Other, NoSub, NoInternet
- S1501 - Education - Total25plus, Male25Plus, Female25Plus, Assoc25Plus, Bach25Plus, Grad25Plus, HSOrHigher, BachOrHigher, PCPovertyDueToEdu, MedEarnings
- S2301 - Labor - Total, LFPR, EmpPopRatio, UnEmpRate