I could get at my goals "the long way" but am hoping to stay completely within R. I am looking to append Census demographic data by zip code to records in my database. I know that R has a few Census-based packages, but, unless I am missing something, these data do not seem to exist at the zip code level, nor is it intuitive to merge onto an existing data frame.
In short, is it possible to do this within R, or is my best approach to grab the data elsewhere and read it into R?
Any help will be greatly appreciated!
In short, no. Census to zip translations are generally created from proprietary sources.
It's unlikely that you'll find anything at the zipcode level from a census perspective (privacy). However, that doesn't mean you're left in the cold. You can use the zipcodes that you have and append census data from the MSA, muSA or CSA level. Now all you need is a listing of postal codes within your MSA, muSA or CSA so that you can merge. There's a bunch online that are pretty cheap if you don't already have such a list.
For example, in Canada, we can get income data from CRA at the FSA level (the first three digits of a postal code in the form A1A 1A1). I'm not sure what or if the IRS provides similar information, I'm also not too familiar with US Census data, but I imagine they provide information at the CSA level at the very least.
If you're bewildered by all these acronyms:
MSA: http://en.wikipedia.org/wiki/Metropolitan_Statistical_Area
CSA: http://en.wikipedia.org/wiki/Combined_statistical_area
muSA: http://en.wikipedia.org/wiki/Micropolitan_Statistical_Area
As others in this thread have mentioned, the Census Bureau American FactFinder is a free source of comprehensive and detailed data. Unfortunately, it’s not particularly easy to use in its raw format.
We’ve pulled, cleaned, consolidated, and reformatted the Census Bureau data. The details of this process and how to use the data files can be found on our team blog.
None of these tables actually have a field called “ZIP code.” Rather, they have a field called “ZCTA5”. A ZCTA5 (or ZCTA) can be thought of as interchangeable with a zip code given following caveats:
There are no ZCTAs for PO Box ZIP codes - this means that for 42,000 US ZIP Codes there are 32,000 ZCTAs.
ZCTAs, which stand for Zip Code Tabulation Areas, are based on zip codes but don’t necessarily follow exact zip code boundaries. If you would like to read more about ZCTAs, please refer to this link. The Census Bureau also provides an animation that shows how ZCTAs are formed.
I just wrote a R package called totalcensus (https://github.com/GL-Li/totalcensus), with which you can extract any data in decennial census and ACS survey easily.
For this old question if you still care, you can get total population (by default) and population of other races from national data of decennial census 2010 or 2015 ACS 5-year survey.
From 2015 ACS 5-year survey. Download national data with download_census("acs5year", 2015, "US") and then:
zip_acs5 <- read_acs5year(
year = 2015,
states = "US",
geo_headers = "ZCTA5",
table_contents = c(
"white = B02001_002",
"black = B02001_003",
"asian = B02001_005"
),
summary_level = "860"
)
# GEOID lon lat ZCTA5 state population white black asian GEOCOMP SUMLEV NAME
# 1: 86000US01001 -72.62827 42.06233 01001 NA 17438 16014 230 639 all 860 ZCTA5 01001
# 2: 86000US01002 -72.45851 42.36398 01002 NA 29780 23333 1399 3853 all 860 ZCTA5 01002
# 3: 86000US01003 -72.52411 42.38994 01003 NA 11241 8967 699 1266 all 860 ZCTA5 01003
# 4: 86000US01005 -72.10660 42.41885 01005 NA 5201 5062 40 81 all 860 ZCTA5 01005
# 5: 86000US01007 -72.40047 42.27901 01007 NA 14838 14086 104 330 all 860 ZCTA5 01007
# ---
# 32985: 86000US99923 -130.04103 56.00232 99923 NA 13 13 0 0 all 860 ZCTA5 99923
# 32986: 86000US99925 -132.94593 55.55020 99925 NA 826 368 7 0 all 860 ZCTA5 99925
# 32987: 86000US99926 -131.47074 55.13807 99926 NA 1711 141 0 2 all 860 ZCTA5 99926
# 32988: 86000US99927 -133.45792 56.23906 99927 NA 123 114 0 0 all 860 ZCTA5 99927
# 32989: 86000US99929 -131.60683 56.41383 99929 NA 2365 1643 5 60 all 860 ZCTA5 99929
From Census 2010. Download national data with download_census("decennial", 2010, "US") and then:
zip_2010 <- read_decennial(
year = 2010,
states = "US",
table_contents = c(
"white = P0030002",
"black = P0030003",
"asian = P0030005"
),
geo_headers = "ZCTA5",
summary_level = "860"
)
# lon lat ZCTA5 state population white black asian GEOCOMP SUMLEV
# 1: -66.74996 18.18056 00601 NA 18570 17285 572 5 all 860
# 2: -67.17613 18.36227 00602 NA 41520 35980 2210 22 all 860
# 3: -67.11989 18.45518 00603 NA 54689 45348 4141 85 all 860
# 4: -66.93291 18.15835 00606 NA 6615 5883 314 3 all 860
# 5: -67.12587 18.29096 00610 NA 29016 23796 2083 37 all 860
# ---
# 33116: -130.04103 56.00232 99923 NA 87 79 0 0 all 860
# 33117: -132.94593 55.55020 99925 NA 819 350 2 4 all 860
# 33118: -131.47074 55.13807 99926 NA 1460 145 6 2 all 860
# 33119: -133.45792 56.23906 99927 NA 94 74 0 0 all 860
# 33120: -131.60683 56.41383 99929 NA 2338 1691 3 33 all 860
Your best bet is probably with the U.S. Census Bureau TIGER/Line shapefiles. They have ZIP code tabulation area shapefiles (ZCTA5) for 2010 at the state level which may be sufficient for your purposes.
Census data itself can be found at American FactFinder. For example, you can get population estimates at the sub-county level (i.e. city/town), but not straight-forward population estimates at the zip-code level. I don't know the details of your data set, but one solution might require the use of relationship tables that are also available as part of the TIGER/Line data, or alternatively spatially joining the place names containing the census data (subcounty shapefiles) with the ZCTA5 codes.
Note from the metadata: "These products are free to use in a product or publication, however acknowledgement must be given to the U.S. Census Bureau as the source."
HTH
simple for loop to get zip level population. you need to get a key though. it is for US now.
masterdata <- data.table()
for(z in 1:length(ziplist)){
print(z)
textt <- paste0("http://api.opendatanetwork.com/data/v1/values?variable=demographics.population.count&entity_id=8600000US",ziplist[z],"&forecast=3&describe=false&format=&app_token=YOURKEYHERE")
errorornot <- try(jsonlite::fromJSON(textt), silent=T)
if(is(errorornot,"try-error")) next
data <- jsonlite::fromJSON(textt)
data <- as.data.table(data$data)
zipcode <- data[1,2]
data <- data[2:nrow(data)]
setnames(data,c("Year","Population","Forecasted"))
data[,ZipCodeQuery:=zipcode]
data[,ZipCodeData:=ziplist[z]]
masterdata <- rbind(masterdata,data)
}
Related
I am working through creating pivot tables with the Pivottabler package to summarise frequencies of rock art classes by location. The data I am summarising here are from published papers, and I have it stored in an RDS file created in R, and looks like this:
> head(cyp_art_freq)
Class Location value
1: Figurative Princess Charlotte Bay 347
2: Track Princess Charlotte Bay 35
3: Non-Figurative Princess Charlotte Bay 18
4: Figurative Mitchell-Palmer and Chillagoe 320
5: Track Mitchell-Palmer and Chillagoe 79
6: Non-Figurative Mitchell-Palmer and Chillagoe 1002
>str(cyp_art_freq)
Classes ‘data.table’ and 'data.frame': 12 obs. of 3 variables:
Class : chr "Figurative" "Track" "Non-Figurative" "Figurative" ...
Location: chr "Princess Charlotte Bay" "Princess Charlotte Bay" "Princess Charlotte Bay" "Mitchell-Palmer and Chillagoe" ...
value : num 347 35 18 320 79 ...
attr(*, ".internal.selfref")=<externalptr>
The problem is that pivottabler does not sum the contents of the 'value' col. Instead, it counts the number of rows/cases. So, as the graphic below shows, the resulting table includes a total of 12 cases when the result should be into the 1000s. I think this relates to the 'value' column which is a count of a larger dataset. I've tried pivot_longer and pivot_wider, changed datatypes and used CSVs instead of RDS for import (and more).
The code block I'm using for this data works with the built-in BHMtrains dataset, and my other datasets, but I suspect I can either specify that pivottabler tallies the contents of the 'values' col, or I just expand the underlying dataset.
How might I ensure that the 'Count' columns actually count the contents of the input 'value' column? I hope that is clear, and thanks for any suggestions on how to address this issue.
table01 <- PivotTable$new()
table01$addData(cyp_art_freq)
table01$addColumnDataGroups("Class", totalCaption = "Total")
table01$defineCalculation(calculationName="Count", summariseExpression="n()", caption="Count", visible=TRUE)
filterOverrides <- PivotFilterOverrides$new(table01, keepOnlyFiltersFor="Count")
table01$defineCalculation(calculationName="TOCTotal", filters=filterOverrides,
summariseExpression="n()", caption="TOC Total", visible=FALSE)
table01$defineCalculation(calculationName="PercentageAllMotifs", type="calculation",
basedOn=c("Count", "TOCTotal"),
calculationExpression="values$Count/values$TOCTotal*100",
format="%.1f %%", caption="Percent")
table01$addRowDataGroups("Location")
table01$theme <- "compact"
table01$renderPivot()
table01$evaluatePivot()
The PT returned from this code
Pretty simple problem, I think, but not sure of the proper solution. Have done some research on this and think I recall seeing a solution somewhere, but cannot remember where...anyway,
Want to get DP03, one-year acs data for all Ohio counties, year 2019. However, The code below only accesses 39 of Ohio's 88 counties. How can I access the remaining counties?
My guess is that data is only being pulled for counties with populations greater than 60,000.
library(tidycensus)
library(tidyverse)
acs_2019 <- load_variables(2019, dataset = "acs1/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name, label)
Ohio_county <-
get_acs(geography = "county",
year = 2019,
state = "OH",
survey = "acs1",
variables = DP03,
output = "wide")
This results in a table that looks like this...
Ohio_county
# A tibble: 39 x 550
GEOID NAME `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 39057 Gree~ 138295 815 138295 NA 87465
2 39043 Erie~ 61316 516 61316 NA 38013
3 39153 Summ~ 442279 1273 442279 NA 286777
4 39029 Colu~ 83317 634 83317 NA 48375
5 39099 Maho~ 188298 687 188298 NA 113806
6 39145 Scio~ 60956 588 60956 NA 29928
7 39003 Alle~ 81560 377 81560 NA 49316
8 39023 Clar~ 108730 549 108730 NA 64874
9 39093 Lora~ 250606 896 250606 NA 150136
10 39113 Mont~ 428140 954 428140 NA 267189
Pretty sure I've seen a solution somewhere, but cannot recall where.
Any help would be appreciated since it would let the office more easily pull census data rather than wading through the US Census Bureau site. Best of luck and Thank you!
My colleague, who already pulled the data, did not specify whether or not the DP03 data came from the ACS 1 year survey or the ACS 5 year survey. As it turns out, it was from the ACS 5 year survey, which includes all Ohio counties, not just those counties over 65,000 population. Follow comments above for a description of how this answer was determined.
Code for all counties is here
library(tidycensus)
library(tidyverse)
acs_2018 <- load_variables(2018, dataset = "acs5/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name)
Ohio_county <-
get_acs(geography = "county",
year = 2018,
state = "OH",
survey = "acs5",
variables = DP03,
output = "wide")
I am looking to automate the process of downloading Census data from all block groups from the US using the tidycensus package. There is instructions from the developer to download all tracts within the US, however, block groups cannot be accessed using the same method.
Here is my current code that does not work
library(tidyverse)
library(tidycensus)
census_api_key("key here")
# create lists of state and county codes
data("fips_codes")
temp <- data.frame(state = as.character(fips_codes$state_code),
county = fips_codes$county_code,
stringsAsFactors = F)
temp <- aggregate(county~state, temp, c)
state <- temp$state
coun <- temp$county
# use map2_df to loop through the files, similar to the "tract" data pull
home <- map2_df(state, coun, function(x,y) {
get_acs(geography = "block group", variables = "B25038_001", #random var
state = x,county = y)
})
The resulting error is
No encoding supplied: defaulting to UTF-8.
Error: parse error: premature EOF
(right here) ------^
A similar approach to convert the counties within each state into a list also does not work
temp <- aggregate(county~state, temp, c)
state <- temp$state
coun <- temp$county
df<- map2_df(state, coun, function(x,y) {
get_acs(geography = "block group", variables = "B25038_001",
state = x,county = y)
})
Error: Result 1 is not a length 1 atomic vector is returned.
Does anyone have an understanding of how this could be completed? More than likely I am not using functions properly or syntax, and I am also not very good with loops. Any help would be appreciated.
The solution was provided by the author of tidycensus (Kyle Walker), and is as follows:
Unfortunately this just doesn't work at the moment. If it did work,
your code would need to identify the counties within each state within
a function evaluated by map_df and then stitch together the dataset
county-by-county, and state-by-state. The issue is that block group
data is only available by county, so you'd need to walk through all
3000+ counties in the US in turn. If it did work, a successful call
would look like this:
library(tigris)
library(tidyverse)
library(tidycensus)
library(sf)
ctys <- counties(cb = TRUE)
state_codes <- unique(fips_codes$state_code)[1:51]
bgs <- map_df(state_codes, function(state_code) {
state <- filter(ctys, STATEFP == state_code)
county_codes <- state$COUNTYFP
get_acs(geography = "block group", variables = "B25038_001",
state = state_code, county = county_codes)
})
The issue is that while I have internal logic to allow for multi-state
calls, or multi-county calls within a state, tidycensus can't yet
handle multi-state and multi-county calls simultaneously.
Try this package: totalcensus at https://github.com/GL-Li/totalcensus. It downloads census data files to your own computer and extracts any data from these files. After set up folders and path, run the code below if you want all block group data in 2015 ACS 5-year survey.
library(totalcensus)
# download the 2015 ACS 5-year survey data, which is about 50 GB.
download_census("acs5year", 2015)
# read block group data of variable B25038_001 from all states plus DC
block_groups <- read_acs5year(
year = 2015,
states = states_DC,
table_contents = "B25038_001",
summary_level = "block group"
)
The extracted data of 217739 block groups of all states and DC:
# GEOID lon lat state population B25038_001 GEOCOMP SUMLEV NAME
# 1: 15000US020130001001 -164.1232 54.80448 AK 982 91 all 150 Block Group 1, Census Tract 1, Aleutians East Borough, Alaska
# 2: 15000US020130001002 -161.1786 55.60224 AK 1116 247 all 150 Block Group 2, Census Tract 1, Aleutians East Borough, Alaska
# 3: 15000US020130001003 -160.0655 55.13399 AK 1206 352 all 150 Block Group 3, Census Tract 1, Aleutians East Borough, Alaska
# 4: 15000US020160001001 178.3388 51.95945 AK 1065 264 all 150 Block Group 1, Census Tract 1, Aleutians West Census Area, Alaska
# 5: 15000US020160002001 -166.8899 53.85881 AK 2038 380 all 150 Block Group 1, Census Tract 2, Aleutians West Census Area, Alaska
# ---
# 217735: 15000US560459511001 -104.7889 43.99520 WY 1392 651 all 150 Block Group 1, Census Tract 9511, Weston County, Wyoming
# 217736: 15000US560459511002 -104.4785 43.76853 WY 2050 742 all 150 Block Group 2, Census Tract 9511, Weston County, Wyoming
# 217737: 15000US560459513001 -104.2575 43.88160 WY 1291 520 all 150 Block Group 1, Census Tract 9513, Weston County, Wyoming
# 217738: 15000US560459513002 -104.1807 43.85406 WY 1046 526 all 150 Block Group 2, Census Tract 9513, Weston County, Wyoming
# 217739: 15000US560459513003 -104.2601 43.84355 WY 1373 547 all 150 Block Group 3, Census Tract 9513, Weston County, Wyoming
I have a data of "population" Pakistan of its four provinces(1. KPK, 2. Punjab, 3. Sindh, 4. Baluchistan and 5. Islamabad () and wanted to show it on Geo chart like the given
p.name population
1 KPK 3615
2 Punjab 5348
3 Sindh 5500
4 Baloachistan 4500
5 Islamabad 2500
G <- gvisGeoChart(pop, "state.name", "population",
options=list(region="Pakistan",
resolution="provinces",
width=600, height=400))
But showing the alert Requested map does not exist
try using an uppercase ISO-3166-1 alpha-2 code
options=list(region="PK",
I would like to select in my dataframe (catch) only the rows for which my "tspp.name" variable is the same as my "elasmo.name" variable.
For example, row #74807 and #74809 in this case would be selected, but not row #74823 because the elasmo.name is "skate" and the tspp.name is "Northern shrimp".
I am sure there is an easy answer for this, but I have not found it yet. Any hints would be appreciated.
> catch[4:6,]
gear tripID obsID sortie setID date time NAFO lat long dur depth bodymesh
74807 GRL2 G00001 A 1 13 2000-01-04 13:40:00 2H 562550 594350 2.000000 377 80
74809 GRL2 G00001 A 1 14 2000-01-04 23:30:00 2H 562550 594350 2.166667 370 80
74823 GRL2 G00001 A 1 16 2000-01-05 07:45:00 2H 561450 593050 3.000000 408 80
codendmesh mail.fil long.fil nbr.fil hook.shape hook.size hooks VTS tspp tspp.name elasmo
74807 45 NA NA NA NA NA 3.3 2211 Northern shrimp 2211
74809 45 NA NA NA NA NA 3.2 2211 Northern shrimp 2211
74823 45 NA NA NA NA NA 3.3 2211 Northern shrimp 211
elasmo.name kept discard Tcatch date.1 latitude longitude EID
74807 Northern shrimp 2747 50 2797 2000-01-04 56.91667 -60.21667 G00001-13
74809 Northern shrimp 4919 100 5019 2000-01-04 56.91667 -60.21667 G00001-14
74823 Skates 0 50 50 2000-01-05 56.73333 -60.00000 G00001-16
fgear
74807 Shrimp trawl (stern) with a grid
74809 Shrimp trawl (stern) with a grid
74823 Shrimp trawl (stern) with a grid
I know what the problem is - you need to read in the data "as is", by adding the argument as.is=TRUE to the read.csv command (which you presumably used to load everything in). Without this, the strings get stored as factors, and all methods suggested above will fail (as you've discovered!)
Once you've read in the data correctly, you can use either
catch[which(catch$tspp.name == catch$elasmo.name),]
or
subset(catch, tspp.name == elasmo.name)
to obtain the matching rows - do not omit the which in the first one otherwise the code will fail when doing comparisons with NAs.
Below is a 30-second example using a small fabricated data set that illustrates all these points explicitly.
First, create a text file on disk that looks like this (I saved it as "F:/test.dat" but it can be saved anywhere)...
col1~col2
a~b
a~a
b~b
c~NA
NA~d
NA~NA
Let's load it in without converting factors to strings, just to see the methods proposed above fall over:
> dat=read.csv("F:/test.dat",sep="~") # don't forget to check the filename
> dat[which(dat$col1==dat$col2),]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> dat[dat$col1==dat$col2,]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> subset(dat,col1==col2)
Error in Ops.factor(col1, col2) : level sets of factors are different
This is exactly the problem you were having. If you type dat$col1 and dat$col2 you'll see that the first has factor levels a b c while the second has factor levels a b d - hence the error messages.
Now let's do the same, but this time reading in the data "as is":
> dat=read.csv("F:/test.dat",sep="~",as.is=TRUE) # note the as.is=TRUE
> dat[which(dat$col1==dat$col2),]
col1 col2
2 a a
3 b b
> dat[dat$col1==dat$col2,]
col1 col2
2 a a
3 b b
NA <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>
> subset(dat,col1==col2)
col1 col2
2 a a
3 b b
As you can see, the first method (based on which) and the third method (based on subset) both give the right answer, while the second method gets confused by comparisons with NA. I would personally advocate the subset method as in my opinion it's the neatest.
A final note: There are other ways that you can get strings arising as factors in a data frame - and to avoid all of those headaches, always remember to include the argument stringsAsFactors = FALSE at the end whenever you create a data frame using data.frame. For instance, the correct way to create the object dat directly in R would be:
dat=data.frame(col1=c("a","a","b","c",NA,NA), col2=c("b","a","b",NA,"d",NA),
stringsAsFactors=FALSE)
Type dat$col1 and dat$col2 and you'll see they've been interpreted correctly. If you try it again but with the stringsAsFactors argument omitted (or set to TRUE), you'll see those darned factors appear (just like the dodgy first method of loading from disk).
In short, always remember as.is=TRUE and stringsAsFactors=FALSE, and learn how to use the subset command, and you won't go far wrong!
Hope this helps :)