r tidycensus download all block groups - r

I am looking to automate the process of downloading Census data from all block groups from the US using the tidycensus package. There is instructions from the developer to download all tracts within the US, however, block groups cannot be accessed using the same method.
Here is my current code that does not work
library(tidyverse)
library(tidycensus)
census_api_key("key here")
# create lists of state and county codes
data("fips_codes")
temp <- data.frame(state = as.character(fips_codes$state_code),
county = fips_codes$county_code,
stringsAsFactors = F)
temp <- aggregate(county~state, temp, c)
state <- temp$state
coun <- temp$county
# use map2_df to loop through the files, similar to the "tract" data pull
home <- map2_df(state, coun, function(x,y) {
get_acs(geography = "block group", variables = "B25038_001", #random var
state = x,county = y)
})
The resulting error is
No encoding supplied: defaulting to UTF-8.
Error: parse error: premature EOF
(right here) ------^
A similar approach to convert the counties within each state into a list also does not work
temp <- aggregate(county~state, temp, c)
state <- temp$state
coun <- temp$county
df<- map2_df(state, coun, function(x,y) {
get_acs(geography = "block group", variables = "B25038_001",
state = x,county = y)
})
Error: Result 1 is not a length 1 atomic vector is returned.
Does anyone have an understanding of how this could be completed? More than likely I am not using functions properly or syntax, and I am also not very good with loops. Any help would be appreciated.

The solution was provided by the author of tidycensus (Kyle Walker), and is as follows:
Unfortunately this just doesn't work at the moment. If it did work,
your code would need to identify the counties within each state within
a function evaluated by map_df and then stitch together the dataset
county-by-county, and state-by-state. The issue is that block group
data is only available by county, so you'd need to walk through all
3000+ counties in the US in turn. If it did work, a successful call
would look like this:
library(tigris)
library(tidyverse)
library(tidycensus)
library(sf)
ctys <- counties(cb = TRUE)
state_codes <- unique(fips_codes$state_code)[1:51]
bgs <- map_df(state_codes, function(state_code) {
state <- filter(ctys, STATEFP == state_code)
county_codes <- state$COUNTYFP
get_acs(geography = "block group", variables = "B25038_001",
state = state_code, county = county_codes)
})
The issue is that while I have internal logic to allow for multi-state
calls, or multi-county calls within a state, tidycensus can't yet
handle multi-state and multi-county calls simultaneously.

Try this package: totalcensus at https://github.com/GL-Li/totalcensus. It downloads census data files to your own computer and extracts any data from these files. After set up folders and path, run the code below if you want all block group data in 2015 ACS 5-year survey.
library(totalcensus)
# download the 2015 ACS 5-year survey data, which is about 50 GB.
download_census("acs5year", 2015)
# read block group data of variable B25038_001 from all states plus DC
block_groups <- read_acs5year(
year = 2015,
states = states_DC,
table_contents = "B25038_001",
summary_level = "block group"
)
The extracted data of 217739 block groups of all states and DC:
# GEOID lon lat state population B25038_001 GEOCOMP SUMLEV NAME
# 1: 15000US020130001001 -164.1232 54.80448 AK 982 91 all 150 Block Group 1, Census Tract 1, Aleutians East Borough, Alaska
# 2: 15000US020130001002 -161.1786 55.60224 AK 1116 247 all 150 Block Group 2, Census Tract 1, Aleutians East Borough, Alaska
# 3: 15000US020130001003 -160.0655 55.13399 AK 1206 352 all 150 Block Group 3, Census Tract 1, Aleutians East Borough, Alaska
# 4: 15000US020160001001 178.3388 51.95945 AK 1065 264 all 150 Block Group 1, Census Tract 1, Aleutians West Census Area, Alaska
# 5: 15000US020160002001 -166.8899 53.85881 AK 2038 380 all 150 Block Group 1, Census Tract 2, Aleutians West Census Area, Alaska
# ---
# 217735: 15000US560459511001 -104.7889 43.99520 WY 1392 651 all 150 Block Group 1, Census Tract 9511, Weston County, Wyoming
# 217736: 15000US560459511002 -104.4785 43.76853 WY 2050 742 all 150 Block Group 2, Census Tract 9511, Weston County, Wyoming
# 217737: 15000US560459513001 -104.2575 43.88160 WY 1291 520 all 150 Block Group 1, Census Tract 9513, Weston County, Wyoming
# 217738: 15000US560459513002 -104.1807 43.85406 WY 1046 526 all 150 Block Group 2, Census Tract 9513, Weston County, Wyoming
# 217739: 15000US560459513003 -104.2601 43.84355 WY 1373 547 all 150 Block Group 3, Census Tract 9513, Weston County, Wyoming

Related

Is there an R function to generate a column of U.S county names based on latitude and longitude? [duplicate]

Is there a fast way to convert latitude and longitude coordinates to State codes in R? I've been using the zipcode package as a look up table but it's too slow when I'm querying lots of lat/long values
If not in R is there any way to do this using google geocoder or any other type of fast querying service?
Thanks!
Here are two options, one using sf and one using sp package functions. sf is the more modern (and, here in 2020, recommended) package for analyzing spatial data, but in case it's still useful, I am leaving my original 2012 answer showing how to do this with sp-related functions.
Method 1 (using sf):
library(sf)
library(spData)
## pointsDF: A data.frame whose first column contains longitudes and
## whose second column contains latitudes.
##
## states: An sf MULTIPOLYGON object with 50 states plus DC.
##
## name_col: Name of a column in `states` that supplies the states'
## names.
lonlat_to_state <- function(pointsDF,
states = spData::us_states,
name_col = "NAME") {
## Convert points data.frame to an sf POINTS object
pts <- st_as_sf(pointsDF, coords = 1:2, crs = 4326)
## Transform spatial data to some planar coordinate system
## (e.g. Web Mercator) as required for geometric operations
states <- st_transform(states, crs = 3857)
pts <- st_transform(pts, crs = 3857)
## Find names of state (if any) intersected by each point
state_names <- states[[name_col]]
ii <- as.integer(st_intersects(pts, states))
state_names[ii]
}
## Test the function with points in Wisconsin, Oregon, and France
testPoints <- data.frame(x = c(-90, -120, 0), y = c(44, 44, 44))
lonlat_to_state(testPoints)
## [1] "Wisconsin" "Oregon" NA
If you need higher resolution state boundaries, read in your own vector data as an sf object using sf::st_read() or by some other means. One nice option is to install the rnaturalearth package and use it to load a state vector layer from rnaturalearthhires. Then use the lonlat_to_state() function we just defined as shown here:
library(rnaturalearth)
us_states_ne <- ne_states(country = "United States of America",
returnclass = "sf")
lonlat_to_state(testPoints, states = us_states_ne, name_col = "name")
## [1] "Wisconsin" "Oregon" NA
For very accurate results, you can download a geopackage containing GADM-maintained administrative borders for the United States from this page. Then, load the state boundary data and use them like this:
USA_gadm <- st_read(dsn = "gadm36_USA.gpkg", layer = "gadm36_USA_1")
lonlat_to_state(testPoints, states = USA_gadm, name_col = "NAME_1")
## [1] "Wisconsin" "Oregon" NA
Method 2 (using sp):
Here is a function that takes a data.frame of lat-longs within the lower 48 states, and for each point, returns the state in which it is located.
Most of the function simply prepares the SpatialPoints and SpatialPolygons objects needed by the over() function in the sp package, which does the real heavy lifting of calculating the 'intersection' of points and polygons:
library(sp)
library(maps)
library(maptools)
# The single argument to this function, pointsDF, is a data.frame in which:
# - column 1 contains the longitude in degrees (negative in the US)
# - column 2 contains the latitude in degrees
lonlat_to_state_sp <- function(pointsDF) {
# Prepare SpatialPolygons object with one SpatialPolygon
# per state (plus DC, minus HI & AK)
states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
states_sp <- map2SpatialPolygons(states, IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Convert pointsDF to a SpatialPoints object
pointsSP <- SpatialPoints(pointsDF,
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Use 'over' to get _indices_ of the Polygons object containing each point
indices <- over(pointsSP, states_sp)
# Return the state names of the Polygons object containing each point
stateNames <- sapply(states_sp#polygons, function(x) x#ID)
stateNames[indices]
}
# Test the function using points in Wisconsin and Oregon.
testPoints <- data.frame(x = c(-90, -120), y = c(44, 44))
lonlat_to_state_sp(testPoints)
[1] "wisconsin" "oregon" # IT WORKS
You can do it in a few lines of R.
library(sp)
library(rgdal)
#lat and long
Lat <- 57.25
Lon <- -9.41
#make a data frame
coords <- as.data.frame(cbind(Lon,Lat))
#and into Spatial
points <- SpatialPoints(coords)
#SpatialPolygonDataFrame - I'm using a shapefile of UK counties
counties <- readOGR(".", "uk_counties")
#assume same proj as shapefile!
proj4string(points) <- proj4string(counties)
#get county polygon point is in
result <- as.character(over(points, counties)$County_Name)
See ?over in the sp package.
You'll need to have the state boundaries as a SpatialPolygonsDataFrame.
Example data (polygons and points)
library(raster)
pols <- shapefile(system.file("external/lux.shp", package="raster"))
xy <- coordinates(p)
Use raster::extract
extract(p, xy)
# point.ID poly.ID ID_1 NAME_1 ID_2 NAME_2 AREA
#1 1 1 1 Diekirch 1 Clervaux 312
#2 2 2 1 Diekirch 2 Diekirch 218
#3 3 3 1 Diekirch 3 Redange 259
#4 4 4 1 Diekirch 4 Vianden 76
#5 5 5 1 Diekirch 5 Wiltz 263
#6 6 6 2 Grevenmacher 6 Echternach 188
#7 7 7 2 Grevenmacher 7 Remich 129
#8 8 8 2 Grevenmacher 12 Grevenmacher 210
#9 9 9 3 Luxembourg 8 Capellen 185
#10 10 10 3 Luxembourg 9 Esch-sur-Alzette 251
#11 11 11 3 Luxembourg 10 Luxembourg 237
#12 12 12 3 Luxembourg 11 Mersch 233
It's very straightforward using sf:
library(maps)
library(sf)
## Get the states map, turn into sf object
US <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
## Test the function using points in Wisconsin and Oregon
testPoints <- data.frame(x = c(-90, -120), y = c(44, 44))
# Make it a spatial dataframe, using the same coordinate system as the US spatial dataframe
testPoints <- st_as_sf(testPoints, coords = c("x", "y"), crs = st_crs(US))
#.. and perform a spatial join!
st_join(testPoints, US)
ID geometry
1 wisconsin POINT (-90 44)
2 oregon POINT (-120 44)

Tidycensus - One year ACS for All Counties in a State

Pretty simple problem, I think, but not sure of the proper solution. Have done some research on this and think I recall seeing a solution somewhere, but cannot remember where...anyway,
Want to get DP03, one-year acs data for all Ohio counties, year 2019. However, The code below only accesses 39 of Ohio's 88 counties. How can I access the remaining counties?
My guess is that data is only being pulled for counties with populations greater than 60,000.
library(tidycensus)
library(tidyverse)
acs_2019 <- load_variables(2019, dataset = "acs1/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name, label)
Ohio_county <-
get_acs(geography = "county",
year = 2019,
state = "OH",
survey = "acs1",
variables = DP03,
output = "wide")
This results in a table that looks like this...
Ohio_county
# A tibble: 39 x 550
GEOID NAME `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 39057 Gree~ 138295 815 138295 NA 87465
2 39043 Erie~ 61316 516 61316 NA 38013
3 39153 Summ~ 442279 1273 442279 NA 286777
4 39029 Colu~ 83317 634 83317 NA 48375
5 39099 Maho~ 188298 687 188298 NA 113806
6 39145 Scio~ 60956 588 60956 NA 29928
7 39003 Alle~ 81560 377 81560 NA 49316
8 39023 Clar~ 108730 549 108730 NA 64874
9 39093 Lora~ 250606 896 250606 NA 150136
10 39113 Mont~ 428140 954 428140 NA 267189
Pretty sure I've seen a solution somewhere, but cannot recall where.
Any help would be appreciated since it would let the office more easily pull census data rather than wading through the US Census Bureau site. Best of luck and Thank you!
My colleague, who already pulled the data, did not specify whether or not the DP03 data came from the ACS 1 year survey or the ACS 5 year survey. As it turns out, it was from the ACS 5 year survey, which includes all Ohio counties, not just those counties over 65,000 population. Follow comments above for a description of how this answer was determined.
Code for all counties is here
library(tidycensus)
library(tidyverse)
acs_2018 <- load_variables(2018, dataset = "acs5/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name)
Ohio_county <-
get_acs(geography = "county",
year = 2018,
state = "OH",
survey = "acs5",
variables = DP03,
output = "wide")

Fuzzy matching of rows of two datasets without using a for-loop

I have two datasets A and B with 8 coloumns each. Dataset A has 942 rows and Dataset B has 5079 rows. I have to compare Dataset A and Dataset B and do fuzzy matching. If there is any row is matched in Dataset B I have to mark "Matched" in dataset A in additional column.
I'm relatively new to R and not sure how to optimize r code with lapply, mapply or sapply instead of forloop.
Following is my code
##############################
# Install Necessary Packages #
##############################
#install.packages("openxlsx")
#install.packages("stringdist")
#install.packages("XLConnect")
##############################
# Load Packages #
##############################
library(openxlsx)
library(stringdist)
library(XLConnect)
cmd_newleads <- read.xlsx("Src/CMD - New Leads to Load.xlsx", sheet = "Top Leads Full Data", startRow = 1, colNames = TRUE)
cmd_newleads[c("Lead_Match","Opportunity_Match")] <- ""
c4c_leads <- read.xlsx("Src/C4C - Leads.xlsx", sheet = "Leads", startRow = 1, colNames = TRUE)
#c4c_opportunities <- read.xlsx("Src/C4C - Opportunities Data 6-24-16.xlsx", sheet = "Export 06-24-2016 04.55.46 PM", startRow = 1, colNames = TRUE)
cmd_newleads_selcols <- cmd_newleads[,c("project_name","project_address","project_city","project_state_province_region_code","project_postalcode","project_country","project_sector","project_type")]
cmd_newleads_selcols[is.na(cmd_newleads_selcols)] <- ""
#rownames(cmd_newleads_selcols)
c4cleads_selcols <- c4c_leads[,c("Lead","Address1.(Lead)","City.(Lead)","Region.(Lead)","Postal.Code.(Lead)","Country.(Lead)","Sector.(Lead)","Type.(Lead)")]
c4cleads_selcols[is.na(c4cleads_selcols)] <- ""
#cmd_c4copportunities_selcols <- c4c_opportunities[,c("project_name","project_address","project_city","project_state_province_region_code","project_postalcode","project_country","project_sector","project_type")]
rcount_cmdnewleads <- nrow(cmd_newleads)
rcount_c4cleads <- nrow(c4c_leads)
#rcount_c4copportunities <- nrow(c4c_opportunities)
for(i in 1:rcount_cmdnewleads)
{
cmd_project_name <- cmd_newleads_selcols[i,1]
cmd_project_address <- cmd_newleads_selcols[i,2]
cmd_project_city <- cmd_newleads_selcols[i,3]
cmd_project_region_code <- cmd_newleads_selcols[i,4]
cmd_project_postalcode <- cmd_newleads_selcols[i,5]
cmd_project_country <- cmd_newleads_selcols[i,6]
cmd_project_sector <- cmd_newleads_selcols[i,7]
cmd_project_type <- cmd_newleads_selcols[i,8]
for(j in 1:rcount_c4cleads)
{
c4cleads_project_name <- c4cleads_selcols[j,1]
c4cleads_project_address <- c4cleads_selcols[j,2]
c4cleads_project_city <- c4cleads_selcols[j,3]
c4cleads_project_region_code <- c4cleads_selcols[j,4]
c4cleads_project_postalcode <- c4cleads_selcols[j,5]
c4cleads_project_country <- c4cleads_selcols[j,6]
c4cleads_project_sector <- c4cleads_selcols[j,7]
c4cleads_project_type <- c4cleads_selcols[j,8]
project_percent <- stringsim(cmd_project_name,c4cleads_project_name, method="dl", p=0.1)
address_percent <- stringsim(cmd_project_address,c4cleads_project_address, method="dl", p=0.1)
city_percent <- stringsim(cmd_project_city,c4cleads_project_city, method="dl", p=0.1)
region_percent <- stringsim(cmd_project_region_code,c4cleads_project_region_code, method="dl", p=0.1)
postalcode_percent <- stringsim(cmd_project_postalcode,c4cleads_project_postalcode, method="dl", p=0.1)
country_percent <- stringsim(cmd_project_country,c4cleads_project_country, method="dl", p=0.1)
sector_percent <- stringsim(cmd_project_sector,c4cleads_project_sector, method="dl", p=0.1)
type_percent <- stringsim(cmd_project_type,c4cleads_project_type, method="dl", p=0.1)
if(project_percent > 0.833 && address_percent > 0.833 && city_percent > 0.833 && region_percent > 0.833 && postalcode_percent > 0.833 && country_percent > 0.833 && sector_percent > 0.833 && type_percent > 0.833)
{
cmd_newleads[i,51] <- c4cleads[j,c4cleads$Lead.ID]
}
else
{
cmd_newleads[i,51] <- "New Lead"
}
}
}
Sample data for cmd_newleads_selcols and c4cleads_selcols respectively
project_name project_address project_city
1 Wynn Mystic Casino & Hotel 22 Chemical Ln Everett
2 Northpoint Complex Development East Street Cambridge
3 Northpoint Complex Development East Street Cambridge
4 Northpoint Complex Development East Street Cambridge
5 Northpoint Complex Development East Street Cambridge
6 Northpoint Complex Development East Street Cambridge
project_state_province_region_code project_postalcode
1 MA 02149
2 MA 02138
3 MA 02138
4 MA 02138
5 MA 02138
6 MA 02138
project_country project_sector project_type
1 United States of America Hospitality New Building
2 United States of America Apartments New Building
3 United States of America Apartments New Building
4 United States of America Apartments New Building
5 United States of America Apartments New Building
6 United States of America Apartments New Building
Lead Address1.(Lead) City.(Lead) Region.(Lead) Postal.Code.(Lead) Country.(Lead)
1 1 Hotel Brooklyn Bridge Park Old Fulton St & Furman St Brooklyn New York 11201 United States
2 10 Trinity Square Hotel 10 Trinity Square London # EC3P United Kingdom
3 100 Stewart 1900 1st Avenue Seattle Washington 98101 United States
4 1136 S Wabash # # # # Not assigned
5 115-129 37th Street 115-129 37th Street Union CIty New Jersey # United States
6 1418 W Addison 1418 w Addison Chicago # 60613 Not assigned
Sector.(Lead) Type.(Lead)
1 Hospitality New Building
2 Hospitality Brand Conversion
3 Hospitality New Building
4 High Rise Residential New Building
5 Developer New Building
6 High Rise Residential New Building
If you are experiencing efficiency problems, it's not because you are using a for loop. The main issue is that you are doing a lot of work for every possible combination of rows in your two data sets. Using more efficient language features might speed things up a bit, but it wouldn't change the fact that you're doing a lot of unnecessary computation.
One of the best ways to increase efficiency in data matching problems is to rule out obvious non-matches to cut down on unnecessary computations. For example, you could change your inner loop to first check some key condition; if the score is low (i.e. it's obviously a non-match) you don't need to compute similarity scores for the rest of the attributes.
For example:
for(i in 1:rcount_cmdnewleads)
{
cmd_project_name <- cmd_newleads_selcols[i,1]
...
for(j in 1:rcount_c4cleads)
{
c4cleads_project_name <- c4cleads_selcols[j,1]
project_percent <- stringsim(cmd_project_name,c4cleads_project_name, method="dl", p=0.1)
if (project_percent < .83) {
# you already know that this is a non-match, so go to the next one
next
} else {
# check the rest of the values!
...
}
}
}
I'm not familiar with the R RecordLinkage package, but the Python recordlinkage package has tools for ruling out obvious non-matches early in the process to increase efficiency. Consider checking out this tutorial to learn more about speeding up record linkage by ruling out obvious non matches.
You might want to look at the package RecordLinkage, which allows you to perform phonetic matching, probabilistic record linkage and machine learning approaches.

Faster way to process 1.2 million JSON geolocation queries from large dataframe

I am working on the Gowalla location-based checkin dataset which has around 6.44 million checkins. Unique locations on these checkins are 1.28 million. But Gowalla only gives latitudes and longitudes. So I need to find city, state and country for each of those lats and longs. From another post on StackOverflow I was able to create the R query below which queries the open street maps and finds the relevant geographical details.
Unfortunately it takes around 1 minute to process 125 rows, which means 1.28 million rows would take a couple of days. Is there a faster way to find these details? Maybe there is some package with builtin lats and longs of cities of the world to find the city name for the given lat and long so I do not have to do online querying.
Venue table is a data frame with 3 columns: 1: vid(venueId), 2 lat(latitude), 3: long(longitude)
for(i in 1:nrow(venueTable)) {
#this is just an indicator to display current value of i on screen
cat(paste(".",i,"."))
#Below code composes the url query
url <- paste("http://nominatim.openstreetmap.org/reverse.php? format=json&lat=",
venueTableTest3$lat[i],"&lon=",venueTableTest3$long[i])
url <- gsub(' ','',url)
url <- paste(url)
x <- fromJSON(url)
venueTableTest3$display_name[i] <- x$display_name
venueTableTest3$country[i] <- x$address$country
}
I am using the jsonlite package in R which makes x which is the result of the JSON query as a dataframe which stores various results returned. So using x$display_name or x$address$city i use my required field.
My laptop is core i5 3230M with 8gb ram and 120gb SSD using Windows 8.
You're going to have issues even if you persevere with time. The service you're querying allows 'an absolute maximum of one request per second', which you're already breaching. It's likely that they will throttle your requests before you reach 1.2m queries. Their website notes similar APIs for larger uses have only around 15k free daily requests.
It'd be much better for you to use an offline option. A quick search shows that there are many freely available datasets of populated places, along with their longitude and latitudes. Here's one we'll use: http://simplemaps.com/resources/world-cities-data
> library(dplyr)
> cities.data <- read.csv("world_cities.csv") %>% tbl_df
> print(cities.data)
Source: local data frame [7,322 x 9]
city city_ascii lat lng pop country iso2 iso3 province
(fctr) (fctr) (dbl) (dbl) (dbl) (fctr) (fctr) (fctr) (fctr)
1 Qal eh-ye Now Qal eh-ye 34.9830 63.1333 2997 Afghanistan AF AFG Badghis
2 Chaghcharan Chaghcharan 34.5167 65.2500 15000 Afghanistan AF AFG Ghor
3 Lashkar Gah Lashkar Gah 31.5830 64.3600 201546 Afghanistan AF AFG Hilmand
4 Zaranj Zaranj 31.1120 61.8870 49851 Afghanistan AF AFG Nimroz
5 Tarin Kowt Tarin Kowt 32.6333 65.8667 10000 Afghanistan AF AFG Uruzgan
6 Zareh Sharan Zareh Sharan 32.8500 68.4167 13737 Afghanistan AF AFG Paktika
7 Asadabad Asadabad 34.8660 71.1500 48400 Afghanistan AF AFG Kunar
8 Taloqan Taloqan 36.7300 69.5400 64256 Afghanistan AF AFG Takhar
9 Mahmud-E Eraqi Mahmud-E Eraqi 35.0167 69.3333 7407 Afghanistan AF AFG Kapisa
10 Mehtar Lam Mehtar Lam 34.6500 70.1667 17345 Afghanistan AF AFG Laghman
.. ... ... ... ... ... ... ... ... ...
It's hard to demonstrate without any actual data examples (helpful to provide!), but we can make up some toy data.
# make up toy data
> candidate.longlat <- data.frame(vid = 1:3,
lat = c(12.53, -16.31, 42.87),
long = c(-70.03, -48.95, 74.59))
Using the distm function in geosphere, we can calculate the distances between all your data and all the city locations at once. For you, this will make a matrix containing ~8,400,000,000 numbers, so it might take a while (can explore parallisation), and may be highly memory intensive.
> install.packages("geosphere")
> library(geosphere)
# compute distance matrix using geosphere
> distance.matrix <- distm(x = candidate.longlat[,c("long", "lat")],
y = cities.data[,c("lng", "lat")])
It's then easy to find the closest city to each of your data points, and cbind it to your data.frame.
# work out which index in the matrix is closest to the data
> closest.index <- apply(distance.matrix, 1, which.min)
# rbind city and country of match with original query
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")])
> print(candidate.longlat)
vid lat long city country
1 1 12.53 -70.03 Oranjestad Aruba
2 2 -16.31 -48.95 Anapolis Brazil
3 3 42.87 74.59 Bishkek Kyrgyzstan
Here's an alternate way using R's inherent spatial processing capabilities:
library(sp)
library(rgeos)
library(rgdal)
# world places shapefile
URL1 <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places.zip"
fil1 <- basename(URL1)
if (!file.exists(fil1)) download.file(URL1, fil1)
unzip(fil1)
places <- readOGR("ne_10m_populated_places.shp", "ne_10m_populated_places",
stringsAsFactors=FALSE)
# some data from the other answer since you didn't provide any
URL2 <- "http://simplemaps.com/resources/files/world/world_cities.csv"
fil2 <- basename(URL2)
if (!file.exists(fil2)) download.file(URL2, fil2)
# we need the points from said dat
dat <- read.csv(fil2, stringsAsFactors=FALSE)
pts <- SpatialPoints(dat[,c("lng", "lat")], CRS(proj4string(places)))
# this is not necessary
# I just don't like the warning about longlat not being a real projection
robin <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
pts <- spTransform(pts, robin)
places <- spTransform(places, robin)
# compute the distance (makes a pretty big matrix so you should do this
# in chunks unless you have a ton of memory or do it row-by-row
far <- gDistance(pts, places, byid=TRUE)
# find the closest one
closest <- apply(far, 1, which.min)
# map to the fields (you may want to map to other fields)
locs <- places#data[closest, c("NAME", "ADM1NAME", "ISO_A2")]
locs[sample(nrow(locs), 10),]
## NAME ADM1NAME ISO_A2
## 3274 Szczecin West Pomeranian PL
## 1039 Balakhna Nizhegorod RU
## 1012 Chitre Herrera PA
## 3382 L'Aquila Abruzzo IT
## 1982 Dothan Alabama US
## 5159 Bayankhongor Bayanhongor MN
## 620 Deming New Mexico US
## 1907 Fort Smith Arkansas US
## 481 Dedougou Mou Houn BF
## 7169 Prague Prague CZ
It's about a minute (on my system) for ~7500 so you're looking at a couple hours vs a day or more. You can do this in parallel and prbly get it done in less than an hour.
For better place resolution, you could use a very lightweight shapefile of country or Admin 1 polygons, then use a second process to do the distance from better resolution points for those geographic locations.

Zip Code Demographics in R

I could get at my goals "the long way" but am hoping to stay completely within R. I am looking to append Census demographic data by zip code to records in my database. I know that R has a few Census-based packages, but, unless I am missing something, these data do not seem to exist at the zip code level, nor is it intuitive to merge onto an existing data frame.
In short, is it possible to do this within R, or is my best approach to grab the data elsewhere and read it into R?
Any help will be greatly appreciated!
In short, no. Census to zip translations are generally created from proprietary sources.
It's unlikely that you'll find anything at the zipcode level from a census perspective (privacy). However, that doesn't mean you're left in the cold. You can use the zipcodes that you have and append census data from the MSA, muSA or CSA level. Now all you need is a listing of postal codes within your MSA, muSA or CSA so that you can merge. There's a bunch online that are pretty cheap if you don't already have such a list.
For example, in Canada, we can get income data from CRA at the FSA level (the first three digits of a postal code in the form A1A 1A1). I'm not sure what or if the IRS provides similar information, I'm also not too familiar with US Census data, but I imagine they provide information at the CSA level at the very least.
If you're bewildered by all these acronyms:
MSA: http://en.wikipedia.org/wiki/Metropolitan_Statistical_Area
CSA: http://en.wikipedia.org/wiki/Combined_statistical_area
muSA: http://en.wikipedia.org/wiki/Micropolitan_Statistical_Area
As others in this thread have mentioned, the Census Bureau American FactFinder is a free source of comprehensive and detailed data. Unfortunately, it’s not particularly easy to use in its raw format.
We’ve pulled, cleaned, consolidated, and reformatted the Census Bureau data. The details of this process and how to use the data files can be found on our team blog.
None of these tables actually have a field called “ZIP code.” Rather, they have a field called “ZCTA5”. A ZCTA5 (or ZCTA) can be thought of as interchangeable with a zip code given following caveats:
There are no ZCTAs for PO Box ZIP codes - this means that for 42,000 US ZIP Codes there are 32,000 ZCTAs.
ZCTAs, which stand for Zip Code Tabulation Areas, are based on zip codes but don’t necessarily follow exact zip code boundaries. If you would like to read more about ZCTAs, please refer to this link. The Census Bureau also provides an animation that shows how ZCTAs are formed.
I just wrote a R package called totalcensus (https://github.com/GL-Li/totalcensus), with which you can extract any data in decennial census and ACS survey easily.
For this old question if you still care, you can get total population (by default) and population of other races from national data of decennial census 2010 or 2015 ACS 5-year survey.
From 2015 ACS 5-year survey. Download national data with download_census("acs5year", 2015, "US") and then:
zip_acs5 <- read_acs5year(
year = 2015,
states = "US",
geo_headers = "ZCTA5",
table_contents = c(
"white = B02001_002",
"black = B02001_003",
"asian = B02001_005"
),
summary_level = "860"
)
# GEOID lon lat ZCTA5 state population white black asian GEOCOMP SUMLEV NAME
# 1: 86000US01001 -72.62827 42.06233 01001 NA 17438 16014 230 639 all 860 ZCTA5 01001
# 2: 86000US01002 -72.45851 42.36398 01002 NA 29780 23333 1399 3853 all 860 ZCTA5 01002
# 3: 86000US01003 -72.52411 42.38994 01003 NA 11241 8967 699 1266 all 860 ZCTA5 01003
# 4: 86000US01005 -72.10660 42.41885 01005 NA 5201 5062 40 81 all 860 ZCTA5 01005
# 5: 86000US01007 -72.40047 42.27901 01007 NA 14838 14086 104 330 all 860 ZCTA5 01007
# ---
# 32985: 86000US99923 -130.04103 56.00232 99923 NA 13 13 0 0 all 860 ZCTA5 99923
# 32986: 86000US99925 -132.94593 55.55020 99925 NA 826 368 7 0 all 860 ZCTA5 99925
# 32987: 86000US99926 -131.47074 55.13807 99926 NA 1711 141 0 2 all 860 ZCTA5 99926
# 32988: 86000US99927 -133.45792 56.23906 99927 NA 123 114 0 0 all 860 ZCTA5 99927
# 32989: 86000US99929 -131.60683 56.41383 99929 NA 2365 1643 5 60 all 860 ZCTA5 99929
From Census 2010. Download national data with download_census("decennial", 2010, "US") and then:
zip_2010 <- read_decennial(
year = 2010,
states = "US",
table_contents = c(
"white = P0030002",
"black = P0030003",
"asian = P0030005"
),
geo_headers = "ZCTA5",
summary_level = "860"
)
# lon lat ZCTA5 state population white black asian GEOCOMP SUMLEV
# 1: -66.74996 18.18056 00601 NA 18570 17285 572 5 all 860
# 2: -67.17613 18.36227 00602 NA 41520 35980 2210 22 all 860
# 3: -67.11989 18.45518 00603 NA 54689 45348 4141 85 all 860
# 4: -66.93291 18.15835 00606 NA 6615 5883 314 3 all 860
# 5: -67.12587 18.29096 00610 NA 29016 23796 2083 37 all 860
# ---
# 33116: -130.04103 56.00232 99923 NA 87 79 0 0 all 860
# 33117: -132.94593 55.55020 99925 NA 819 350 2 4 all 860
# 33118: -131.47074 55.13807 99926 NA 1460 145 6 2 all 860
# 33119: -133.45792 56.23906 99927 NA 94 74 0 0 all 860
# 33120: -131.60683 56.41383 99929 NA 2338 1691 3 33 all 860
Your best bet is probably with the U.S. Census Bureau TIGER/Line shapefiles. They have ZIP code tabulation area shapefiles (ZCTA5) for 2010 at the state level which may be sufficient for your purposes.
Census data itself can be found at American FactFinder. For example, you can get population estimates at the sub-county level (i.e. city/town), but not straight-forward population estimates at the zip-code level. I don't know the details of your data set, but one solution might require the use of relationship tables that are also available as part of the TIGER/Line data, or alternatively spatially joining the place names containing the census data (subcounty shapefiles) with the ZCTA5 codes.
Note from the metadata: "These products are free to use in a product or publication, however acknowledgement must be given to the U.S. Census Bureau as the source."
HTH
simple for loop to get zip level population. you need to get a key though. it is for US now.
masterdata <- data.table()
for(z in 1:length(ziplist)){
print(z)
textt <- paste0("http://api.opendatanetwork.com/data/v1/values?variable=demographics.population.count&entity_id=8600000US",ziplist[z],"&forecast=3&describe=false&format=&app_token=YOURKEYHERE")
errorornot <- try(jsonlite::fromJSON(textt), silent=T)
if(is(errorornot,"try-error")) next
data <- jsonlite::fromJSON(textt)
data <- as.data.table(data$data)
zipcode <- data[1,2]
data <- data[2:nrow(data)]
setnames(data,c("Year","Population","Forecasted"))
data[,ZipCodeQuery:=zipcode]
data[,ZipCodeData:=ziplist[z]]
masterdata <- rbind(masterdata,data)
}

Resources