ggmap in R: How do I extract individual location features from geocoding? - r

I'm trying to clean up user inputted addresses, so I thought using GGMAP to extract the Longitude/Latitude and Address used would be a way to clean everything up. However, the Address it spits out sometimes has colloquial names in the address and it makes it hard to parse out the individual location aspects.
Here's the code I'm using
for(i in 1:nrow(Raw_Address))
{
result <- try(geocode(Raw_Address$Address_Total[i], output = "more", source = "google"))
Raw_Address$lon[i] <- as.numeric(result[1])
Raw_Address$lat[i] <- as.numeric(result[2])
Raw_Address$geoAddress[i] <- as.character(result[3])
}
I tried changing the "latlona" to "more" and going through the result numbers, but only got back different longitude/latitudes. I didn't see anywhere in the documentation that shows the results vectors.
Basically, I want Street Name, City, State, Zip, Longitude, and Latitude.
Edit: Here's an example of the data
User Input: 1651 SE TIFFANY AVE. PORT ST. LUCIE FL
GGMAP Output: martin health systems - tiffany ave., 1651 se tiffany ave, port st. lucie, fl 34952, usa
This is hard to parse because of the colloquial name. I could use the stringr package to try and parse, but it probably wouldn't be all inclusive. But it returns a distinct address while some users spell "Tiffany" wrong or spell out "Saint" instead of "St."

Rather than using a for loop, purrr::map_dfr will iterate over a vector and rbind the resulting data frames into a single one, which is handy here. For example,
library(tidyverse)
libraries <- tribble(
~library, ~address,
"Library of Congress", "101 Independence Ave SE, Washington, DC 20540",
"British Library", "96 Euston Rd, London NW1 2DB, UK",
"New York Public Library", "476 5th Ave, New York, NY 10018",
"Library and Archives Canada", "395 Wellington St, Ottawa, ON K1A 0N4, Canada"
)
library_locations <- map_dfr(libraries$address, ggmap::geocode,
output = "more", source = "dsk")
This will output a lot of messages, some telling you what geocode is calling, e.g.
#> Information from URL : http://www.datasciencetoolkit.org/maps/api/geocode/json?address=101%20Independence%20Ave%20SE,%20Washington,%20DC%2020540&sensor=false
and some warning that factors are being coerced to character:
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
which they should be, so you can ignore them all. (If you really want you can write more code to make them go away, but you'll end up with the same thing.)
Combine the resulting data frames, and you get all the location data linked to your original dataset:
full_join(libraries, library_locations)
#> Joining, by = "address"
#> # A tibble: 4 x 15
#> library address lon lat type loctype north south east west
#> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Librar… 101 In… -77.0 38.9 stre… rooftop 38.9 38.9 -77.0 -77.0
#> 2 Britis… 96 Eus… -0.125 51.5 stre… rooftop 51.5 51.5 -0.124 -0.126
#> 3 New Yo… 476 5t… -74.0 40.8 stre… rooftop 40.8 40.8 -74.0 -74.0
#> 4 Librar… 395 We… -114. 60.1 coun… approx… 83.1 41.7 -52.3 -141.
#> # … with 5 more variables: street_number <chr>, route <chr>,
#> # locality <chr>, administrative_area_level_1 <chr>, country <chr>
You may notice that Data Science Toolkit has utterly failed to geocode Libraries and Archives Canada, for whatever reason—it's marked as a country instead of an address. Geocoders are faulty sometimes. From here, subset out whatever you don't need.
If you want even more information, you can use geocode's output = "all" method, but that returns a list you'll need to parse, which takes more work.

Related

Error in subset.default using bulk_postcode_loo

Ultimately I want to use postcodes for all state-funded secondary schools in England, but for now I'm trying to figure out what code I will need to use, so using a selection of just 5.
I want to retrieve the coordinates (so latitude and longitude) and the lsoa value for each postcode.
pc_list <- list(postcodes = c("PE7 3BY", "ME15 9AZ", "BS21 6AH", "SG18 8JB", "M11 2NA"))
pclist1 <- bulk_postcode_lookup(pc_list)
This returns all the information about those 5 postcodes. Now I want it just to return information on those 3 variables (latitude, longitude and lsoa) that I'm interested in.
pclist2 <- subset(pclist1, select = c(longitude, latitude, lsoa))
This returns the following error.
Error in subset.default(pclist1, select = c(longitude, latitude, lsoa)) :
argument "subset" is missing, with no default
Once I am able to get this information, I want to add these 3 variables along with their relevant postcode into a new dataframe that I can perform susbequent analysis on - is this what pclist2 will be?
Slightly modified example from https://docs.ropensci.org/PostcodesioR/articles/Introduction.html#multiple-postcodes , for whatever reason I only received positive responses when removed spaces from postcodes :
library(PostcodesioR)
library(purrr)
pc_list <- list(postcodes = c("PE73BY", "ME159AZ", "BS216AH", "SG188JB", "M112NA"))
pclist1 <- bulk_postcode_lookup(pc_list)
# extract 2nd list item from each response (the "result" list)
bulk_list <- lapply(pclist1, "[[", 2)
# extract list of items from results lists, return tibble / data frame
bulk_df <- map_dfr(bulk_list, `[`, c("postcode", "longitude", "latitude", "lsoa"))
Resulting tibble / data frame :
bulk_df
#> # A tibble: 5 × 4
#> postcode longitude latitude lsoa
#> <chr> <dbl> <dbl> <chr>
#> 1 PE7 3BY -0.226 52.5 Peterborough 019D
#> 2 ME15 9AZ 0.538 51.3 Maidstone 013C
#> 3 BS21 6AH -2.84 51.4 North Somerset 005A
#> 4 SG18 8JB -0.249 52.1 Central Bedfordshire 006C
#> 5 M11 2NA -2.18 53.5 Manchester 015E
Created on 2023-01-13 with reprex v2.0.2

Tidycensus - One year ACS for All Counties in a State

Pretty simple problem, I think, but not sure of the proper solution. Have done some research on this and think I recall seeing a solution somewhere, but cannot remember where...anyway,
Want to get DP03, one-year acs data for all Ohio counties, year 2019. However, The code below only accesses 39 of Ohio's 88 counties. How can I access the remaining counties?
My guess is that data is only being pulled for counties with populations greater than 60,000.
library(tidycensus)
library(tidyverse)
acs_2019 <- load_variables(2019, dataset = "acs1/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name, label)
Ohio_county <-
get_acs(geography = "county",
year = 2019,
state = "OH",
survey = "acs1",
variables = DP03,
output = "wide")
This results in a table that looks like this...
Ohio_county
# A tibble: 39 x 550
GEOID NAME `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~ `Estimate!!EMPL~
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 39057 Gree~ 138295 815 138295 NA 87465
2 39043 Erie~ 61316 516 61316 NA 38013
3 39153 Summ~ 442279 1273 442279 NA 286777
4 39029 Colu~ 83317 634 83317 NA 48375
5 39099 Maho~ 188298 687 188298 NA 113806
6 39145 Scio~ 60956 588 60956 NA 29928
7 39003 Alle~ 81560 377 81560 NA 49316
8 39023 Clar~ 108730 549 108730 NA 64874
9 39093 Lora~ 250606 896 250606 NA 150136
10 39113 Mont~ 428140 954 428140 NA 267189
Pretty sure I've seen a solution somewhere, but cannot recall where.
Any help would be appreciated since it would let the office more easily pull census data rather than wading through the US Census Bureau site. Best of luck and Thank you!
My colleague, who already pulled the data, did not specify whether or not the DP03 data came from the ACS 1 year survey or the ACS 5 year survey. As it turns out, it was from the ACS 5 year survey, which includes all Ohio counties, not just those counties over 65,000 population. Follow comments above for a description of how this answer was determined.
Code for all counties is here
library(tidycensus)
library(tidyverse)
acs_2018 <- load_variables(2018, dataset = "acs5/profile")
DP03 <- acs_2019 %>%
filter(str_detect(name, pattern = "^DP03")) %>%
pull(name)
Ohio_county <-
get_acs(geography = "county",
year = 2018,
state = "OH",
survey = "acs5",
variables = DP03,
output = "wide")

Apply an API Function over 2 columns of Dataframe, Output a Third Column

Here are the first four columns of a dataframe that is 20K long.
# A tibble: 4 x 4
Address Address_Total lon lat
<chr> <dbl> <dbl> <dbl>
1 !500 s. dobson rd., mesa, AZ, 852… 14.1 -112. 33.4
2 # l10, jackson, MS, 39202, United… 16.1 NA NA
3 0 fletcher allen health care, bur… 300 -73.2 44.5
4 00 w main st # 110, babylon, NY, … 287. NA NA
I want to convert the geocodes in the dataframe (lon and lat values) to county codes (FIPS). I found a great script that does that using FCC API. All you need is to input two lat / long values:
geo2fips <- function(latitude, longitude) {
url <- "https://geo.fcc.gov/api/census/block/find?format=json&latitude=%f&longitude=%f"
url <- sprintf(url, latitude, longitude)
json <- RCurl::getURL(url)
json <- RJSONIO::fromJSON(json)
as.character(json$County['FIPS'])
}
For instance, if I insert a combo of lat / long it comes up with this, which is perfect:
> # Orange County
> geo2fips(28.35975, -81.421988)
[1] "12095"
What I want to do is to use some member of the apply family to run geo2fips over the entire dataset from top to bottom. I would like the output to be a fifth column of my dataframe called "FIPS" or something like that, which just contains the FIPS codes.
Can anyone help? Been at this for hours and I can't get it to work. I'm sure there's just some syntax with using the apply family and I'm pretty sure that it's my fault because I'm not coercing the dataframe columns correctly.

Faster way to process 1.2 million JSON geolocation queries from large dataframe

I am working on the Gowalla location-based checkin dataset which has around 6.44 million checkins. Unique locations on these checkins are 1.28 million. But Gowalla only gives latitudes and longitudes. So I need to find city, state and country for each of those lats and longs. From another post on StackOverflow I was able to create the R query below which queries the open street maps and finds the relevant geographical details.
Unfortunately it takes around 1 minute to process 125 rows, which means 1.28 million rows would take a couple of days. Is there a faster way to find these details? Maybe there is some package with builtin lats and longs of cities of the world to find the city name for the given lat and long so I do not have to do online querying.
Venue table is a data frame with 3 columns: 1: vid(venueId), 2 lat(latitude), 3: long(longitude)
for(i in 1:nrow(venueTable)) {
#this is just an indicator to display current value of i on screen
cat(paste(".",i,"."))
#Below code composes the url query
url <- paste("http://nominatim.openstreetmap.org/reverse.php? format=json&lat=",
venueTableTest3$lat[i],"&lon=",venueTableTest3$long[i])
url <- gsub(' ','',url)
url <- paste(url)
x <- fromJSON(url)
venueTableTest3$display_name[i] <- x$display_name
venueTableTest3$country[i] <- x$address$country
}
I am using the jsonlite package in R which makes x which is the result of the JSON query as a dataframe which stores various results returned. So using x$display_name or x$address$city i use my required field.
My laptop is core i5 3230M with 8gb ram and 120gb SSD using Windows 8.
You're going to have issues even if you persevere with time. The service you're querying allows 'an absolute maximum of one request per second', which you're already breaching. It's likely that they will throttle your requests before you reach 1.2m queries. Their website notes similar APIs for larger uses have only around 15k free daily requests.
It'd be much better for you to use an offline option. A quick search shows that there are many freely available datasets of populated places, along with their longitude and latitudes. Here's one we'll use: http://simplemaps.com/resources/world-cities-data
> library(dplyr)
> cities.data <- read.csv("world_cities.csv") %>% tbl_df
> print(cities.data)
Source: local data frame [7,322 x 9]
city city_ascii lat lng pop country iso2 iso3 province
(fctr) (fctr) (dbl) (dbl) (dbl) (fctr) (fctr) (fctr) (fctr)
1 Qal eh-ye Now Qal eh-ye 34.9830 63.1333 2997 Afghanistan AF AFG Badghis
2 Chaghcharan Chaghcharan 34.5167 65.2500 15000 Afghanistan AF AFG Ghor
3 Lashkar Gah Lashkar Gah 31.5830 64.3600 201546 Afghanistan AF AFG Hilmand
4 Zaranj Zaranj 31.1120 61.8870 49851 Afghanistan AF AFG Nimroz
5 Tarin Kowt Tarin Kowt 32.6333 65.8667 10000 Afghanistan AF AFG Uruzgan
6 Zareh Sharan Zareh Sharan 32.8500 68.4167 13737 Afghanistan AF AFG Paktika
7 Asadabad Asadabad 34.8660 71.1500 48400 Afghanistan AF AFG Kunar
8 Taloqan Taloqan 36.7300 69.5400 64256 Afghanistan AF AFG Takhar
9 Mahmud-E Eraqi Mahmud-E Eraqi 35.0167 69.3333 7407 Afghanistan AF AFG Kapisa
10 Mehtar Lam Mehtar Lam 34.6500 70.1667 17345 Afghanistan AF AFG Laghman
.. ... ... ... ... ... ... ... ... ...
It's hard to demonstrate without any actual data examples (helpful to provide!), but we can make up some toy data.
# make up toy data
> candidate.longlat <- data.frame(vid = 1:3,
lat = c(12.53, -16.31, 42.87),
long = c(-70.03, -48.95, 74.59))
Using the distm function in geosphere, we can calculate the distances between all your data and all the city locations at once. For you, this will make a matrix containing ~8,400,000,000 numbers, so it might take a while (can explore parallisation), and may be highly memory intensive.
> install.packages("geosphere")
> library(geosphere)
# compute distance matrix using geosphere
> distance.matrix <- distm(x = candidate.longlat[,c("long", "lat")],
y = cities.data[,c("lng", "lat")])
It's then easy to find the closest city to each of your data points, and cbind it to your data.frame.
# work out which index in the matrix is closest to the data
> closest.index <- apply(distance.matrix, 1, which.min)
# rbind city and country of match with original query
> candidate.longlat <- cbind(candidate.longlat, cities.data[closest.index, c("city", "country")])
> print(candidate.longlat)
vid lat long city country
1 1 12.53 -70.03 Oranjestad Aruba
2 2 -16.31 -48.95 Anapolis Brazil
3 3 42.87 74.59 Bishkek Kyrgyzstan
Here's an alternate way using R's inherent spatial processing capabilities:
library(sp)
library(rgeos)
library(rgdal)
# world places shapefile
URL1 <- "http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_populated_places.zip"
fil1 <- basename(URL1)
if (!file.exists(fil1)) download.file(URL1, fil1)
unzip(fil1)
places <- readOGR("ne_10m_populated_places.shp", "ne_10m_populated_places",
stringsAsFactors=FALSE)
# some data from the other answer since you didn't provide any
URL2 <- "http://simplemaps.com/resources/files/world/world_cities.csv"
fil2 <- basename(URL2)
if (!file.exists(fil2)) download.file(URL2, fil2)
# we need the points from said dat
dat <- read.csv(fil2, stringsAsFactors=FALSE)
pts <- SpatialPoints(dat[,c("lng", "lat")], CRS(proj4string(places)))
# this is not necessary
# I just don't like the warning about longlat not being a real projection
robin <- "+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs"
pts <- spTransform(pts, robin)
places <- spTransform(places, robin)
# compute the distance (makes a pretty big matrix so you should do this
# in chunks unless you have a ton of memory or do it row-by-row
far <- gDistance(pts, places, byid=TRUE)
# find the closest one
closest <- apply(far, 1, which.min)
# map to the fields (you may want to map to other fields)
locs <- places#data[closest, c("NAME", "ADM1NAME", "ISO_A2")]
locs[sample(nrow(locs), 10),]
## NAME ADM1NAME ISO_A2
## 3274 Szczecin West Pomeranian PL
## 1039 Balakhna Nizhegorod RU
## 1012 Chitre Herrera PA
## 3382 L'Aquila Abruzzo IT
## 1982 Dothan Alabama US
## 5159 Bayankhongor Bayanhongor MN
## 620 Deming New Mexico US
## 1907 Fort Smith Arkansas US
## 481 Dedougou Mou Houn BF
## 7169 Prague Prague CZ
It's about a minute (on my system) for ~7500 so you're looking at a couple hours vs a day or more. You can do this in parallel and prbly get it done in less than an hour.
For better place resolution, you could use a very lightweight shapefile of country or Admin 1 polygons, then use a second process to do the distance from better resolution points for those geographic locations.

Zip Code Demographics in R

I could get at my goals "the long way" but am hoping to stay completely within R. I am looking to append Census demographic data by zip code to records in my database. I know that R has a few Census-based packages, but, unless I am missing something, these data do not seem to exist at the zip code level, nor is it intuitive to merge onto an existing data frame.
In short, is it possible to do this within R, or is my best approach to grab the data elsewhere and read it into R?
Any help will be greatly appreciated!
In short, no. Census to zip translations are generally created from proprietary sources.
It's unlikely that you'll find anything at the zipcode level from a census perspective (privacy). However, that doesn't mean you're left in the cold. You can use the zipcodes that you have and append census data from the MSA, muSA or CSA level. Now all you need is a listing of postal codes within your MSA, muSA or CSA so that you can merge. There's a bunch online that are pretty cheap if you don't already have such a list.
For example, in Canada, we can get income data from CRA at the FSA level (the first three digits of a postal code in the form A1A 1A1). I'm not sure what or if the IRS provides similar information, I'm also not too familiar with US Census data, but I imagine they provide information at the CSA level at the very least.
If you're bewildered by all these acronyms:
MSA: http://en.wikipedia.org/wiki/Metropolitan_Statistical_Area
CSA: http://en.wikipedia.org/wiki/Combined_statistical_area
muSA: http://en.wikipedia.org/wiki/Micropolitan_Statistical_Area
As others in this thread have mentioned, the Census Bureau American FactFinder is a free source of comprehensive and detailed data. Unfortunately, it’s not particularly easy to use in its raw format.
We’ve pulled, cleaned, consolidated, and reformatted the Census Bureau data. The details of this process and how to use the data files can be found on our team blog.
None of these tables actually have a field called “ZIP code.” Rather, they have a field called “ZCTA5”. A ZCTA5 (or ZCTA) can be thought of as interchangeable with a zip code given following caveats:
There are no ZCTAs for PO Box ZIP codes - this means that for 42,000 US ZIP Codes there are 32,000 ZCTAs.
ZCTAs, which stand for Zip Code Tabulation Areas, are based on zip codes but don’t necessarily follow exact zip code boundaries. If you would like to read more about ZCTAs, please refer to this link. The Census Bureau also provides an animation that shows how ZCTAs are formed.
I just wrote a R package called totalcensus (https://github.com/GL-Li/totalcensus), with which you can extract any data in decennial census and ACS survey easily.
For this old question if you still care, you can get total population (by default) and population of other races from national data of decennial census 2010 or 2015 ACS 5-year survey.
From 2015 ACS 5-year survey. Download national data with download_census("acs5year", 2015, "US") and then:
zip_acs5 <- read_acs5year(
year = 2015,
states = "US",
geo_headers = "ZCTA5",
table_contents = c(
"white = B02001_002",
"black = B02001_003",
"asian = B02001_005"
),
summary_level = "860"
)
# GEOID lon lat ZCTA5 state population white black asian GEOCOMP SUMLEV NAME
# 1: 86000US01001 -72.62827 42.06233 01001 NA 17438 16014 230 639 all 860 ZCTA5 01001
# 2: 86000US01002 -72.45851 42.36398 01002 NA 29780 23333 1399 3853 all 860 ZCTA5 01002
# 3: 86000US01003 -72.52411 42.38994 01003 NA 11241 8967 699 1266 all 860 ZCTA5 01003
# 4: 86000US01005 -72.10660 42.41885 01005 NA 5201 5062 40 81 all 860 ZCTA5 01005
# 5: 86000US01007 -72.40047 42.27901 01007 NA 14838 14086 104 330 all 860 ZCTA5 01007
# ---
# 32985: 86000US99923 -130.04103 56.00232 99923 NA 13 13 0 0 all 860 ZCTA5 99923
# 32986: 86000US99925 -132.94593 55.55020 99925 NA 826 368 7 0 all 860 ZCTA5 99925
# 32987: 86000US99926 -131.47074 55.13807 99926 NA 1711 141 0 2 all 860 ZCTA5 99926
# 32988: 86000US99927 -133.45792 56.23906 99927 NA 123 114 0 0 all 860 ZCTA5 99927
# 32989: 86000US99929 -131.60683 56.41383 99929 NA 2365 1643 5 60 all 860 ZCTA5 99929
From Census 2010. Download national data with download_census("decennial", 2010, "US") and then:
zip_2010 <- read_decennial(
year = 2010,
states = "US",
table_contents = c(
"white = P0030002",
"black = P0030003",
"asian = P0030005"
),
geo_headers = "ZCTA5",
summary_level = "860"
)
# lon lat ZCTA5 state population white black asian GEOCOMP SUMLEV
# 1: -66.74996 18.18056 00601 NA 18570 17285 572 5 all 860
# 2: -67.17613 18.36227 00602 NA 41520 35980 2210 22 all 860
# 3: -67.11989 18.45518 00603 NA 54689 45348 4141 85 all 860
# 4: -66.93291 18.15835 00606 NA 6615 5883 314 3 all 860
# 5: -67.12587 18.29096 00610 NA 29016 23796 2083 37 all 860
# ---
# 33116: -130.04103 56.00232 99923 NA 87 79 0 0 all 860
# 33117: -132.94593 55.55020 99925 NA 819 350 2 4 all 860
# 33118: -131.47074 55.13807 99926 NA 1460 145 6 2 all 860
# 33119: -133.45792 56.23906 99927 NA 94 74 0 0 all 860
# 33120: -131.60683 56.41383 99929 NA 2338 1691 3 33 all 860
Your best bet is probably with the U.S. Census Bureau TIGER/Line shapefiles. They have ZIP code tabulation area shapefiles (ZCTA5) for 2010 at the state level which may be sufficient for your purposes.
Census data itself can be found at American FactFinder. For example, you can get population estimates at the sub-county level (i.e. city/town), but not straight-forward population estimates at the zip-code level. I don't know the details of your data set, but one solution might require the use of relationship tables that are also available as part of the TIGER/Line data, or alternatively spatially joining the place names containing the census data (subcounty shapefiles) with the ZCTA5 codes.
Note from the metadata: "These products are free to use in a product or publication, however acknowledgement must be given to the U.S. Census Bureau as the source."
HTH
simple for loop to get zip level population. you need to get a key though. it is for US now.
masterdata <- data.table()
for(z in 1:length(ziplist)){
print(z)
textt <- paste0("http://api.opendatanetwork.com/data/v1/values?variable=demographics.population.count&entity_id=8600000US",ziplist[z],"&forecast=3&describe=false&format=&app_token=YOURKEYHERE")
errorornot <- try(jsonlite::fromJSON(textt), silent=T)
if(is(errorornot,"try-error")) next
data <- jsonlite::fromJSON(textt)
data <- as.data.table(data$data)
zipcode <- data[1,2]
data <- data[2:nrow(data)]
setnames(data,c("Year","Population","Forecasted"))
data[,ZipCodeQuery:=zipcode]
data[,ZipCodeData:=ziplist[z]]
masterdata <- rbind(masterdata,data)
}

Resources