Geocoding in R using googleway - r

I have read Batch Geocoding with googleway R
I am attempting to geocode some addresses using googleway. I want the geocodes, address, and county returned back.
Using the answer linked to above I created the following function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
df<-as.data.frame(unlist(res[[x]]$results$address_components))
address<-paste(df[1,],df[2,],sep = " ")
city<-paste0(df[3,])
county<-paste0(df[4,])
state<-paste0(df[5,])
zip<-paste0(df[7,])
coordinates<-cbind(coordinates,address,city,county,state,zip)
coordinates<-as.data.frame(coordinates)
})
Then put it back together like so...
library(data.table)
done<-rbindlist(geocodes))
The issue is getting the address and county back out from the 'res' list. The answer linked to above pulls the address from the dataframe that was sent to google and assumes the list is in the right order and there are no multiple match results back from google (in my list there seems to be a couple). Point is, taking the addresses from one file and the coordinates from another seems rather reckless and since I need the county anyway, I need a way to pull it out of google's resulting list saved in 'res'.
The issue is that some addresses have more "types" than others which means referencing by row as I did above does not work.
I also tried including rbindlist inside the function to convert the sublist into a datatable and then pull out the fields but can't quite get it to work. The issue with this approach is that actual addresses are in a vector but the 'types' field which I would use to filter or select is in a sublist.
The best way I can describe it is like this -
list <- c(long address),c(short address), types(LIST(street number, route, county, etc.))
Obviously, I'm a beginner at this. I know there's a simpler way but I am just really struggling with lists and R seems to make extensive use of them.
Edit:
I definitely recognize that I cannot rbind the whole list. I need to pull specific elements out and bind just those. A big part of the problem, in my mind, is that I do not have a great handle on indexing and manipulating lists.
Here are some addresses to try - "301 Adams St, Friendship, WI 53934, USA" has an 7X3 "address components" and corresponding "types" list of 7. Compare that to "222 S Walnut St, Appleton, WI 45911, USA" which has an address components of 9X3 and "types" list of 9. The types list needs to be connected back to the address components matrix because the types list identifies what each row of the address components matrix contains.
Then there are more complexities introduced by imperfect matches. Try "211 Grand Avenue, Rothschild, WI, 54474" and you get 2 lists, one for east grand ave and one for west grand ave. Google seems to prefer the east since that's what comes out in the "formatted address." I don't really care which is used since the county will be the same for either. The "location" interestingly contains 2 sets of geocodes which, presumably, refer to the two matches. I think this complexity can be ignored since the location consisting of two coordinates is still stored as a 'double' (not a list!) so it should stack with the coordinates for the other addresses.
Edit: This should really work but I'm getting an error in the do.call(rbind,types) line of the function.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
R says the "types" object is not a list so it can't rbind it. I tried coercing it to a list but still get the error. I checked using the following paired down function and found #294 is null. This halts the function. I get "over query limit" as an error but I am not over the query limit.
geocodes<-lapply(seq_along(res),function(x) {
types<-res[[x]]$results$address_components[[1]]$types
print(typeof(types))
})

Here's my solution using tidyverse functions. This gets the geocode and also the formatted address in case you want it (other components of the result can be returned as well, they just need to be added to the table in the last row of the map function that gets returned.
suppressPackageStartupMessages(require(tidyverse))
suppressPackageStartupMessages(require(googleway))
set_key("your key here")
df <- tibble(full_address = c("2379 ADDISON BLVD HIGH POINT 27262",
"1751 W LEXINGTON AVE HIGH POINT 27262", "dljknbkjs"))
df %>%
mutate(geocode_result = map(full_address, function(full_address) {
res <- google_geocode(full_address)
if(res$status == "OK") {
geo <- geocode_coordinates(res) %>% as_tibble()
formatted_address <- geocode_address(res)
geocode <- bind_cols(geo, formatted_address = formatted_address)
}
else geocode <- tibble(lat = NA, lng = NA, formatted_address = NA)
return(geocode)
})) %>%
unnest()
#> # A tibble: 3 x 4
#> full_address lat lng formatted_address
#> <chr> <dbl> <dbl> <chr>
#> 1 2379 ADDISON BLVD HIGH POI… 36.0 -80.0 2379 Addison Blvd, High Point, N…
#> 2 1751 W LEXINGTON AVE HIGH … 36.0 -80.1 1751 W Lexington Ave, High Point…
#> 3 dljknbkjs NA NA <NA>
Created on 2019-04-14 by the reprex package (v0.2.1)

Ok, I'll answer it myself.
Begin with a dataframe of addresses. I called mine "addresses" and the singular column in the dataframe is also called "Addresses" (note that I capitalized it).
Use googleway to get the geocode data. I did this using apply to loop across the rows in the address dataframe
library(googleway)
res<-apply(addresses,1,function (x){
google_geocode(address=x[['Address']], key='insert your google api key here - its free to get')
})
Here is the function I wrote to get the nested lists into a dataframe.
geocodes<-lapply(seq_along(res),function(x) {
coordinates<-res[[x]]$results$geometry$location
types<-res[[x]]$results$address_components[[1]]$types
types<-do.call(rbind,types)
types<-types[,1]
address<-as.data.frame(res[[x]]$results$address_components[[1]]$long_name,strings.As.Factors=FALSE)
names(address)[1]<-"V2"
address<-cbind(address,types)
address<-tidyr::spread(address,types,V2)
address<-cbind(address,coordinates)
})
library(data.table)
geocodes<-rbindlist(geocodes,fill=TRUE)
lapply loops along the items in the list, within the function I create a coordinates dataframe and put the geocodes there. I also wanted the other address components, particularly the county, so I also created the "types" dataframe which identifies what the items in the address are. I cbind the address items with the types, then use spread from the tidyr package to reshape the dataframe into wideformat so it's just 1 row wide. I then cbind in the lat and lon from the coordinates dataframe.
The rbindlist stacks it all back together. You could use do.call(rbind, geocodes) but rbindlist is faster.

Related

How do I preserve prexisting identifiers when geocoding a list of addresses in R?

I'm currently working with an R script set up to use RDSTK, a wrapper for the Data Science Toolkit API based on this, to geocode a list of addresses from a CSV.
The script appears to work, but the list of addresses has a preexisting unique identifier which isn't preserved in the process - the input file has two columns: id, and address. The id column, for the purposes of the geocoding process, is meaningless, but I'd like the output to retain it - that is, I'd like the output, which has three columns (address, long, and lat) to have four - id being the first.
The issue is that
The output is not in the same order as the input addresses, or doesn't appear to be, so I cannot simply tack on the column of addresses at the end, and
The output does not include nulls, so the two would not be the same number of rows in any case, even if it was the same order, and
I am not sure how to effectively tie the id column in such that it becomes a part of the geocoding process, which obviously would be the ideal solution.
Here is the script:
require("RDSTK")
library(httr)
library(rjson)
dff = read.csv("C:/Users/name/Documents/batchtestv2.csv")
data <- paste0("[",paste(paste0("\"",dff$address,"\""),collapse=","),"]")
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,lapply(json, function(x) c(long=x$longitude,lat=x$latitude)))
geocode
write.csv(geocode, file = "C:/Users/name/Documents/geocodetest.csv")
And here is a sample of the output:
2633 Camino Ramon Suite 500 San Ramon California 94583 United States -121.96208 37.77027
555 Lordship Boulevard Stratford Connecticut 6615 United States -73.14098 41.16542
500 West 13th Street Fort Worth Texas 76102 United States -97.33288 32.74782
50 North Laura Street Suite 2500 Jacksonville Florida 32202 United States -81.65923 30.32733
7781 South Little Egypt Road Stanley North Carolina 28164 United States -81.00597 35.44482
Maybe the solution is extraordinarily simple and I'm just being dense - it's entirely possible (I don't have extensive experience with any particular language, so I sometimes miss obvious things) but I haven't been able to solve it.
Thanks in advance!

Julia DataFrames: Problems with Split-Apply-Combine strategy

I have some data (from a R course assignment, but that doesn't matter) that I want to use split-apply-combine strategy, but I'm having some problems. The data is on a DataFrame, called outcome, and each line represents a Hospital. Each column has an information about that hospital, like name, location, rates, etc.
My objective is to obtain the Hospital with the lowest "Mortality by Heart Attack Rate" of each State.
I was playing around with some strategies, and got a problem using the by function:
best_heart_rate(df) = sort(df, cols = :Mortality)[end,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
The idea was to split the hospitals DataFrame by State, sort each of the SubDataFrames by Mortality Rate, get the lowest one, and combine the lines in a new DataFrame
But when I used this strategy, I got:
ERROR: no method nrow(SubDataFrame{Array{Int64,1}})
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:311
in sort at /home/paulo/.julia/v0.3/DataFrames/src/dataframe/sort.jl:296
in f at none:1
in based_on at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:144
in by at /home/paulo/.julia/v0.3/DataFrames/src/groupeddataframe/grouping.jl:202
I suppose the nrow function is not implemented for SubDataFrames, so I got an error. So I used a nastier code:
best_heart_rate(df) = (df[sortperm(df[:,:Mortality] , rev=true), :])[1,:]
best_hospitals = by(hospitals, :State, best_heart_rate)
Seems to work. But now there is a NA problem: how can I remove the rows from the SubDataFrames that have NA on the Mortality column? Is there a better strategy to accomplish my objective?
I think this might work, if I've understood you correctly:
# Let me make up some data about hospitals in states
hospitals = DataFrame(State=sample(["CA", "MA", "PA"], 10), mortality=rand(10), hospital=split("abcdefghij", ""))
hospitals[3, :mortality] = NA
# You can use the indmax function to find the index of the maximum element
by(hospitals[complete_cases(hospitals), :], :State, df -> df[indmax(df[:mortality]), [:mortality, :hospital]])
State mortality hospital
1 CA 0.9469632421111882 j
2 MA 0.7137144590022733 f
3 PA 0.8811901895164764 e

How to store trees/nested lists in R?

I have a list of boroughs and a list of localities (like this one). Each locality lies in exactly one borough. What's the best way to store this kind of hierarchical structure in R, considerung that I'd like to have a convenient and readable way of accessing these, and using this list to accumulate data on the locality-level to the borough level.
I've come up with the following:
localities <- list("Mitte" = c("Mitte", "Moabit", "Hansaviertel", "Tiergarten", "Wedding", "Gesundbrunnen",
"Friedrichshain-Kreuzberg" = c("Friedrichshain", "Kreuzberg")
)
But I am not sure if this is the most elegant and accessible way.
If I wanted to assign additional information on the localitiy-level, I could do that by replacing the c(...) by some other call, like rbind(c('0201', '0202'), c("Friedrichshain", "Kreuzberg")) if I wanted to add additional information to the borough-level (like an abbreviated name and a full name for each list), how would I do this?
Edit: For example, I'd like to condense a table like this into a borough-wise version.
Hard to know without having a better view on how you intend to use this, but I would strongly recommend moving away from a nested list structure to a data frame structure:
library(reshape2)
loc.df <- melt(localities)
This is what the molten data looks like:
value L1
1 Mitte Mitte
2 Moabit Mitte
3 Hansaviertel Mitte
4 Tiergarten Mitte
5 Wedding Mitte
6 Gesundbrunnen Mitte
7 Friedrichshain Friedrichshain-Kreuzberg
8 Kreuzberg Friedrichshain-Kreuzberg
You can then use all the standard data frame and other computations:
loc.df$population <- sample(100:500, nrow(loc.df)) # make up population
tapply(loc.df$population, loc.df$L1, mean) # population by borough
gives mean population by Borough:
Friedrichshain-Kreuzberg Mitte
278.5000 383.8333
For more complex calculations you can use data.table and dplyr
You can extract all of this data directly into a data.frame using the XML library.
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Boroughs_and_localities_of_Berlin#List_of_localities"
tables<-readHTMLTable(theurl)
boroughs<-tables[[1]]$Borough
localities<-tables[c(3:14)]
names(localities) <- as.character(boroughs)
all<-do.call("rbind", localities)
#Roland, I think you will find data frames superior to lists for the reasons cited earlier, but also because there is other data on the web page you reference. Loading to a data frame will make it easy to go further if you wish. For example, making comparisons based on population density or other items provided "for free" on the page will be a snap from a data frame.

`data.table` way to select subsets based on `agrep`?

I'm trying to convert from data.frame to data.table, and need some advice on some logical indexing I am trying to do on a single column. Here is a table I have:
places <- data.table(name=c('Brisbane', 'Sydney', 'Auckland',
'New Zealand', 'Australia'),
search=c('Brisbane AU Australia',
'Sydney AU Australia',
'Auckland NZ New Zealand',
'NZ New Zealand',
'AU Australia'))
# name search
# 1: Brisbane Brisbane AU Australia
# 2: Sydney Sydney AU Australia
# 3: Auckland Auckland NZ New Zealand
# 4: New Zealand NZ New Zealand
# 5: Australia AU Australia
setkey(places, search)
I want to extract rows whose search column matches all words in a list, like so:
words <- c('AU', 'Brisbane')
hits <- places
for (w in words) {
hits <- hits[search %like% w]
}
# I end up with the 'Brisbane AU Australia' row.
I have one question:
Is there a more data.table-way to do this? It seems to me that storing hits each time seems like a data.frame way to do this.
This is subject to the caveat that I eventually want to use agrep rather than grep/%like%:
words <- c('AU', 'Bisbane') # note the mis-spelling
hits <- places
for (w in words) {
hits <- hits[agrep(w, search)]
}
I feel like this doesn't quite take advantage of data.table's capabilities and would appreciate thoughts on how to modify the code so it does.
EDIT
I want the for loop because places is quite large, and I only want to find rows that match all the words. Hence I only need to search in the results for the last word for the next word (that is, successively refine the results).
With the talk of "binary scan" vs "vector scan" in the data.table introduction (i.e. "bad way" is DT[DT$x == "R" & DT$y == "h"], "good way" is setkey(DT, x, y); DT[J("R", "h")] I just wondered if there was some way I could apply this approach here.
Mathematical.coffee, as I mentioned under comments, you can not "partial match" by setting a column (or more columns) as key column(s). That is, in the data.table places, you've set the column "search" as the key column. Here, you can fast subset by using data.table's binary search (as opposed to vector scan subsetting) by doing:
places["Brisbane AU Australia"] # binary search when "search" column is key'd
# is faster compared to:
places[search == "Brisbane AU Australia"] # vector scan
But in your case, yo require:
places["AU"]
to give all rows with has a partial match of "AU" within the key column. And this is not possible (while it's certainly a very interesting feature to have).
If the substring you're searching for by itself does not contain mismatches, then you can try splitting the search strings into separate columns. That is, the column search if split into three columns containing Brisbane, AU and Australia, then you can set the key of the data.table to the columns that contain AU and Brisbane. Then, you can query the way you mention as:
# fast subset, AU and Brisbane are entries of the two key columns
places[J("AU", "Brisbane")]
You can vectorize the agrep function to avoid looping.
Note that the result of agrep2 is a list hence the unlist call
words <- c("Bisbane", "NZ")
agrep2 <- Vectorize(agrep, vectorize.args = "pattern")
places[unlist(agrep2(words, search))]
## name search
## 1: Brisbane Brisbane AU Australia
## 2: Auckland Auckland NZ New Zealand
## 3: New Zealand NZ New Zealand

Apply a function to each row in a data.frame and append the result to the data.frame in R

It seems like I should either know how to do this or at least find the answer here or elsewhere. Unfortunately neither is working.
I have a data frame of customers where one column is their id and another column is their full address. I want to add 3 columns for each row with the lat, long and county code from a geocode lookup.
That data frame looks like
customer_id fulladdress
1 123 Main St., Anywhere, FL
2 321 Oak St., Thisplace, CA
I created a geocode function that takes the full address and returns a data frame with lat, long and county columns.
How can I apply my geocode function to each row of the data frame and append the results as 3 columns into the existing data frame so that it looks like this:
customer_id fulladdress lat long county
1 123 Main St., Anywhere, FL 33.2345 -92.3333 43754
2 321 Oak St., Thisplace, CA 25.3333 -120.333 32960
I've tried playing with apply and ddply, but I can't seem to figure out what either one is doing. I tried this with ddply but all it does is give me back the original data frame.
ddply(customers[1:3,], .(fulladdress), function(x) { geocode(x$fulladdress)})
Thanks for the help.
Thanks for putting me on the right track. Here is what finally worked:
cbind(customers, t(sapply(customers$fulladdress,geocode, USE.NAMES=F)))

Resources