I am fed up with Google's geocoding, and decided to try an alternative. The Data Science Toolkit (http://www.datasciencetoolkit.org) allows you to Geocode unlimited number of addresses. R has an excellent package that serves as a wrapper for its functions (CRAN:RDSTK). The package has a function called street2coordinates() that interfaces with the Data Science Toolkit's geocoding utility.
However, the RDSTK function street2coordinates() does not work if you try to geocode something simple like City, Country. In the following example I will try to use the function to get the latitude and longitude for the city of Phoenix:
> require("RDSTK")
> street2coordinates("Phoenix+Arizona+United+States")
[1] full.address
<0 rows> (or 0-length row.names)
The utility from the data science toolkit works perfectly. This is the URL request that gives the answer:
http://www.datasciencetoolkit.org/maps/api/geocode/json?sensor=false&address=Phoenix+Arizona+United+States
I am interested in geocoding multiple addresses (which complete addresses and city names). I know that the Data Science Toolkit URL will work well.
How do I interface with the URL and get multiple latitudes and longitudes into a data frame with the addresses?
Here is an sample dataset:
dff <- data.frame(address=c(
"Birmingham, Alabama, United States",
"Mobile, Alabama, United States",
"Phoenix, Arizona, United States",
"Tucson, Arizona, United States",
"Little Rock, Arkansas, United States",
"Berkeley, California, United States",
"Duarte, California, United States",
"Encinitas, California, United States",
"La Jolla, California, United States",
"Los Angeles, California, United States",
"Orange, California, United States",
"Redwood City, California, United States",
"Sacramento, California, United States",
"San Francisco, California, United States",
"Stanford, California, United States",
"Hartford, Connecticut, United States",
"New Haven, Connecticut, United States"
))
Like this:
library(httr)
library(rjson)
data <- paste0("[",paste(paste0("\"",dff$address,"\""),collapse=","),"]")
url <- "http://www.datasciencetoolkit.org/street2coordinates"
response <- POST(url,body=data)
json <- fromJSON(content(response,type="text"))
geocode <- do.call(rbind,sapply(json,
function(x) c(long=x$longitude,lat=x$latitude)))
geocode
# long lat
# San Francisco, California, United States -117.88536 35.18713
# Mobile, Alabama, United States -88.10318 30.70114
# La Jolla, California, United States -117.87645 33.85751
# Duarte, California, United States -118.29866 33.78659
# Little Rock, Arkansas, United States -91.20736 33.60892
# Tucson, Arizona, United States -110.97087 32.21798
# Redwood City, California, United States -117.88536 35.18713
# New Haven, Connecticut, United States -72.92751 41.36571
# Berkeley, California, United States -122.29673 37.86058
# Hartford, Connecticut, United States -72.76356 41.78516
# Sacramento, California, United States -121.55541 38.38046
# Encinitas, California, United States -116.84605 33.01693
# Birmingham, Alabama, United States -86.80190 33.45641
# Stanford, California, United States -122.16750 37.42509
# Orange, California, United States -117.85311 33.78780
# Los Angeles, California, United States -117.88536 35.18713
This takes advantage of the POST interface to the street2coordinates API (documented here), which returns all the results in 1 request, rather than using multiple GET requests.
The absence of Phoenix seems to be a bug in the street2coordinates API. If you go the API demo page and try "Phoenix, Arizona, United States", you get a null response. However, as your example shows, using their "Google-style Geocoder" does give a result for Phoenix. So here's a solution using repeated GET requests. Note that this runs much slower.
geo.dsk <- function(addr){ # single address geocode with data sciences toolkit
require(httr)
require(rjson)
url <- "http://www.datasciencetoolkit.org/maps/api/geocode/json"
response <- GET(url,query=list(sensor="FALSE",address=addr))
json <- fromJSON(content(response,type="text"))
loc <- json['results'][[1]][[1]]$geometry$location
return(c(address=addr,long=loc$lng, lat= loc$lat))
}
result <- do.call(rbind,lapply(as.character(dff$address),geo.dsk))
result <- data.frame(result)
result
# address long lat
# 1 Birmingham, Alabama, United States -86.801904 33.456412
# 2 Mobile, Alabama, United States -88.103184 30.701142
# 3 Phoenix, Arizona, United States -112.0733333 33.4483333
# 4 Tucson, Arizona, United States -110.970869 32.217975
# 5 Little Rock, Arkansas, United States -91.207356 33.608922
# 6 Berkeley, California, United States -122.29673 37.860576
# 7 Duarte, California, United States -118.298662 33.786594
# 8 Encinitas, California, United States -116.846046 33.016928
# 9 La Jolla, California, United States -117.876447 33.857515
# 10 Los Angeles, California, United States -117.885359 35.187133
# 11 Orange, California, United States -117.853112 33.787795
# 12 Redwood City, California, United States -117.885359 35.187133
# 13 Sacramento, California, United States -121.555406 38.380456
# 14 San Francisco, California, United States -117.885359 35.187133
# 15 Stanford, California, United States -122.1675 37.42509
# 16 Hartford, Connecticut, United States -72.763564 41.78516
# 17 New Haven, Connecticut, United States -72.927507 41.365709
The ggmap package includes support for geocoding using either Google or Data Science Toolkit, the latter with their "Google-style geocoder". This is quite slow for multiple addresses, as noted in the earlier answer.
library(ggmap)
result <- geocode(as.character(dff[[1]]), source = "dsk")
print(cbind(dff, result))
# address lon lat
# 1 Birmingham, Alabama, United States -86.80190 33.45641
# 2 Mobile, Alabama, United States -88.10318 30.70114
# 3 Phoenix, Arizona, United States -112.07404 33.44838
# 4 Tucson, Arizona, United States -110.97087 32.21798
# 5 Little Rock, Arkansas, United States -91.20736 33.60892
# 6 Berkeley, California, United States -122.29673 37.86058
# 7 Duarte, California, United States -118.29866 33.78659
# 8 Encinitas, California, United States -116.84605 33.01693
# 9 La Jolla, California, United States -117.87645 33.85751
# 10 Los Angeles, California, United States -117.88536 35.18713
# 11 Orange, California, United States -117.85311 33.78780
# 12 Redwood City, California, United States -117.88536 35.18713
# 13 Sacramento, California, United States -121.55541 38.38046
# 14 San Francisco, California, United States -117.88536 35.18713
# 15 Stanford, California, United States -122.16750 37.42509
# 16 Hartford, Connecticut, United States -72.76356 41.78516
# 17 New Haven, Connecticut, United States -72.92751 41.36571
Related
I have a column that contains thousands of descriptions like this (example) :
Description
Building a hospital in the city of LA, USA
Building a school in the city of NYC, USA
Building shops in the city of Chicago, USA
I'd like to create a column with the first word after "city of", like that :
Description
City
Building a hospital in the city of LA, USA
LA
Building a school in the city of NYC, USA
NYC
Building shops in the city of Chicago, USA
Chicago
I tried with the following code after seeing this topic Extracting string after specific word, but my column is only filled with missing values
library(stringr)
df$city <- data.frame(str_extract(df$Description, "(?<=city of:\\s)[^;]+"))
df$city <- data.frame(str_extract(df$Description, "(?<=of:\\s)[^;]+"))
I took a look at the dput() and the output is the same than the descriptions i see in the dataframe directly.
Solution
This should make the trick for the data you showed:
df$city <- str_extract(df$Description, "(?<=city of )(\\w+)")
df
#> Description city
#> 1 Building a hospital in the city of LA, USA LA
#> 2 Building a school in the city of NYC, USA NYC
#> 3 Building shops in the city of Chicago, USA Chicago
Alternative
However, in case you want the whole string till the first comma (for example in case of cities with a blank in the name), you can go with:
df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")
Check out the following example:
df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
"Building a school in the city of NYC, USA",
"Building shops in the city of Chicago, USA",
"Building a church in the city of Salt Lake City, USA"))
str_extract(df$Description, "(?<=the city of )(\\w+)")
#> [1] "LA" "NYC" "Chicago" "Salt"
str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA" "NYC" "Chicago" "Salt Lake City"
Documentation
Check out ?regex:
Patterns (?=...) and (?!...) are zero-width positive and negative
lookahead assertions: they match if an attempt to match the ...
forward from the current position would succeed (or not), but use up
no characters in the string being processed. Patterns (?<=...) and
(?<!...) are the lookbehind equivalents: they do not allow repetition
quantifiers nor \C in ....
I was working with the googleway package and I had a bunch of addresses that I needed to parse out the various components of the addresses that were in a nested list of lists. Loops (not encouraged) and apply functions both seemed confusing and I was not sure if there was a tidy solution. I found the map function (specifically the pluck function that it calls on lists on the backend) could accomplish my goal so I will share my solution.
Problem:
I need to pull out certain information about the White House such as
Latitude
Longitude
You need to set up your Google Cloud API Key with googleway::set_key(API_KEY), but this is just an example of a nested list that I hope someone working with this package will see.
# Address for the White House and the Lincoln Memorial
address_vec <- c(
"1600 Pennsylvania Ave NW, Washington, DC 20006",
"2 Lincoln Memorial Cir NW, Washington, DC 20002"
)
address_vec <- pmap(list(address_vec), googleway::google_geocode)
outputs
[[1]]
[[1]]$results
address_components
1 1600, Pennsylvania Avenue Northwest, Northwest Washington, Washington, District of Columbia, United States, 20500, 1600, Pennsylvania Avenue NW, Northwest Washington, Washington, DC, US, 20500, street_number, route, neighborhood, political, locality, political, administrative_area_level_1, political, country, political, postal_code
formatted_address geometry.bounds.northeast.lat
1 1600 Pennsylvania Avenue NW, Washington, DC 20500, USA 38.8979
geometry.bounds.northeast.lng geometry.bounds.southwest.lat geometry.bounds.southwest.lng geometry.location.lat
1 -77.03551 38.89731 -77.03796 38.89766
geometry.location.lng geometry.location_type geometry.viewport.northeast.lat geometry.viewport.northeast.lng
1 -77.03657 ROOFTOP 38.89895 -77.03539
geometry.viewport.southwest.lat geometry.viewport.southwest.lng place_id
1 38.89626 -77.03808 ChIJGVtI4by3t4kRr51d_Qm_x58
types
1 establishment, point_of_interest, premise
[[1]]$status
[1] "OK"
[[2]]
[[2]]$results
address_components
1 2, Lincoln Memorial Circle Northwest, Southwest Washington, Washington, District of Columbia, United States, 20037, 2, Lincoln Memorial Cir NW, Southwest Washington, Washington, DC, US, 20037, street_number, route, neighborhood, political, locality, political, administrative_area_level_1, political, country, political, postal_code
formatted_address geometry.location.lat geometry.location.lng
1 2 Lincoln Memorial Cir NW, Washington, DC 20037, USA 38.88927 -77.05018
geometry.location_type geometry.viewport.northeast.lat geometry.viewport.northeast.lng
1 ROOFTOP 38.89062 -77.04883
geometry.viewport.southwest.lat geometry.viewport.southwest.lng place_id
1 38.88792 -77.05152 ChIJgRuEham3t4kRFju4R6De__g
plus_code.compound_code plus_code.global_code types
1 VWQX+PW Washington, DC, USA 87C4VWQX+PW street_address
[[2]]$status
[1] "OK"
Here's some code that I got from the Googleway Vignette:
df <- google_geocode(address = "Flinders Street Station",
key = key,
simplify = TRUE)
geocode_coordinates(df)
# lat lng
# 1 -37.81827 144.9671
It looks like what you need to do is:
df <- google_geocode("1600 Pennsylvania Ave")
geocode_coordinates(df)
The solution I came up with is a custom function that can access any section of the list:
geocode_accessor <- function(df, accessor, ...) {
unlist(map(df, list(accessor, ...)))
}
This has three important parts to understand:
The map function is calling the pluck function for us (it replaces the use of [[ ). You can read more about what is happening here, but just know this lets us access things by name
The "..." in the function's definition as well as in the list allows us to access multiple levels. Again, the use of list() to access further levels in a list is explained in the pluck documentation
The use of unlist converts the list to a vector (what I want in my instance)
Putting this all together, we can get the latitude of the White House & Lincoln Memorial:
geocode_accessor(address_vec, "results", "geometry", "location", "lat")
[1] 38.89766 38.88927
This question already has answers here:
How can I match fuzzy match strings from two datasets?
(7 answers)
Closed 4 years ago.
I have a list of university names input with spelling errors and inconsistencies. I need to match them against an official list of university names to link my data together.
I know fuzzy matching/join is my way to go, but I'm a bit lost on the correct method. Any help would be greatly appreciated.
d<-data.frame(name=c("University of New Yorkk", "The University of South
Carolina", "Syracuuse University", "University of South Texas",
"The University of No Carolina"), score = c(1,3,6,10,4))
y<-data.frame(name=c("University of South Texas", "The University of North
Carolina", "University of South Carolina", "Syracuse
University","University of New York"), distance = c(100, 400, 200, 20, 70))
And I desire an output that has them merged together as closely as possible
matched<-data.frame(name=c("University of New Yorkk", "The University of South Carolina",
"Syracuuse University","University of South Texas","The University of No Carolina"),
correctmatch = c("University of New York", "University of South Carolina",
"Syracuse University","University of South Texas", "The University of North Carolina"))
I use adist() for things like this and have little wrapper function called closest_match() to help compare a value against a set of "good/permitted" values.
library(magrittr) # for the %>%
closest_match <- function(bad_value, good_values) {
distances <- adist(bad_value, good_values, ignore.case = TRUE) %>%
as.numeric() %>%
setNames(good_values)
distances[distances == min(distances)] %>%
names()
}
sapply(d$name, function(x) closest_match(x, y$name)) %>%
setNames(d$name)
University of New Yorkk The University of South\n Carolina Syracuuse University
"University of New York" "University of South Carolina" "University of New York"
University of South Texas The University of No Carolina
"University of South Texas" "University of South Carolina"
adist() utilizes Levenshtein distance to compare similarity between two strings.
I have a column State as shown below
State
Arizona, Arizona, Arizona, Arizona,
Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona
Virginia, Virginia, Virginia
.
.
.
I want to remove all duplicate words of specific type retain one unique word in this case I want to remove only duplicate Arizona words and Virginia Words and the final dataset should look like this below
Result
Arizona
Arizona, California Carmel Beach, California LBC, California Napa
Virginia
.
.
.
# Create a test data vector
testin <- c(
"Arizona, Arizona, Arizona, Arizona, ",
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona",
"Virginia, Virginia, Virginia"
)
# The names to remove if duplicated
kickDuplicates <- c("Arizona", "Virginia")
# create a list of vectors of place names
broken <- strsplit(testin, ",\\s*")
# paste each broken vector of place names back together
# .......kicking out duplicated instances of the chosen names
testout <- sapply(broken, FUN = function(x) paste(x[!duplicated(x) | !x %in% kickDuplicates ], collapse = ", "))
# see what we did
testout
I think this is what you want.
trimmed <- gsub('^\\s*','',state)
trimmed <- gsub('\\s*$','',trimmed)
lapply(lapply(strsplit(trimmed,'\\s*,\\s*'),unique),paste,sep =', ')
You could try with a single gsub to get the unique values, but the order of elements will be different
df1$Result <- gsub('(\\b\\S+\\b)(?=.*\\b\\1\\b.*), ', "",
df1$State, perl=TRUE)
Regex101
df1$Result
#[1] "Arizona"
#[2] "California Carmel Beach, California LBC, California Napa, Arizona"
#[3] "Virginia"
data
df1 <- structure(list(State = c("Arizona, Arizona, Arizona, Arizona",
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona",
"Virginia, Virginia, Virginia")), .Names = "State", class = "data.frame",
row.names = c(NA, -3L))
I have a data frame with the columns city, state, and country. I want to create a string that concatenates: "City, State, Country". However, one of my cities doesn't have a State (has a NA instead). I want the string for that city to be "City, Country". Here is the code that creates the wrong string:
# define City, State, Country
city <- c("Austin", "Knoxville", "Salk Lake City", "Prague")
state <- c("Texas", "Tennessee", "Utah", NA)
country <- c("United States", "United States", "United States", "Czech Rep")
# create data frame
dff <- data.frame(city, state, country)
# create full string
dff["string"] <- paste(city, state, country, sep=", ")
When I display dff$string, I get the following. Note that the last string has a NA,, which is not needed:
> dff["string"]
string
1 Austin, Texas, United States
2 Knoxville, Tennessee, United States
3 Salk Lake City, Utah, United States
4 Prague, NA, Czech Rep
What do I do to skip that NA,, including the sep = ", ".
The alternative is to just fix it up afterwards:
gsub("NA, ","",dff$string)
#[1] "Austin, Texas, United States"
#[2] "Knoxville, Tennessee, United States"
#[3] "Salk Lake City, Utah, United States"
#[4] "Prague, Czech Rep"
Alternative #2, is to use apply once you have your data.frame called dff:
apply(dff, 1, function(x) paste(na.omit(x),collapse=", ") )
Late to the party, but unite provides a one-step approach:
dff %>% unite("string", c(city, state, country), sep=", ", remove = FALSE, na.rm = TRUE)
string city state country
1 Austin, Texas, United States Austin Texas United States
2 Knoxville, Tennessee, United States Knoxville Tennessee United States
3 Salk Lake City, Utah, United States Salk Lake City Utah United States
4 Prague, Czech Rep Prague <NA> Czech Rep