Iterate in R with a function that requires four vectors - r

I'm trying to find the distance between multiple cities using the distHaversine function in the geosphere package. This code requires a variety of arguments:
The longitude and latitude of the first place.
The longitude and latitude of the second place.
The radius of the earth in whatever unit (I'm using r = 3961 for miles).
When I input this as a vector, it works easily:
HongKong <- c(114.17, 22.31)
GrandCanyon <- c(-112.11, 36.11)
distHaversine(HongKong, GrandCanyon, r=3961)
#[1] 7399.113 distance in miles
However, my actual datasets look like this:
location1 <- tibble(person = c("Sally", "Jane", "Lisa"),
current_loc = c("Bogota Colombia", "Paris France", "Hong Kong China"),
lon = c(-74.072, 2.352, 114.169),
lat = c(4.710, 48.857, 22.319))
location2 <- tibble(destination = c("Atlanta United States", "Rome Italy", "Bangkok Thailand", "Grand Canyon United States"),
lon = c(-84.388, 12.496, 100.501, -112.113),
lat = c(33.748, 41.903, 13.756, 36.107))
What I want is for there to be rows that say how far each destination is from the person's current location.
I know there has to be a way using purrr's pmap_dbl(), but I'm unable to figure it out.
Bonus points if your code uses the tidyverse and if there's any easy way to make a column that identifies the closest destination. Thank you!
In an ideal world, I would get this:
solution <- tibble(person = c("Sally", "Jane", "Lisa"),
current_loc = c("Bogota Colombia", "Paris France", "Hong Kong China"),
lon = c(-74.072, 2.352, 114.169),
lat = c(4.710, 48.857, 22.319),
dist_Atlanta = c(1000, 2000, 7000),
dist_Rome = c(2000, 500, 3000),
dist_Bangkok = c(7000, 5000, 1000),
dist_Grand = c(1500, 4000, 7500),
nearest = c("Atlanta United State", "Rome Italy", "Bangkok Thailand"))
Note: The numbers in the dist columns are random; however, they would be the output from the distHaversine() function. The name of those columns is arbitrary--it does not need to be called that. Also, if the nearest column is out of the scope of this question, I think that I can figure that one out.

distHaversine accepts only one pair of lat and lon values at a time so we need to send all combinations of location1 and location2 rows one by one to the function. One way using sapply would be
location1[paste0("dist_", stringr::word(location2$destination))] <-
t(sapply(seq_len(nrow(location1)), function(i)
sapply(seq_len(nrow(location2)), function(j) {
distHaversine(location1[i, c("lon", "lat")], location2[j, c("lon", "lat")], r=3961)
location1$nearest <- location2$destination[apply(location1[5:8], 1, which.min)]
# A tibble: 3 x 9
# person current_loc lon lat dist_Atlanta dist_Rome dist_Bangkok dist_Grand nearest
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 Sally Bogota Colombia -74.1 4.71 2114. 5828. 11114. 3246. Atlanta United States
#2 Jane Paris France 2.35 48.9 4375. 687. 5871. 5329. Rome Italy
#3 Lisa Hong Kong China 114. 22.3 8380. 5768. 1075. 7399. Bangkok Thailand

Using the tidyverse an map fuction form purrr as you asked, I found a solution, all in one pipe line.
# renaming lon an lat variables in each df
location1 <- location1 %>%
rename(lon.act = lon, lat.act = lat)
location2 <- location2 %>%
rename(lon.dest = lon, lat.dest = lat)
# geting distances
merge(location1, location2, all = TRUE) %>%
group_by(person,current_loc, destination) %>%
nest() %>%
mutate( act = map(data, `[`, c("lon.act", "lat.act")) %>%
dest = map(data, `[`, c("lon.dest", "lat.dest")) %>%
dist = map2(act, dest, ~distHaversine(.x, .y, r = 3961))) %>%
unnest(data, dist) %>%
group_by(person) %>%
mutate(mindis = dist == min(dist))


Computing correlations between variables in 2 dataframes

Am trying to compute the correlations of the below countries, with USA. I have a relatively big dataset with 80+ variables & 3000+ observations in my first df as below, so am trying to use R to automate this instead of using excel.
I am trying to compute correlations for the countries in the first df (i.e. Germany, Italy, Japan and more) with USA in the 2nd df. So it should go Germany - USA, Italy - USA, Japan - USA and so on.
Not too sure how should I begin - should I loop every column in the first table to correlate with USA in the 2nd? Help is much appreciated.
More countries...
More numbers...
More numbers...
and on and on
and on and on
You could use this approach:
countries = c("Germany", "Italy", "Japan")
left_join(df1, df2) %>% summarise(across(countries, ~cor(., USA)))
or, as the OP did not have access to the latest version of dplyr and across():
left_join(df1, df2) %>% summarise_at(countries, ~cor(., USA))
left_join merges df1 and df2 together so that the dates always match up with one another
summarise allows you to perform column-wise operations
across tells you which columns you want to make a correlation with USA
~cor(., USA) says take each country and perform the correlation with USA
Germany Italy Japan
<dbl> <dbl> <dbl>
1 -0.393 -0.147 -0.214
Thank you Damien Georges for the data.
Something like that should do the trick:
df1 <-
date = 2001:2010,
Germany = runif(10),
Italy = runif(10),
Japan = runif(10)
df2 <-
date = 2001:2010,
USA = runif(10)
df.cor <-
df1 %>%
summarise(across(-one_of('date'), ~ cor(.x, df2$USA)))
Note: You have to be sure that dates are consistent between df1 and df2. You can use join function (e.g. left_join) to ensure this
Here are two base R solutions, depending on the final format you want.
Both with the new pipe operator, introduced in R 4.1.0.
df2[-1] |> cor(df1[-1]) |>
# Germany Italy Japan
#USA 0.3161338 0.5483885 0.1725733
df1[-1] |> cor(df2[-1]) |>
#Germany 0.3161338
#Italy 0.5483885
#Japan 0.1725733
More traditional but equivalent versions:[-1], df1[-1]))[-1], df2[-1]))
Data creation code borrowed from Damien Georges.
df1 <-
date = 2001:2010,
Germany = runif(10),
Italy = runif(10),
Japan = runif(10)
df2 <-
date = 2001:2010,
USA = runif(10)

From state and county names to fips in R

I have the following data frame in R. I would like to get fips from this dataset. I tried to use fips function in usmap ( But I could not get fips from this function because I need to enclose double quote. Then, I tried to use paste0(""", df$state, """), but I could not get it. Is there any efficient ways to get fips?
> df1
state county
1 california napa
2 florida palm beach
3 florida collier
4 florida duval
I can get "\"california\"" by using dQuote. Thanks. After the conversion of each column, I tried the followings. How do I deal with this issue?
> df1$state <- dQuote(df1$state, FALSE)
> df1$county <- dQuote(df1$county, FALSE)
> fips(state = df1$state, county = df1$county)
Error in fips(state = df1$state, county = df1$county) :
`county` parameter cannot be used with multiple states.
> fips(state = df1$state[1], county = df1$county[1])
Error in fips(state = df1$state[1], county = df1$county[1]) :
"napa" is not a valid county in "california".
> fips(state = "california", county = "napa")
[1] "06055"
We can split the dataset by state and apply the fips
lapply(split(df1, df1$state), function(x)
fips(state = x$state[1], county = x$county))
#[1] "06055"
#[1] "12099" "12021" "12031"
Or with Map
lst1 <- split(df1$county, df1$state)
Map(fips, lst1, state = names(lst1))
#[1] "06055"
#[1] "12099" "12021" "12031"
Or with tidyverse
df1 %>%
group_by(state) %>%
summarise(new = list(fips(state = first(state), county = county))) %>%
# A tibble: 4 x 2
# state new
# <chr> <chr>
#1 california 06055
#2 florida 12099
#3 florida 12021
#4 florida 12031
df1 <- structure(list(state = c("california", "florida", "florida",
"florida"), county = c("napa", "palm beach", "collier", "duval"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

extract country names (or other entity) from column

I have a data.frame containing countries and cities in the column location, and I want to extract the former by matching with the world.cities$country.etc dataframe from library(maps) (or any other collection of country names).
Consider this example:
df <- data.frame(location = c("Aarup, Denmark",
"Estonia: Aaspere"),
other_col = c(2,3,4))
I attempted using this code
df %>% extract(location,
into = c("country", "rest_location"),
remove = FALSE,
function(x) x[which x %in% world.cities$country.etc])
But am not successful; I expect something like this:
location other_col country rest_location
1 Aarup, Denmark 2 Denmark Aarup,
2 Switzerland 3 Switzerland
3 Estonia: Aaspere 4 Estonia : Aaspere
We can create a pattern of all country names by pasting them together and use str_extract_all to get all the country names which match the pattern in location and remove the words which match the country names to get rest_location.
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$country <- sapply(str_extract_all(df$location, all_countries), toString)
df$rest_location <- str_remove_all(df$location, all_countries)
#OR can also do
#df$rest_location <- str_remove_all(df$location, df$country)
# location other_col country rest_location
#1 Aarup, Denmark 2 Denmark Aarup,
#2 Switzerland 3 Switzerland
#3 Estonia: Aaspere 4 Estonia : Aaspere
Using sapply and toString for country because if there are more than one country names in location they all are concatenated in one string.
you can try this as a starting point
df %>%
rownames_to_column() %>%
separate_rows(location) %>%
mutate(gr = location %in% world.cities$country.etc) %>%
mutate(gr = ifelse(gr, "country", "rest_location")) %>%
spread(gr, location) %>%
right_join(df %>%
by = c("rowname", "other_col")) %>%
select(location, other_col, country, rest_location)
location other_col country rest_location
1 Aarup, Denmark 2 Denmark Aarup
2 Switzerland 3 Switzerland <NA>
3 Estonia: Aaspere 4 Estonia Aaspere
Of note, this only works if there are only two "words" in the location column. If necessary you have to specify a suitable separate e.g. sep=",|:"
Base R (not including maps package):
# Import the library:
# Split the string on the spaces:
country_city_vec <- strsplit(df$location, "\\s+")
# Replicate the other col's rows by the split string vec:
rolled_out_df <- data.frame(other_col = rep(df$other_col, sapply(country_city_vec, length)),
location = gsub("[[:punct:]]", "", unlist(country_city_vec)), stringsAsFactors = F)
# Match with the world df:
matched_with_world_df <- merge(df,
setNames(rolled_out_df[rolled_out_df$location %in% world.cities$country.etc,],
c("other_col", "country")),
by = "other_col", all.x = T)
# Extract the city/location drilldown:
matched_with_world_df$rest_location <- trimws(gsub("[[:punct:]]",
collapse = "|"),
"", matched_with_world_df$location)), "both")

How can I fuzzy string match multiple strings from different sized data frames?

I would like to match the strings from my first dataset with all of their closest common matches.
Data looks like:
New York
Te xas
Fl0 rida
New york
new york
desired result is:
col_1 col_2 col_3 col4
California Californiia callifoornia
Texas T3xas texas Te xas
Florida folrida Fl0 rida
New York New york new york
The question is:
How do I search for common strings between the first dataset and the
second dataset, and generate a list of terms in the second dataset
that align with each term in the first?
Thanks in advance.
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
max_dist = 3) %>%
rename(col_1 = "states.x") %>%
group_by(col_1) %>%
mutate(col = paste0("col_", row_number() + 1)) %>%
spread(col, states.y)
#Joining by: "states"
## A tibble: 4 x 4
## Groups: col_1 [4]
# col_1 col_2 col_3 col_4
# <chr> <chr> <chr> <chr>
#1 California Californiia callifoornia NA
#2 Florida Fl0 rida folrida NA
#3 New York New york new york NA
#4 Texas T3xas Te xas texas
dataset1 <- data.frame(states = c("California",
"New York"),
stringsAsFactors = F)
dataset2 <- data.frame(stringsAsFactors = F,
states = c(
"Te xas",
"Fl0 rida",
"New york",
"new york"
I read a bit about stringdist and came up with this. It's a workaround, but I like it. Can definitely be improved:
ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')
distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)
df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string
for (j in 1:ncol(df)) {
trigger <- df[j,] < 4
df[trigger , j] <- names(df)[j]
df[!trigger , j] <- ""
df <- remove_constant(df)
write.csv(df, file="~/Desktop/df.csv")

Using a variable number of groups with do in function

I would like to understand if and how this could be achieved using the tidyverse framework.
Assume I have the following simple function:
my_fn <- function(list_char) {
data.frame(comma_separated = rep(paste0(list_char, collapse = ","),2),
second_col = "test",
stringsAsFactors = FALSE)
Given the below list:
list_char <- list(name = "Chris", city = "London", language = "R")
my function works fine if you run:
However if we change some of the list's elements with a vector of characters we could use the dplyr::do function in the following way to achieve the below:
list_char_mult <- list(name = c("Chris", "Mike"),
city = c("New York", "London"), language = "R")
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(list(name = .$name, city = .$city, language = "R")))
The question is how to write a function that could do this for a list with a variable number of elements. For example:
my_fn_generic <- function(list_char_mult) {
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
Regarding how to use the function with variable number of arguments
my_fn_generic <- function(list_char) {
expand.grid(list_char, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(, list(.)))
# A tibble: 4 x 4
# Groups: name, city, language [4]
# name city language comma_separated
# <chr> <chr> <chr> <chr>
#1 Chris London R Chris,London,R
#2 Chris New York R Chris,New York,R
#3 Mike London R Mike,London,R
#4 Mike New York R Mike,New York,R
Or use the pmap
list_char_mult %>%
expand.grid(., stringsAsFactors = FALSE) %>%
mutate(comma_separated = purrr::pmap_chr(.l = ., .f = paste, sep=", ") )
# name city language comma_separated
#1 Chris New York R Chris, New York, R
#2 Mike New York R Mike, New York, R
#3 Chris London R Chris, London, R
#4 Mike London R Mike, London, R
If I understand your question, you could use apply without grouping:
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, paste, collapse=","))
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, my_fn))
name city language comma_separated
1 Chris London R Chris,London,R
2 Chris New York R Chris,New York,R
3 Mike London R Mike,London,R
4 Mike New York R Mike,New York,R
