Iterate in R with a function that requires four vectors - r

I'm trying to find the distance between multiple cities using the distHaversine function in the geosphere package. This code requires a variety of arguments:
The longitude and latitude of the first place.
The longitude and latitude of the second place.
The radius of the earth in whatever unit (I'm using r = 3961 for miles).
When I input this as a vector, it works easily:
HongKong <- c(114.17, 22.31)
GrandCanyon <- c(-112.11, 36.11)
library(geosphere)
distHaversine(HongKong, GrandCanyon, r=3961)
#[1] 7399.113 distance in miles
However, my actual datasets look like this:
library(dplyr)
location1 <- tibble(person = c("Sally", "Jane", "Lisa"),
current_loc = c("Bogota Colombia", "Paris France", "Hong Kong China"),
lon = c(-74.072, 2.352, 114.169),
lat = c(4.710, 48.857, 22.319))
location2 <- tibble(destination = c("Atlanta United States", "Rome Italy", "Bangkok Thailand", "Grand Canyon United States"),
lon = c(-84.388, 12.496, 100.501, -112.113),
lat = c(33.748, 41.903, 13.756, 36.107))
What I want is for there to be rows that say how far each destination is from the person's current location.
I know there has to be a way using purrr's pmap_dbl(), but I'm unable to figure it out.
Bonus points if your code uses the tidyverse and if there's any easy way to make a column that identifies the closest destination. Thank you!
In an ideal world, I would get this:
solution <- tibble(person = c("Sally", "Jane", "Lisa"),
current_loc = c("Bogota Colombia", "Paris France", "Hong Kong China"),
lon = c(-74.072, 2.352, 114.169),
lat = c(4.710, 48.857, 22.319),
dist_Atlanta = c(1000, 2000, 7000),
dist_Rome = c(2000, 500, 3000),
dist_Bangkok = c(7000, 5000, 1000),
dist_Grand = c(1500, 4000, 7500),
nearest = c("Atlanta United State", "Rome Italy", "Bangkok Thailand"))
Note: The numbers in the dist columns are random; however, they would be the output from the distHaversine() function. The name of those columns is arbitrary--it does not need to be called that. Also, if the nearest column is out of the scope of this question, I think that I can figure that one out.

distHaversine accepts only one pair of lat and lon values at a time so we need to send all combinations of location1 and location2 rows one by one to the function. One way using sapply would be
library(geosphere)
location1[paste0("dist_", stringr::word(location2$destination))] <-
t(sapply(seq_len(nrow(location1)), function(i)
sapply(seq_len(nrow(location2)), function(j) {
distHaversine(location1[i, c("lon", "lat")], location2[j, c("lon", "lat")], r=3961)
})))
location1$nearest <- location2$destination[apply(location1[5:8], 1, which.min)]
location1
# A tibble: 3 x 9
# person current_loc lon lat dist_Atlanta dist_Rome dist_Bangkok dist_Grand nearest
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 Sally Bogota Colombia -74.1 4.71 2114. 5828. 11114. 3246. Atlanta United States
#2 Jane Paris France 2.35 48.9 4375. 687. 5871. 5329. Rome Italy
#3 Lisa Hong Kong China 114. 22.3 8380. 5768. 1075. 7399. Bangkok Thailand

Using the tidyverse an map fuction form purrr as you asked, I found a solution, all in one pipe line.
library(tidyverse)
library(geosphere)
# renaming lon an lat variables in each df
location1 <- location1 %>%
rename(lon.act = lon, lat.act = lat)
location2 <- location2 %>%
rename(lon.dest = lon, lat.dest = lat)
# geting distances
merge(location1, location2, all = TRUE) %>%
group_by(person,current_loc, destination) %>%
nest() %>%
mutate( act = map(data, `[`, c("lon.act", "lat.act")) %>%
map(as.numeric),
dest = map(data, `[`, c("lon.dest", "lat.dest")) %>%
map(as.numeric),
dist = map2(act, dest, ~distHaversine(.x, .y, r = 3961))) %>%
unnest(data, dist) %>%
group_by(person) %>%
mutate(mindis = dist == min(dist))

Related

Computing correlations between variables in 2 dataframes

Am trying to compute the correlations of the below countries, with USA. I have a relatively big dataset with 80+ variables & 3000+ observations in my first df as below, so am trying to use R to automate this instead of using excel.
I am trying to compute correlations for the countries in the first df (i.e. Germany, Italy, Japan and more) with USA in the 2nd df. So it should go Germany - USA, Italy - USA, Japan - USA and so on.
Not too sure how should I begin - should I loop every column in the first table to correlate with USA in the 2nd? Help is much appreciated.
Thanks!
df1
Date
Germany
Italy
Japan
More countries...
01-01-2020
1000
200
2304
More numbers...
01-02-2020
2000
389
2098
More numbers...
and on and on
df2
Date
USA
01-01-2020
500
01-02-2020
600
and on and on
You could use this approach:
library(dplyr);library(magrittr)
countries = c("Germany", "Italy", "Japan")
left_join(df1, df2) %>% summarise(across(countries, ~cor(., USA)))
or, as the OP did not have access to the latest version of dplyr and across():
left_join(df1, df2) %>% summarise_at(countries, ~cor(., USA))
left_join merges df1 and df2 together so that the dates always match up with one another
summarise allows you to perform column-wise operations
across tells you which columns you want to make a correlation with USA
~cor(., USA) says take each country and perform the correlation with USA
Germany Italy Japan
<dbl> <dbl> <dbl>
1 -0.393 -0.147 -0.214
Thank you Damien Georges for the data.
Something like that should do the trick:
library(dplyr)
df1 <-
tibble(
date = 2001:2010,
Germany = runif(10),
Italy = runif(10),
Japan = runif(10)
)
df2 <-
tibble(
date = 2001:2010,
USA = runif(10)
)
df.cor <-
df1 %>%
summarise(across(-one_of('date'), ~ cor(.x, df2$USA)))
df.cor
Note: You have to be sure that dates are consistent between df1 and df2. You can use join function (e.g. left_join) to ensure this
Here are two base R solutions, depending on the final format you want.
Both with the new pipe operator, introduced in R 4.1.0.
df2[-1] |> cor(df1[-1]) |> as.data.frame()
# Germany Italy Japan
#USA 0.3161338 0.5483885 0.1725733
df1[-1] |> cor(df2[-1]) |> as.data.frame()
# USA
#Germany 0.3161338
#Italy 0.5483885
#Japan 0.1725733
More traditional but equivalent versions:
as.data.frame(cor(df2[-1], df1[-1]))
as.data.frame(cor(df1[-1], df2[-1]))
Data
Data creation code borrowed from Damien Georges.
set.seed(2021)
df1 <-
data.frame(
date = 2001:2010,
Germany = runif(10),
Italy = runif(10),
Japan = runif(10)
)
df2 <-
data.frame(
date = 2001:2010,
USA = runif(10)
)

From state and county names to fips in R

I have the following data frame in R. I would like to get fips from this dataset. I tried to use fips function in usmap (https://rdrr.io/cran/usmap/man/fips.html). But I could not get fips from this function because I need to enclose double quote. Then, I tried to use paste0(""", df$state, """), but I could not get it. Is there any efficient ways to get fips?
> df1
state county
1 california napa
2 florida palm beach
3 florida collier
4 florida duval
UPDATE
I can get "\"california\"" by using dQuote. Thanks. After the conversion of each column, I tried the followings. How do I deal with this issue?
> df1$state <- dQuote(df1$state, FALSE)
> df1$county <- dQuote(df1$county, FALSE)
> fips(state = df1$state, county = df1$county)
Error in fips(state = df1$state, county = df1$county) :
`county` parameter cannot be used with multiple states.
> fips(state = df1$state[1], county = df1$county[1])
Error in fips(state = df1$state[1], county = df1$county[1]) :
"napa" is not a valid county in "california".
> fips(state = "california", county = "napa")
[1] "06055"
We can split the dataset by state and apply the fips
library(usmap)
lapply(split(df1, df1$state), function(x)
fips(state = x$state[1], county = x$county))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with Map
lst1 <- split(df1$county, df1$state)
Map(fips, lst1, state = names(lst1))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with tidyverse
library(dplyr)
library(tidyr)
df1 %>%
group_by(state) %>%
summarise(new = list(fips(state = first(state), county = county))) %>%
unnest(c(new))
# A tibble: 4 x 2
# state new
# <chr> <chr>
#1 california 06055
#2 florida 12099
#3 florida 12021
#4 florida 12031
data
df1 <- structure(list(state = c("california", "florida", "florida",
"florida"), county = c("napa", "palm beach", "collier", "duval"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

extract country names (or other entity) from column

I have a data.frame containing countries and cities in the column location, and I want to extract the former by matching with the world.cities$country.etc dataframe from library(maps) (or any other collection of country names).
Consider this example:
df <- data.frame(location = c("Aarup, Denmark",
"Switzerland",
"Estonia: Aaspere"),
other_col = c(2,3,4))
I attempted using this code
df %>% extract(location,
into = c("country", "rest_location"),
remove = FALSE,
function(x) x[which x %in% world.cities$country.etc])
But am not successful; I expect something like this:
location other_col country rest_location
1 Aarup, Denmark 2 Denmark Aarup,
2 Switzerland 3 Switzerland
3 Estonia: Aaspere 4 Estonia : Aaspere
We can create a pattern of all country names by pasting them together and use str_extract_all to get all the country names which match the pattern in location and remove the words which match the country names to get rest_location.
library(maps)
library(stringr)
all_countries <- str_c(unique(world.cities$country.etc), collapse = "|")
df$country <- sapply(str_extract_all(df$location, all_countries), toString)
df$rest_location <- str_remove_all(df$location, all_countries)
#OR can also do
#df$rest_location <- str_remove_all(df$location, df$country)
df
# location other_col country rest_location
#1 Aarup, Denmark 2 Denmark Aarup,
#2 Switzerland 3 Switzerland
#3 Estonia: Aaspere 4 Estonia : Aaspere
Using sapply and toString for country because if there are more than one country names in location they all are concatenated in one string.
you can try this as a starting point
library(tidyverse)
df %>%
rownames_to_column() %>%
separate_rows(location) %>%
mutate(gr = location %in% world.cities$country.etc) %>%
mutate(gr = ifelse(gr, "country", "rest_location")) %>%
spread(gr, location) %>%
right_join(df %>%
rownames_to_column(),
by = c("rowname", "other_col")) %>%
select(location, other_col, country, rest_location)
location other_col country rest_location
1 Aarup, Denmark 2 Denmark Aarup
2 Switzerland 3 Switzerland <NA>
3 Estonia: Aaspere 4 Estonia Aaspere
Of note, this only works if there are only two "words" in the location column. If necessary you have to specify a suitable separate e.g. sep=",|:"
Base R (not including maps package):
# Import the library:
library(maps)
# Split the string on the spaces:
country_city_vec <- strsplit(df$location, "\\s+")
# Replicate the other col's rows by the split string vec:
rolled_out_df <- data.frame(other_col = rep(df$other_col, sapply(country_city_vec, length)),
location = gsub("[[:punct:]]", "", unlist(country_city_vec)), stringsAsFactors = F)
# Match with the world df:
matched_with_world_df <- merge(df,
setNames(rolled_out_df[rolled_out_df$location %in% world.cities$country.etc,],
c("other_col", "country")),
by = "other_col", all.x = T)
# Extract the city/location drilldown:
matched_with_world_df$rest_location <- trimws(gsub("[[:punct:]]",
"",
gsub(paste0(matched_with_world_df$country,
collapse = "|"),
"", matched_with_world_df$location)), "both")

How can I fuzzy string match multiple strings from different sized data frames?

I would like to match the strings from my first dataset with all of their closest common matches.
Data looks like:
dataset1:
California
Texas
Florida
New York
dataset2:
Californiia
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york
desired result is:
col_1 col_2 col_3 col4
California Californiia callifoornia
Texas T3xas texas Te xas
Florida folrida Fl0 rida
New York New york new york
The question is:
How do I search for common strings between the first dataset and the
second dataset, and generate a list of terms in the second dataset
that align with each term in the first?
Thanks in advance.
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2,
max_dist = 3) %>%
rename(col_1 = "states.x") %>%
group_by(col_1) %>%
mutate(col = paste0("col_", row_number() + 1)) %>%
spread(col, states.y)
#Joining by: "states"
## A tibble: 4 x 4
## Groups: col_1 [4]
# col_1 col_2 col_3 col_4
# <chr> <chr> <chr> <chr>
#1 California Californiia callifoornia NA
#2 Florida Fl0 rida folrida NA
#3 New York New york new york NA
#4 Texas T3xas Te xas texas
data:
dataset1 <- data.frame(states = c("California",
"Texas",
"Florida",
"New York"),
stringsAsFactors = F)
dataset2 <- data.frame(stringsAsFactors = F,
states = c(
"Californiia",
"callifoornia",
"T3xas",
"Te xas",
"texas",
"Fl0 rida",
"folrida",
"New york",
"new york"
)
)
I read a bit about stringdist and came up with this. It's a workaround, but I like it. Can definitely be improved:
library(stringdist)
library(janitor)
ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')
distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)
df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string
for (j in 1:ncol(df)) {
trigger <- df[j,] < 4
df[trigger , j] <- names(df)[j]
df[!trigger , j] <- ""
}
df <- remove_constant(df)
write.csv(df, file="~/Desktop/df.csv")

Using a variable number of groups with do in function

I would like to understand if and how this could be achieved using the tidyverse framework.
Assume I have the following simple function:
my_fn <- function(list_char) {
data.frame(comma_separated = rep(paste0(list_char, collapse = ","),2),
second_col = "test",
stringsAsFactors = FALSE)
}
Given the below list:
list_char <- list(name = "Chris", city = "London", language = "R")
my function works fine if you run:
my_fn(list_char)
However if we change some of the list's elements with a vector of characters we could use the dplyr::do function in the following way to achieve the below:
list_char_mult <- list(name = c("Chris", "Mike"),
city = c("New York", "London"), language = "R")
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(list(name = .$name, city = .$city, language = "R")))
The question is how to write a function that could do this for a list with a variable number of elements. For example:
my_fn_generic <- function(list_char_mult) {
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(my_fn(...))
}
Thanks
Regarding how to use the function with variable number of arguments
my_fn_generic <- function(list_char) {
expand.grid(list_char, stringsAsFactors = FALSE) %>%
tbl_df() %>%
group_by_all() %>%
do(do.call(my_fn, list(.)))
}
my_fn_generic(list_char_mult)
# A tibble: 4 x 4
# Groups: name, city, language [4]
# name city language comma_separated
# <chr> <chr> <chr> <chr>
#1 Chris London R Chris,London,R
#2 Chris New York R Chris,New York,R
#3 Mike London R Mike,London,R
#4 Mike New York R Mike,New York,R
Or use the pmap
library(tidyverse)
list_char_mult %>%
expand.grid(., stringsAsFactors = FALSE) %>%
mutate(comma_separated = purrr::pmap_chr(.l = ., .f = paste, sep=", ") )
# name city language comma_separated
#1 Chris New York R Chris, New York, R
#2 Mike New York R Mike, New York, R
#3 Chris London R Chris, London, R
#4 Mike London R Mike, London, R
If I understand your question, you could use apply without grouping:
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, paste, collapse=","))
expand.grid(list_char_mult, stringsAsFactors = FALSE) %>%
mutate(comma_separated = apply(., 1, my_fn))
name city language comma_separated
1 Chris London R Chris,London,R
2 Chris New York R Chris,New York,R
3 Mike London R Mike,London,R
4 Mike New York R Mike,New York,R

Resources