Calculate distance between multiple latitude and longitude points - r

I have a dataset that has latitude and longitude information for participants' home and work, and I'd like to create a new column in the dataset containing the euclidean distance between home and work for each participant. I think this should be relatively simple, but all the other Q&As I've seen seem to be dealing with slightly different issues.
To start, I tried running this code (using the geosphere package):
distm(c(homelong, homelat), c(worklong, worklat), fun=distHaversine)
But got an error saying "Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2" because (if I understand correctly) I'm trying to calculate the distance between multiple sets of two points.
Can I adjust this code to get what I'm looking for, or is there something else I should be trying instead? Thanks!

distm() returns a distance matrix, which is not what you want; you want the pairwise distances. So use the distance function (distHaversine(), distGeo(), or whatever) directly:
library(tidyverse)
locations <- tibble(
homelong = c(0, 2),
homelat = c(2, 5),
worklong = c(70, 60),
worklat = c(45, 60)
)
locations <- locations %>%
mutate(
dist = geosphere::distHaversine(cbind(homelong, homelat), cbind(worklong, worklat))
)
locations
#> # A tibble: 2 × 5
#> homelong homelat worklong worklat dist
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 2 70 45 8299015.
#> 2 2 5 60 60 7809933.
Note that geosphere functions want matrices as inputs, so you can cbind() your columns together. Don't c() them; that's creating a single shapeless vector and losing the differentiation between lon and lat. This is the cause of the error, I suspect; the vector only has one dimension, not two like a matrix.

You can have the latitudes and longitudes in a dataframe and then do rowwise operations on the dataframe to get the distance corresponding to each row.
library(tidyverse)
library(geosphere)
locations <- tibble(
homelong = c(0, 2),
homelat = c(2, 5),
worklong = c(70, 60),
worklat = c(45, 60)
)
locations %>%
rowwise() %>%
mutate(d = as.numeric(distm(c(homelong, homelat), c(worklong, worklat), fun = distHaversine)))
results in
# A tibble: 2 x 5
# Rowwise:
homelong homelat worklong worklat d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2 70 45 8299015.
2 2 5 60 60 7809933.

Related

Spatial point distance analysis by group in R

I have a dataset which looks like this, though much larger
### ##Fake data for stack exdb <- data.frame(zone =
c(1,1,1,2,2,2), site = c("study", "collect", "collect", "study",
"collect", "collect"), x = c(53.307726, 53.310660, 53.307089,
53.313831, 53.319087, 53.318792), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))
I need to run a point analysis between the STUDY site and the COLLECT site to see the distance in metres. The problem is that I have many different ZONES or groups that are all independent (i.e the distance from a point in zone 1 is irrelevant to a point in zone 2).
For this reason I need to do two things,
the point analysis, which computes the distance between the one study site per zone and the multiple collect sites in meters,
and then write a FOREACH or a LOOP function which calculates this distance for every group in the data set.
an optimal output would look like
exdb <- data.frame(zone = c(1,1,1,2,2,2),
site = c("study", "collect", "collect", "study", "collect", "collect"),
x = c(53.307726, 53.310660, 53.307089, 53.313831, 53.319087, 53.318792),
y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815),
dist = c(0, 10.3, 30.4, 0, 12.5, 11.2))
Where the study site in each zone is always 0, as it is the distance from this site, and the distance to each collect site is ONLY CALCULATED TO THE STUDY SITE IN EACH UNIQUE ZONE.
Thank you very much.
Kil
Simple Base R version, no other packages required.
Starting with exdb as above.
First add a new column called dist with the value "study" because the plan is to self-merge on zone and site=="study":
> exdb$dist = "study"
Self-Merge, keeping only the coordinate columns:
> MM = merge(exdb, exdb,
by.x=c("zone","site"),
by.y=c("zone","dist"))[,c("x.x","y.x","x.y","y.y")]
Use distGeo to overwrite the dist column. Keeps it neat and tidy:
> exdb$dist = distGeo(MM[,2:1],MM[,4:3])
> exdb
zone site x y dist
1 1 study 53.30773 -6.222291 0.0000
2 1 collect 53.31066 -6.217151 473.2943
3 1 collect 53.30709 -6.215080 485.8806
4 2 study 53.31383 -6.214152 0.0000
5 2 collect 53.31909 -6.218723 659.5238
6 2 collect 53.31879 -6.215815 563.1349
Returns same answer as #wimpel but with no additional dependencies and in fewer lines of code.
Maybe something like this?
Assuming x and y are latitude and longitude, we can use the haversine function to get the distance in meters after pivoting the table to have both points in a row between which the distance is being calculated from (in meters):
library(tidyverse)
library(pracma)
#>
#> Attaching package: 'pracma'
#> The following object is masked from 'package:purrr':
#>
#> cross
data <- data.frame(zone = c(1, 1, 1, 2, 2, 2), site = c(
"study", "collect", "collect", "study",
"collect", "collect"
), x = c(
53.307726, 53.310660, 53.307089,
53.313831, 53.319087, 53.318792
), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))
data %>%
pivot_wider(names_from = site, values_from = c(x, y)) %>%
unnest(y_collect, y_study, x_collect, x_study) %>%
mutate(
dist = list(x_study, y_study, x_collect, y_collect) %>% pmap_dbl(~haversine(c(..1, ..2), c(..3, ..4)) * 1000)
)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: unnest() has a new interface. See ?unnest for details.
#> Try `df %>% unnest(c(y_collect, y_study, x_collect, x_study))`, with `mutate()` if needed
#> # A tibble: 4 x 6
#> zone x_study x_collect y_study y_collect dist
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 53.3 53.3 -6.22 -6.22 472.
#> 2 1 53.3 53.3 -6.22 -6.22 484.
#> 3 2 53.3 53.3 -6.21 -6.22 659.
#> 4 2 53.3 53.3 -6.21 -6.22 563.
Created on 2021-09-13 by the reprex package (v2.0.1)
I'm still learning the spatial side but does this work?
library(sf)
library(tidyverse)
exdb %>%
arrange(zone, desc(site)) %>% #ensure study is first
st_as_sf(coords = c("x", "y"), crs = 4326) %>%
group_by(zone) %>%
mutate(
study_coord = geometry[1],
dist = st_distance(geometry, study_coord, by_element = T),
)
I believe this should work.. But I could not reproduce your distances in the desired output.
library(data.table)
library(purrr) # Or tidyverse
library(geosphere)
# Make your data a data.table
setDT(mydata)
# Split to a list based on zone and site
L <- split(mydata, by = c("zone", "site"), flatten = FALSE)
# Loop over list
L <- lapply(L, function(zone) {
#get reference point to take dustance from
point.study <- c(zone$study$y,zone$study$x)
zone$study$dist <- 0
# Calculate distance
zone$collect$dist <- unlist(purrr::pmap( list(a = zone$collect$y,
b = zone$collect$x ),
~(geosphere::distGeo( point.study, c(..1, ..2)))))
return(zone)
})
# Rowbind the results together
data.table::rbindlist(lapply(L, data.table::rbindlist))
# zone site x y dist
# 1: 1 study 53.30773 -6.222291 0.0000
# 2: 1 collect 53.31066 -6.217151 473.2943
# 3: 1 collect 53.30709 -6.215080 485.8806
# 4: 2 study 53.31383 -6.214152 0.0000
# 5: 2 collect 53.31909 -6.218723 659.5238
# 6: 2 collect 53.31879 -6.215815 563.1349

Mutate when function output has data in two columns (geosphere)

I have animal survey data from transects. Transects are divided into sections. There are lat/lon data for the start/endpoints of some sections but not others, and I want to calculate the start/endpoints for sections where these values are missing. Missing start/endpoints should be calculated using the section bearing (degrees), section length (m).
Example data:
Section
StartLon
StartLat
EndLon
EndLat
Bearing
Length
1
-132.4053
53.00704
-132.4053
53.00714
360
5
2
-132.4053
53.00714
NA
NA
360
10
I'm trying to use destPoint (geosphere) to calculate the missing start/endpoints (NAs). The output of destPoint looks like:
lon lat
[1,] -132.4053 53.00701
My code:
data %>%
mutate(EndLon = if_else(is.na(EndLon), destPoint(c(StartLon, StartLat), Bearing, Length), EndLon))
data %>%
mutate(EndLat = if_else(is.na(EndLat), destPoint(c(StartLon, StartLat), Bearing, Length), EndLat))
My code gives this error:
Error: Problem with `mutate()` input `test`.
x Wrong length for a vector, should be 2
i Input `test` is `if_else(...)`
I think the error is because the output of destPoint is two values (lon and lat), and the mutated column can only hold one value. Maybe there's a way to use select() so that only lon or lat goes into the mutated column?
Hoping for a dplyr solution.
We may use rowwise
library(dplyr)
library(geosphere)
data %>%
rowwise %>%
mutate(EndLon = if(is.na(EndLon))
destPoint(c(StartLon, StartLat), Bearing, Length)[, 'lon'] else EndLon) %>%
ungroup
-output
# A tibble: 2 x 7
# Section StartLon StartLat EndLon EndLat Bearing Length
# <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
#1 1 -132. 53.0 -132. 53.0 360 5
#2 2 -132. 53.0 -132. NA 360 10
data
data <- structure(list(Section = 1:2, StartLon = c(-132.4053, -132.4053
), StartLat = c(53.00704, 53.00714), EndLon = c(-132.4053, NA
), EndLat = c(53.00714, NA), Bearing = c(360L, 360L), Length = c(5L,
10L)), class = "data.frame", row.names = c(NA, -2L))
The issue would be that c(StartLon, StartLat) would concatenate the whole column values from both of those column, and thereby the length of one of the arguments for if_else becomes different in length than the rest. If we do the rowwise, it is grouped by row and we can use if/else (which requires a input logical expression of length 1)

programatically create new variables which are sums of nested series of other variables

I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").

Set rnorm parameters equal to vector

I have a data frame that contains columns of sample sizes, means, and standard deviations, as well as a target value:
ssize <- c(200, 300, 150)
mean <- c(10, 40, 50)
sd <- c(5, 15, 65)
target <- c(7, 23, 30)
df <- data.frame(ssize, mean, sd, target)
I wish to add another variable below that returns the number of elements less than the target value, as drawn from a normal distribution with parameters mean and sd and sample size ssize. However, I cannot get rnorm to use the values of each row as parameters. For example, running
df$below <- sum(rnorm(df$ssize, df$mean, df$sd) < df$target)
generates distributions that have sample sizes equal to length(df$ssize) instead of the value of df$ssize itself.
Updated: data table solution for large datasets?
The solutions from #alistaire and #G5W work well, but I would like to extract the mean value of below from 100 replicates of rnorm, for each row. I tried both solutions:
df <- df %>% mutate(below = mean(replicate(100, pmap_int(., ~sum(rnorm(..1, ..2, ..3) < ..4)))))
df$below <- with(df, sapply(1:nrow, function(i) mean(replicate(100, sum(rnorm(n[i], mean[i], sd[i]) < target[i])))))
But they will take a very long time to run with my dataset, which has >4.3m rows. Is there a data table (or other) solution that might be faster?
List columns are a natural way to do this, so you can store the samples right next to the parameters that generated them. Using purrr for iteration,
library(tidyverse)
set.seed(47) # for reproducibility
df <- data_frame(n = c(200, 300, 150), # rename to name of parameter in rnorm so pmap works naturally
mean = c(10, 40, 50),
sd = c(5, 15, 65),
target = c(7, 23, 30))
df %>%
mutate(samples = pmap(.[1:3], rnorm), # iterate in parallel over parameters and store samples as list column
below = map2_int(samples, target, ~sum(.x < .y))) # iterate over samples and target, calculate number below, and simplify to integer vector
#> # A tibble: 3 x 6
#> n mean sd target samples below
#> <dbl> <dbl> <dbl> <dbl> <list> <int>
#> 1 200 10 5 7 <dbl [200]> 47
#> 2 300 40 15 23 <dbl [300]> 41
#> 3 150 50 65 30 <dbl [150]> 58
You can do this in base R with lapply and a temporary function
df$below = with(df,
sapply(1:3, function(i) sum(rnorm(ssize[i], mean[i], sd[i]) < target[i])))
df$below
[1] 44 45 48

dplyr rowwise, find closest (latitude, longitude) record in a second data frame

Would like to assign each zip code to its closest ghcnd weather collection station. Using library(zipcode) and the list of ghcnd stations available from NOAA (ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/)
Trying to use dplyr; using a rowwise %>% mutate(). All rows are assigned the same value from the lookup table.
#small selection of zip codes from library(zipcode)
zip_samp <- "zip latitude longitude
30002 33.77212 -84.26491
30003 33.96035 -84.03786
30004 34.11918 -84.30292
30005 34.08004 -84.21929
39885 31.71000 -84.34000
39886 31.73000 -84.60000
39897 30.90000 -84.32000
39901 33.89125 -84.07456
"
zip <- read.table(text=zip_samp, header=TRUE)
#two example stations
station_samp <- "id lat long
US1GADK0015 33.7794 -84.2572
US1GAGW0024 33.8885 -84.0998
"
stations <- read.table(text=station_samp, header=TRUE)
Illustrate desired output by hardcoding:
as.character(stations[which.min(distGeo(c(-84.26491, 33.77212), select(stations, long, lat))), "id"])
[1] "US1GADK0015"
as.character(stations[which.min(distGeo(c(-84.07456, 33.89125), select(stations, long, lat))), "id"])
[1] "US1GAGW0024"
note these two zip codes are assigned to different station ids, but when the same formula is applied rowwise with dplyr, all zip codes are assigned to a single ID.
assigned <- zip %>%
select(longitude, latitude) %>%
rowwise() %>%
mutate(station =
as.character(stations[which.min(distGeo(., select(stations, long, lat))), "id"])
)
print(assigned)
Source: local data frame [8 x 3]
Groups: <by row>
# A tibble: 8 x 3
longitude latitude station
<dbl> <dbl> <chr>
1 -84.26491 33.77212 US1GADK0015
2 -84.03786 33.96035 US1GADK0015
3 -84.30292 34.11918 US1GADK0015
4 -84.21929 34.08004 US1GADK0015
5 -84.34000 31.71000 US1GADK0015
6 -84.60000 31.73000 US1GADK0015
7 -84.32000 30.90000 US1GADK0015
8 -84.07456 33.89125 US1GADK0015
Is there an alternative to indexing inside the mutate() call?
note, trying to avoid creating the entire distance matrix at once. I hope computing rowwise will require fewer resources
Also, there are a couple similar questions on SO, none using dplyr though. Hoping to figure why this application of rowwise %>% mutate isnt producing the expected results.
See if this works for you (an eyeball of the data says it does):
library(tidyverse)
library(microbenchmark)
library(zipcode)
stat_df <- read_fwf(
"ghcnd-stations.txt",
fwf_widths(widths=c(11, 1, 8, 1, 9, 1, 6, 1, 2, 1, 30, 1, 3, 1, 3, 1, 5))
)
stations <- select(stat_df, station_id = X1, latitude = X3, longitude = X5)
closest_station <- function(lat, lon) {
index <- which.min(sqrt((stations$latitude-lat)^2 + (stations$longitude-lon)^2)) # less precise but likely good enough
stations[index,]$station_id
}
data(zipcode)
zipcode <- tbl_df(zipcode)
zipcode
set.seed(1492)
smpl <- zipcode[sample(nrow(zipcode), 100),]
mutate(smpl, station_id = map2_chr(latitude, longitude, closest_station))
## # A tibble: 100 x 6
## zip city state latitude longitude station_id
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 28137 Richfield NC 35.49326 -80.25524 US1NCSN0006
## 2 22027 Dunn Loring VA 38.89392 -77.21976 US1VAFX0064
## 3 19080 Wayne PA 40.04320 -75.35768 US1PADL0005
## 4 12459 New Kingston NY 42.22799 -74.68912 USC00305743
## 5 06082 Enfield CT 41.98724 -72.56365 US1CTHR0005
## 6 01302 Greenfield MA 42.52218 -72.62416 USC00193295
## 7 83540 Lapwai ID 46.39708 -116.78649 USC00105132
## 8 49266 Osseo MI 41.84489 -84.55244 USC00203823
## 9 37871 Strawberry Plains TN 36.04051 -83.67934 USC00408677
## 10 75042 Garland TX 32.91562 -96.67399 US1TXDA0065
## # ... with 90 more rows
I think this math is right but have to jet…
microbenchmark(mutate(smpl, station_id = map2_chr(latitude, longitude, closest_station)))
# !300ms for 100
# ((nrow(zipcode)/100) * 300) / 1000 / 60 == ~3m

Resources