Spatial point distance analysis by group in R - r

I have a dataset which looks like this, though much larger
### ##Fake data for stack exdb <- data.frame(zone =
c(1,1,1,2,2,2), site = c("study", "collect", "collect", "study",
"collect", "collect"), x = c(53.307726, 53.310660, 53.307089,
53.313831, 53.319087, 53.318792), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))
I need to run a point analysis between the STUDY site and the COLLECT site to see the distance in metres. The problem is that I have many different ZONES or groups that are all independent (i.e the distance from a point in zone 1 is irrelevant to a point in zone 2).
For this reason I need to do two things,
the point analysis, which computes the distance between the one study site per zone and the multiple collect sites in meters,
and then write a FOREACH or a LOOP function which calculates this distance for every group in the data set.
an optimal output would look like
exdb <- data.frame(zone = c(1,1,1,2,2,2),
site = c("study", "collect", "collect", "study", "collect", "collect"),
x = c(53.307726, 53.310660, 53.307089, 53.313831, 53.319087, 53.318792),
y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815),
dist = c(0, 10.3, 30.4, 0, 12.5, 11.2))
Where the study site in each zone is always 0, as it is the distance from this site, and the distance to each collect site is ONLY CALCULATED TO THE STUDY SITE IN EACH UNIQUE ZONE.
Thank you very much.
Kil

Simple Base R version, no other packages required.
Starting with exdb as above.
First add a new column called dist with the value "study" because the plan is to self-merge on zone and site=="study":
> exdb$dist = "study"
Self-Merge, keeping only the coordinate columns:
> MM = merge(exdb, exdb,
by.x=c("zone","site"),
by.y=c("zone","dist"))[,c("x.x","y.x","x.y","y.y")]
Use distGeo to overwrite the dist column. Keeps it neat and tidy:
> exdb$dist = distGeo(MM[,2:1],MM[,4:3])
> exdb
zone site x y dist
1 1 study 53.30773 -6.222291 0.0000
2 1 collect 53.31066 -6.217151 473.2943
3 1 collect 53.30709 -6.215080 485.8806
4 2 study 53.31383 -6.214152 0.0000
5 2 collect 53.31909 -6.218723 659.5238
6 2 collect 53.31879 -6.215815 563.1349
Returns same answer as #wimpel but with no additional dependencies and in fewer lines of code.

Maybe something like this?
Assuming x and y are latitude and longitude, we can use the haversine function to get the distance in meters after pivoting the table to have both points in a row between which the distance is being calculated from (in meters):
library(tidyverse)
library(pracma)
#>
#> Attaching package: 'pracma'
#> The following object is masked from 'package:purrr':
#>
#> cross
data <- data.frame(zone = c(1, 1, 1, 2, 2, 2), site = c(
"study", "collect", "collect", "study",
"collect", "collect"
), x = c(
53.307726, 53.310660, 53.307089,
53.313831, 53.319087, 53.318792
), y = c(-6.222291, -6.217151, -6.215080, -6.214152, -6.218723, -6.215815))
data %>%
pivot_wider(names_from = site, values_from = c(x, y)) %>%
unnest(y_collect, y_study, x_collect, x_study) %>%
mutate(
dist = list(x_study, y_study, x_collect, y_collect) %>% pmap_dbl(~haversine(c(..1, ..2), c(..3, ..4)) * 1000)
)
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: Values are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = length` to identify where the duplicates arise
#> * Use `values_fn = {summary_fun}` to summarise duplicates
#> Warning: unnest() has a new interface. See ?unnest for details.
#> Try `df %>% unnest(c(y_collect, y_study, x_collect, x_study))`, with `mutate()` if needed
#> # A tibble: 4 x 6
#> zone x_study x_collect y_study y_collect dist
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 53.3 53.3 -6.22 -6.22 472.
#> 2 1 53.3 53.3 -6.22 -6.22 484.
#> 3 2 53.3 53.3 -6.21 -6.22 659.
#> 4 2 53.3 53.3 -6.21 -6.22 563.
Created on 2021-09-13 by the reprex package (v2.0.1)

I'm still learning the spatial side but does this work?
library(sf)
library(tidyverse)
exdb %>%
arrange(zone, desc(site)) %>% #ensure study is first
st_as_sf(coords = c("x", "y"), crs = 4326) %>%
group_by(zone) %>%
mutate(
study_coord = geometry[1],
dist = st_distance(geometry, study_coord, by_element = T),
)

I believe this should work.. But I could not reproduce your distances in the desired output.
library(data.table)
library(purrr) # Or tidyverse
library(geosphere)
# Make your data a data.table
setDT(mydata)
# Split to a list based on zone and site
L <- split(mydata, by = c("zone", "site"), flatten = FALSE)
# Loop over list
L <- lapply(L, function(zone) {
#get reference point to take dustance from
point.study <- c(zone$study$y,zone$study$x)
zone$study$dist <- 0
# Calculate distance
zone$collect$dist <- unlist(purrr::pmap( list(a = zone$collect$y,
b = zone$collect$x ),
~(geosphere::distGeo( point.study, c(..1, ..2)))))
return(zone)
})
# Rowbind the results together
data.table::rbindlist(lapply(L, data.table::rbindlist))
# zone site x y dist
# 1: 1 study 53.30773 -6.222291 0.0000
# 2: 1 collect 53.31066 -6.217151 473.2943
# 3: 1 collect 53.30709 -6.215080 485.8806
# 4: 2 study 53.31383 -6.214152 0.0000
# 5: 2 collect 53.31909 -6.218723 659.5238
# 6: 2 collect 53.31879 -6.215815 563.1349

Related

How to add a column to my data frame that calculates the distance between lat/long points between the previous point with matching IDs

I have a data frame of individual animals with a unique ID, the lat/long where they were found, and the date they were found. The database has frequent returns of the same individual. I have over 2000 individuals. I want to add a column to my data frame to calculate euclidian distance between current location & previous location. I want to add a second column to tell me which calculation number I'm on for each individual. The data frame is already organized by sequential date. I'm trying to solve this in R.
Event
ID
Lat
Long
1
1
31.89
-80.98
2
2
31.54
-80.12
3
1
31.45
-81.92
4
1
31.64
-81.82
5
2
31.23
-80.98
Add a column so that now it looks like
Event
ID
Lat
Long
Dist.
Calculation #
1
1
31.89
-80.98
-
0
2
2
31.54
-80.12
-
0
3
1
31.45
-81.92
Distance between event 1 & 3
1
4
1
31.64
-81.82
Distance between event 3 & 4
2
5
2
31.23
-80.98
Distance between event 2 & 5
1
Is there a faster way to do this without a for loop? I'm stuck on where to start. I know I can use a distance function from the geospatial package once, I have the uniqueID sorted, but I'm having trouble iterating through my data.
Here is one option which leans on the sf package and dplyr. The function sf::st_distance calculates distances between pairs of points, and dplyr::lag can be used to look "one row behind". You will want to confirm your coordinate system, which I guessed here is WGS84/4326.
library(dplyr)
library(sf)
dat <- read.table(text = " Event ID Lat Long
1 1 31.89 -80.98
2 2 31.54 -80.12
3 1 31.45 -81.92
4 1 31.64 -81.82
5 2 31.23 -80.98", h = T)
dat_sf <- st_as_sf(dat, coords = c('Long', 'Lat'), crs = 4326)
dat_sf %>%
arrange(ID) %>%
group_by(ID) %>%
mutate(distance = as.numeric(st_distance(geometry, lag(geometry), by_element = TRUE)),
calculation = row_number() - 1)
#> Simple feature collection with 5 features and 4 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -81.92 ymin: 31.23 xmax: -80.12 ymax: 31.89
#> Geodetic CRS: WGS 84
#> # A tibble: 5 x 5
#> # Groups: ID [2]
#> Event ID geometry distance calculation
#> * <int> <int> <POINT [°]> <dbl> <dbl>
#> 1 1 1 (-80.98 31.89) NA 0
#> 2 3 1 (-81.92 31.45) 101524. 1
#> 3 4 1 (-81.82 31.64) 23155. 2
#> 4 2 2 (-80.12 31.54) NA 0
#> 5 5 2 (-80.98 31.23) 88615. 1
Created on 2022-11-14 by the reprex package (v2.0.0)
Try this:
load library geosphere
create demo data
get all unique IDs and sort dataframe by ID and event
append last known coords of each animal to each row
apply distance function to each row
library(geosphere)
df <- data.frame(
event = seq(5),
id = c(1, 2, 1, 1, 2),
lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
df <- keys %>% lapply(
function(key){
tmp <- df[df$id == key, ]
tmp$last_lat <- tmp$lat
tmp$last_long <- tmp$long
tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
tmp %>% return
}
) %>% do.call(rbind, .)
df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))
Since you said you need speed, below is the same code as above but run in parallel:
library(tictoc)
library(parallel)
tic()
clust <- makeCluster(detectCores() - 1)
df <- data.frame(
event = seq(5),
id = c(1, 2, 1, 1, 2),
lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
clusterExport(clust, "df")
clusterEvalQ(clust, library(magrittr))
df <- keys %>% parLapply(
clust, .,
function(key){
tmp <- df[df$id == key, ]
tmp$last_lat <- tmp$lat
tmp$last_long <- tmp$long
tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
tmp %>% return
}
) %>% do.call(rbind, .)
df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))
toc()
Above, tictoc just records the execution time. I just created a cluster with the number of your cpu cores minus 1, and changed the lapply part to parLapply The second version will be slower than the first if you have a small dataset (due to overhead setting up the parallel computation). But if you have a large dataset, the second version will be much faster.

Calculating 5th, 15th... etc percentile in R

I'm a little new to R and was hoping to get some insight about how to calculate for any percentile, for example 5th, 15th , etc...
The data I'm working with has two columns:
salary: (datatype is numeric / double)
student (set up as factor / integer, but only has yes/no)
I've already used:
favstats(salary~student, data=Default, na.rm=TRUE)
to get the two rows of stats broken down by whether they're a student or not; however, I'm not sure how to have the output show me a percentile of my choosing.
Would love to know the simplest way to go on about this in R Studio.
Thank you!
The quantile() function in base r does this.
x <- rnorm(100)
percentiles <- c(0.05, 0.15)
quantile(x, percentiles)
#> 5% 15%
#> -1.593506 -1.120130
If you need to produce a more complex summary table, you can do something with {tidyverse} like this:
library(tidyverse)
n <- 50
d <- tibble(student = rep(c(T, F), each = n),
salary = c(rnorm(n, 75, 10), rnorm(n, 95, 15)))
d %>%
group_by(student) %>%
summarize(quantile_salary = quantile(salary, percentiles))
#> # A tibble: 4 × 2
#> # Groups: student [2]
#> student quantile_salary
#> <lgl> <dbl>
#> 1 FALSE 70.0
#> 2 FALSE 78.5
#> 3 TRUE 57.3
#> 4 TRUE 64.0
Created on 2022-09-22 by the reprex package (v2.0.1)

Calculate distance between multiple latitude and longitude points

I have a dataset that has latitude and longitude information for participants' home and work, and I'd like to create a new column in the dataset containing the euclidean distance between home and work for each participant. I think this should be relatively simple, but all the other Q&As I've seen seem to be dealing with slightly different issues.
To start, I tried running this code (using the geosphere package):
distm(c(homelong, homelat), c(worklong, worklat), fun=distHaversine)
But got an error saying "Error in .pointsToMatrix(x) : Wrong length for a vector, should be 2" because (if I understand correctly) I'm trying to calculate the distance between multiple sets of two points.
Can I adjust this code to get what I'm looking for, or is there something else I should be trying instead? Thanks!
distm() returns a distance matrix, which is not what you want; you want the pairwise distances. So use the distance function (distHaversine(), distGeo(), or whatever) directly:
library(tidyverse)
locations <- tibble(
homelong = c(0, 2),
homelat = c(2, 5),
worklong = c(70, 60),
worklat = c(45, 60)
)
locations <- locations %>%
mutate(
dist = geosphere::distHaversine(cbind(homelong, homelat), cbind(worklong, worklat))
)
locations
#> # A tibble: 2 × 5
#> homelong homelat worklong worklat dist
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 2 70 45 8299015.
#> 2 2 5 60 60 7809933.
Note that geosphere functions want matrices as inputs, so you can cbind() your columns together. Don't c() them; that's creating a single shapeless vector and losing the differentiation between lon and lat. This is the cause of the error, I suspect; the vector only has one dimension, not two like a matrix.
You can have the latitudes and longitudes in a dataframe and then do rowwise operations on the dataframe to get the distance corresponding to each row.
library(tidyverse)
library(geosphere)
locations <- tibble(
homelong = c(0, 2),
homelat = c(2, 5),
worklong = c(70, 60),
worklat = c(45, 60)
)
locations %>%
rowwise() %>%
mutate(d = as.numeric(distm(c(homelong, homelat), c(worklong, worklat), fun = distHaversine)))
results in
# A tibble: 2 x 5
# Rowwise:
homelong homelat worklong worklat d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2 70 45 8299015.
2 2 5 60 60 7809933.

select top n values by group with n depending on other value in data frame

I'm quite new to r and coding in general. Your help would be highly appreciated :)
I'm trying to select the top n values by group with n depending on an other value (in the following called factor) from my data frame. Then, the selected values shoud be summarised by group to calculate the mean (d100). My goal is to get one value for d100 per group.
(Background: In forestry there is an indicator called d100 which is the mean diameter of the 100 thickest trees per hectare. If the size of the sampling area is smaller than 1 ha you need to select accordingly fewer trees to calculate d100. That's what the factor is for.)
First I tried to put the factor inside my dataframe as an own column. Then I thought maybe it would help to have something like a "lookup-table", because R said, that n must be a single number. But I don't know how to create a lookup-function. (See last part of the sample code.) Or maybe summarising df$factor before using it would do the trick?
Sample data:
(I indicated expressions where I'm not sure how to code them in R like this: 'I dont know how')
# creating sample data
library(tidyverse)
df <- data.frame(group = c(rep(1, each = 5), rep(2, each = 8), rep(3, each = 10)),
BHD = c(rnorm(23, mean = 30, sd = 5)),
factor = c(rep(pi*(15/100)^2, each = 5), rep(pi*(20/100)^2, each = 8), rep(pi*(25/100)^2, each = 10))
)
# group by ID, then select top_n values of df$BHD with n depending on value of df$factor
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*df$factor,
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
# other thought: having a "lookup-table" for the factor like this:
lt <- data.frame(group = c(1, 2, 3),
factor = c(pi*(15/100)^2, pi*(20/100)^2, pi*(25/100)^2))
# then
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*lt$factor 'where lt$group == df$group',
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
I already found this answer to a problem which seems similar to mine, but it didn't quite help.
Since all the factor values are the same within each group, you can select any one factor value.
library(dplyr)
df %>%
group_by(group) %>%
top_n(BHD, n = 100* first(factor)) %>%
ungroup
# group BHD factor
# <dbl> <dbl> <dbl>
# 1 1 25.8 0.0707
# 2 1 24.6 0.0707
# 3 1 27.6 0.0707
# 4 1 28.3 0.0707
# 5 1 29.2 0.0707
# 6 2 28.8 0.126
# 7 2 39.5 0.126
# 8 2 23.1 0.126
# 9 2 27.9 0.126
#10 2 31.7 0.126
# … with 13 more rows

Using custom function to apply across multiple groups and subsets

I am having trouble trying to apply a custom function to multiple groups within a data frame and mutate it to the original data. I am trying to calculate the percent inhibition for each row of data (each observation in the experiment has a value). The challenging issue is that the function needs the mean of two different groups of values (positive and negative controls) and then uses that mean value in each calculation.
In other words, the mean of the negative control is subtracted by the experimental value, then divided by the mean of the negative control minus the positive control.
Each observation including the + and - controls should have a calculated percent inhibition, and as a double check, for each experiment(grouping) the
mean of the pct inhib of the - controls should be around 0 and the + controls around 100.
The function:
percent_inhibition <- function(uninhibited, inhibited, unknown){
uninhibited <- as.vector(uninhibited)
inhibited <- as.vector(inhibited)
unknown <- as.vector(unknown)
mu_u <- mean(uninhibited, na.rm = TRUE)
mu_i <- mean(inhibited, na.rm = TRUE)
percent_inhibition <- (mu_u - unknown)/(mu_u - mu_i)*100
return(percent_inhibition)
}
I have a data frame with multiple variables: target, box, replicate, and sample type. I am able to do the calculation by subsetting the data (below), (1 target, box, and replicate) but have not been able to figure out the right way to apply it to all of the data.
subset <- data %>%
filter(target == "A", box == "1", replicate == 1)
uninhib <-
subset$value[subset$sample == "unihib"]
inhib <-
subset$value[subset$sample == "inhib"]
pct <- subset %>%
mutate(pct = percent_inhibition(uninhib, inhib, .$value))
I have tried group_by and do, and nest functions, but my knowledge is lacking in how to apply these functions to my subsetting problem. I'm stuck when it comes to the subset of the subset (calculating the means) and then applying that to the individual values. I am hoping there is an elegant way to do this without all of the subsetting, but I am at a loss on how.
I have tried:
inhibition <- data %>%
group_by(target, box, replicate) %>%
mutate(pct = (percent_inhibition(.$value[.$sample == "uninhib"], .$value[.$sample == "inhib"], .$value)))
But get the error that columns are not the right length, because of the group_by function.
library(tidyr)
library(purrr)
library(dplyr)
data %>%
group_by(target, box, replicate) %>%
mutate(pct = {
x <- split(value, sample)
percent_inhibition(x$uninhib, x$inhib, value)
})
#> # A tibble: 10,000 x 6
#> # Groups: target, box, replicate [27]
#> target box replicate sample value pct
#> <chr> <chr> <int> <chr> <dbl> <dbl>
#> 1 A 1 3 inhib -0.836 1941.
#> 2 C 1 1 uninhib -0.221 -281.
#> 3 B 3 2 inhib -2.10 1547.
#> 4 C 1 1 uninhib -1.67 -3081.
#> 5 C 1 3 inhib -1.10 -1017.
#> 6 A 2 1 inhib -1.67 906.
#> 7 B 3 1 uninhib -0.0495 -57.3
#> 8 C 3 2 inhib 1.56 5469.
#> 9 B 3 2 uninhib -0.405 321.
#> 10 B 1 2 inhib 0.786 -3471.
#> # … with 9,990 more rows
Created on 2019-03-25 by the reprex package (v0.2.1)
Or:
data %>%
group_by(target, box, replicate) %>%
mutate(pct = percent_inhibition(value[sample == "uninhib"],
value[sample == "inhib"], value))
With data as:
n <- 10000L
set.seed(123) ; data <-
tibble(
target = sample(LETTERS[1:3], n, replace = TRUE),
box = sample(as.character(1:3), n, replace = TRUE),
replicate = sample(1:3, n, replace = TRUE),
sample = sample(c("inhib", "uninhib"), n, replace = TRUE),
value = rnorm(n)
)

Resources