Distance to nearest point by group using sf - r

I have a dataset that looks similar to the example below. For each code I would like to calculate the distance to the next nearest code that belongs to the same area as it. So in my example, for each code belonging to area A001 I would be after an additional column in the dataset that contains the minimum distance to one of the other points that belong to area A001. I assume there should be a way of using st_distance to achieve this?
require("data.table")
require("sf")
dt1 <- data.table(
code=c("A00111", "A00112","A00113","A00211","A00212","A00213","A00214","A00311","A00312"),
area=c("A001", "A001","A001","A002","A002","A002","A002","A003","A003"),
x=c(325147,323095,596020,257409,241206,248371,261076,595218,596678),
y=c(286151,284740,335814,079727,084266,078283,062045,333889,337836))
sf1 <- st_as_sf(dt1, coords = c("x","y"), crs=27700, na.fail=FALSE)

There might be a 'cleaner' way to get here, but this gets you the correct values.
library(tidyverse)
# intermediate fun to help later in apply()
smallest_non_zero <- function(x) {
min_val <- min(x[x != 0])
x[match(min_val, x)]
}
closest_grp_distances <- sf1 %>%
group_split(area) %>%
map(~st_distance(., .) %>% # returns matrix
apply(1, smallest_non_zero)) %>%
unlist()
sf1$closest_grp_distances <- closest_grp_distances
I wanted to use the baseR split but it doesn't have a method for sf objects.

Related

How can I access a matrix entries using a for loop in R?

I have a distance matrix with all distances between a all points in the data set.
How can I access the individual distances in the matrix without using a for loop?
This is a working example using a for loop:
# Create a distance matrix of all possible distances
DistMatrix <- Stations %>%
st_sf(crs = 4326) %>%
filter(!is.na(end.station.id)) %>%
st_distance(Stations$geometry)
# initialisation of new distance data frame Dist containing start and end station id.
Dist <- Bike %>%
select(start.station.id, end.station.id, dateS, tripduration) %>%
filter(!is.na(end.station.id)) %>%
mutate(Dist = NA)
# iterates over all rows and allocates the corresponding distance to start and end station id's.
for(i in 1:length(Dist$dateS)){
Dist$Dist[i] <- DistMatrix[which(Stations$end.station.id == Dist$end.station.id[i]),
which(Stations$end.station.id == Dist$start.station.id[i])]
}
This is my best try go at this problem using dplyr::mutate:
Dist2 <- Dist2 %>% mutate(Dist = DistMatrix[which(end.station.id == Stations[1]),
which(start.station.id == Stations[1])])
The expected outcome would be that the dataframe Dist is edited with the column Dist(distances):
Is there a working solution to this problem?
Thx!
EDIT: The example code is more detailed now.
EDIT 2: Added expected out come.

Lagged values multiple columns with function in R

I would like to create lagged values for multiple columns in R.
First, I used a function to create lead/lag like this:
mleadlag <- function(x, n, ts_id) {
pos <- match(as.numeric(ts_id) + n, as.numeric(ts_id))
x[pos]
}
Second, I would like to apply this function for several columns in R. firm.characteristics is list of columns I would like to compute lagged values.
library(dplyr)
firm.characteristics <- colnames(df)[4:6]
for(i in 1:length(firm.characteristics)){
df <- df %>%
group_by(company) %>%
mutate(!!paste0("lag_", i) := mleadlag(df[[i]] ,-1, fye)) %>%
ungroup()
}
However, I didn't get the correct values. The output for all companies in year t is the last row in year t-1. It didn't group by the company any compute the lagged values.
Can anyone help me which is wrong in the loop? Or what should I do to get the correct lagged values?
Thank you so much for your help.
Reproducible sample could be like this:
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(company=1:n,
fye=2009,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat2 <- data.frame(company=1:n,
fye=2010,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat3 <- data.frame(company=1:n,
fye=2011,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
df <- rbind(dat,dat2,dat3)
I would try to stay away from loops in the tidyverse. Many of the tidyverse applications that would traditionally require loops already exist and are very fast, which creates more efficient and intuitive code (the latter being my opinion). This is a great use case for dplyr's across() functionality. I first changed the df to a tibble.
df %>%
as_tibble() %>%
group_by(company) %>%
mutate(
across(firm.characteristics, ~lag(., 1L))
) %>%
ungroup()
This generates the required lagged values. For more information see dplyr's across documentation.

Want to write a loop to find reflectance values for each column

So I have this code in R that I'm using on a dataframe df that comes in the format where each row is a wavelength (823 rows/wavelengths) and each column is a pixel (written as V1-V2554).
I have the code to normalise each reflectance value as such per each spectrum/pixel:
# Define function to find vector length
veclen=function(vec) {
sqrt(sum(vec^2))
}
# Find vector length for spectrum of each pixel
df_vecV6 <- df %>%
group_by(Wavelength) %>%
summarise(veclengthV6 = veclen(V6))
# Join new variable "veclength"
df <- df %>%
left_join(df_vecV6, by = "Wavelength")
# Define function that return normalized vector
vecnorm=function(vector) {
vector/veclen(vector)
}
# Normalize by dividing each reflectance value by the vector’s length
df$refl_normV6 <- vecnorm(df$V6)
but I want to create a loop to do this for all 2553 columns. I started writing it but seem to come up with problems. In this case df is finaldatat and I wanted to create a list svec to store vector lengths before the next steps:
for(i in (1:ncol(finaldatat))){
svec[[i]]<- finaldatat %>%
#group_by(Wavelength) %>%
summarise (x = veclen(finaldatat[,i]))
}
That first step runs, but the vector lengths that are meant to be below zero are way above so I already know there's a problem. Any help is appreciated!
Ideally in the final dataframe I would only have the normalised results in the same 2554x824 format.
You can use dplyr's across function to apply vecnorm to all columns from V1 to V2554.
result <- df %>%
group_by(Wavelength) %>%
summarise(across(V1:V2554, vecnorm))
#In older version of dplyr use summarise_at :
summarise_at(vars(V1:V2554), vecnorm)

How to filter a dataframe using a list of multiple ranges of a variable

I'm attempting to filter a large signal intensity dataframe using a list of ranges of one variable (chromosome position) in the dataframe. The list has 256 ranges in total, with start and end positions. I can successfully filter the dataframe using a single range, but I can't seem to get this to loop over the entire dataframe.
DT is the original signal intensity dataframe (SNP, Chr, Position, Intensity Ratio) and PR is a two column dataframe with start and end Position:
Chr Start End
1 130104 207101
1 1423247 4459324
1 6543121 7924836
This line of code works to extract the data from a single range:
test <- DT %>% filter(Chr %in% ("1")) %>% filter(Position %in% c(PR$Start[1]:PR$End[1]))
This does NOT work:
for (i in 1:nrow(PR)){
help <- DT %>% filter(Chr %in% ("1")) %>% filter(Position %in% c(PR$Start[i]:PR$End[i]))
}
The above code produces a dataframe with a random selection of data that doesn't correspond to the range of positions.
This doesn't work either:
range = data.table(start=PR$Start,end=PR$End)
x <- DT[Position %inrange% range]
Thank you in advance!
Your data.table solution worked for me. Does this work for you, with my made up data?
dt <- data.table(id = 1:100, var=runif(100))
ranges <- data.table(start=c(20,50,70), end=c(30,55,72))
dt[id %inrange% ranges]

Method for calculating distance between all points in a dataframe containing a list of xy coordinates

I'm sure this has been answered before, but I can't find the thread for the life of me!
I am trying to use r to produce a list of all the distances between pairs of xy coordinates in a dataframe. The data is stored something like this:
ID = c('1','2','3','4','5','6','7')
x = c(1,2,4,5,1,3,1)
y = c(3,5,6,3,1,5,1)
df= data.frame(ID,x,y)
At the moment I can calculate the distance between two points using:
length = sqrt((x1 - x2)^2+(y1 - y2)^2).
However, I am uncertain as to where to go next. Should I use something from plyr or a for loop?
Thanks for any help!
Have you tried ?dist, the formula you listed is euclidean distance
dist(df[,-1])
You can use a self-join to get all combinations then apply your distance formula. All of this is easily do-able using the tidyverse (combination of packages from Hadley Wickham):
# Load the tidyverse
library(tidyverse)
# Set up a fake key to join on (just a constant)
df <- df %>% mutate(k = 1)
# Perform the join, remove the key, then create the distance
df %>%
full_join(df, by = "k") %>%
mutate(dist = sqrt((x.x - x.y)^2 + (y.x - y.y)^2)) %>%
select(-k)
N.B. using this method, you'll also calculate the distance between each point and itself (as well as with all other points). It's easy to filter those points out though:
df %>%
full_join(df, by = "k") %>%
filter(ID.x != ID.y) %>%
mutate(dist = sqrt((x.x - x.y)^2 + (y.x - y.y)^2)) %>%
select(-k)
For more information about using the tidyverse set of packages I'd recommend R for Data Science or the tidyverse website.

Resources