How to find the three closest (nearest) values within a vector? - r

I would like to find out the three closest numbers in a vector.
Something like
v = c(10,23,25,26,38,50)
c = findClosest(v,3)
c
23 25 26
I tried with sort(colSums(as.matrix(dist(x))))[1:3], and it kind of works, but it selects the three numbers with minimum overall distance not the three closest numbers.
There is already an answer for matlab, but I do not know how to translate it to R:
%finds the index with the minimal difference in A
minDiffInd = find(abs(diff(A))==min(abs(diff(A))));
%extract this index, and it's neighbor index from A
val1 = A(minDiffInd);
val2 = A(minDiffInd+1);
How to find two closest (nearest) values within a vector in MATLAB?

My assumption is that the for the n nearest values, the only thing that matters is the difference between the v[i] - v[i - (n-1)]. That is, finding the minimum of diff(x, lag = n - 1L).
findClosest <- function(x, n) {
x <- sort(x)
x[seq.int(which.min(diff(x, lag = n - 1L)), length.out = n)]
}
findClosest(v, 3L)
[1] 23 25 26

Let's define "nearest numbers" by "numbers with minimal sum of L1 distances". You can achieve what you want by a combination of diff and windowed sum.
You could write a much shorter function but I wrote it step by step to make it easier to follow.
v <- c(10,23,25,26,38,50)
#' Find the n nearest numbers in a vector
#'
#' #param v Numeric vector
#' #param n Number of nearest numbers to extract
#'
#' #details "Nearest numbers" defined as the numbers which minimise the
#' within-group sum of L1 distances.
#'
findClosest <- function(v, n) {
# Sort and remove NA
v <- sort(v, na.last = NA)
# Compute L1 distances between closest points. We know each point is next to
# its closest neighbour since we sorted.
delta <- diff(v)
# Compute sum of L1 distances on a rolling window with n - 1 elements
# Why n-1 ? Because we are looking at deltas and 2 deltas ~ 3 elements.
withingroup_distances <- zoo::rollsum(delta, k = n - 1)
# Now it's simply finding the group with minimum within-group sum
# And working out the elements
group_index <- which.min(withingroup_distances)
element_indices <- group_index + 0:(n-1)
v[element_indices]
}
findClosest(v, 2)
# 25 26
findClosest(v, 3)
# 23 25 26

A base R option, idea being we first sort the vector and subtract every ith element with i + n - 1 element in the sorted vector and select the group which has minimum difference.
closest_n_vectors <- function(v, n) {
v1 <- sort(v)
inds <- which.min(sapply(head(seq_along(v1), -(n - 1)), function(x)
v1[x + n -1] - v1[x]))
v1[inds: (inds + n - 1)]
}
closest_n_vectors(v, 3)
#[1] 23 25 26
closest_n_vectors(c(2, 10, 1, 20, 4, 5, 23), 2)
#[1] 1 2
closest_n_vectors(c(19, 23, 45, 67, 89, 65, 1), 2)
#[1] 65 67
closest_n_vectors(c(19, 23, 45, 67, 89, 65, 1), 3)
#[1] 1 19 23
In case of tie this will return the numbers with smallest value since we are using which.min.
BENCHMARKS
Since we have got quite a few answers, it is worth doing a benchmark of all the solutions till now
set.seed(1234)
x <- sample(100000000, 100000)
identical(findClosest_antoine(x, 3), findClosest_Sotos(x, 3),
closest_n_vectors_Ronak(x, 3), findClosest_Cole(x, 3))
#[1] TRUE
microbenchmark::microbenchmark(
antoine = findClosest_antoine(x, 3),
Sotos = findClosest_Sotos(x, 3),
Ronak = closest_n_vectors_Ronak(x, 3),
Cole = findClosest_Cole(x, 3),
times = 10
)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
#antoine 148.751 159.071 163.298 162.581 167.365 181.314 10 b
# Sotos 1086.098 1349.762 1372.232 1398.211 1453.217 1553.945 10 c
# Ronak 54.248 56.870 78.886 83.129 94.748 100.299 10 a
# Cole 4.958 5.042 6.202 6.047 7.386 7.915 10 a

An idea is to use zoo library to do a rolling operation, i.e.
library(zoo)
m1 <- rollapply(v, 3, by = 1, function(i)c(sum(diff(i)), c(i)))
m1[which.min(m1[, 1]),][-1]
#[1] 23 25 26
Or make it into a function,
findClosest <- function(vec, n) {
require(zoo)
vec1 <- sort(vec)
m1 <- rollapply(vec1, n, by = 1, function(i) c(sum(diff(i)), c(i)))
return(m1[which.min(m1[, 1]),][-1])
}
findClosest(v, 3)
#[1] 23 25 26

For use in a dataframe,
data%>%
group_by(var1,var2)%>%
do(data.frame(findClosest(.$val,3)))

Related

Minimum absolute difference between vector pairs (greedy algorithm)

Given a numeric vector, I'd like to find the smallest absolute difference in combinations of size 2. However, the point of friction comes with the use of combn to create the matrix holding the pairs. How would one handle issues when a matrix/vector is too large?
When the number of resulting pairs (number of columns) using combn is too large, I get the following error:
Error in matrix(r, nrow = len.r, ncol = count) :
invalid 'ncol' value (too large or NA)
This post states that the size limit of a matrix is roughly one billion rows and two columns.
Here is the code I've used. Apologies for the use of cat in my function output -- I'm solving the Minimum Absolute Difference in an Array Greedy Algorithm problem in HackerRank and R outputs are only counted as correct if they're given using cat:
minimumAbsoluteDifference <- function(arr) {
combos <- combn(arr, 2)
cat(min(abs(combos[1,] - combos[2,])))
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
# This fails
inputFail <- rpois(10e4, 1)
minimumAbsoluteDifference(inputFail)
#Error in matrix(r, nrow = len.r, ncol = count) :
# invalid 'ncol' value (too large or NA)
TL;DR
No need for combn or the like, simply:
min(abs(diff(sort(v))))
The Nitty Gritty
Finding the difference between every possible combinations is O(n^2). So when we get to vectors of length 1e5, the task is burdensome both computationally and memory-wise.
We need a different approach.
How about sorting and taking the difference only with its neighbor?
By first sorting, for any element vj, the difference min |vj - vj -/+ 1| will be the smallest such difference involving vj. For example, given the sorted vector v:
v = -9 -8 -6 -4 -2 3 8
The smallest distance from -2 is given by:
|-2 - 3| = 5
|-4 - -2| = 2
There is no need in checking any other elements.
This is easily implemented in base R as follows:
getAbsMin <- function(v) min(abs(diff(sort(v))))
I'm not going to use rpois as with any reasonably sized vector, duplicates will be produces, which will trivially give 0 as an answer. A more sensible test would be with runif or sample (minimumAbsoluteDifference2 is from the answer provided by #RuiBarradas):
set.seed(1729)
randUnif100 <- lapply(1:100, function(x) {
runif(1e3, -100, 100)
})
randInts100 <- lapply(1:100, function(x) {
sample(-(1e9):(1e9), 1e3)
})
head(sapply(randInts100, getAbsMin))
[1] 586 3860 2243 2511 5186 3047
identical(sapply(randInts100, minimumAbsoluteDifference2),
sapply(randInts100, getAbsMin))
[1] TRUE
options(scipen = 99)
head(sapply(randUnif100, getAbsMin))
[1] 0.00018277206 0.00020549633 0.00009834766 0.00008395873 0.00005299225 0.00009313226
identical(sapply(randUnif100, minimumAbsoluteDifference2),
sapply(randUnif100, getAbsMin))
[1] TRUE
It's very fast as well:
library(microbenchmark)
microbenchmark(a = getAbsMin(randInts100[[50]]),
b = minimumAbsoluteDifference2(randInts100[[50]]),
times = 25, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
a 1.0000 1.0000 1.0000 1.0000 1.00000 1.00000 25
b 117.9799 113.2221 105.5144 107.6901 98.55391 81.05468 25
Even for very large vectors, the result is instantaneous;
set.seed(321)
largeTest <- sample(-(1e12):(1e12), 1e6)
system.time(print(getAbsMin(largeTest)))
[1] 3
user system elapsed
0.083 0.003 0.087
Something like this?
minimumAbsoluteDifference2 <- function(x){
stopifnot(length(x) >= 2)
n <- length(x)
inx <- rep(TRUE, n)
m <- NULL
for(i in seq_along(x)[-n]){
inx[i] <- FALSE
curr <- abs(x[i] - x[which(inx)])
m <- min(c(m, curr))
}
m
}
# This works fine
input0 <- c(3, -7, 0)
minimumAbsoluteDifference(input0) #returns 3
minimumAbsoluteDifference2(input0) #returns 3
set.seed(2020)
input1 <- rpois(1e3, 1)
minimumAbsoluteDifference(input1) #returns 0
minimumAbsoluteDifference2(input1) #returns 0
inputFail <- rpois(1e5, 1)
minimumAbsoluteDifference(inputFail) # This fails
minimumAbsoluteDifference2(inputFail) # This does not fail

Fast sum of values of a vector above given thresholds

I have a vector of threshold values, thresholds, and another vector, x. I'd like to create a new vector, say vec_sum, of the same length as thresholds, that stores, for each element of thresholds, the sum of values of x larger than this element.
What is the fastest way of doing this?
The naive way I'm doing it is
vec_sum <- rep(NA,length(thresholds))
for(i in seq_along(thresholds))
{
vec_sum[i] <- sum(x[x>thresholds[i]])
}
In case it helps, thresholds is already sorted.
Here is another solution using cumsum:
f1 <- function(v, th){
v2 <- v[order(v)]
v2s <- rev(cumsum(rev(v2)))
return(v2s[findInterval(th, v2) + 1])
}
Here are some tests and comparison with the other answer (as well as the example data) by Ronak:
f2 <- function(x, thresholds){
if (all(x < thresholds[1])) return(rep(0, length(thresholds)))
if (all(x > thresholds[length(thresholds)])) return(rep(sum(x), length(thresholds)))
return(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
}
test_th <- c(3, 5, 10)
test_x <- c(2, 3, 1, 19, 4, 6, 5, 15, 7:14, 16:18, 20)
vec_sum <- rep(NA,length(test_th))
for(i in seq_along(test_th)) {
vec_sum[i] <- sum(test_x[test_x>test_th[i]])
}
all(dplyr::near(f1(test_x, test_th), vec_sum))
# [1] TRUE
all(dplyr::near(f2(test_x, test_th), vec_sum))
# [1] TRUE
set.seed(123)
test_x <- rnorm(10000)
test_th <- sort(rnorm(100)) ## f2 requires sorted threshold values
vec_sum <- rep(NA,length(test_th))
for(i in seq_along(test_th)) {
vec_sum[i] <- sum(test_x[test_x>test_th[i]])
}
all(dplyr::near(f1(test_x, test_th), vec_sum))
# [1] TRUE
all(dplyr::near(f2(test_x, test_th), vec_sum))
# [1] FALSE
# Warning message:
# In x - y : longer object length is not a multiple of shorter object length
library(microbenchmark)
microbenchmark(
a = f1(test_x, test_th),
b = f2(test_x, test_th)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# a 587.116 682.864 900.3572 694.713 703.726 10647.206 100
# b 1157.213 1203.063 1260.0663 1223.600 1258.552 2143.069 100
Not sure if this is any faster, but we can use findInterval to cut x by thresholds. We take sum of each group using tapply and take cumsum in reverse.
as.integer(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
Tested on
thresholds <- c(3, 5, 10)
x <- c(2, 3, 1, 19, 4, 6, 5, 15, 7:14, 16:18, 20) #1:20 in random order
vec_sum <- rep(NA,length(thresholds))
for(i in seq_along(thresholds)) {
vec_sum[i] <- sum(x[x>thresholds[i]])
}
vec_sum
#[1] 204 195 155
Using the proposed solution
as.integer(rev(cumsum(rev(tapply(x,
findInterval(x, thresholds, left.open = TRUE), sum)[-1]))))
#[1] 204 195 155
Explaining the answer. findInterval returns groups where each value of x belongs
findInterval(x, thresholds, left.open = TRUE)
#[1] 0 0 0 3 1 2 1 3 2 2 2 2 3 3 3 3 3 3 3 3
We use tapply to get sum of each group
tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)
# 0 1 2 3
# 6 9 40 155
0-group should be excluded since they are smaller than all the values of threshold (hence -1). Group 2 should also contain sum from group 1 and group 3 should contain sum of group 1 and 2. So we reverse the sequence and take cumsum
cumsum(rev(tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)[-1]))
# 3 2 1
#155 195 204
To get it in original order and to match it with threshold we reverse it again
rev(cumsum(rev(tapply(x, findInterval(x, thresholds, left.open = TRUE), sum)[-1])))
# 1 2 3
#204 195 155
Edge Cases :
If there are all values below threshold or all values above threshold, we might need to do an extra check and return the following.
if (all(x < thresholds[1])) rep(0, length(thresholds))
if (all(x > thresholds[length(thresholds)])) rep(sum(x), length(thresholds))

make a variable based on a cumulative sum with reset based on condition

I want a variable such as desired_output, based on a cumulative sum over cumsumover, where the cumsum function resets every time it reaches the next number in thresh.
cumsumover <- c(1, 2, 7, 4, 2, 5)
thresh <- c(3, 7, 11)
desired_output <- c(3, 3 ,7 ,11 ,11 ,11) # same length as cumsumover
This question is similar, but I can't wrap my head around the code.
dplyr / R cumulative sum with reset
Compared to similar questions my condition is specified in a vector of different length than the cumsumover.
Any help would be greatly appreciated. Bonus if both a base R and a tidyverse approach is provided.
In base R, we can use cut with breaks as thresh and labels as letters of same length as thresh.
cut(cumsum(cumsumover),breaks = c(0, thresh[-1], max(cumsum(cumsumover))),
labels = letters[seq_along(thresh)])
#[1] a a b c c c
Replaced the last element of thresh with max(cumsum(cumsumover)) so that anything outside last element of thresh is assigned the last label.
If we want labels as thresh instead of letters
cut(cumsum(cumsumover),breaks = c(0, thresh[-1], max(cumsum(cumsumover))),labels = thresh)
#[1] 3 3 7 11 11 11
Here is another solution:
data:
cumsumover <- c(1, 2, 7, 4, 2, 5)
thresh <- c(3, 7, 11)
code:
outp <- letters[1:3] # to make solution more general
cumsumover_copy <- cumsumover # I use <<- inside sapply so therefore I make a copy to stay save
unlist(
sapply(seq_along(thresh), function(x) {
cs_over <- cumsum(cumsumover_copy)
ntimes = sum( cs_over <= thresh[x] )
cumsumover_copy <<- cumsumover_copy[-(1:ntimes)]
return( rep(outp[x], ntimes) )
} )
)
result:
#[1] "a" "a" "b" "c" "c" "c"
Using .bincode you can do this:
thresh[.bincode(cumsum(cumsumover), c(-Inf,thresh[-1],Inf))]
[1] 3 3 7 11 11 11
.bincode is used by cut, which basically adds labels and checks, so it's more efficient:
x <-rep(cumsum(cumsumover),10000)
microbenchmark::microbenchmark(
bincode = thresh[.bincode(x, c(-Inf,thresh[-1],Inf))],
cut = cut(x,breaks = c(-Inf, thresh[-1], Inf),labels = thresh))
# Unit: microseconds
# expr min lq mean median uq max neval
# bincode 450.2 459.75 654.794 482.10 642.20 5028.4 100
# cut 1739.3 1864.90 2622.593 2215.15 2713.25 12194.8 100

Finding an inversion score using R

Sorry in advance if "inversion score" isn't the proper terminology. Here's a wiki entry.
Consider a list of values, for instance
1 2 3 4 7 6 9 10 8
would have three penalties (a score of 3)
The 6 comes after 7
The 8 comes after 9
The 8 comes after 10
How can I calculate this inversion for a given vector of numbers in R? Note that some values will be NA, and I just want to skip these.
Your "inversion score" is a central component of Kendall's tau statistic. According to Wikipedia (see link), the tau statistic is (# concordant pairs-#discordant pairs)/(n*(n-1)/2). I believe that what R reports as T is the number of concordant pairs. Therefore, we should be able to reconstruct the number of discordant pairs (which I think is what you want) via n*(n-1)/2-T, as follows
x <- c(1,2,3,4,7,6,9,10,8)
(cc <- cor.test(sort(x),x,method="kendall"))
## Kendall's rank correlation tau
## data: sort(x) and x
## T = 33, p-value = 0.0008543
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.8333333
So this function should work:
ff <- function(x) {
cc <- cor.test(sort(x),x,method="kendall")
n <- length(x)
n*(n-1)/2-unname(cc$statistic["T"])
}
ff(x) is 3 as requested (it would be good if you gave more examples of desired output ...) Haven't checked speed, but this has the advantage of being implemented in underlying C code.
I quickly came up with two strategies. A naive and a more clever using the outer function.
We look at two vectors of numbers A and B, where A is your example.
A <- scan(text = "1 2 3 4 7 6 9 10 8")
B <- sample(1:2321)
Define and try the naive inversion counting:
simpleInversion <- function(A) {
sum <- 0
n <- length(A)
for (i in 1:(n-1)) {
for (j in (i+1):n) {
sum <- sum + (A[i] > A[j])
}
}
return(sum)
}
simpleInversion(A)
simpleInversion(B)
Define and try the slightly more clever inversion counting:
cleverInversion <- function(A) {
tab <- outer(A, A, FUN = ">")
return(sum(tab[upper.tri(tab)]))
}
cleverInversion(A)
cleverInversion(B)
For the version which ignores NAs we can simply add an na.omit:
cleverInversion2 <- function(A) {
AA <- na.omit(A)
Tab <- outer(AA, AA, FUN = ">")
return(sum(Tab[upper.tri(Tab)]))
}
A[2] <- NA
cleverInversion2(A)
Hope this helps.
Edit: A faster version
Both functions become quite slow quickly when the size of the vector grows. So I came up with at faster version:
fastInversion <- function(A) {
return(sum(cbind(1, -1) %*% combn(na.omit(AA), 2) > 0))
}
C <- sample(c(1:500, NA))
library("microbenchmark")
microbenchmark(
simpleInversion(C),
cleverInversion(C),
fastInversion(C))
#Unit: microseconds
# expr min lq median uq max neval
# simpleInversion(C) 128538.770 130483.626 133999.272 144660.116 185767.208 100
# cleverInversion(C) 9546.897 9893.358 10513.799 12564.298 17041.789 100
# fastInversion(C) 104.632 114.229 193.144 198.209 324.614 100
So we gain quite a speed-up of nearly two orders of magnitude. The speed-up is even greater for larger vectors.
You could test each pair of values in your vector, counting the number that are inverted:
inversion.score <- function(vec) {
sum(apply(combn(length(vec), 2), 2, function(x) vec[x[2]] < vec[x[1]]), na.rm=T)
}
inversion.score(c(1, 2, 3, 7, 6, 9, 10, 8, NA))
# [1] 3

Interpolate NA values

I have two set of samples that are time independent. I would like to merge them and calculate the missing values
for the times where I do not have values of both. Simplified example:
A <- cbind(time=c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
Avalue=c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2))
B <- cbind(time=c(15, 30, 45, 60), Bvalue=c(100, 200, 300, 400))
C <- merge(A,B, all=TRUE)
time Avalue Bvalue
1 10 1 NA
2 15 NA 100
3 20 2 NA
4 30 3 200
5 40 2 NA
6 45 NA 300
7 50 1 NA
8 60 2 400
9 70 3 NA
10 80 2 NA
11 90 1 NA
12 100 2 NA
By assuming linear change between each sample, it is possible to calculate the missing NA values.
Intuitively it is easy to see that the A value at time 15 and 45 should be 1.5. But a proper calculation for B
for instance at time 20 would be
100 + (20 - 15) * (200 - 100) / (30 - 15)
which equals 133.33333.
The first parenthesis being the time between estimate time and the last sample available.
The second parenthesis being the difference between the nearest samples.
The third parenthesis being the time between the nearest samples.
How can I use R to calculate the NA values?
Using the zoo package:
library(zoo)
Cz <- zoo(C)
index(Cz) <- Cz[,1]
Cz_approx <- na.approx(Cz)
The proper way to do this statistically and still get valid confidence intervals is to use Multiple Imputation. See Rubin's classic book, and there's an excellent R package for this (mi).
An ugly and probably inefficient Base R solution:
# Data provided:
A <- cbind(time=c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
Avalue=c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2))
B <- cbind(time=c(15, 30, 45, 60), Bvalue=c(100, 200, 300, 400))
C <- merge(A,B, all=TRUE)
# Scalar valued at the minimum time difference: -> min_time_diff
min_time_diff <- min(diff(C$time))
# Adjust frequency of the series to hold all steps in range: -> df
df <- merge(C,
data.frame(time = seq(min_time_diff,
max(C$time),
by = min_time_diff)),
by = "time",
all = TRUE)
# Linear interpolation function handling ties,
# returns interpolated vector the same length
# a the input vector: -> vector
l_interp_vec <- function(na_vec){
approx(x = na_vec,
method = "linear",
ties = "constant",
n = length(na_vec))$y
}
# Applied to a dataframe, replacing NA values
# in each of the numeric vectors,
# with interpolated values.
# input is dataframe: -> dataframe()
interped_df <- data.frame(lapply(df, function(x){
if(is.numeric(x)){
# Store a scalar of min row where x isn't NA: -> min_non_na
min_non_na <- min(which(!(is.na(x))))
# Store a scalar of max row where x isn't NA: -> max_non_na
max_non_na <- max(which(!(is.na(x))))
# Store scalar of the number of rows needed to impute prior
# to first NA value: -> ru_lower
ru_lower <- ifelse(min_non_na > 1, min_non_na - 1, min_non_na)
# Store scalar of the number of rows needed to impute after
# the last non-NA value: -> ru_lower
ru_upper <- ifelse(max_non_na == length(x),
length(x) - 1,
(length(x) - (max_non_na + 1)))
# Store a vector of the ramp to function: -> l_ramp_up:
ramp_up <- as.numeric(
cumsum(rep(x[min_non_na]/(min_non_na), ru_lower))
)
# Apply the interpolation function on vector "x": -> y
y <- as.numeric(l_interp_vec(as.numeric(x[min_non_na:max_non_na])))
# Create a vector that combines the ramp_up vector
# and y if the first NA is at row 1: -> z
if(length(ramp_up) > 1 & max_non_na != length(x)){
# Create a vector interpolations if there are
# multiple NA values after the last value: -> lower_l_int
lower_l_int <- as.numeric(cumsum(rep(mean(diff(c(ramp_up, y))),
ru_upper+1)) +
as.numeric(x[max_non_na]))
# Store the linear interpolations in a vector: -> z
z <- as.numeric(c(ramp_up, y, lower_l_int))
}else if(length(ramp_up) > 1 & max_non_na == length(x)){
# Store the linear interpolations in a vector: -> z
z <- as.numeric(c(ramp_up, y))
}else if(min_non_na == 1 & max_non_na != length(x)){
# Create a vector interpolations if there are
# multiple NA values after the last value: -> lower_l_int
lower_l_int <- as.numeric(cumsum(rep(mean(diff(c(ramp_up, y))),
ru_upper+1)) +
as.numeric(x[max_non_na]))
# Store the linear interpolations in a vector: -> z
z <- as.numeric(c(y, lower_l_int))
}else{
# Store the linear interpolations in a vector: -> z
z <- as.numeric(y)
}
# Interpolate between points in x, return new x:
return(as.numeric(ifelse(is.na(x), z, x)))
}else{
x
}
}
)
)
# Subset interped df to only contain
# the time values in C, store a data frame: -> int_df_subset
int_df_subset <- interped_df[interped_df$time %in% C$time,]

Resources