R: find nearest index - r

I have two vectors with a few thousand points, but generalized here:
A <- c(10, 20, 30, 40, 50)
b <- c(13, 17, 20)
How can I get the indicies of A that are nearest to b? The expected outcome would be c(1, 2, 2).
I know that findInterval can only find the first occurrence, and not the nearest, and I'm aware that which.min(abs(b[2] - A)) is getting warmer, but I can't figure out how to vectorize it to work with long vectors of both A and b.

You can just put your code in a sapply. I think this has the same speed as a for loop so isn't technically vectorized though:
sapply(b,function(x)which.min(abs(x - A)))

FindInterval gets you very close. You just have to pick between the offset it returns and the next one:
#returns the nearest occurence of x in vec
nearest.vec <- function(x, vec)
{
smallCandidate <- findInterval(x, vec, all.inside=TRUE)
largeCandidate <- smallCandidate + 1
#nudge is TRUE if large candidate is nearer, FALSE otherwise
nudge <- 2 * x > vec[smallCandidate] + vec[largeCandidate]
return(smallCandidate + nudge)
}
nearest.vec(b,A)
returns (1,2,2), and should comparable to FindInterval in performance.

Here's a solution that uses R's often overlooked outer function. Not sure if it'll perform better, but it does avoid sapply.
A <- c(10, 20, 30, 40, 50)
b <- c(13, 17, 20)
dist <- abs(outer(A, b, '-'))
result <- apply(dist, 2, which.min)
# [1] 1 2 2

Related

Rounding numbers to the nearest values (with different intervals) in R

I want to round (or replace) numbers in a
a <- c(0.505, 1.555, 2.667, 53.850, 411.793)
to the nearest values in b:
b <- c(0, 5, 10, 50, 100, 200, 500)
The output will be this:
a_rnd <- c(0, 0, 5, 50, 500)
The logic is simple but I couldn't find any solution, because all the approaches I found require values in b have an equal interval!
How can I achieve this?
You can use sapply to loop over all values of a and use these indexes to extract the proper b values
b[sapply(a, function(x) which.min(abs(x - b)))]
#> [1] 0 0 5 50 500
This is a relatively simple approach:
b[apply(abs(outer(a, b, "-")), 1, which.min)]

How to find the three closest (nearest) values within a vector?

I would like to find out the three closest numbers in a vector.
Something like
v = c(10,23,25,26,38,50)
c = findClosest(v,3)
c
23 25 26
I tried with sort(colSums(as.matrix(dist(x))))[1:3], and it kind of works, but it selects the three numbers with minimum overall distance not the three closest numbers.
There is already an answer for matlab, but I do not know how to translate it to R:
%finds the index with the minimal difference in A
minDiffInd = find(abs(diff(A))==min(abs(diff(A))));
%extract this index, and it's neighbor index from A
val1 = A(minDiffInd);
val2 = A(minDiffInd+1);
How to find two closest (nearest) values within a vector in MATLAB?
My assumption is that the for the n nearest values, the only thing that matters is the difference between the v[i] - v[i - (n-1)]. That is, finding the minimum of diff(x, lag = n - 1L).
findClosest <- function(x, n) {
x <- sort(x)
x[seq.int(which.min(diff(x, lag = n - 1L)), length.out = n)]
}
findClosest(v, 3L)
[1] 23 25 26
Let's define "nearest numbers" by "numbers with minimal sum of L1 distances". You can achieve what you want by a combination of diff and windowed sum.
You could write a much shorter function but I wrote it step by step to make it easier to follow.
v <- c(10,23,25,26,38,50)
#' Find the n nearest numbers in a vector
#'
#' #param v Numeric vector
#' #param n Number of nearest numbers to extract
#'
#' #details "Nearest numbers" defined as the numbers which minimise the
#' within-group sum of L1 distances.
#'
findClosest <- function(v, n) {
# Sort and remove NA
v <- sort(v, na.last = NA)
# Compute L1 distances between closest points. We know each point is next to
# its closest neighbour since we sorted.
delta <- diff(v)
# Compute sum of L1 distances on a rolling window with n - 1 elements
# Why n-1 ? Because we are looking at deltas and 2 deltas ~ 3 elements.
withingroup_distances <- zoo::rollsum(delta, k = n - 1)
# Now it's simply finding the group with minimum within-group sum
# And working out the elements
group_index <- which.min(withingroup_distances)
element_indices <- group_index + 0:(n-1)
v[element_indices]
}
findClosest(v, 2)
# 25 26
findClosest(v, 3)
# 23 25 26
A base R option, idea being we first sort the vector and subtract every ith element with i + n - 1 element in the sorted vector and select the group which has minimum difference.
closest_n_vectors <- function(v, n) {
v1 <- sort(v)
inds <- which.min(sapply(head(seq_along(v1), -(n - 1)), function(x)
v1[x + n -1] - v1[x]))
v1[inds: (inds + n - 1)]
}
closest_n_vectors(v, 3)
#[1] 23 25 26
closest_n_vectors(c(2, 10, 1, 20, 4, 5, 23), 2)
#[1] 1 2
closest_n_vectors(c(19, 23, 45, 67, 89, 65, 1), 2)
#[1] 65 67
closest_n_vectors(c(19, 23, 45, 67, 89, 65, 1), 3)
#[1] 1 19 23
In case of tie this will return the numbers with smallest value since we are using which.min.
BENCHMARKS
Since we have got quite a few answers, it is worth doing a benchmark of all the solutions till now
set.seed(1234)
x <- sample(100000000, 100000)
identical(findClosest_antoine(x, 3), findClosest_Sotos(x, 3),
closest_n_vectors_Ronak(x, 3), findClosest_Cole(x, 3))
#[1] TRUE
microbenchmark::microbenchmark(
antoine = findClosest_antoine(x, 3),
Sotos = findClosest_Sotos(x, 3),
Ronak = closest_n_vectors_Ronak(x, 3),
Cole = findClosest_Cole(x, 3),
times = 10
)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
#antoine 148.751 159.071 163.298 162.581 167.365 181.314 10 b
# Sotos 1086.098 1349.762 1372.232 1398.211 1453.217 1553.945 10 c
# Ronak 54.248 56.870 78.886 83.129 94.748 100.299 10 a
# Cole 4.958 5.042 6.202 6.047 7.386 7.915 10 a
An idea is to use zoo library to do a rolling operation, i.e.
library(zoo)
m1 <- rollapply(v, 3, by = 1, function(i)c(sum(diff(i)), c(i)))
m1[which.min(m1[, 1]),][-1]
#[1] 23 25 26
Or make it into a function,
findClosest <- function(vec, n) {
require(zoo)
vec1 <- sort(vec)
m1 <- rollapply(vec1, n, by = 1, function(i) c(sum(diff(i)), c(i)))
return(m1[which.min(m1[, 1]),][-1])
}
findClosest(v, 3)
#[1] 23 25 26
For use in a dataframe,
data%>%
group_by(var1,var2)%>%
do(data.frame(findClosest(.$val,3)))

Find that values that are immediately below a given set of values and return the entry from another variable

I have two data frames:
a <- c(10, 20, 30)
c <- c(1, 50, 100)
df1 <- data.frame(cbind(a, b, c))
x <- c(80, 30, 15)
z <- c(10, 46, 99)
df2 <- data.frame(cbind(x, y, z))
I want to find the values in c that are immediately below the values in z, and then return the equivalent values in a.
So matching z to c will give me the locations: 1, 1, 2, and I want to output those locations from a (i.e 10, 10, 20)
Edit: For each value in z I want to find the location of the value that is below it in c, then return the value in a based on that location
You can use outer with the comparison <. Then colSums should add the TRUEs and give you your answer given that df1 is ordered on c, i.e.
colSums(outer(df1$c, df2$z, `<`))
#[1] 1 1 2
or
df1$a[colSums(outer(df1$c, df2$z, `<`))]
#[1] 10 10 20

R solver optimization

I am new to R solver and I want to have a simple example in R for the below problem:
I have four columns which I calculate the individual sums as the illustrated sample example below:
The problem I want to solve in R:
Find the optimal lines that satisfies, simultaneously, the below statements:
For the first two columns (a, b) the individual summations to be more close to 0
The sums of (c, d) to be more close to 5
I do not have restrictions of which package solver to use. It could be helpful to have an example of R code for this!
EDIT
For the same solution I would like to apply some rules:
I want the sum(c) > sum(d) AND sum(d) < (static number, like 5)
Also, if I want the sums to fall into a range of numbers and not just static numbers, how the solution could it be written?
Using M defined reproducibly in the Note at the end we find the b which minimizes the following objective where b is a 0/1 vector:
sum((b %*% M - c(0, 0, 5, 5))^2)
1) CVXR Using the CVXR package we get a solution c(1, 0, 0, 1, 1) which means choose rows 1, 4 and 5.
library(CVXR)
n <- nrow(M)
b <- Variable(n, boolean = TRUE)
pred <- t(b) %*% M
y <- c(0, 0, 5, 5)
objective <- Minimize(sum((t(y) - pred)^2))
problem <- Problem(objective)
soln <- solve(problem)
bval <- soln$getValue(b)
zapsmall(c(bval))
## [1] 1 0 0 1 1
2) Brute Force Alternately since there are only 5 rows there are only 2^5 possible solutions so we can try them all and pick the one which minimizes the objective. First we compute a matrix solns with 2^5 columns such that each column is one possible solution. Then we compute the objective function for each column and take the one which minimizes it.
n <- nrow(M)
inverse.which <- function(ix, n) replace(integer(n), ix, 1)
L <- lapply(0:n, function(i) apply(combn(n, i), 2, inverse.which, n))
solns <- do.call(cbind, L)
pred <- t(t(solns) %*% M)
obj <- colSums((pred - c(0, 0, 5, 5))^2)
solns[, which.min(obj)]
## [1] 1 0 0 1 1
Note
M <- matrix(c(.38, -.25, .78, .83, -.65,
.24, -.35, .44, -.88, .15,
3, 5, 13, -15, 18,
18, -7, 23, -19, 7), 5)

Find indices of numbers of interest in a vector

I have a long vector, let's say A <- c(12, 16, 23, 15, 89, 43, ...) and I would like to find the positions of some numbers in this vector, contained in another vector, B <- c(16, 89).
In this example, I would like to obtain the vector c(2,5). A the moment I am using a for loop, but I would really like to avoid it:
C <- numeric(length(B))
for (i in 1:length(C)){
C[i] <- which(A==B[i])
}
Any suggestion?
Thanks in advance
Try with
x <- which(A %in% B)
#> x
#[1] 2 5
Hope this helps
You can use simply:
match(B,A)
#[1] 2 5

Resources