Weight any set of numbers to sum to zero in R - r

I want to write a function in R that calculates weights to sum any set of numbers in R to zero. For example if
x <- c(-5, 6, 2, 4, -3)
I want a function that would return a new vector which has been weighted to force the vector sum to zero, by taking something off the positive numbers and adding something to the negative values...
EDIT: To clarify I do not want to shift values up or down a scale... I want to weight so that the rescaled negative numbers become slightly more/less negative and the rescaled positive numbers become slightly less/more positive.
I am not sure 1) how to go about calculating the right values for proportional weights and 2) if there is a function in R that can do it?

How about
x <- scale(x)
> x
[,1]
[1,] -1.2450825
[2,] 1.1162809
[3,] 0.2576033
[4,] 0.6869421
[5,] -0.8157437
attr(,"scaled:center")
[1] 0.8
attr(,"scaled:scale")
[1] 4.658326
> sum(scale(x))
[1] 5.551115e-17
Edit:
As suggested by #Josh O'brien, setting scale = FALSE gives
scale(x, scale = FALSE)
[,1]
[1,] -5.8
[2,] 5.2
[3,] 1.2
[4,] 3.2
[5,] -3.8
attr(,"scaled:center")
[1] 0.8
sum(scale(x, scale = FALSE))
[1] 6.661338e-16

1) offsets #jdharrison has already indicated if you want a vector a such that sum(x-a) is zero then setting a to be the mean of x will do it.
2) weight vector The wording of the question seems to ask for a weight vector w such that sum(w * x) is zero.
(i) If x is not constant (i.e. its elements are not all the same) then in mathematical notation P = I-xx'/(x'x) is a projection orthogonal to x and P1 = 1 - xx'1/(x'x) is a vector in the range of P so switching to R code:
w <- 1 - x * sum(x) / sum(x*x)
is such a weight vector. We can verify this:
> sum(w*x)
[1] 2.220446e-16
(ii) If x is constant but not identically zero then choose any non-constant vector s <- seq_along(x), say. Then Ps = s - xx's/(x'x) is orthogonal to x so:
x <- c(1, 1, 1, 1)
s <- seq_along(x)
w <- s - x * sum(s*x) / sum(x*x)
sum(w * x)
giving:
> sum(w * x)
[1] 0

Elaborating #jdharrison's comment:
> x
[1] -5 6 2 4 -3
> sum(x)
[1] 4
> mean(x)
[1] 0.8
> x - mean(x)
[1] -5.8 5.2 1.2 3.2 -3.8
> sum(x - mean(x))
[1] 6.661338e-16 #floating point 0
So x - mean(x) will do the trick.

If you want to keep the sign after the rescaling...
x <- c(-5, -3, 0, 2, 4, 6, 50)
rescale_zero <- function(x){
x1 <- x[x>0]
x2 <- x[x<0]
d <- (sum(x1) + sum(x2)) / 2
w1 <- (sum(x1) - d) / sum(x1)
w2 <- (sum(x2) - d) / sum(x2)
y <- x
y[x>0] <- x1*w1
y[x<0] <- x2*w2
y
}
rescale_zero(x)
# [1] -21.875000 -13.125000 0.000000 1.129032 2.258065 3.387097 28.225806

Related

Sum of correlation matrix convergence

Assume a correlation matrix P with diagonal of zero. I want to determine the order n where the sum of all the correlation matrices orders would converge i.e. diag(3)+ P + P%^%2 + P%^%3 + ... + P%^%n would converge meaning the L1 norm drops below a tol. I looked into How to find when a matrix converges with a loop but this doens't do it for me, since it doesn't keep the orders, nor it sums them up. I can do it in a really lengthy and lousy way with for loops and all but I don't want to, since I have a big df with many time windows, so I'm looking for something efficient. Thanks!
P <- matrix(c(0,0.1,0.8,0.1,0,-0.7,0.8,-0.7,0), nrow = 3, ncol = 3, byrow = TRUE)
Some notes: The %^% operator is from expm package. To sum the matrices I used matrix(mapply(sum, diag(3), P, P%^%2, P%^%3, MoreArgs=list(na.rm=T)), ncol=3).
x %^% n computes the nth power of x efficiently, but it is inefficient to compute x %^% i for all i from 0 to n, because each x %^% i requires O(log(i)) matrix multiplications.
In general, the most efficient way to compute all of the powers of x up to the nth is recursive multiplication by x, possibly taking advantage of the diagonalizability of x.
The difference is nontrivial for large n: whereas
x2 <- x %^% 2
x3 <- x %^% 3
x4 <- x %^% 4
## and so on
requires O(log(n!)) = O(n * log(n)) matrix multiplications,
x2 <- x %*% x
x3 <- x2 %*% x
x4 <- x3 %*% x
## and so on
requires just O(n).
Here is a function that recursively computes the powers of a matrix x and their sum until it encounters a power whose 1-norm is less than tol. It begins by checking that the spectral radius of x is less than 1, which is a necessary and sufficient condition for convergence of the norm of x %^% n to 0 and thus a necessary condition for convergence of the power series. It does not attempt to diagonalize x, which would simplify computation of the power series but complicate computation of norms.
f <- function(x, tol = 1e-06, nmax = 1e+03) {
stopifnot(max(abs(eigen(x, only.values = TRUE)$values)) < 1)
pow <- sum <- diag(nrow(x))
nrm <- rep.int(NA_real_, nmax + 1)
i <- 1
while ((nrm[i] <- norm(pow, "1")) >= tol && i <= nmax) {
pow <- pow %*% x
sum <- sum + pow
i <- i + 1
}
list(x = x, tol = tol, nmax = nmax, n = i - 1, sum = sum,
norm = nrm[seq_len(i)], converged = nrm[i] < tol)
}
Your matrix P has spectral radius greater than 1, hence:
P <- matrix(c(0, 0.1, 0.8, 0.1, 0, -0.7, 0.8, -0.7, 0), 3L, 3L, byrow = TRUE)
f(P)
Error in f(P) :
max(abs(eigen(x, only.values = TRUE)$values)) < 1 is not TRUE
We can always construct a matrix P whose spectral radius is less than 1, for the purpose of testing f:
set.seed(1L)
m <- 3L
V <- matrix(rnorm(m * m), m, m)
D <- diag(runif(m, -0.9, 0.9))
P <- V %*% D %*% solve(V)
all.equal(sort(eigen(P)$values), sort(diag(D))) # [1] TRUE
(fP <- f(P))
$x
[,1] [,2] [,3]
[1,] 0.26445172 0.5317116 -0.2432849
[2,] 0.04932194 0.6332122 0.1496390
[3,] -0.31174920 0.6847937 0.1682702
$tol
[1] 1e-06
$nmax
[1] 1000
$n
[1] 60
$sum
[,1] [,2] [,3]
[1,] 1.53006915 2.081717 -0.07302465
[2,] -0.04249899 4.047528 0.74063387
[3,] -0.60849191 2.552208 1.83947562
$norm
[1] 1.000000e+00 1.849717e+00 1.223442e+00 1.008928e+00 7.799426e-01
[6] 6.131516e-01 4.795602e-01 3.754905e-01 2.938577e-01 2.299751e-01
[11] 1.799651e-01 1.408263e-01 1.101966e-01 8.622768e-02 6.747162e-02
[16] 5.279503e-02 4.131077e-02 3.232455e-02 2.529304e-02 1.979107e-02
[21] 1.548592e-02 1.211727e-02 9.481396e-03 7.418905e-03 5.805067e-03
[26] 4.542288e-03 3.554202e-03 2.781054e-03 2.176090e-03 1.702724e-03
[31] 1.332329e-03 1.042507e-03 8.157298e-04 6.382837e-04 4.994374e-04
[36] 3.907945e-04 3.057848e-04 2.392672e-04 1.872193e-04 1.464934e-04
[41] 1.146266e-04 8.969179e-05 7.018108e-05 5.491455e-05 4.296896e-05
[46] 3.362189e-05 2.630810e-05 2.058529e-05 1.610736e-05 1.260351e-05
[51] 9.861865e-06 7.716607e-06 6.038009e-06 4.724558e-06 3.696822e-06
[56] 2.892650e-06 2.263410e-06 1.771049e-06 1.385792e-06 1.084340e-06
[61] 8.484627e-07
$converged
[1] TRUE
Hence convergence is attained at n = 60. You can check that the reported sum is correct by comparing against the directly (but inefficiently) calculated value:
library("expm")
all.equal(Reduce(`+`, lapply(0:fP$n, function(i) P %^% i)), fP$sum) # [1] TRUE
And just for fun:
plot(0:fP$n, fP$norm)

Alternative to for loop for fast calculations when equations depend on each other

I am using a for-loop to do step-by-step calculations where several equations depend on each other. Because of this dependence, I cannot find a solution where I do the calculations inside a dataframe. My main motivation is to speed up the calculations when the Time vector is very large in the reprex below.
Could you please suggest alternatives to the following for-loop based calculations, preferably inside a dataframe in R? The only thing I can think of is using for-loop in Rcpp.
Reproducible Example
last_time <- 10
STEP = 1
Time <- seq(from = 0, to = last_time, by = STEP)
## empty vectors
eq1 <- vector(mode = "double", length = length(Time))
eq2 <- vector(mode = "double", length = length(Time))
eq <- vector(mode = "double", length = length(Time))
eq3 <- vector(mode = "double", length = length(Time))
eq4 <- vector(mode = "double", length = length(Time))
## adding the first values
eq1[1] <- 25
eq2[1] <- 25
eq[1] <- 25
eq3[1] <- 100
eq4[1] <- 2
for (t in 2:length(Time)) {
## eq1
eq1[t] <- eq[t-1] + (2.5 * STEP * (1 - (eq[t-1])/25))
## eq2
eq2[t] <- (-2 * STEP) + ((-2^2) * (STEP^2)) - (2 * eq3[t-1]) - (eq[t-1] * STEP)
## min.
eq[t] <- min(eq1[t], eq2[t] )
## eq3
eq3[t] <- (eq[t] - eq[t-1])/(STEP)
## eq4
eq4[t] <- eq4[t-1] + (eq[t-1] * STEP) + (0.5 * eq3[t-1] * (STEP)^2)
}
Output:
my_data <- data.frame(Time, eq1, eq2, eq, eq3, eq4)
my_data
#> Time eq1 eq2 eq eq3 eq4
#> 1 0 25.00000 25.00000 25.00000 -256.00000 2.0000
#> 2 1 25.00000 -231.00000 -231.00000 25.60000 -101.0000
#> 3 2 -205.40000 225.00000 -205.40000 23.04000 -319.2000
#> 4 3 -182.36000 199.40000 -182.36000 20.73600 -513.0800
#> 5 4 -161.62400 176.36000 -161.62400 18.66240 -685.0720
#> 6 5 -142.96160 155.62400 -142.96160 16.79616 -837.3648
#> 7 6 -126.16544 136.96160 -126.16544 15.11654 -971.9283
#> 8 7 -111.04890 120.16544 -111.04890 13.60489 -1090.5355
#> 9 8 -97.44401 105.04890 -97.44401 12.24440 -1194.7819
#> 10 9 -85.19961 91.44401 -85.19961 11.01996 -1286.1037
#> 11 10 -74.17965 79.19961 -74.17965 0.00000 -1365.7934
Created on 2021-02-28 by the reprex package (v1.0.0)
You could define a recursive function. A loop is faster than recursion though.
g <- function(m, STEP, time, x=2) {
if (time == 0) m
else {
## eq1
m[x, 2] <- m[x - 1, 1] + 2.5*STEP*(1 - (m[x - 1, 1])/25)
## eq2
m[x, 3] <- -2*STEP + -2^2*STEP^2 - 2*m[x - 1, 4] - m[x - 1, 1]*STEP
## min.
m[x, 1] <- min(m[x, 2], m[x, 3])
## eq3
m[x - 1, 4] <- (m[x, 1] - m[x - 1, 1])/STEP
## eq4
m[x, 5] <- m[x - 1, 5] + m[x - 1, 1]*STEP + 0.5*m[x - 1, 4]*STEP^2
g(m, STEP, time - 1, x + 1)
}
}
Usage
last_time <- 10; STEP <- 1
First <- c(eq0=25, eq1=25, eq2=25, eq3=100, eq4=2)
m <- matrix(0, last_time + 1, length(First), dimnames=list(NULL, names(First)))
m[1, ] <- First
g(m, STEP, last_time)
# eq0 eq1 eq2 eq3 eq4
# [1,] 25.00000 25.00000 25.00000 -256.00000 2.0000
# [2,] -231.00000 25.00000 -231.00000 25.60000 -101.0000
# [3,] -205.40000 -205.40000 225.00000 23.04000 -319.2000
# [4,] -182.36000 -182.36000 199.40000 20.73600 -513.0800
# [5,] -161.62400 -161.62400 176.36000 18.66240 -685.0720
# [6,] -142.96160 -142.96160 155.62400 16.79616 -837.3648
# [7,] -126.16544 -126.16544 136.96160 15.11654 -971.9283
# [8,] -111.04890 -111.04890 120.16544 13.60489 -1090.5355
# [9,] -97.44401 -97.44401 105.04890 12.24440 -1194.7819
# [10,] -85.19961 -85.19961 91.44401 11.01996 -1286.1037
# [11,] -74.17965 -74.17965 79.19961 0.00000 -1365.7934
as you asked how it works:
The recursive filter function of stats::filter can be used with mapply as follows:
dataframe <-
mapply(stats::filter,
dataframe,
filter = vector,
method = "recursive")
where vector is e.g. c(25), which could be your first eq1[1] <- 25
The recursive filter works like a recursive loop but is a bit more elegant:
Then the mapply recursive filter would do:
dataframe / vector
row or timepoint 1 20
row or timepoint 2 30 + (20 * c(25))
row or timepoint 3 40 + ((20*25)+30) * c(25))
It calculates the value in the first row and uses it in the next, where it multiplies the next vector. Perhaps if you play around with stats filter and the recursive method you also get the same result. It is a row based calculation over time similar to Rcpp but more flexible.

Generate vector of 'random' proportions of a given length within specific boundaries

I want to generate a vector of a given length, e.g., n = 5. Each value in the vector should be a proportion (i.e., a value between 0 and 1) so that across n elements they sum up to 1.
Unfortunately, I have two vectors: one (mymins) defines the allowed lower boundaries of each proportion and the other (mymaxs) defines the allowed top boundaries of each proportion.
In my example below the desired proportion for the first element is allowed to fall anywhere between 0.3 and 0.9. And for the last element, the desired proportion is allowed to fall between 0.05 and 0.7.
mymins <- c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs <- c(0.9, 1, 1, 1, 0.7)
Let's assume that mymins are always 'legitimate' (i.e., their sum is never larger than 1).
How could I find a set of 5 proportions such that they all sum to 1 but lie within the boundaries?
Here is what I tried:
n = 5
mydif <- mymaxs - mymins # possible range for each proportion
myorder <- rank(mydif) # order those differences from smallest to largest
mytarget <- sum(mydif) # sum up the 5 ranges
x <- sort(runif(n))[myorder] # generate 5 random values an sort them in the order of mydif
x2 <- mymins + x / sum(x) * mytarget # rescale random values to sum up to mytarget and add them to mymins
x3 <- x2/sum(x2) # rescale x2 to sum up to 1
As you can see, I am not very far - because after rescaling some values are outside of their allowed boundaries.
I should probably also mention that I need this operation to be fast - because I am using it in an optimization loop.
I also tried to find a solution using optim, however the problem is that it always finds the same solution - and I need to generate a DIFFERENT solutions every time I find the proporotion:
myfun <- function(x) {
x <- round(x, 4)
abovemins <- x - mymins
n_belowmins <- sum(abovemins < 0)
if (n_belowmins > 0) return(100000)
belowmax <- x - mymaxs
n_abovemax <- sum(belowmax > 0)
if (n_abovemax > 0) return(100000)
mydist <- abs(sum(x) - 1)
return(mydist)
}
myopt <- optim(par = mymins + 0.01, fn = myfun)
myopt$par
sum(round(myopt$par, 4))
Thank you very much for your suggestions!
Perhaps its better to think of this in a different way. Your samples actually need to sum to 0.35 (which is 1 - sum(mymins)), then be added on to the minimum values
constrained_sample <- function(mymins, mymaxs)
{
sizes <- mymaxs - mymins
samp <- (runif(5) * sizes)
samp/sum(samp) * (1 - sum(mymins)) + mymins
}
It works like this:
constrained_sample(mymins, mymaxs)
#> [1] 0.31728333 0.17839397 0.07196067 0.29146744 0.14089459
We can test this works by running the following loop, which will print a message to the console if any of the criteria aren't met:
for(i in 1:1000)
{
test <- constrained_sample(mymins, mymaxs)
if(!all(test > mymins) | !all(test < mymaxs) | abs(sum(test) - 1) > 1e6) cat("failure")
}
This throws no errors, since the criteria are always met. However, as #GregorThomas points out, the bounds aren't realistic in this case. We can see a range of solutions constrained by your conditions using a boxplot:
samp <- constrained_sample(mymins, mymaxs)
for(i in 1:999) samp <- rbind(samp, constrained_sample(mymins, mymaxs))
df <- data.frame(val = c(samp[,1], samp[,2], samp[,3], samp[,4], samp[,5]),
index = factor(rep(1:5, each = 1000)))
ggplot(df, aes(x = index, y = val)) + geom_boxplot()
Because you need 5 random numbers to sum to 1, you really only have 4 independent numbers and one dependent number.
mymins <- c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs <- c(0.9, 1, 1, 1, 0.7)
set.seed(42)
iter <- 1000
while(iter > 0 &&
(
(1 - sum(x <- runif(4, mymins[-5], mymaxs[-5]))) < mymins[5] ||
(1 - sum(x)) > mymaxs[5]
)
) iter <- iter - 1
if (iter < 1) {
# failed
stop("unable to find something within 1000 iterations")
} else {
x <- c(x, 1-sum(x))
}
sum(x)
# [1] 1
all(mymins <= x & x <= mymaxs)
# [1] TRUE
x
# [1] 0.37732330 0.21618036 0.07225311 0.24250359 0.09173965
The reason I use iter there is to make sure you don't take an "infinite" amount of time to find something. If your mymins and mymaxs combination make this mathematically infeasible (as your first example was), then you don't need to spin forever. If it is mathematically improbable to find it in a reasonable amount of time, you need to weigh how long you want to do this.
One reason this takes so long is that we are iteratively pulling entropy. If you expect this to go for a long time, then it is generally better to pre-calculate as much as you think you'll need (overall) and run things as a matrix.
set.seed(42)
n <- 10000
m <- matrix(runif(prod(n, length(mymins)-1)), nrow = n)
m <- t(t(m) * (mymaxs[-5] - mymins[-5]) + mymins[-5])
remainders <- (1 - rowSums(m))
ind <- mymins[5] <= remainders & remainders <= mymaxs[5]
table(ind)
# ind
# FALSE TRUE
# 9981 19
m <- cbind(m[ind,,drop=FALSE], remainders[ind])
nrow(m)
# [1] 19
rowSums(m)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
head(m)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0.3405821 0.1306152 0.05931363 0.2199362 0.24955282
# [2,] 0.3601376 0.1367465 0.20235704 0.2477507 0.05300821
# [3,] 0.4469526 0.1279795 0.02265618 0.2881733 0.11423845
# [4,] 0.5450527 0.1029903 0.07503371 0.2052423 0.07168103
# [5,] 0.3161519 0.1469783 0.15290720 0.3268470 0.05711557
# [6,] 0.4782448 0.1185735 0.01664063 0.2178225 0.16871845
all(
mymins[1] <= m[,1] & m[,1] <= mymaxs[1],
mymins[2] <= m[,2] & m[,2] <= mymaxs[2],
mymins[3] <= m[,3] & m[,3] <= mymaxs[3],
mymins[4] <= m[,4] & m[,4] <= mymaxs[4],
mymins[5] <= m[,5] & m[,5] <= mymaxs[5]
)
# [1] TRUE
This time it took 10000 attempts to make 19 valid combinations. It might take more or fewer attempts based on randomness, so ymmv with regards to how much you need to pre-generate.
If your example bounds are realistic, we can refine them quite a bit, narrowing the range of possibilities. For the current version of the question with:
mymins = c(0.3, 0.1, 0, 0.2, 0.05)
mymaxs = c(0.9, 1, 1, 1, 0.7)
What's the max for x[1]? Well, if x[2:5] take on minimum values, they will add up to 0.1 + 0 + 0.2 + 0.05 = 0.35, so based on the other mins only we know that max value for x[1] is 1 - 0.35 = 0.65. The 0.9 in mymaxs is way too high.
We can calculate the actual max values taking the minimum of the max values based on the minimums and the mymaxs vector:
new_max = pmin(mymaxs, 1 - (sum(mymins) - mymins))
new_max
# [1] 0.65 0.45 0.35 0.55 0.40
We can similarly revise the min bounds, though in this case even the revised max bounds new_max are high enough that it would have any impact on the minimums.
new_min = pmax(mymins, 1 - (sum(new_max) - new_max))
new_min
# [1] 0.30 0.10 0.00 0.20 0.05
With these adjustments, we should be able to see easily if any solutions are possible (all(new_min < new_max)). And then generating random numbers as in r2evans's answer should go much quicker using the new bounds.

Choose closest x elements by index in a list/vector

If I have a vector such as x <-c(1,2,3,4,5,6,7,8,9), I want a function f such that
f(vector,index,num) where it takes the vector and gives me num "closest" elements to that one on the index
Examples:
f(x,3,4) = c(1,2,4,5)
f(x,1,5) = c(2,3,4,5,6)
f(x,8,3) = c(6,7,9)
Since there is also the issue where if we have an odd num, we will need to choose whether to pick left or right side by symmetry, let's go with choosing the left side (but right side is ok too)
i.e f(x,4,5) = c(1,2,3,5,6) and f(x,7,3) = c(5,6,8)
I hope my question is clear, thank you for any help/responses!
edit: The original vector of c(1:9) is arbitrary, the vector could be a vector of strings, or a vector of length 1000 with shuffled numbers with repeats etc.
i.e c(1,7,4,2,3,7,2,6,234,56,8)
num_closest_by_indices <- function(v, idx, num) {
# Try the base case, where idx is not within (num/2) of the edge
i <- abs(seq_along(x) - idx)
i[idx] <- +Inf # sentinel
# If there are not enough elements in the base case, incrementally add more
for (cutoff_idx in seq(floor(num/2), num)) {
if (sum(i <= cutoff_idx) >= num) {
# This will add two extra indices every iteration. Strictly if we have an even length, we should add the leftmost one first and `continue`, to break ties towards the left.
return(v[i <= cutoff_idx])
}
}
}
Here's an illustration of this algorithm: we rank the indices in order of desirability, then pick the lowest num legal ones:
> seq_along(x)
1 2 3 4 5 6 7 8 9
> seq_along(x) - idx
-2 -1 0 1 2 3 4 5 6
> i <- abs(seq_along(x) - idx)
2 1 0 1 2 3 4 5 6
> i[idx] <- +Inf # sentinel to prevent us returning the element itself
2 1 Inf 1 2 3 4 5 6
Now we can just find num elements with smallest values (break ties arbitrarily, unless you have a preference (left)).
Our first guess is all indices <= (num/2) ; this might not be enough if index is within (num/2) of the start/end.
> i <= 2
TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
> v[i <= 2]
1 2 4 5
So, adapting #dash2's code to handle the corner cases where some indices are illegal (nonpositive, or > length(x)), i.e. ! %in% 1:L. Then min(elems) would be the number of illegal indices which we cannot pick, hence we must pick abs(min(elems)) more.
Notes:
in the end the code is simpler and faster to handle it by three piecewise cases. Aww.
it actually seems to simplify things if we pick (num+1) indices, then remove idx before returning the answer. Using result[-idx] to remove it.
Like so:
f <- function (vec, elem, n) {
elems <- seq(elem - ceiling(n/2), elem + floor(n/2))
if (max(elems) > length(vec)) elems <- elems - (max(elems) - length(vec))
if (elems[1] < 1) elems <- elems + (1 - elems[1])
elems <- setdiff(elems, elem)
vec[elems]
}
Giving results:
> f(1:9, 1, 5)
[1] 2 3 4 5 6
> f(1:9, 9, 5)
[1] 4 5 6 7 8
> f(1:9, 2, 5)
[1] 1 3 4 5 6
> f(1:9, 4, 5)
[1] 1 2 3 5 6
> f(1:9, 4, 4)
[1] 2 3 5 6
> f(1:9, 2, 4)
[1] 1 3 4 5
> f(1:9, 1, 4)
[1] 2 3 4 5
> f(1:9, 9, 4)
[1] 5 6 7 8
Start a function with the variable argument x first, and the reference table and n after
.nearest_n <- function(x, table, n) {
The algorithm assumes that table is numeric, without any duplicates, and all values finite; n has to be less than or equal to the length of the table
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
Sort the table and then 'clamp' maximum and minimum values
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
Find the interval in table where x occurs; findInterval() uses an efficient search. Use the interval index as the initial lower index, and add 1 for the upper index, making sure to stay in-bounds.
## where to start?
lower <- findInterval(x, table)
upper <- min(lower + 1L, len)
Find the nearest n neighbors by comparing the lower and upper index distance to x, record the nearest value, and increment the lower or upper index as appropriate and making sure to stay in-bounds
## find
nearest <- numeric(n)
for (i in seq_len(n)) {
if (abs(x - table[lower]) < abs(x - table[upper])) {
nearest[i] = table[lower]
lower = max(1L, lower - 1L)
} else {
nearest[i] = table[upper]
upper = min(len, upper + 1L)
}
}
Then return the solution and finish the function
nearest
}
The code might seem verbose, but is actually relatively efficient because the only operations on the entire vector (sort(), findInterval()) are implemented efficiently in R.
A particular advantage of this approach is that it can be vectorized in it's first argument, calculating the test for using lower (use_lower = ...) as a vector and using pmin() / pmax() as clamps.
.nearest_n <- function(x, table, n) {
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
## where to start?
lower <- findInterval(x, table)
upper <- pmin(lower + 1L, len)
## find
nearest <- matrix(0, nrow = length(x), ncol = n)
for (i in seq_len(n)) {
use_lower <- abs(x - table[lower]) < abs(x - table[upper])
nearest[,i] <- ifelse(use_lower, table[lower], table[upper])
lower[use_lower] <- pmax(1L, lower[use_lower] - 1L)
upper[!use_lower] <- pmin(len, upper[!use_lower] + 1L)
}
# return
nearest
}
For instance
> set.seed(123)
> table <- sample(100, 10)
> sort(table)
[1] 5 29 41 42 50 51 79 83 86 91
> .nearest_n(c(30, 20), table, 4)
[,1] [,2] [,3] [,4]
[1,] 29 41 42 50
[2,] 29 5 41 42
Generalize this by taking any argument and coercing it to the required form using a reference look-up table table0 and the indexes into it table1
nearest_n <- function(x, table, n) {
## coerce to common form
table0 <- sort(unique(c(x, table)))
x <- match(x, table0)
table1 <- match(table, table0)
## find nearest
m <- .nearest_n(x, table1, n)
## result in original form
matrix(table0[m], nrow = nrow(m))
}
As an example...
> set.seed(123)
> table <- sample(c(letters, LETTERS), 30)
> nearest_n(c("M", "Z"), table, 5)
[,1] [,2] [,3] [,4] [,5]
[1,] "o" "L" "O" "l" "P"
[2,] "Z" "z" "Y" "y" "w"

logical check of vector values at the same precesion or not

I have a vector with variable elements in it, and I want to check whether it's last two element are in the same digit order.
For example, if the last two vectors are 0.0194 and 0.0198 return TRUE. because their digit order after zero is the same (0.01 order 10^-2). ! for other example the number could be 0.00014 and 0.00012 so their precision is still around the same the function should return also TRUE.
How can we build a logical statement or function to check this.
x<- c(0.817104, 0.241665, 0.040581, 0.022903, 0.019478, 0.019846)
I may be over-thinking this, but you can test that the order of magnitude and first non-zero digit are identical for each.
x <- c(0.817104, 0.241665, 0.040581, 0.022903, 0.019478, 0.019846)
oom <- function(x, base = 10) as.integer(ifelse(x == 0, 0, floor(log(abs(x), base))))
oom(x)
# [1] -1 -1 -2 -2 -2 -2
(tr <- trunc(x / 10 ** oom(x, 10)))
# [1] 8 2 4 2 1 1
So for the last two, the order of magnitude for both is -2 and the first non-zero digit is 1 for both.
Put into a function:
f <- function(x) {
oom <- function(x, base = 10) as.integer(ifelse(x == 0, 0, floor(log(abs(x), base))))
x <- tail(x, 2)
oo <- oom(x)
tr <- trunc(x / 10 ** oo)
(oo[1] == oo[2]) & (tr[1] == tr[2])
}
## more test cases
x1 <- c(0.019, 0.011)
x2 <- c(0.01, 0.001)
f(x) ## TRUE
f(x1) ## TRUE
f(x2) ## FALSE
Here is a more general function than the above for checking the last n instead of 2
g <- function(x, n = 2) {
oom <- function(x, base = 10) as.integer(ifelse(x == 0, 0, floor(log(abs(x), base))))
x <- tail(x, n)
oo <- oom(x)
tr <- trunc(x / 10 ** oo)
Reduce(`==`, oo) & Reduce(`==`, tr)
}
g(c(.24, .15, .14), 2) ## TRUE
g(c(.24, .15, .14), 3) ## FALSE
#rawr worries about over-thinking. I guess I should as well. This is what I came up with and do note that this handles the fact that print representations of floating point numbers are sometimes deceiving.
orddig <- function(x) which( sapply( 0:16, function(n){ isTRUE(all.equal(x*10^n ,
round(x*10^n,0)))}))[1]
> sapply( c(0.00014 , 0.00012 ), orddig)
[1] 6 6
My original efforts were with the signif function but that's a different numerical thought trajectory, since 0.01 and 0.001 have the same number of significant digits. Also notice that:
> sapply( 10^5*c(0.00014 , 0.00012 ), trunc, 4)
[1] 13 12
Which was why we need the isTRUE(all.equal(... , ...))

Resources