Find closest value with condition - r

I have a function that finds me the nearest values for each row in a matrix. It then reports a list with an index of the nearest rows. However, I want it to exclude values if they are +1 in the first AND +1 in the second column away from a particular set of values (-1 in the first and -1 in the second column should also be removed). Moreover, +1 in first column and -1 in second column with respect to the values of interest should also be avoided.
As an example, if I want things closes to c(2, 1), it should accept c(3,1) or (2,2) or (1,1), but NOT c(3,2) and not c(1,0).
Basically, for an output to be reported either column 1 or column 2 should be a value of 1 away from a row of interest, but not both.
input looks like this
x
v1 v2
[1,] 3 1
[2,] 2 1
[3,] 3 2
[4,] 1 2
[5,] 8 5
myfunc(x)
The output looks like this. Notice that the closest thing to row 2 ($V2 in output) is row 1,3,4. The answer should only be 1 though.
$V1
[1] 2 3
$V2
[1] 1 3 4
$V3
[1] 1 2
$V4
[1] 2
$V5
integer(0)
Here is myfunc
myfunc = function(t){
d1 <- dist(t[,1])
d2 <- dist(t[,2])
dF <- as.matrix(d1) <= 1 & as.matrix(d2) <= 1
diag(dF) <- NA
colnames(dF) <- NULL
dF2 <- lapply(as.data.frame(dF), which)
return(dF2)
}

Basically, the rows that you want to find should differ from your reference element by +1 or -1 in one column and be identical in the other column. That means that the sum over the absolute values of the differences is exactly one. For your example c(2, 1), this works as follows:
c(3, 1): difference is c(1, 0), thus sum(abs(c(1, 0))) = 1 + 0 = 1
c(1, 1): difference is c(-1, 0), thus sum(abs(c(-1, 0))) = 1 + 0 = 1
etc.
The following function checks exactly this:
myfunc <- function(x) {
do_row <- function(r) {
r_mat <- matrix(rep(r, length = length(x)), ncol = ncol(x), byrow = TRUE)
abs_dist <- abs(r_mat - x)
return(which(rowSums(abs_dist) == 1))
}
return(apply(x, 1, do_row))
}
do_row() does the job for a single row, and then apply() is used to do this with each row. For your example, I get:
myfunc(x)
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
Using sweep(), one can write a shorter function:
myfunc2 <- function(x) {
apply(x, 1, function(r) which(rowSums(abs(sweep(x, 2, r))) == 1))
}
But this seems harder to understand and it turns out that it is slower by about a factor two for your matrix x. (I have also tried it with a large matrix, and there, the efficiency seems about the same.)

Related

How to write a function that returns indexes of input vector according to some rules, using repeat, while loop, or other iteration technique

I'm trying to write a function that takes a numeric vector as input, and returns the indexes of a shorter version of the input vector, according to some rules:
(a) if all elements are identical, return just the index of the first element; i.e., return 1; else:
if NOT all elements identical, then test for whether special_treatment_value is among them:
(b) if special_treatment_value is there, return the input vector's indexes except for the indexes of elements where special_treatment_value appeared; else:
(c) if special_treatment_value is not there, return the indexes of the input vector as-is, i.e., 1:length(x).
The problem: if we ended up in route (b), we might encounter a situation in which all vector elements are now the same. In such case, we would like to iterate through (a) again to minimize to just the first element.
Example
Let's say that I want to pass the following vectors through my function:
my_vec_1 <- c(1, 2, 1, 2, 3)
my_vec_2 <- c(4, 4, 4)
my_vec_3 <- c(1, 2, 1, 4, 1)
my_vec_4 <- c(3, 3, 3, 4)
and that:
special_treatment_value <- 4
According to my rules, the function should return the outputs:
for my_vec_1: it fits route (c) and thus the output should be 1:5 (indexes of all)
for my_vec_2: it fits route (a) and thus the output should be 1 (index of first)
for my_vec_3: it fits route (b). output should be 1 2 3 5 (indexes of all except for special value's)
my_vec_4 demonstrates the problem. My desired output is 1 because first we go through route (b) then I want to pass through (a). But right now it doesn't happen and my function (see below) returns 1 2 3 (indexes of all except for special value's).
my current attempt
get_indexes <- function(x, special_val) {
if (var(x) == 0) { # route (a)
output_idx <- 1
return(output_idx)
}
idx_entire_length <- 1:length(x)
if (any(x == special_val)) { # route (b)
idx_to_remove <- which(x == special_val)
output_idx <- idx_entire_length[-idx_to_remove]
return(output_idx)
}
# else
output_idx <- idx_entire_length # route (c)
return(output_idx)
}
get_indexes(my_vec_1, 4)
#> [1] 1 2 3 4 5
get_indexes(my_vec_2, 4)
#> [1] 1
get_indexes(my_vec_3, 4)
#> [1] 1 2 3 5
get_indexes(my_vec_4, 4)
#> [1] 1 2 3
I guess there should be some repeat block or while loop, but I can't figure out how to implement it correctly (and efficiently).
You can try
foo <- function(x, y){
tmp <- which(x != y)
if(dplyr::n_distinct(x[x!=y])<=1){
tmp <- 1
}
return(tmp)
}
Instead n_distinct() you can use length(unique())
Result:
lapply(list(my_vec_1, my_vec_2, my_vec_3, my_vec_4), foo, 4)
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1
[[3]]
[1] 1 2 3 5
[[4]]
[1] 1
You could repeat the condition for going through (a) inside condition (b), for example:
f <- function(x, treatment){
if(var(x) == 0) 1 else {
if(treatment %in% x) {
x[-which(x == treatment)] |>
(\(.) if(var(.) == 0) 1 else (1:length(x))[-which(x == treatment)])()
} else {
1:length(x)
}
}
}
lapply(list(v1, v2, v3, v4), f, 4)
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1
[[3]]
[1] 1 2 3 5
[[4]]
[1] 1

For and If in R data programming

I want to evaluate the distance between non-zero data. So if i have 50 data, and only the first and last data is non-zero, thus i want the result to be 49.
For example, my data is:
1. 0
2. 0
3. 5
4. 6
5. 0
6. 1
7. 0
Based on my data above, i want to get 4 variables:
v0 = 3 (because the distance between 0th to 3rd data is 3 jumps)
v1 = 1 (because the distance between 3rd to 4th data is 1 jump)
v2 = 2 (because the distance between 4rd to 6th data is 2 jump)
v3 = 1 (because the distance between 6rd to 7th data is 1 jump)
This is my code:
data=c(0,0,5,6,0,1,0)
t=1
for (i in data) {
if (i == 0) {
t[i]=t+1
}
else {
t[i]=1
}
}
t
The result is:
[1] 1 NA NA NA 1 1
Could you help me in figuring out this problem? I also hope that the code is using some kind of loop, so that it can be applied to any other data.
The general rule is not clear from the question but if x is the input we assume that:
the input is non-negative
the first element in output is the position of the first +ve element in x
subsequent elements of output are distances between successive +ve elements of x
if that results in a vector whose sum is less than length(x) append the remainder
To do that determine the positions of the positive elements of c(1, x), calculate the differences between successive elements in that reduced vector using diff and then if they don't sum to length(x) append the remainder.
dists <- function(x) {
d <- diff(which(c(1, x) > 0))
if (sum(d) < length(x)) c(d, length(x) - sum(d)) else d
}
# distance to 5 is 3 and then to 6 is 1 and then to 1 is 2 and 1 is left
x1 <- c(0, 0, 5, 6, 0, 1, 0)
dists(x1)
## [1] 3 1 2 1
# distance to first 1 is 1 and from that to second 1 is 3
x2 <- c(1, 0, 0, 1)
dists(x2)
## [1] 1 3
Here it is redone using a loop:
dists2 <- function(x) {
pos <- 0
out <- numeric(0)
for(i in seq_along(x)) {
if (x[i]) {
out <- c(out, i - pos)
pos <- i
}
}
if (sum(out) < length(x)) out <- c(out, length(x) - sum(out))
out
}
dists2(x1)
## [1] 3 1 2 1
dists2(x2)
## [1] 1 3
Updates
Simplification based on comments below answer. Added loop approach.

Creating a vector of sequences

I'm trying to find all the numbers less than the square root of a inputted number.
I've written a function which will do this on entering one number. I have a sequence of numbers that I wish to evaluate the function for.
x <- 1:1000
z <- x^2+1
findy <- function(z){
y <<- seq(1, sqrt(z), 1)
}
n <- length(y)
for (i in 1:n) {
a[i] <- z[i] - y[i]
}
What I want to do is as follows.
Start with a vector z <- 1:1000
Create a new vector: w <- z^2 + 1
then for each number in this vector evaluate the function above.
Example
z <- c(1, 2, 3, 4)
w <- c(2, 5, 10, 17)
(this is where it gets tricky to describe the output)
y= 1
1,2
1,2,3
1,2,3,4
If that makes sense.
Then I would like to be able to pull out certain values of the above array.
If anyone could help then that would be amazing!
An option using sequence and split. The function returns a list.
f <- function(x) {
w <- x^2 + 1 # why do you need this line?
out <- sequence(sqrt(w)) # same as sequence(x)
split(out, cumsum(out == 1L))
}
out <- f(1:4)
out
#$`1`
#[1] 1
#
#$`2`
#[1] 1 2
#
#$`3`
#[1] 1 2 3
#
#$`4`
#[1] 1 2 3 4
To extract the vectors you can use $ or [[
out$`1` # output is a vector
[1] 1
or
out[2:3] # output is a list
#$`2`
#[1] 1 2
#$`3`
#[1] 1 2 3
See help("Extract") for details.

Choose closest x elements by index in a list/vector

If I have a vector such as x <-c(1,2,3,4,5,6,7,8,9), I want a function f such that
f(vector,index,num) where it takes the vector and gives me num "closest" elements to that one on the index
Examples:
f(x,3,4) = c(1,2,4,5)
f(x,1,5) = c(2,3,4,5,6)
f(x,8,3) = c(6,7,9)
Since there is also the issue where if we have an odd num, we will need to choose whether to pick left or right side by symmetry, let's go with choosing the left side (but right side is ok too)
i.e f(x,4,5) = c(1,2,3,5,6) and f(x,7,3) = c(5,6,8)
I hope my question is clear, thank you for any help/responses!
edit: The original vector of c(1:9) is arbitrary, the vector could be a vector of strings, or a vector of length 1000 with shuffled numbers with repeats etc.
i.e c(1,7,4,2,3,7,2,6,234,56,8)
num_closest_by_indices <- function(v, idx, num) {
# Try the base case, where idx is not within (num/2) of the edge
i <- abs(seq_along(x) - idx)
i[idx] <- +Inf # sentinel
# If there are not enough elements in the base case, incrementally add more
for (cutoff_idx in seq(floor(num/2), num)) {
if (sum(i <= cutoff_idx) >= num) {
# This will add two extra indices every iteration. Strictly if we have an even length, we should add the leftmost one first and `continue`, to break ties towards the left.
return(v[i <= cutoff_idx])
}
}
}
Here's an illustration of this algorithm: we rank the indices in order of desirability, then pick the lowest num legal ones:
> seq_along(x)
1 2 3 4 5 6 7 8 9
> seq_along(x) - idx
-2 -1 0 1 2 3 4 5 6
> i <- abs(seq_along(x) - idx)
2 1 0 1 2 3 4 5 6
> i[idx] <- +Inf # sentinel to prevent us returning the element itself
2 1 Inf 1 2 3 4 5 6
Now we can just find num elements with smallest values (break ties arbitrarily, unless you have a preference (left)).
Our first guess is all indices <= (num/2) ; this might not be enough if index is within (num/2) of the start/end.
> i <= 2
TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
> v[i <= 2]
1 2 4 5
So, adapting #dash2's code to handle the corner cases where some indices are illegal (nonpositive, or > length(x)), i.e. ! %in% 1:L. Then min(elems) would be the number of illegal indices which we cannot pick, hence we must pick abs(min(elems)) more.
Notes:
in the end the code is simpler and faster to handle it by three piecewise cases. Aww.
it actually seems to simplify things if we pick (num+1) indices, then remove idx before returning the answer. Using result[-idx] to remove it.
Like so:
f <- function (vec, elem, n) {
elems <- seq(elem - ceiling(n/2), elem + floor(n/2))
if (max(elems) > length(vec)) elems <- elems - (max(elems) - length(vec))
if (elems[1] < 1) elems <- elems + (1 - elems[1])
elems <- setdiff(elems, elem)
vec[elems]
}
Giving results:
> f(1:9, 1, 5)
[1] 2 3 4 5 6
> f(1:9, 9, 5)
[1] 4 5 6 7 8
> f(1:9, 2, 5)
[1] 1 3 4 5 6
> f(1:9, 4, 5)
[1] 1 2 3 5 6
> f(1:9, 4, 4)
[1] 2 3 5 6
> f(1:9, 2, 4)
[1] 1 3 4 5
> f(1:9, 1, 4)
[1] 2 3 4 5
> f(1:9, 9, 4)
[1] 5 6 7 8
Start a function with the variable argument x first, and the reference table and n after
.nearest_n <- function(x, table, n) {
The algorithm assumes that table is numeric, without any duplicates, and all values finite; n has to be less than or equal to the length of the table
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
Sort the table and then 'clamp' maximum and minimum values
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
Find the interval in table where x occurs; findInterval() uses an efficient search. Use the interval index as the initial lower index, and add 1 for the upper index, making sure to stay in-bounds.
## where to start?
lower <- findInterval(x, table)
upper <- min(lower + 1L, len)
Find the nearest n neighbors by comparing the lower and upper index distance to x, record the nearest value, and increment the lower or upper index as appropriate and making sure to stay in-bounds
## find
nearest <- numeric(n)
for (i in seq_len(n)) {
if (abs(x - table[lower]) < abs(x - table[upper])) {
nearest[i] = table[lower]
lower = max(1L, lower - 1L)
} else {
nearest[i] = table[upper]
upper = min(len, upper + 1L)
}
}
Then return the solution and finish the function
nearest
}
The code might seem verbose, but is actually relatively efficient because the only operations on the entire vector (sort(), findInterval()) are implemented efficiently in R.
A particular advantage of this approach is that it can be vectorized in it's first argument, calculating the test for using lower (use_lower = ...) as a vector and using pmin() / pmax() as clamps.
.nearest_n <- function(x, table, n) {
## assert & setup
stopifnot(
is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
n <= length(table)
)
## sort and clamp
table <- c(-Inf, sort(table), Inf)
len <- length(table)
## where to start?
lower <- findInterval(x, table)
upper <- pmin(lower + 1L, len)
## find
nearest <- matrix(0, nrow = length(x), ncol = n)
for (i in seq_len(n)) {
use_lower <- abs(x - table[lower]) < abs(x - table[upper])
nearest[,i] <- ifelse(use_lower, table[lower], table[upper])
lower[use_lower] <- pmax(1L, lower[use_lower] - 1L)
upper[!use_lower] <- pmin(len, upper[!use_lower] + 1L)
}
# return
nearest
}
For instance
> set.seed(123)
> table <- sample(100, 10)
> sort(table)
[1] 5 29 41 42 50 51 79 83 86 91
> .nearest_n(c(30, 20), table, 4)
[,1] [,2] [,3] [,4]
[1,] 29 41 42 50
[2,] 29 5 41 42
Generalize this by taking any argument and coercing it to the required form using a reference look-up table table0 and the indexes into it table1
nearest_n <- function(x, table, n) {
## coerce to common form
table0 <- sort(unique(c(x, table)))
x <- match(x, table0)
table1 <- match(table, table0)
## find nearest
m <- .nearest_n(x, table1, n)
## result in original form
matrix(table0[m], nrow = nrow(m))
}
As an example...
> set.seed(123)
> table <- sample(c(letters, LETTERS), 30)
> nearest_n(c("M", "Z"), table, 5)
[,1] [,2] [,3] [,4] [,5]
[1,] "o" "L" "O" "l" "P"
[2,] "Z" "z" "Y" "y" "w"

Conditional expression for a specific column in a list of data frames in R

Sorry if the title is confusing.
I have a list of data frames combined into temp.list. I want to raise each row of a specific column based on the value in vec. For example, vec has the values 2, 0, and 3. I want to do: X2^2, log(X2), X2^3. So do log(X2) if the value in vec==0. The last three lines of code is where I have an issue.
M1 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
M2 <- data.frame(matrix(1:9, nrow = 3, ncol = 3))
M3 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
mlist <- list(M1, M2, M3)
temp.list <-mlist
vec <- c(2,0,3)
The code below works! But I don't want to raise X2^0.
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- temp.list[[i]]$X2^vec[[i]]
}
The code below replaces all rows of X2 by the first value calculated in X2.
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- ifelse(vec[[i]]==0,log(temp.list[[i]]$X2),temp.list[[i]]$X2^vec[[i]]
}
Any other ways of doing this would also be much appreciated.
You could use this:
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- if(vec[[i]]==0) log(temp.list[[i]]$X2)
else temp.list[[i]]$X2^vec[[i]]
}
temp.list
# [[1]]
# X1 X2
# 1 1 9
# 2 2 16
# [[2]]
# X1 X2 X3
# 1 1 1.386294 7
# 2 2 1.609438 8
# 3 3 1.791759 9
# [[3]]
# X1 X2
# 1 1 27
# 2 2 64
The problem is with the ifelse(...) statement, which returns a vector of the same length as the condition (e.g., 1 in your case). The if (...) ... else ... statement evaluates the expression and executes whichever block of code is appropriate.

Resources