closest smaller and bigger value comparing two vectors - r

Let´s say I have two vectors
a <- c(5,10,12)
b <- c(4,11,15)
I would like to compare a with b and obtain the smaller closest value to each element. The smaller closest value to 5 is 4, for 10 is 4 and for 12 is 11. And the same but finding the closest bigger value. For 5 is 11, for 10 is 11 and for 12 is 15.
Desired vector of closest smaller values:
4 4 11
Desired vector of closest bigger values:
11 11 15
I found another example using the function closest from the package DescTools, but the results are different
> unlist(lapply(a, function(i) min(Closest(x = b, a = i))))
[1] 4 11 11
> unlist(lapply(a, function(i) max(Closest(x = b, a = i))))
[1] 4 11 11
Do you know how I could achieve my objective?

This should do it:
> sapply(a,function(x) b[tail(which(b<x),1)])
[1] 4 4 11
> sapply(a,function(x) b[head(which(b>x),1)])
[1] 11 11 15

Assuming both are sorted and unique:
idx <- findInterval(a, b)
a[idx]
[1] 5 5 10
b[idx+1]
[1] 11 11 15

Related

Get an uniform distributed sample but keep the order

How could i get a sample of a values of a vector but keep the order without compairing the values themself against each other?
for example:
V1 contains values (1,2,3,4,5,6,7,8,9,10,11,12,13,14)
I woule like to get a sample
sample <- (2,7,10,14)
As you can see the values are still on order but randomly selected.
But if i use a function sample or rdunif in R I get random orderd selection:
ie. (7,10,2,14)
Thank you!
With the following solution you do not compare the elements of your original vector in order to sort them; the only thing you do is shuffling a vector of logical values (TRUE or FALSE).
Let's say you want to pick n elements from the already-ordered vector v and maintain their order. Then you can do
v <- 1:14
n <- 4
set.seed(42) # for reproducibility
logi <- sample(c(rep(TRUE, n), rep(FALSE, length(v) - n)))
v[logi]
# [1] 1 6 7 14
EDIT to prove that the vector v can be any vector, and we still manage to maintain its original order.
set.seed(1)
n <- 4
v <- sample(14, replace = FALSE)
v
# [1] 9 4 7 1 2 12 3 6 10 8 5 11 13 14
set.seed(42) # for reproducibility
logi <- sample(c(rep(TRUE, n), rep(FALSE, length(v) - n)))
v[logi]
# [1] 9 12 3 14
These numbers respect indeed the original order of vector v.
Let's see if we can't do this when the original V1 is not in numerical order.
set.seed(42)
v <- sample(1:14,14,rep=FALSE)
# [1] 1 5 14 9 10 4 2 8 12 11 6 13 7 3
n <- 4
foo <- sample(v,length(v)-n,rep=FALSE)
match(foo,v)
v[-match(foo,v)]
# [1] 1 13 7 3
Now the output sample values are in the same order they are in the original vector.
V1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14)
sample_V1 <- sample(V1, 4)
sort(sample_V1)

Fastest way to find nearest value in vector

I have two integer/posixct vectors:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
Now my resulting vector c should contain for each element of vector a the nearest element of b:
c <- c(4,4,4,4,4,6,6,...)
I tried it with apply and which.min(abs(a - b)) but it's very very slow.
Is there any more clever way to solve this? Is there a data.table solution?
As it is presented in this link you can do either:
which(abs(x - your.number) == min(abs(x - your.number)))
or
which.min(abs(x - your.number))
where x is your vector and your.number is the value. If you have a matrix or data.frame, simply convert them to numeric vector with appropriate ways and then try this on the resulting numeric vector.
For example:
x <- 1:100
your.number <- 21.5
which(abs(x - your.number) == min(abs(x - your.number)))
would output:
[1] 21 22
Update: Based on the very kind comment of hendy I have added the following to make it more clear:
Note that the answer above (i.e 21 and 22) are the indexes if the items (this is how which() works in R), so if you want to get the actual values, you have use these indexes to get the value. Let's have another example:
x <- seq(from = 100, to = 10, by = -5)
x
[1] 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10
Now let's find the number closest to 42:
your.number <- 42
target.index <- which(abs(x - your.number) == min(abs(x - your.number)))
x[target.index]
which would output the "value" we are looking for from the x vector:
[1] 40
Not quite sure how it will behave with your volume but cut is quite fast.
The idea is to cut your vector a at the midpoints between the elements of b.
Note that I am assuming the elements in b are strictly increasing!
Something like this:
a <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) #has > 2 mil elements
b <- c(4,6,10,16) # 200000 elements
cuts <- c(-Inf, b[-1]-diff(b)/2, Inf)
# Will yield: c(-Inf, 5, 8, 13, Inf)
cut(a, breaks=cuts, labels=b)
# [1] 4 4 4 4 4 6 6 6 10 10 10 10 10 16 16
# Levels: 4 6 10 16
This is even faster using a lower-level function like findInterval (which, again, assumes that breakpoints are non-decreasing).
findInterval(a, cuts)
[1] 1 1 1 1 2 2 2 3 3 3 3 3 4 4 4
So of course you can do something like:
index = findInterval(a, cuts)
b[index]
# [1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Note that you can choose what happens to elements of a that are equidistant to an element of b by passing the relevant arguments to cut (or findInterval), see their help page.
library(data.table)
a=data.table(Value=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,merge:=Value]
b=data.table(Value=c(4,6,10,16))
b[,merge:=Value]
setkeyv(a,c('merge'))
setkeyv(b,c('merge'))
Merge_a_b=a[b,roll='nearest']
In the Data table when we merge two data table, there is an option called nearest which put all the element in data table a to the nearest element in data table b. The size of the resultant data table will be equal to the size of b (whichever is within the bracket). It requires a common key for merging as usual.
For those who would be satisfied with the slow solution:
sapply(a, function(a, b) {b[which.min(abs(a-b))]}, b)
Here might be a simple base R option, using max.col + outer:
b[max.col(-abs(outer(a,b,"-")))]
which gives
> b[max.col(-abs(outer(a,b,"-")))]
[1] 4 4 4 4 6 6 6 10 10 10 10 10 16 16 16
Late to the party, but there is now a function from the DescTools package called Closest which does almost exactly what you want (it just doesn't do multiple at once)
To get around this we can lapply over your a list, and find the closest.
library(DescTools)
lapply(a, function(i) Closest(x = b, a = i))
You might notice that more values are being returned than exist in a. This is because Closest will return both values if the value you are testing is exactly between two (e.g. 3 is exactly between 1 and 5, so both 1 and 5 would be returned).
To get around this, put either min or max around the result:
lapply(a, function(i) min(Closest(x = b, a = i)))
lapply(a, function(i) max(Closest(x = b, a = i)))
Then unlist the result to get a plain vector :)

choosing vector elements in a loop based on another vector [duplicate]

a questions from a relative n00b: I’d like to split a vector into three vectors of different lengths, with the values assigned to each vector at random. For example, I’d like to split the vector of length 12 below into vectors of length 2,3, and 7
I can get three equal sized vectors using this:
test<-1:12
split(test,sample(1:3))
Any suggestions on how to split test into vectors of 2,3, and 7 instead of three vectors of length 4?
You could use rep to create the indices for each group and then split based on that
split(1:12, rep(1:3, c(2, 3, 7)))
If you wanted the items to be randomly assigned so that it's not just the first 2 items in the first vector, the next 3 items in the second vector, ..., you could just add call to sample
split(1:12, sample(rep(1:3, c(2, 3, 7))))
If you don't have the specific lengths (2,3,7) in mind but just don't want it to be equal length vectors every time then SimonO101's answer is the way to go.
How about using sample slightly differently...
set.seed(123)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 6
#$`2`
#[1] 3 7 9 10 12
#$`3`
#[1] 2 4 5 8 11
set.seed(1234)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 7 8
#$`2`
#[1] 2 3 4 6 9 10 12
#$`3`
#[1] 5 11
The first argument in sample is the number of groups to split the vector into. The second argument is the number of elements in the vector. This will randomly assign each successive element into one of 3 vectors. For 4 vectors just do split( test , sample(4, 12 , repl = TRUE) ).
It is easier than you think. To split the vector in three new randomly chosen sets run the following code:
test <- 1:12
split(sample(test), 1:3)
By doing so any time you run your this code you would get a new random distribution in three different sets(perfect for k-fold cross validation).
You get:
> split(sample(test), 1:3)
$`1`
[1] 5 8 7 3
$`2`
[1] 4 1 10 9
$`3`
[1] 2 11 12 6
> split(sample(test), 1:3)
$`1`
[1] 12 6 4 1
$`2`
[1] 3 8 7 5
$`3`
[1] 9 2 10 11
You could use an auxiliary vector to format the way you want to split your data. Example:
Data <- c(1,2,3,4,5,6)
Format <- c("X","Y","X","Y","Z,"Z")
output <- split(Data,Format)
Will generate the output:
$X
[1] 1 3
$Y
[1] 2 4
$Z
[1] 5 6

Split a vector into three vectors of unequal length in R

a questions from a relative n00b: I’d like to split a vector into three vectors of different lengths, with the values assigned to each vector at random. For example, I’d like to split the vector of length 12 below into vectors of length 2,3, and 7
I can get three equal sized vectors using this:
test<-1:12
split(test,sample(1:3))
Any suggestions on how to split test into vectors of 2,3, and 7 instead of three vectors of length 4?
You could use rep to create the indices for each group and then split based on that
split(1:12, rep(1:3, c(2, 3, 7)))
If you wanted the items to be randomly assigned so that it's not just the first 2 items in the first vector, the next 3 items in the second vector, ..., you could just add call to sample
split(1:12, sample(rep(1:3, c(2, 3, 7))))
If you don't have the specific lengths (2,3,7) in mind but just don't want it to be equal length vectors every time then SimonO101's answer is the way to go.
How about using sample slightly differently...
set.seed(123)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 6
#$`2`
#[1] 3 7 9 10 12
#$`3`
#[1] 2 4 5 8 11
set.seed(1234)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 7 8
#$`2`
#[1] 2 3 4 6 9 10 12
#$`3`
#[1] 5 11
The first argument in sample is the number of groups to split the vector into. The second argument is the number of elements in the vector. This will randomly assign each successive element into one of 3 vectors. For 4 vectors just do split( test , sample(4, 12 , repl = TRUE) ).
It is easier than you think. To split the vector in three new randomly chosen sets run the following code:
test <- 1:12
split(sample(test), 1:3)
By doing so any time you run your this code you would get a new random distribution in three different sets(perfect for k-fold cross validation).
You get:
> split(sample(test), 1:3)
$`1`
[1] 5 8 7 3
$`2`
[1] 4 1 10 9
$`3`
[1] 2 11 12 6
> split(sample(test), 1:3)
$`1`
[1] 12 6 4 1
$`2`
[1] 3 8 7 5
$`3`
[1] 9 2 10 11
You could use an auxiliary vector to format the way you want to split your data. Example:
Data <- c(1,2,3,4,5,6)
Format <- c("X","Y","X","Y","Z,"Z")
output <- split(Data,Format)
Will generate the output:
$X
[1] 1 3
$Y
[1] 2 4
$Z
[1] 5 6

Select smaller square matrices along the diagonal inside a big square matrix in R

First of all, if the title of my question is not clear, please go ahead and edit it!
So suppose I have a square matrix.
ex = outer(1:4, 2:5, "+")
colnames(ex) = paste(rep(c("Subj1", "Subj2"), each=2), "_",
rep("Factor1", each=2), ".", rep(c("A", "B")), sep="")
rownames(ex) = paste(rep(c("Subj1", "Subj2"), each=2), "_",
rep("Factor2", each=2), ".", rep(c("A", "B")), sep="")
The matrix:
I would like to extract the values in the red boxes, which basically are values for the different combinations of factor levels within each subject (but not across different subjects), and save them into a vector in the sequence below:
[1] 3, 4, 4, 5, 7, 8, 8, 9
I can of course use a loop like the one below:
v = NULL
for(i in 1:16){if(ex2[i,2] == ex2[i,3]) v[i] = ex2[i,1]}
v = v[!is.na(v)]
v
[1] 3 4 4 5 7 8 8 9
I wonder if there is a more elegant way to do this that can take into account the number of subjects, the number of factors, as well as the number of levels within each factor (assuming that all factors have an equal number of levels.)
To extract the submatrices in the red boxes, you can simply do:
ex[1:2, 1:2]
and
ex[3:4, 3:4]
To turn them into a single vector like you want, just do:
c(ex[1:2, 1:2], ex[3:4, 3:4])
# [1] 3 4 4 5 7 8 8 9
ETA: To answer your question in more general terms: let's say we had the number of subjects and levels set up in advance (increasing the number of factors is more complicated, unless I'm mistaken, because then it would no longer be a two-dimensional matrix).
num.subjects = 2
num.levels = 2
size = num.subjects * num.levels
ex = outer(1:size, (1:size)+1, "+")
We can get the solution like this:
subjects = rep(1:num.subjects, each=num.levels)
v = c(sapply(1:num.subjects, function(s) ex[subjects == s, subjects == s]))
v is now
[1] 3 4 4 5 7 8 8 9
This can be extended to much larger numbers of subjects and levels. Setting subjects to 3 and levels to 4 gets:
[1] 3 4 5 6 4 5 6 7 5 6 7 8 6 7 8 9 11 12 13 14 12 13 14 15 13
[26] 14 15 16 14 15 16 17 19 20 21 22 20 21 22 23 21 22 23 24 22 23 24 25
To give a bit more explanation: creating a list of each of the per-individual submatrices can be done pretty simply:
matrices = lapply(1:num.subjects, function(s) ex[subjects == s, subjects == s])
matrices is now:
[[1]]
[,1] [,2]
[1,] 3 4
[2,] 4 5
[[2]]
[,1] [,2]
[1,] 7 8
[2,] 8 9
For the vector version, you'll have to concatenate each individually and then overall. This is effectively what the above solution does.

Resources