Check if each number in a vector is between some numbers in another vector in R - r

Say I have two vectors, A and B. A has 15 variables and B has 28 variables.
A = c(13,14,29,31,32,39,42,51,59,61,68,91,102,109,120)
B = c(26,26,28,29,30,30,33,38,41,42,45,46,47,47,49,49,80,81,86,86,90,90,92,100,101,105,105,107)
I want a 14 by 27 matrix, Z, where a i by j entry is 1 if (B_j,B_{j+1}] overlaps with (A_i, A_{i+1}].
For instance, the (3,4) entry of Z would be 1 since (29,31] and (29,30] overlap, with 30 as a common number. Is there a fast way to compute this?
I have the following code:
Z = matrix(0, length(A)-1, length(B)-1)
for (i in 1:(length(A)-1)){
nn = which(B > A[i] & B <= A[(i+1)])
if (length(nn)>0){
Z[i,(nn-1)] = 1}}
It works well but my A and B vector often contain 30,000+ elements and it is incredibly slow. Making of the matrix Z even takes unnecessarily long time. Can anyone help with this?
Ideally, there is a vectorized solution to this or a well written function from a package that can do this like cutting a cake.

Here's an option using matrix multiplication. As commented the matrix can get big, and you'll have to see if the speed improvement is worth it:
res1 <- outer(A, B, FUN = function(A, B){B > A})
res2 <- outer(A, B, FUN = function(A, B){B <= A})
dim(res1); dim(res2)
res3 <- (res1[-nrow(res1),] + res2[-1,]) == 2
image(res3)
dim(res3)
op <- par(mfcol=c(1,2))
image(Z, main="Z")
image(res3, main="res3")
par(op)

If closed Intervals [B_j,B_{j+1}] and [A_i, A_{i+1}] are ok for you as well you could use
A <- as.integer(c(13,14,29,31,32,39,42,51,59,61,68,91,102,109,120))
B <- as.integer(c(26,26,28,29,30,30,33,38,41,42,45,46,47,47,49,49,80,81,86,86,90,90,92,100,101,105,105,107))
DT_A <- data.table(A0 = A, A1 = shift(A, type = "lead"), key=c("A0", "A1"))[-length(A)]
DT_B <- data.table(B0 = B, B1 = shift(B, type = "lead"), key=c("B0", "B1"))[-length(B)]
ind_true <- foverlaps(DT_A, DT_B, type="any", mult="all", which=TRUE)[!is.na(yid)]
mat <- matrix(0, length(A)-1, length(B)-1)
mat[ind_true$xid, ind_true$yid] = 1

This answer uses matrix indexing and relies on expand.grid though there are much faster implementations of it. You lag your vectors to create matrices of A and B, then with a function that does simple boolean check, we can index into the matrices with an expanded grid. Then it returns a matrix.
overlap = function(id,x1,x2){
idA = id[,1]
idB = id[,2]
o = (x1[idA,1] >= x2[idB,1] & x1[idA,1] <= x2[idB,2]) | (x1[idA,2] >= x2[idB,1] & x1[idA,2] <= x2[idB,2]) |
(x1[idA,1] <= x2[idB,1] & x1[idA,2] >= x2[idB,1]) | (x1[idA,1] <= x2[idB,2] & x1[idA,2] >= x2[idB,2])
matrix(o,nrow=nrow(x1))
}
A = c(13,14,29,31,32,39,42,51,59,61,68,91,102,109,120)
nA = cbind(lag(A),A)[-1,]
B = c(26,26,28,29,30,30,33,38,41,42,45,46,47,47,49,49,80,81,86,86,90,90,92,100,101,105,105,107)
nB = cbind(lag(B),B)[-1,]
expand.grid.jc <- function(seq1,seq2) {
cbind(Var1 = rep.int(seq1, length(seq2)),
Var2 = rep.int(seq2, rep.int(length(seq1),length(seq2))))
}
ids = expand.grid.jc(1:nrow(nA),1:nrow(nB))
overlap(ids,nA,nB)

Related

Find P(X<Y<Z) in R

I want to find the P(X<Y<Z) in r. For each value of z_i, I want to check whether it satisfies the conditions or not. I demonstrated the problem below. Here I used the ifelse function in r. I don't how to put multiple statements within ifelse. When I type ifelse(z[i]>y>x, 1, 0) I get errors. I want to know how to include this.
x = c(1,1)
y = c(2,2)
z = c(3,3)
value = NULL
n1 = length(x)
n2 = length(y)
n3 = length(z)
for(i in 1: length(z)){
value[i] = sum (ifelse(z[i]>y & z[i]> x & y > x, 1, 0))
}
value
The desired output should be 4 4. But the above code gives 2 2. Thanks in advance.

Listing all integer compositions in R

What I look for is basically an R-version of the answer to this question: Generating all permutation of numbers that sums up to N. First of all the answer uses java, which I have a really hard time reading. Second of all the code uses "deque", which I cant figure out a way to implement in R.
I have found several algorithms to do this, but they have all been written in programming languages using structures not available in R such as deques, heaps or list-comprehensions.
What I actually need is a way of finding all the vectors v of length N-1 where:
sum(v * 1:(N-1)) == N
and I think I can manage that myself if only I find a way of obtaining all the ordered integer partitions.
As an example for N = 4 all the ordered integer partitions using numbers 1 to N-1 are:
1+1+1+1
1+1+2
1+3
2+2
What I effectively need is output of the either form:
c(1,1,1,1)
c(1,1,2)
c(1,3)
c(2,2)
Or of the form:
c(4,0,0)
c(2,1,0)
c(1,0,1)
c(0,2,0)
since I should be able to convert the former format to the latter by myself. Any hint as to how to approach this problem using R would be greatly appreciated. The latter format is excactly the vectors v such that sum(v * 1:3) is 4.
EDIT:
My own attempt:
rek = function(mat, id1, id2){
if(id1 + id2 != length(mat) + 1){ #If next state not absorbing
mat[id1] = mat[id1] - 1
mat[id2] = mat[id2] - 1
mat[id1+id2] = mat[id1+id2] + 1
out = mat
id = which(mat > 0)
for(i in id){
for(j in id[id>=i]){
if(j == i & mat[i] == 1){
next
}
out = rbind(out, rek(mat,i,j))
}
}
return(out)
}
}
start = c(n, rep(0, n-2))
states = rbind(start, rek(start, 1, 1))
states = states[!duplicated(states), ] #only unique states.
This is incredibly inefficient. E. g. when n = 11, my states has over 120,000 rows prior to removing duplicates, which leaves only 55 rows.
EDIT 2:
Using the parts() function described below I came up with:
temp = partitions::parts(n)
temp = t(temp)
for(i in 1:length(temp[,1])){
row = temp[i,]
if(any(row>(n-1))){#if absorbing state
next
}
counts = plyr::count(row[row>0])
newrow = rep(0,n-1)
id = counts$x
numbs = counts$freq
newrow[id] = numbs
states = rbind(states, newrow)
}
states = states[-1,]#removing the first row, added manually
which excactly gives me the vectors v such that sum(v * 1:(N-1)) is N.
If anyone is interested, this is to be used within coalescent theory, as a way to describe the possible relations between N individuals omitting when all are related. As an example with N = 4:
(4, 0, 0) -- No individuals are related
(2, 1, 0) -- Two individuals are related, the rest are not
(0, 2, 0) -- The individuals are pair-wise related
(1, 0, 1) -- Three individuals are related, the other individual is not.
Hope parts from package partitions could help
library(partitions)
N <- 4
res <- unique(lapply(asplit(parts(N),2),function(x) sort(x[x>0])))[-1]
which gives
> res
[[1]]
[1] 1 3
[[2]]
[1] 2 2
[[3]]
[1] 1 1 2
[[4]]
[1] 1 1 1 1
If you would like to write a custom base R function, here is a recursive version
f <- function(n, vhead = n, v = c()) {
if (n == 0) return(list(v))
unlist(lapply(seq_len(min(n, vhead)), function(k) f(n - k, k, c(k,v))), recursive = FALSE)
}
then we can run
res <- Filter(function(x) length(x)>1,f(N))

rev() in r and how to apply it to a list using loops

I have a list of say {a,b,c,d,...} and each element, a,b,c,d, ... are data.table that I need to reverse the order of, however, for the data.table I only want to rev() all of it except the first column, as it is an ID. I tried using loops to do it but it returned
Error in `[<-.data.table`(`*tmp*`, , -1, value = list(code_a = c("a", :
Item 1 of column numbers in j is -1 which is outside range [1,ncol=4]. Use column names instead in j to add new columns.
Example:
a <- c("a","b","c","d","e","f")
b <- 1:6
c <- c("F","E","D","C","B","A")
d <- 10:15
dt1 <- data.table("ID" = b, "code_a" = a)
dt2 <- data.table("ID" = b, "code_c" = c)
dt3 <- data.table("ID" = b, "code_d" = d)
dt <- list(dt1,dt2,dt3)
rev_dt <- rev(dt)
merged_list <- list()
rev_merged_list <- list()
rev_merged_list <- Reduce(merge, rev_dt, accumulate = TRUE)
merged_list <- rev_merged_list
merged_list <- rev(merged_list)
for(z in 1:length(dt)){
merged_list[[z]][,-1] = rev(merged_list[[z]][,-1])
}
More Information:
The for loop here is supposed to be:
- for z from 1 to the length of dt
- the merged_list element z (which with double square brackets) should be a data.table
- where the data does not include the first column
- should be assigned to the rev of the same element z, where the first column is also excluded
Does this logic hold for the above loop? I am unsure what is wrong!
Expected Output:
output_ <- list()
a_ <- data.table("ID" = b, "code_a" = a, "code_c" = c, "code_d" = d)
b_ <- data.table("ID" = b, "code_c" = c, "code_d" = d)
c_ <- data.table("ID" = b, "code_d" = d)
output_[[1]] <- a_
output_[[2]] <- b_
output_[[3]] <- c_
output_
I was told yesterday that the merge above i can specify a right hand merge, however in doing so, I need to specify a by = "ID" in the merge, but I am unsure what is the x and y values in the case of merging multiple sets of data.
I am also under the impression that lapply() can do the same thing instead of loop, but I am unsure in this case how might I achieved that. Thanks~
We can use setcolorder
for(i in seq_along(merged_list)){
setcolorder(merged_list[[i]],
c(names(merged_list[[i]])[1], rev(names(merged_list[[i]])[-1])))
}
all.equal(merged_list, output_, check.attributes = FALSE)
#[1] TRUE

Matching rows from one dataset to a reference dataset (R)

I have a question.
Let's say I have two data frames.
values <- data.frame(x = rnorm(10000), y = rnorm(10000), matches = 0)
reference <- data.frame(a = rnorm(10000), b = rnorm(10000))
For each row in “values”, I would like to know how many matches there are in the “reference” dataset within a defined range.
system.time(
for (i in 1:nrow(values))
{
# defining valid range
x1 <- values$x[i] - 0.1
x2 <- values$x[i] + 0.1
y1 <- values$y[i] - 0.2
y2 <- values$y[i] + 0.2
#matching values versus reference dataset
values$matches[i] <- nrow(reference[reference$a %between% c(x1,x2) & reference$b %between% c(y1,y2),])
}
)
user system elapsed
9.91 0.03 9.94
The example above is functional, but for large datasets it takes ages.
Can maybe this be done with data.table?
Thank you in advance
Here is one data.table method:
# set of data.tables
values <- setDT(data.frame(x = rnorm(10000), y = rnorm(10000), matches = 0))
reference <- setDT(data.frame(a = rnorm(10000), b = rnorm(10000)))
# calculate sum of ranges, initialize matches variable as integer for speed
values[, matches := integer(nrow(values))]
values[, matches := sum(reference$a %between% c(x-0.1, x+0.1) *
reference$b %between% c(y-0.2, y+0.2)), by=rownames(values)]
It's probably faster than what you have, though there is likely a faster method.
Here's another solution using dplyr's rowwise().
If the "defined range" is symetric you can improve performance by checking only two conditions:
count_matches <- function(x, y) {
sum(abs(reference$a - x) <= 0.1 & abs(reference$b - y) <= 0.2)
}
library(dplyr)
values %>%
rowwise() %>%
mutate(matches = count_matches(x, y))

subsetting dataframe R avoid for loop

In a large dataframe (1 million+ rows), I am counting the number of elements (rows) that are within a particular range and satisfy a third criteria. I have 33 of those ranges and use a very slow for loop to get me the answer, no problem.
As speed is of massive concern, I would appreciate any help to get this to run faster. Can I get rid of the for loop and "vectorise" or any sort of "apply" solution?
Thanks in advance
Code:
N.data<-c(1:33)
Lower<-c(0,100000,125000,150000,175000,200000,225000,250000,275000,300000,325000,350000,375000,400000,425000,450000,475000,500000,550000,600000,650000,700000,750000,800000,850000,900000,950000,1000000,1100000,1200000,1300000,1400000,1500000)
Upper<-c(100000,125000,150000,175000,200000,225000,250000,275000,300000,325000,350000,375000,400000,425000,450000,475000,500000,550000,600000,650000,700000,750000,800000,850000,900000,950000,1000000,1100000,1200000,1300000,1400000,1500000, 5000000)
for (i in 1:(length(N.data))){
N.data[i]<-nrow(dataset[dataset$Z==c & dataset$X > Lower[i] & dataset$X < Upper[i],])
}
A more efficient approach:
# first logical index (vector)
idx1 <- dataset$Z == c
# second logical index (matrix)
idx2 <- mapply(function(l, u) dataset$X > l & dataset$X < u, Lower, Upper)
# combine both indices and count number of rows
N.data <- colSums(idx1 & idx2)
apply functions are not VECTORIZED. They are merely more efficient implementations of a for loop. To achieve what you seek using vectorization, here is one approach.
# Create a Dummy Dataset and Breaks
dataset = data.frame(
X = rpois(100, 10),
Z = rpois(100, 20)
)
breaks = seq(0, max(dataset$Z), length = 5)
# Add Column with Breaks
dataset = transform(dataset, Z2 = cut(Z, breaks, labels = FALSE))
# Use Aggregate to compute length for each value of Z2
c = 10
aggregate(Z ~ Z2, data = dataset, length, subset = (X == c))
This should be more efficient that using mapply, as it is completely vectorized.

Resources