R: fastest way to check presence of each element of a vector in each of the columns of a matrix - r

I have an integer vector a
a=function(l) as.integer(runif(l,1,600))
a(100)
[1] 414 476 6 58 74 76 45 359 482 340 103 575 494 323 74 347 157 503 385 518 547 192 149 222 152 67 497 588 388 140 457 429 353
[34] 484 91 310 394 122 302 158 405 43 300 439 173 375 218 357 98 196 260 588 499 230 22 369 36 291 221 358 296 206 96 439 423 281
[67] 581 127 178 330 403 91 297 341 280 164 442 114 234 36 257 307 320 307 222 53 327 394 467 480 323 97 109 564 258 2 355 253 596
[100] 215
and an integer matrix B
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
B(10)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 250 411 181 345 4 519 167 395 130 388
[2,] 383 377 555 304 119 317 586 351 136 528
[3,] 238 262 513 476 579 145 461 191 262 302
[4,] 428 467 217 590 50 171 450 189 140 158
[5,] 178 14 31 148 285 365 515 64 166 584
and I would like to make a new boolean l x c matrix that shows whether or not each vector element in a is present in each specific column of matrix B.
I tried this with
ispresent1 = function (a,B) {
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }
or with
ispresent2 = function (a,B) t(sapply(1:length(a), function(i) apply(B,2,function(x) a[[i]] %in% x)))
but neither of these ways to do this are very fast:
a1=a(1000)
B1=B(20000)
system.time(ispresent1(a1,B1))
user system elapsed
76.63 1.08 77.84
system.time(ispresent2(a1,B1))
user system elapsed
218.10 0.00 230.00
(in my application matrix B would have about 500 000 - 2 million columns)
Probably this is something trivial, but what is the proper way to do this?
EDIT: proper syntax, as mentioned below, is ispresent = function (a,B) apply(B,2,function(x) { a %in% x } ), but the Rcpp solution below is still almost 2 times faster! Thanks for this!

After digging a little, and by curiosity about the Rcpp answer of #Backlin I did write a benchmark of orignal solution and our two solutions:
I had to change a little Backlin's function as inline didn't work on my windows box (sorry if I missed something with it, let me know if there's something to adapt)
Code used:
set.seed(123) # Fix the generator
a=function(l) as.integer(runif(l,1,600))
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
ispresent1 = function (a,B) {
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }
a1=a(1000)
B1=B(20000)
tensibai <- function(v,m) {
apply(m,2,function(x) { v %in% x })
}
library(Rcpp)
cppFunction("LogicalMatrix backlin(IntegerVector a,IntegerMatrix B) {
IntegerVector av(a);
IntegerMatrix Bm(B);
int i,j,k;
LogicalMatrix out(av.size(), Bm.ncol());
for(i = 0; i < av.size(); i++){
for(j = 0; j < Bm.ncol(); j++){
for(k = 0; k < Bm.nrow() && av[i] != Bm(k, j); k++);
if(k < Bm.nrow()) out(i, j) = true;
}
}
return(out);
}")
Validation:
> identical(ispresent1(a1,B1),tensibai(a1,B1))
[1] TRUE
> identical(ispresent1(a1,B1),backlin(a1,B1))
[1] TRUE
Benchmark:
> library(microbenchmark)
> microbenchmark(ispresent1(a1,B1),tensibai(a1,B1),backlin(a1,B1),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
ispresent1(a1, B1) 36358.4633 36683.0932 37312.0568 37007.7231 37788.8536 38569.9840 3
tensibai(a1, B1) 603.6323 645.7884 802.0970 687.9445 901.3294 1114.7144 3
backlin(a1, B1) 471.5052 506.2873 528.3476 541.0694 556.7689 572.4684 3
Backlin's solution is slightly faster, proving again Rcpp is a good choice if you know cpp at first :)

Rcpp is awesome for problems like this. It is quite possible that there is some way to do it with data.table or with an existing function, but with the inline package it takes almost less time to write it yourself than to find out.
require(inline)
ispresent.cpp <- cxxfunction(signature(a="integer", B="integer"),
plugin="Rcpp", body='
IntegerVector av(a);
IntegerMatrix Bm(B);
int i,j,k;
LogicalMatrix out(av.size(), Bm.ncol());
for(i = 0; i < av.size(); i++){
for(j = 0; j < Bm.ncol(); j++){
for(k = 0; k < Bm.nrow() && av[i] != Bm(k, j); k++);
if(k < Bm.nrow()) out(i, j) = true;
}
}
return(out);
')
set.seed(123)
a1 <- a(1000)
B1 <- B(20000)
system.time(res.cpp <- ispresent.cpp(a1, B1))
user system elapsed
0.442 0.005 0.446
res1 <- ispresent1(a1,B1)
identical(res1, res.cpp)
[1] TRUE

a=function(l) as.integer(runif(l,1,600))
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
ispresent1 = function (a,B) {
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }
ispresent2 = function (a,B) t(sapply(1:length(a), function(i) apply(B,2,function(x) a[[i]] %in% x)))
ispresent3<-function(a,B){
tf<-matrix((B %in% a),nrow=5)
sapply(1:ncol(tf),function(x) a %in% B[,x][tf[,x]])
}
a1=a(1000)
B1=B(20000)
> system.time(ispresent1(a1,B1))
user system elapsed
29.91 0.48 30.44
> system.time(ispresent2(a1,B1))
user system elapsed
89.65 0.15 89.83
> system.time(ispresent3(a1,B1))
user system elapsed
0.83 0.00 0.86
res1<-ispresent1(a1,B1)
res3<-ispresent3(a1,B1)
> identical(res1,res3)
[1] TRUE

Related

How to make the difference between two consecutive elements of a vector and remove the one ending with two zeros if the difference is less than 10

I am trying to generate a vector breaks_x which is the result of another vector break_init. If the difference between two successive elements of break_init is less than 10, the element ending with two zeros will be removed.
My code is always removing breaks_init[i] even if it not ending with two zeros.
Can anyone help please
break_init <- c(100,195,200,238,300,326,400,481,500,537,600,607,697,700,800,875,900,908,957)
breaks_x <- vector()
for(i in 1:(length(break_init) - 1))
{
if (break_init[i+1] - break_init[i] >= 10) {
breaks_x[i] <- break_init[i]
} else {
if (grepl("[00]$", as.character(break_init[i])) == TRUE){
breaks_x[i] <- NA
} else if (grepl("[00]$", as.character(break_init[i])) == FALSE) {
breaks_x[i+1] <- NA
} else {
breaks_x[i] <- break_init[i]
}
}
}
[1] 0 100 NA 200 238 300 326 400 481 500 537 NA 607 NA 700 800 875 NA 908 957 #result breaks_x
[1] 0 100 195 NA 238 300 326 400 481 500 537 NA 607 697 NA 800 875 NA 908 957 #what I want my result to be
r2evans has the right idea. Just a little modification to check both the forward and the backwards difference:
bln10 <- diff(break_init) < 10
breaks_x <- replace(break_init, (c(FALSE, bln10) | c(bln10, FALSE)) & break_init %% 100 == 0, NA)
breaks_x
# [1] 100 195 NA 238 300 326 400 481 500 537 NA 607 697 NA 800 875 NA 908 957

R function generating incorrect results

I am trying to get better with functions in R and I was working on a function to pull out every odd value from 100 to 500 that was divisible by 3. I got close with the function below. It keeps returning all of the values correctly but it also includes the first number in the sequence (101) when it should not. Any help would be greatly appreciated. The code I wrote is as follows:
Test=function(n){
if(n>100){
s=seq(from=101,to=n,by=2)
p=c()
for(i in seq(from=101,to=n,by=2)){
if(any(s==i)){
p=c(p,i)
s=c(s[(s%%3)==0],i)
}}
return (p)}else{
stop
}}
Test(500)
Here is a function that gets all non even multiples of 3. It's fully vectorized, no loops at all.
Check if n is within the range [100, 500].
Create an integer vector N from 100 to n.
Create a logical index of the elements of N that are divisible by 3 but not by 2.
Extract the elements of N that match the index i.
The main work is done in 3 code lines.
Test <- function(n){
stopifnot(n >= 100)
stopifnot(n <= 500)
N <- seq_len(n)[-(1:99)]
i <- ((N %% 3) == 0) & ((N %% 2) != 0)
N[i]
}
Test(500)
Here is a vectorised one-liner which optionally allows you to change the lower bound from a default of 100 to anything you like. If the bounds are wrong, it returns an empty vector rather than throwing an error.
It works by creating a vector of 1:500 (or more generally, 1:n), then testing whether each element is greater than 100 (or whichever lower bound m you set), AND whether each element is odd AND whether each element is divisible by 3. It uses the which function to return the indices of the elements that pass all the tests.
Test <- function(n, m = 100) which(1:n > m & 1:n %% 2 != 0 & 1:n %% 3 == 0)
So you can use it as specified in your question:
Test(500)
# [1] 105 111 117 123 129 135 141 147 153 159 165 171 177 183 189 195 201 207 213 219
# [21] 225 231 237 243 249 255 261 267 273 279 285 291 297 303 309 315 321 327 333 339
# [41] 345 351 357 363 369 375 381 387 393 399 405 411 417 423 429 435 441 447 453 459
# [61] 465 471 477 483 489 495
Or play around with upper and lower bounds:
Test(100, 50)
# [1] 51 57 63 69 75 81 87 93 99
Here is a function example for your objective
Test <- function(n) {
if(n<100 | n> 500) stop("out of range")
v <- seq(101,n,by = 2)
na.omit(ifelse(v%%2==1 & v%%3==0,v,NA))
}
stop() is called when your n is out of range [100,500]
ifelse() outputs desired odd values + NA
na.omit filters out NA and produce the final results

Finding nearest matching points

What I would like to do is for the red points find the nearest equivalent blue dot on the other side of the abline (i.e. 1,5 find 5,1).
Data:
https://1drv.ms/f/s!Asb7WztvacfOuesIq4evh0jjvejZ4Q
Edit: to open data do readRDS("path/to/data")
So what I have tried is to find the difference between the x and y coordinates, rank them and then find the min value going down the ranks for both x and y. The results and pretty bad. The thing I'm struggling with is finding a way to find nearest match of tuples.
My attempt:
find_nearest <- function(query, subject){
weight_df <- data.frame(ID=query$ID)
#find difference of first, then second, rank and find match in both going from top to bottom
tmp_df <- query
for(i in 1:nrow(subject)){
first_order <- order(abs(query$mean_score_n-subject$mean_score_n[i]))
second_order <- order(abs(query$mean_score_p-subject$mean_score_p[i]))
tmp_df$order_1[first_order] <- seq(1, nrow(tmp_df))
tmp_df$order_2[second_order] <- seq(1, nrow(tmp_df))
weight_df[,i+1] <- tmp_df$order_1 + tmp_df$order_2
}
rownames(weight_df) <- weight_df$ID
weight_df$ID <- NULL
print(dim(weight_df))
nearest_match <- list()
count <- 1
subject_ids <- NA
query_ids <- NA
while(ncol(weight_df) > 0 & count <= ncol(weight_df)){
pos <- which(weight_df == min(weight_df, na.rm = TRUE), arr.ind = TRUE)
if(length(unique(rownames(pos))) > 1){
for(i in nrow(pos)){
#if subject/query already used then mask and find another
if(subject$ID[pos[i,2]] %in% subject_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else if(query$ID[pos[i,1]] %in% query_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[i,2]])
query_ids <- c(query_ids, query$ID[pos[i,1]])
nearest_match[[count]] <- data.frame(query=query[pos[i,1],]$ID, subject=subject[pos[i,2],]$ID)
#mask
weight_df[pos[i,1],pos[i,2]] <- NA
count <- count + 1
}
}
}else if(nrow(pos) > 1){
#if subject/query already used then mask and find another
if(subject$ID[pos[1,2]] %in% subject_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else if(query$ID[pos[1,1]] %in% query_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[1,1]])
query_ids <- c(query_ids, query$ID[pos[1,1]])
nearest_match[[count]] <- data.frame(query=query[pos[1,1],]$ID, subject=subject[pos[1,2],]$ID)
#mask
weight_df[pos[1,1],pos[1,2]] <- NA
count <- count + 1
}
}else{
#if subject/query already used then mask and find another
if(subject$ID[pos[2]] %in% subject_ids){
weight_df[pos[1],pos[2]] <- NA
}else if(query$ID[pos[1]] %in% query_ids){
weight_df[pos[1],pos[2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[2]])
query_ids <- c(query_ids, query$ID[pos[1]])
nearest_match[[count]] <- data.frame(query=query[pos[1],]$ID, subject=subject[pos[2],]$ID)
#mask
weight_df[pos[1],pos[2]] <- NA
count <- count + 1
}
}
}
out <- plyr::ldply(nearest_match, rbind)
out <- merge(out, data.frame(subject=subject$ID,
mean_score_p_n=subject$mean_score_p,
mean_score_n_n= subject$mean_score_n), by="subject", all.x=TRUE)
out <- merge(out, data.frame(query=query$ID,
mean_score_p_p=query$mean_score_p,
mean_score_n_p= query$mean_score_n), by="query", all.x=TRUE)
return(out)
}
Edit: is this what the solution looks like for you?
ggplot() +
geom_point(data=B[out,], aes(x=mean_score_p, y= mean_score_n, color="red")) +
geom_point(data=A, aes(x=mean_score_p, y=mean_score_n, color="blue")) +
geom_abline(intercept = 0, slope = 1)
Let
query <- readRDS("query.dms")
subject <- readRDS("subject.dms")
kA <- nrow(subject)
kB <- nrow(query)
A <- as.matrix(subject[, 2:3])
B <- as.matrix(query[, 2:3])
where we want to find the closest "reverse" point (row) in B to each point in A.
Solution permitting non-unique results
Then, assuming that you are using the Euclidean distance,
D <- as.matrix(dist(rbind(A, B[, 2:1])))[(1 + kA):(kA + kB), 1:kA]
unname(apply(D, 2, which.min))
# [1] 268 183 350 284 21 360 132 287 100 298 58 56 170 70 47 305 353
# [18] 43 266 198 58 215 198 389 412 321 255 181 79 340 292 268 198 54
# [35] 390 38 376 47 19 94 244 18 168 201 160 194 114 247 287 273 182
# [52] 87 94 87 192 63 160 244 101 298 62
are the corresponding row numbers in B. The trick was to switch the coordinates of the points in B by using B[, 2:1].
Solution with unique results
out <- vector("numeric", length = kA)
colnames(D) <- 1:ncol(D)
rownames(D) <- 1:nrow(D)
while(any(out == 0))
for(i in 1:nrow(D)) {
aux <- apply(D, 2, which.min)
if(i %in% aux) {
win <- which(aux == i)[which.min(D[i, aux == i])]
out[as.numeric(names(win))] <- as.numeric(rownames(D)[i])
D <- D[-i, -win, drop = FALSE]
}
}
out
# [1] 268 183 350 284 21 360 132 213 100 298 22 56 170 70 128 305 353
# [18] 43 266 198 58 215 294 389 412 321 255 181 79 340 292 20 347 54
# [35] 390 38 376 47 19 94 73 18 168 201 160 194 114 247 287 273 182
# [52] 87 365 158 192 63 211 244 101 68 62
whereas
all(table(res) == 1)
# [1] TRUE
confirms uniqueness. The solution is not the most efficient, but on your dataset it takes only a couple of seconds. It takes some time because it keeps going over all the available points in B checking if it is the closest one to any of the points in A. If so, the corresponding point in B is assigned to the closest one in A. Then both the point in A and the point in B are eliminated from the distance matrix. The loop goes until every point in A has some match in B.

How to combine the n arguments in c() [R]?

I have generate a random matrix d, then make some matrix operation.
Finally, I need to store the result in vector B. Code is below
set.seed(42)
n <- 3
m <- 4
d <- matrix(sample(0:255, n*m, replace=T), nrow = n, ncol = m)
# some matrix operation
B <-c(d[1,], d[2,], d[3,])
> d
[,1] [,2] [,3] [,4]
[1,] 234 212 188 180
[2,] 239 164 34 117
[3,] 73 132 168 184
> B
[1] 234 212 188 180 239 164 34 117 73 132 168 184
>
Could some one please explain me how to rewrite last
line via a function in order to combine the n arguments in one vector?
I have tried
B <- sapply(1:n, FUN=function(i) B<-c(d[i,]))
Thank!
This function should do it (overkill, since c(t(d)) as suggested by #joran works fine):
vectorizeByRow <- function(IN) {
OUT <- rep(NA_real_, length(IN))
nc <- ncol(IN)
nr <- nrow(IN)
a <- seq(1, length(IN), nc)
b <- a + nc - 1
for (n in 1:length(a)) {
OUT[a[n]:b[n]] <- IN[n,]
}
OUT
}
Use:
vectorizeByRow(d)
Produces:
[1] 234 212 188 180 239 164 34 117 73 132
[11] 168 184
This is from the HandyStuff package. Disclaimer: I am the author.

R Conditional summing

I've just started my adventure with programming in R. I need to create a program summing numbers divisible by 3 and 5 in the range of 1 to 1000, using the '%%' operator. I came up with an idea to create two matrices with the numbers from 1 to 1000 in one column and their remainders in the second one. However, I don't know how to sum the proper elements (kind of "sum if" function in Excel). I attach all I've done below. Thanks in advance for your help!
s1<-1:1000
in<-s1%%3
m1<-matrix(c(s1,in), 1000, 2, byrow=FALSE)
s2<-1:1000
in2<-s2%%5
m2<-matrix(c(s2,in2),1000,2,byrow=FALSE)
Mathematically, the best way is probably to find the least common multiple of the two numbers and check the remainder vs that:
# borrowed from Roland Rau
# http://r.789695.n4.nabble.com/Greatest-common-divisor-of-two-numbers-td823047.html
gcd <- function(a,b) if (b==0) a else gcd(b, a %% b)
lcm <- function(a,b) abs(a*b)/gcd(a,b)
s <- seq(1000)
s[ (s %% lcm(3,5)) == 0 ]
# [1] 15 30 45 60 75 90 105 120 135 150 165 180 195 210
# [15] 225 240 255 270 285 300 315 330 345 360 375 390 405 420
# [29] 435 450 465 480 495 510 525 540 555 570 585 600 615 630
# [43] 645 660 675 690 705 720 735 750 765 780 795 810 825 840
# [57] 855 870 885 900 915 930 945 960 975 990
Since your s is every number from 1 to 1000, you could instead do
seq(lcm(3,5), 1000, by=lcm(3,5))
Just use sum on either result if that's what you want to do.
Props to #HoneyDippedBadger for figuring out what the OP was after.
See if this helps
x =1:1000 ## Store no. 1 to 1000 in variable x
x ## print x
Div = x[x%%3==0 & x%%5==0] ## Extract Nos. divisible by 3 & 5 both b/w 1 to 1000
Div ## Nos. Stored in DIv which are divisible by 3 & 5 both
length(Div)
table(x%%3==0 & x%%5==0) ## To see how many are TRUE for given condition
sum(Div) ## Sums up no.s divisible by both 3 and 5 b/w 1 to 1000

Resources