R_Finding the closest match from number of vectors

R_Finding the closest match from number of vectors - r

I have the following vectors
> X <- c(1,1,3,4)
> a <- c(1,1,2,2)
> b <- c(2,1,4,3)
> c <- c(2,1,4,6)
I want to compare each element of X with corresponding elements of a,b and c and finally I need a class assigned to each row of X. for eg.
The first element of X is 1 and it has a match in corresponding element vector a, then I need to assign a class as '1-1' (no matter from which vector it got the match)
The second element of X is 1 and it also has match (in fact 3) so, again the class is '1-1'
The third element of X is 3 and it doesn't have a match then I should look for next integer value, which is 4 and there is 4 (in b and c). So the class should be '3-4'
The fourth element of X is 4 and it doesn't have a match. Also there is no 5 (next integer) then it should look for the previous integer which is 3 and there is 3. So the class should be '4-3'
Actually I have thousand of rows for each vector and I have to do this for each row. Any suggestion to do it in a less complicated way. I would prefer to use base functions of R.

Based on rbatt's comment and answer I realized my original answer was quite lacking. Here's a redo...
match_nearest <- function( x, table )
{
dist <- x - table
tgt <- which( dist < 0, arr.ind=TRUE, useNames=F )
dist[tgt] <- abs( dist[tgt] + .5 )
table[ cbind( seq_along(x), max.col( -dist, ties.method="first" ) ) ]
}
X <- c(1,1,3,4)
a <- c(1,1,2,2)
b <- c(2,1,4,3)
c <- c(2,1,4,6)
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
## [1] "1-1" "1-1" "3-4" "4-3"
Compared to the original answer and rbatt's we find neither was correct!
set.seed(1)
X <- rbinom(n=1E4, size=10, prob=0.5)
a <- rbinom(n=1E4, size=10, prob=0.5)
b <- rbinom(n=1E4, size=10, prob=0.5)
c <- rbinom(n=1E4, size=10, prob=0.5)
T <- current_solution(X,a,b,c)
R <- rbatt_solution(X,a,b,c)
all.equal( T, R )
## [1] "195 string mismatches"
# Look at mismatched rows...
mismatch <- head( which( T != R ) )
cbind(X,a,b,c)[mismatch,]
## X a b c
## [1,] 4 6 3 3
## [2,] 5 7 4 7
## [3,] 5 8 3 9
## [4,] 5 7 7 4
## [5,] 4 6 3 7
## [6,] 5 7 4 2
T[mismatch]
## [1] "4-3" "5-4" "5-3" "5-4" "4-3" "5-4"
R[mismatch]
## [1] "4-6" "5-7" "5-8" "5-7" "4-6" "5-7"
and needlessly slow...
library(microbenchmark)
bm <- microbenchmark( current_solution(X,a,b,c),
previous_solution(X,a,b,c),
rbatt_solution(X,a,b,c) )
print(bm, order="median")
## Unit: milliseconds
## expr min lq median uq max neval
## current_solution(X, a, b, c) 7.088 7.298 7.996 8.268 38.25 100
## rbatt_solution(X, a, b, c) 33.920 38.236 46.524 53.441 85.50 100
## previous_solution(X, a, b, c) 83.082 93.869 101.997 115.961 135.98 100
Looks like the current_solution is getting it right; but without an expected output ...
Here's the functions...
current_solution <- function(X,a,b,c) {
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
}
# DO NOT USE... it is wrong!
previous_solution <- function(X,a,b,c) {
dat <- rbind(X,a,b,c)
v <- apply(dat,2, function(v) {
v2 <- v[1] - v
v2[v2<0] <- abs( v2[v2<0]) - 1
v[ which.min( v2[-1] ) + 1 ]
})
paste("X", v, sep="-")
}
# DO NOT USE... it is wrong!
rbatt_solution <- function(X,a,b,c) {
mat <- cbind(X,a,b,c)
diff.signed <- mat[,"X"]-mat[,c("a","b","c")]
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5
min.ind <- apply(diff.break, 1, which.min)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2)
match.value <- mat[,c("a","b","c")][ind.array]
ref.class <- paste(X, match.value, sep="-")
ref.class
}

This solution should provide the output you want. Also, it is ~ 3x faster than Thell's solution, because the differences are vectorized and are not calculated row-wise with apply.
I compare times for the two approaches below. Note that if you want the "class" as another column in a data.frame, just uncomment the last line of my function. I commented it out to make the calculation times between the two answers more comparable (creating a data.frame is quite slow).
# Example data from Thell, plus 1 more
X1 <- c(1,1,3,4,7,1, 5)
a1 <- c(1,1,2,2,2,2, 9)
b1 <- c(2,1,4,3,3,3, 3)
c1 <- c(2,1,4,6,6,6, 7)
# Random example data, much larger
# X1 <- rbinom(n=1E4, size=10, prob=0.5)
# a1 <- rbinom(n=1E4, size=10, prob=0.5)
# b1 <- rbinom(n=1E4, size=10, prob=0.5)
# c1 <- rbinom(n=1E4, size=10, prob=0.5)
My answer:
rbTest <- function(){
mat <- cbind(X1,a1,b1,c1)
diff.signed <- mat[,"X1"]-mat[,c("a1","b1","c1")] # differences (with sign)
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5 # penalize for differences that are negative by adding 0.5 to them (break ties by preferring higher integer)
min.ind <- apply(diff.break, 1, which.min) # index of smallest difference (prefer larger integers when there is a tie)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2) # array index format
match.value <- mat[,c("a1","b1","c1")][ind.array] # value of the smallest difference (value of the match)
ref.class <- paste(X1, match.value, sep="-") # the 'class' in the format 'ref-match'
ref.class
# data.frame(class=ref.class, mat)
}
Thell answer:
thTest <- function(){
dat <- rbind(X1,a1,b1,c1)
apply(dat,2, function(v) {
# Get distance
v2 <- v[1] - v
# Prefer values >= v[1]
v2[v2<0] <- abs( v2[v2<0]) - 1
# Obtain and return nearest v excluding v[1]
v[ which.min( v2[-1] ) + 1 ]
})
}
Benchmark on large matrix (10,000 rows)
# > microbenchmark(rbTest(), thTest())
# Unit: milliseconds
# expr min lq median uq max neval
# rbTest() 47.95451 52.01729 59.36161 71.94076 103.1314 100
# thTest() 167.49798 180.69627 195.02828 204.19916 315.0610 100
Benchmark on small matrix (7 rows)
# > microbenchmark(rbTest(), thTest())
# Unit: microseconds
# expr min lq median uq max neval
# rbTest() 108.299 112.3550 115.4225 119.4630 146.722 100
# thTest() 147.727 152.2015 155.9005 159.3115 235.898 100
Example output (small matrix):
# > rbTest()
# [1] "1-1" "1-1" "3-4" "4-3" "7-6" "1-2" "5-7" "6-1"
# > thTest()
# [1] 1 1 4 3 6 2 7

Related

Find elements in vector in R

A matrix I have has exactly 2 rows and n columns example
c(0,0,0,0,1,0,2,0,1,0,1,1,1,0,2)->a1
c(0,2,0,0,0,0,2,1,1,0,0,0,0,2,0)->a2
rbind(a1,a2)->matr
for a specific column ( in this example 9 with 1 in both rows) I do need to find to the left and to the right the first instance of 2/0 or 0/2 - in this example to the left is 2 and the other is 14)
The elements of every row can either be 0,1,2 - nothing else . Is there a way to do that operation on large matrixes (with 2 rows) fast? I need to to it 600k times so speed might be a consideration

library(compiler)
myfun <- cmpfun(function(m, cl) {
li <- ri <- cl
nc <- ncol(m)
repeat {
li <- li - 1
if(li == 0 || ((m[1, li] != 1) && (m[1, li] + m[2, li] == 2))) {
l <- li
break
}
}
repeat {
ri <- ri + 1
if(ri == nc || ((m[1, ri] != 1) && (m[1, ri] + m[2, ri] == 2))) {
r <- ri
break
}
}
c(l, r)
})
and, after taking into account #Martin Morgan's observations,
set.seed(1)
N <- 1000000
test <- rbind(sample(0:2, N, replace = TRUE),
sample(0:2, N, replace = TRUE))
library(microbenchmark)
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
expr min lq mean median uq max neval cld
# myfun(test, N/2) 4.658 20.033 2.237153e+01 22.536 26.022 85.567 100 a
# fun(test, N/2) 36685.750 47842.185 9.762663e+04 65571.546 120321.921 365958.316 100 b
# foo(test, N/2) 2622845.039 3009735.216 3.244457e+06 3185893.218 3369894.754 5170015.109 100 d
# AWebb(test, N/2) 121504.084 142926.590 1.990204e+05 193864.670 209918.770 489765.471 100 c
# RHertel(test, N/2) 65998.733 76805.465 1.187384e+05 86089.980 144793.416 385880.056 100 b
set.seed(123)
test <- rbind(sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)),
sample(0:2, N, replace = TRUE, prob = c(5, 90, 5)))
microbenchmark(myfun(test, N / 2), fun(test, N / 2), foo(test, N / 2),
AWebb(test, N / 2), RHertel(test, N / 2))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# myfun(test, N/2) 81.805 103.732 121.9619 106.459 122.36 307.736 100 a
# fun(test, N/2) 26362.845 34553.968 83582.9801 42325.755 106303.84 403212.369 100 b
# foo(test, N/2) 2598806.742 2952221.561 3244907.3385 3188498.072 3505774.31 4382981.304 100 d
# AWebb(test, N/2) 109446.866 125243.095 199204.1013 176207.024 242577.02 653299.857 100 c
# RHertel(test, N/2) 56045.309 67566.762 125066.9207 79042.886 143996.71 632227.710 100 b

I was slower than #Laterow, but anyhow, this is a similar approach
foo <- function(mtr, targetcol) {
matr1 <- colSums(mtr)
matr2 <- apply(mtr, 2, function(x) x[1]*x[2])
cols <- which(matr1 == 2 & matr2 == 0) - targetcol
left <- cols[cols < 0]
right <- cols[cols > 0]
c(ifelse(length(left) == 0, NA, targetcol + max(left)),
ifelse(length(right) == 0, NA, targetcol + min(right)))
}
foo(matr,9) #2 14

Combine the information by squaring the rows and adding them. The right result should be 4. Then, simply find the first column that is smaller than 9 (rev(which())[1]) and the first column that is larger than 9 (which()[1]).
fun <- function(matr, col){
valid <- which((matr[1,]^2 + matr[2,]^2) == 4)
if (length(valid) == 0) return(c(NA,NA))
left <- valid[rev(which(valid < col))[1]]
right <- valid[which(valid > col)[1]]
c(left,right)
}
fun(matr,9)
# [1] 2 14
fun(matr,1)
# [1] NA 2
fun(matrix(0,nrow=2,ncol=100),9)
# [1] NA NA
Benchmark
set.seed(1)
test <- rbind(sample(0:2,1000000,replace=T),
sample(0:2,1000000,replace=T))
microbenchmark::microbenchmark(fun(test,9))
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun(test, 9) 22.7297 27.21038 30.91314 27.55106 28.08437 51.92393 100
Edit: Thanks to #MatthewLundberg for pointing out a lot of mistakes.

If you are doing this many times, precompute all the locations
loc <- which((a1==2 & a2==0) | (a1==0 & a2==2))
You can then find the first to the left and right with findInterval
i<-findInterval(9,loc);loc[c(i,i+1)]
# [1] 2 14
Note that findInterval is vectorized should you care to specify multiple target columns.

That is an interesting question. Here's how I would address it.
First a vector is defined which contains the product of each column:
a3 <- matr[1,]*matr[2,]
Then we can find the columns with pairs of (0/2) or (2/0) rather easily, since we know that the matrix can only contain the values 0, 1, and 2:
the02s <- which(colSums(matr)==2 & a3==0)
Next we want to find the pairs of (0/2) or (2/0) that are closest to a given column number, on the left and on the right of that column. The column number could be 9, for instance:
thecol <- 9
Now we have basically all we need to find the index (the column number in the matrix) of a combination of (0/2) or (2/0) that is closest to the column thecol. We just need to use the output of findInterval():
pos <- findInterval(thecol,the02s)
pos <- c(pos, pos+1)
pos[pos==0] <- NA # output NA if no column was found on the left
And the result is:
the02s[pos]
# 2 14
So the indices of the closest columns on either side of thecol fulfilling the required condition would be 2 and 14 in this case, and we can confirm that these column numbers both contain one of the relevant combinations:
matr[,14]
#a1 a2
# 0 2
matr[,2]
#a1 a2
# 0 2
Edit: I changed the answer such that NA is returned in the case where no column exists on the left and/or on the right of thecol in the matrix that fulfills the required condition.

Faster version of combn

Is there a way to speed up the combn command to get all unique combinations of 2 elements taken from a vector?
Usually this would be set up like this:
# Get latest version of data.table
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
# Toy data
d <- data.table(id=as.character(paste0("A", 10001:15000)))
# Transform data
system.time({
d.1 <- as.data.table(t(combn(d$id, 2)))
})
However, combn is 10 times slower (23sec versus 3 sec on my computer) than calculating all possible combinations using data.table.
system.time({
d.2 <- d[, list(neighbor=d$id[-which(d$id==id)]), by=c("id")]
})
Dealing with very large vectors, I am searching for a way to save memory by only calculating the unique combinations (like combn), but with the speed of data.table (see second code snippet).
I appreciate any help.

Here's a way using data.table function foverlaps(), that also turns out to be fast!
require(data.table) ## 1.9.4+
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
system.time(olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid])
# 0.603 0.062 0.717
Note that foverlaps() does not calculate all permutations. The subset xid != yid is needed to remove self overlaps. The subset could be internally handled more efficiently by implementing ignoreSelf argument - similar to IRanges::findOverlaps.
Now it's just a matter of performing a subset using the ids obtained:
system.time(ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid])))
# 0.576 0.047 0.662
So totally, ~1.4 seconds.
The advantage is that you can do the same way even if your data.table d has more than 1 column on which you've to get the combinations for, and using the same amount of memory (since we return the indices). In that case, you'd just do:
cbind(d[olaps$xid, ..your_cols], d[olaps$yid, ..your_cols])
But it's limited to replacing just combn(., 2L). Not more than 2L.

You could use combnPrim from gRbase
source("http://bioconductor.org/biocLite.R")
biocLite("gRbase") # will install dependent packages automatically.
system.time({
d.1 <- as.data.table(t(combn(d$id, 2)))
})
# user system elapsed
# 27.322 0.585 27.674
system.time({
d.2 <- as.data.table(t(combnPrim(d$id,2)))
})
# user system elapsed
# 2.317 0.110 2.425
identical(d.1[order(V1, V2),], d.2[order(V1,V2),])
#[1] TRUE

A post with any variation of the word Fast in the title is incomplete without benchmarks. Before we post any benchmarks, I would just like to mention that since this question was posted, two highly optimized packages, arrangements and RcppAlgos (I am the author) for generating combinations have been released for R. Note that since version 2.3.0 for RcppAlgos we can take advantage of multiple threads for even greater efficiency.
To give you an idea of their speed over combn and gRbase::combnPrim, here is a basic benchmark:
## We test generating just over 3 million combinations
choose(25, 10)
[1] 3268760
microbenchmark(arrngmnt = arrangements::combinations(25, 10),
combn = combn(25, 10),
gRBase = gRbase::combnPrim(25, 10),
serAlgos = RcppAlgos::comboGeneral(25, 10),
parAlgos = RcppAlgos::comboGeneral(25, 10, nThreads = 4),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
arrngmnt 2.979378 3.072319 1.898390 3.756307 2.139258 0.4842967 20
combn 226.470755 230.410716 118.157110 232.905393 125.718512 17.7778585 20
gRBase 34.219914 34.209820 18.789954 34.218320 19.934485 3.6455493 20
serAlgos 2.836651 3.078791 2.458645 3.703929 2.231475 1.1652445 20
parAlgos 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 20
Now, we benchmark the other functions posted for the very specific case of producing combinations choose 2 and producing a data.table object.
The functions are as follows:
funAkraf <- function(d) {
a <- comb2.int(length(d$id)) ## comb2.int from the answer given by #akraf
setDT(list(V1 = d$id[a[,1]], V2 = d$id[a[,2]]))
}
funAnirban <- function(d) {
indices <- combi2inds(d$id)
ans2 <- setDT(list(d$id[indices$xid], d$id[indices$yid]))
ans2
}
funArun <- function(d) {
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid]
ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid]))
ans
}
funArrangements <- function(d) {
a <- arrangements::combinations(x = d$id, k = 2)
setDT(list(a[, 1], a[, 2]))
}
funGRbase <- function(d) {
a <- gRbase::combnPrim(d$id,2)
setDT(list(a[1, ], a[2, ]))
}
funOPCombn <- function(d) {
a <- combn(d$id, 2)
setDT(list(a[1, ], a[2, ]))
}
funRcppAlgos <- function(d) {
a <- RcppAlgos::comboGeneral(d$id, 2, nThreads = 4)
setDT(list(a[, 1], a[, 2]))
}
Benchmark with OP Data
And here are the benchmarks on the example given by the OP:
d <- data.table(id=as.character(paste0("A", 10001:15000)))
microbenchmark(funAkraf(d),
funAnirban(d),
funArrangements(d),
funArun(d),
funGRbase(d),
funOPCombn(d),
funRcppAlgos(d),
times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
funAkraf(d) 3.220550 2.971264 2.815023 2.665616 2.344018 3.383673 10
funAnirban(d) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
funArrangements(d) 1.464730 1.689231 1.834650 1.960233 1.932361 1.693305 10
funArun(d) 3.256889 2.908075 2.634831 2.729180 2.432277 2.193849 10
funGRbase(d) 3.513847 3.340637 3.327845 3.196399 3.291480 3.129362 10
funOPCombn(d) 30.310469 26.255374 21.656376 22.386270 18.527904 15.626261 10
funRcppAlgos(d) 1.676808 1.956696 1.943773 2.085968 1.949133 1.804180 10
We see that the function provided by #AnirbanMukherjee is the fastest for this task, followed by RcppAlgos/arrangements. For this task, nThreads has no effect as the vector passed is a character, which is not thread safe. What if we instead converted id to a factor?
Benchmarks with Factors (i.e. Categorical Variables)
dFac <- d
dFac$id <- as.factor(dFac$id)
library(microbenchmark)
microbenchmark(funAkraf(dFac),
funAnirban(dFac),
funArrangements(dFac),
funArun(dFac),
funGRbase(dFac),
funOPCombn(dFac),
funRcppAlgos(dFac),
times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
funAkraf(dFac) 10.898202 10.949896 7.589814 10.01369 8.050005 5.557014 10
funAnirban(dFac) 3.104212 3.337344 2.317024 3.00254 2.471887 1.530978 10
funArrangements(dFac) 2.054116 2.058768 1.858268 1.94507 2.797956 1.691875 10
funArun(dFac) 10.646680 12.905119 7.703085 11.50311 8.410893 3.802155 10
funGRbase(dFac) 16.523356 21.609917 12.991400 19.73776 13.599870 6.498135 10
funOPCombn(dFac) 108.301876 108.753085 64.338478 95.56197 65.494335 28.183104 10
funRcppAlgos(dFac) 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 10
Now, we see that RcppAlgos is around 2x faster than any other solution. In particular, the RcppAlgos solution is about 3x than the formerly fastest solution given by Anirban. It should be noted that this increase in efficiency was possible because factor variables are really integers underneath the hood with some additional attributes.
Confirm Equality
They all give the same result as well. The only caveat is that the gRbase solution doesn't support factors. That is, if you pass a factor, it will be converted to character. Thus all of the solutions will give the same result if you were to pass dFac except for the gRbase solution:
identical(funAkraf(d), funOPCombn(d))
#[1] TRUE
identical(funAkraf(d), funArrangements(d))
#[1] TRUE
identical(funRcppAlgos(d), funArrangements(d))
#[1] TRUE
identical(funRcppAlgos(d), funAnirban(d))
#[1] TRUE
identical(funRcppAlgos(d), funArun(d))
#[1] TRUE
## different order... we must sort
identical(funRcppAlgos(d), funGRbase(d))
[1] FALSE
d1 <- funGRbase(d)
d2 <- funRcppAlgos(d)
## now it's the same
identical(d1[order(V1, V2),], d2[order(V1,V2),])
#[1] TRUE
Thanks to #Frank for pointing out how to compare two data.tables without going through the pains of creating new data.tables and then arranging them:
fsetequal(funRcppAlgos(d), funGRbase(d))
[1] TRUE

Here is a solution using Rcpp.
library(Rcpp)
library(data.table)
cppFunction('
Rcpp::DataFrame combi2(Rcpp::CharacterVector inputVector){
int len = inputVector.size();
int retLen = len * (len-1) / 2;
Rcpp::CharacterVector outputVector1(retLen);
Rcpp::CharacterVector outputVector2(retLen);
int start = 0;
for (int i = 0; i < len; ++i){
for (int j = i+1; j < len; ++j){
outputVector1(start) = inputVector(i);
outputVector2(start) = inputVector(j);
++start;
}
}
return(Rcpp::DataFrame::create(Rcpp::Named("id") = outputVector1,
Rcpp::Named("neighbor") = outputVector2));
};
')
# Toy data
d <- data.table(id=as.character(paste0("A", 10001:15000)))
system.time({
d.2 <- d[, list(neighbor=d$id[-which(d$id==id)]), by=c("id")]
})
# 1.908 0.397 2.389
system.time({
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid]
ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid]))
})
# 0.653 0.038 0.705
system.time(ans2 <- combi2(d$id))
# 1.377 0.108 1.495
Using the Rcpp function to get the indices and then form the data.table, works better.
cppFunction('
Rcpp::DataFrame combi2inds(const Rcpp::CharacterVector inputVector){
const int len = inputVector.size();
const int retLen = len * (len-1) / 2;
Rcpp::IntegerVector outputVector1(retLen);
Rcpp::IntegerVector outputVector2(retLen);
int indexSkip;
for (int i = 0; i < len; ++i){
indexSkip = len * i - ((i+1) * i)/2;
for (int j = 0; j < len-1-i; ++j){
outputVector1(indexSkip+j) = i+1;
outputVector2(indexSkip+j) = i+j+1+1;
}
}
return(Rcpp::DataFrame::create(Rcpp::Named("xid") = outputVector1,
Rcpp::Named("yid") = outputVector2));
};
')
system.time({
indices <- combi2inds(d$id)
ans2 <- setDT(list(d$id[indices$xid], d$id[indices$yid]))
})
# 0.389 0.027 0.425

Here are two base-R solutions if you don't want to use additional dependencies:
comb2.int uses rep and other sequence generating functions to generate the desired output.
comb2.mat creates a matrix, uses upper.tri() to get the upper triangle and which(..., arr.ind = TRUE) to obtain the column and row indices => all combinations.
Possibility 1: comb2.int
comb2.int <- function(n, rep = FALSE){
if(!rep){
# e.g. n=3 => (1,2), (1,3), (2,3)
x <- rep(1:n,(n:1)-1)
i <- seq_along(x)+1
o <- c(0,cumsum((n-2):1))
y <- i-o[x]
}else{
# e.g. n=3 => (1,1), (1,2), (1,3), (2,2), (2,3), (3,3)
x <- rep(1:n,n:1)
i <- seq_along(x)
o <- c(0,cumsum(n:2))
y <- i-o[x]+x-1
}
return(cbind(x,y))
}
Possibility 2: comb2.mat
comb2.mat <- function(n, rep = FALSE){
# Use which(..., arr.ind = TRUE) to get coordinates.
m <- matrix(FALSE, nrow = n, ncol = n)
idxs <- which(upper.tri(m, diag = rep), arr.ind = TRUE)
return(idxs)
}
The functions give the same result as combn(.):
for(i in 2:8){
# --- comb2.int ------------------
stopifnot(comb2.int(i) == t(combn(i,2)))
# => Equal
# --- comb2.mat ------------------
m <- comb2.mat(i)
colnames(m) <- NULL # difference 1: colnames
m <- m[order(m[,1]),] # difference 2: output order
stopifnot(m == t(combn(i,2)))
# => Equal up to above differences
}
But I have other elements in my vector than sequencial integers!
Use the return values as indices:
v <- LETTERS[1:5]
c <- comb2.int(length(v))
cbind(v[c[,1]], v[c[,2]])
#> [,1] [,2]
#> [1,] "A" "B"
#> [2,] "A" "C"
#> [3,] "A" "D"
#> [4,] "A" "E"
#> [5,] "B" "C"
#> [6,] "B" "D"
#> [7,] "B" "E"
#> [8,] "C" "D"
#> [9,] "C" "E"
#> [10,] "D" "E"
Benchmark:
time(combn) = ~5x time(comb2.mat) = ~80x time(comb2.int):
library(microbenchmark)
n <- 800
microbenchmark({
comb2.int(n)
},{
comb2.mat(n)
},{
t(combn(n, 2))
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> { comb2.int(n) } 4.394051 4.731737 6.350406 5.334463 7.22677 14.68808 100
#> { comb2.mat(n) } 20.131455 22.901534 31.648521 24.411782 26.95821 297.70684 100
#> { t(combn(n, 2)) } 363.687284 374.826268 391.038755 380.012274 389.59960 532.30305 100

Compare each row with other rows of matrix

I am looking for an efficient solution for the following problem:
b <- matrix(c(0,0,0,1,1,0), nrow = 2, byrow = T)
weight <- c(1,1)
times <- 5
abc <- do.call(rbind, replicate(times, b, simplify=FALSE))
weight <- rep.int(weight,times)
sum1 <- as.numeric(rep.int(NA,nrow(abc)))
##Rprof()
for(j in 1:nrow(abc)){
a <- abc[j,]
sum1[j] <- sum(weight[rowSums(t(a == t(abc)) + 0) == ncol(abc)])
}
##Rprof(NULL)
##summaryRprof()
Is there a faster way to do this? Rprof shows that rowSums(), t(), == and + are quite slow. If nrows is 20,000 it takes like 21 seconds.
Thanks for helping!
Edit: I have a matrix abc and a vector weight with length equal to nrow(abc). The first value of weight corresponds to the first row of matrix abc and so on... Now, I would like to determine which rows of matrix abc are equal. Then, I want to remember the position of those rows in order to sum up the corresponding weights which have the same position. The appropriate sum I wanna store for each row.

Here is a way that looks valid and fast:
ff <- function(mat, weights)
{
rs <- apply(mat, 1, paste, collapse = ";")
unlist(lapply(unique(rs),
function(x)
sum(weights[match(rs, x, 0) > 0])))[match(rs, unique(rs))]
}
ff(abc, weight)
# [1] 5 5 5 5 5 5 5 5 5 5
And comparing with your function:
ffOP <- function(mat, weights)
{
sum1 <- as.numeric(rep.int(NA,nrow(mat)))
for(j in 1:nrow(mat)) {
a <- mat[j,]
sum1[j] <- sum(weights[rowSums(t(a == t(mat)) + 0) == ncol(mat)])
}
sum1
}
ffOP(abc, weight)
# [1] 5 5 5 5 5 5 5 5 5 5
library(microbenchmark)
m = do.call(rbind, replicate(1e3, matrix(0:11, 3, 4), simplify = F))
set.seed(101); w = runif(1e3*3)
all.equal(ffOP(m, w), ff(m, w))
#[1] TRUE
microbenchmark(ffOP(m, w), ff(m, w), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ffOP(m, w) 969.83968 986.47941 996.68563 1015.53552 1051.23847 10
# ff(m, w) 20.42426 20.64002 21.36508 21.97182 22.59127 10
For the record, I, also, implemented your approach in C and here are the benchmarkings:
#> microbenchmark(ffOP(m, w), ff(m, w), ffC(m, w), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ffOP(m, w) 957.66691 967.09429 991.35232 1000.53070 1016.74100 10
# ff(m, w) 20.60243 20.85578 21.70578 22.13434 23.04924 10
# ffC(m, w) 36.24618 36.40940 37.18927 37.39877 38.83358 10

Melt a array into data.frame but converting one dimension into columns

I would like to convert a array with multiple dimensions (e.g. x, y, z; see 'arr' below) into a data.frame, but keep a dimension in the columns (e.g. z, see 'df2' below).
Currently, I use melt and dcast functions in reshape2 package.
set.seed(1111)
num <- 1000
dim_names <- list(x = seq(num), y = seq(num),
z = paste0('Z', 1:5))
dim_arr <- as.numeric(lapply(dim_names, length))
arr <- array(runif(prod(dim_arr)), dim = dim_arr)
dimnames(arr) <- dim_names
library(reshape2)
df <- melt(arr)
head(df)
system.time(df2 <- dcast(df, x + y ~ z, value.var = 'value'))
head(df2)
x y Z1 Z2 Z3 Z4 Z5
1 1 1 0.4655026 0.8027921 0.1950717 0.0403759 0.04669389
2 1 2 0.5156263 0.5427343 0.5799924 0.1911539 0.26069063
3 1 3 0.2788747 0.9394142 0.9081274 0.7712205 0.68748300
4 1 4 0.2827058 0.8001632 0.6995503 0.9913805 0.25421346
5 1 5 0.7054767 0.8013255 0.2511769 0.6556174 0.07780849
6 1 6 0.5576141 0.6452644 0.3362980 0.7353494 0.93147223
However, it took about 10 s to convert 5 M values
user system elapsed
8.13 1.11 9.39
Are there more efficient methods? Thanks for any suggestions.

Here's a slightly more generalized solution for a 4-dimensional array using a combination of aperm(...) and matrix(...). I'm not wizard enough to generalize this further.
nx <- 2 ; ny <- 3 ; nz <- 4 ; nw <- 5
original <- array(rnorm(nx*ny*nz*nw), dim=c(nx,ny,nz,nw),
dimnames=list(x=sprintf('x%s', 1:nx), y=sprintf('y%s', 1:ny),
z=sprintf('z%s', 1:nz), w=sprintf('w%s', 1:nw)))
This is your existing method that uses melt(...) and dcast(...) to remove all but the last dimension:
f.dcast <- function(a) dcast(melt(a), x + y + z ~ w)
The following does the same thing by using aperm(...) to write out the data as a vector in a particular order so that it winds up as a properly formatted matrix, then cbinds with the variable names:
f.aperm <- function(a) {
d <- dim(a)
data <- matrix(as.vector(aperm(a, c(4,3,2,1))), ncol=d[4], byrow=T)
colnames(data) <- dimnames(a)[[4]]
# specify levels in the same order as the input so they don't wind up alphabetical
varnames <- data.frame(
factor(rep(dimnames(a)[[1]], times=1, each=d[2]*d[3]), levels=dimnames(a)[[1]]),
factor(rep(dimnames(a)[[2]], times=d[1], each=d[3] ), levels=dimnames(a)[[2]]),
factor(rep(dimnames(a)[[3]], times=d[1]*d[2], each=1 ), levels=dimnames(a)[[3]])
)
names(varnames) <- names(dimnames(a))[1:3]
cbind(varnames, data)
}
They both give me the same result:
> desired <- f.dcast(original)
> test <- f.aperm(original)
> all.equal(desired, test)
[1] TRUE
The second method is 6 times faster for an array this size:
> microbenchmark::microbenchmark(f.dcast(original), f.aperm(original))
Unit: milliseconds
expr min lq mean median uq max neval
f.dcast(original) 7.270208 7.343227 7.703360 7.481984 7.698812 10.392141 100
f.aperm(original) 1.218162 1.244595 1.327204 1.259987 1.291986 4.182391 100
If I increase the size of the original array:
nx <- 10 ; ny <- 20 ; nz <- 30 ; nw <- 40
Then the second method is over ten times faster:
> microbenchmark::microbenchmark(f.dcast(original), f.aperm(original))
Unit: milliseconds
expr min lq mean median uq max neval
f.dcast(original) 303.21812 385.44857 381.29150 392.11693 394.81721 472.80343 100
f.aperm(original) 18.62788 22.25814 28.85363 23.90133 24.54939 97.96776 100

cbind(x=rep(1:1000,each=1000),
y=1:1000,
matrix(arr, ncol=5, dimnames=list(list(),dimnames(arr)$z) ) ))
Elapsed time for that was around a tenth of a second. This was the result of str()
num [1:1000000, 1:7] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "x" "y" "Z1" "Z2" ..
I suppose you could put in row.names, although it does increase elapsed time to a bit over a second.

Speeding up a function: checking NA count before computing mean

The function below calculates the mean of a vector. However, it first checks the proportion of NA's present in the vector
and if above a given threshold, returns NA instead of the mean.
My issue is that my current implementation is rather innefficient. It takes more than 7x longer than simply running mean(vec, na.rm=TRUE)
I tried an alternate method using na.omit, but that is even slower.
Given the size of my data, executing the single lapply is taking over 40 minutes.
Any suggestions on how to accomplish the same task more quickly?
UPDATE - RE: #thelatemail 's solution and #Arun's comment:
I am executing this function over several hundred groups, each group of varying size. The sample data (originally) provided in this question was provided as a neat data frame simply for ease of creating artificial data.
Alternate sample data to avoid the confusion
# Sample Data
# ------------
set.seed(1)
# slightly different sizes for each group
N1 <- 5e3
N2 <- N1 + as.integer(rnorm(1, 0, 100))
# One group has only a moderate amount of NA's
SAMP1 <- rnorm(N1)
SAMP1[sample(N1, .25 * N1, FALSE)] <- NA # add in NA's
# Another group has many NA's
SAMP2 <- rnorm(N2)
SAMP2[sample(N2, .95 * N2, FALSE)] <- NA # add in large number of NA's
# put them all in a list
SAMP.NEW <- list(SAMP1, SAMP2)
# keep it clean
rm(SAMP1, SAMP2)
# Execute
# -------
lapply(SAMP.NEW, meanIfThresh)
Original Sample Data, function etc
# Sample Data
# ------------
set.seed(1)
rows <- 20000 # actual data has more than 7M rows
cols <- 1000
SAMP <- replicate(cols, rnorm(rows))
SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA # add in NA's
# Select 5 random rows, and have them be 90% NA
tooSparse <- sample(rows, 5)
for (r in tooSparse)
SAMP[r, sample(cols, cols * .9, FALSE)] <- NA
# Function
# ------------
meanIfThresh <- function(vec, thresh=12/15) {
# Calculates the mean of vec, however,
# if the number of non-NA values of vec is less than thresh, returns NA
# thresh : represents how much data must be PRSENT.
# ie, if thresh is 80%, then there must be at least
len <- length(vec)
if( (sum(is.na(vec)) / len) > thresh)
return(NA_real_)
# if the proportion of NA's is greater than the threshold, return NA
# example: if I'm looking at 14 days, and I have 12 NA's,
# my proportion is 85.7 % = (12 / 14)
# default thesh is 80.0 % = (12 / 15)
# Thus, 12 NAs in a group of 14 would be rejected
# else, calculate the mean, removing NA's
return(mean(vec, na.rm=TRUE))
}
# Execute
# -----------------
apply(SAMP, 1, meanIfThresh)
# Compare with `mean`
#----------------
plain <- apply(SAMP, 1, mean, na.rm=TRUE)
modified <- apply(SAMP, 1, meanIfThresh)
# obviously different
identical(plain, modified)
plain[tooSparse]
modified[tooSparse]
microbenchmark( "meanIfThresh" = apply(SAMP, 1, meanIfThresh)
, "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
, times = 15L)
# With the actual data, the penalty is sevenfold
# Unit: seconds
# expr min lq median uq max neval
# meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871 15
# mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450 15

Couldn't you just replace the high NA rows' mean values afterwards like so?:
# changed `result <- apply(SAMP,1,mean,na.rm=TRUE)`
result <- rowMeans(SAMP, na.rm=TRUE)
NArows <- rowSums(is.na(SAMP))/ncol(SAMP) > 0.8
result[NArows] <- NA
Some benchmarking:
Ricardo <- function(vec, thresh=12/15) {
len <- length(vec)
if( (sum(is.na(vec)) / len) > thresh)
return(NA_real_)
return(mean(vec, na.rm=TRUE))
}
DanielFischer <- function(vec, thresh=12/15) {
len <- length(vec)
nas <- is.na(vec)
Nna <- sum(nas)
if( (Nna / len) > thresh)
return(NA_real_)
return(sum(vec[!nas])/(len-Nna))
}
thelatemail <- function(mat) {
result <- rowMeans(mat, na.rm=TRUE)
NArows <- rowSums(is.na(mat))/ncol(mat) > 0.8
result[NArows] <- NA
result
}
require(microbenchmark)
microbenchmark(m1 <- apply(SAMP, 1, Ricardo),
m2 <- apply(SAMP, 1, DanielFischer),
m3 <- thelatemail(SAMP), times = 5L)
Unit: milliseconds
expr min lq median uq max neval
m1 <- apply(SAMP, 1, Ricardo) 2923.7260 2944.2599 3066.8204 3090.8127 3105.4283 5
m2 <- apply(SAMP, 1, DanielFischer) 2643.4883 2683.1034 2755.7032 2799.5155 3089.6015 5
m3 <- latemail(SAMP) 337.1862 340.6339 371.6148 376.5517 383.4436 5
all.equal(m1, m2) # TRUE
all.equal(m1, m3) # TRUE

Is it so that you have to go twice through your vector vec in your function? If you can store your NA first, maybe it could speed up your calculations a bit:
meanIfThresh2 <- function(vec, thresh=12/15) {
len <- length(vec)
nas <- is.na(vec)
Nna <- sum(nas)
if( (Nna / len) > thresh)
return(NA_real_)
return(sum(vec[!nas])/(len-Nna))
}
EDIT: I performed the similar benchmarking, to see the effect on this change:
> microbenchmark( "meanIfThresh" = apply(SAMP, 1, meanIfThresh)
+ , "meanIfThresh2" = apply(SAMP, 1, meanIfThresh2)
+ , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+ , times = 15L)
Unit: seconds
expr min lq median uq max neval
meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493 15
meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028 15
mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495 15

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R_Finding the closest match from number of vectors - r

Related

Find elements in vector in R

Faster version of combn

Compare each row with other rows of matrix

Melt a array into data.frame but converting one dimension into columns

Speeding up a function: checking NA count before computing mean

Categories

Resources