Is there a way to speed up the combn command to get all unique combinations of 2 elements taken from a vector?
Usually this would be set up like this:
# Get latest version of data.table
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table)
# Toy data
d <- data.table(id=as.character(paste0("A", 10001:15000)))
# Transform data
system.time({
d.1 <- as.data.table(t(combn(d$id, 2)))
})
However, combn is 10 times slower (23sec versus 3 sec on my computer) than calculating all possible combinations using data.table.
system.time({
d.2 <- d[, list(neighbor=d$id[-which(d$id==id)]), by=c("id")]
})
Dealing with very large vectors, I am searching for a way to save memory by only calculating the unique combinations (like combn), but with the speed of data.table (see second code snippet).
I appreciate any help.
Here's a way using data.table function foverlaps(), that also turns out to be fast!
require(data.table) ## 1.9.4+
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
system.time(olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid])
# 0.603 0.062 0.717
Note that foverlaps() does not calculate all permutations. The subset xid != yid is needed to remove self overlaps. The subset could be internally handled more efficiently by implementing ignoreSelf argument - similar to IRanges::findOverlaps.
Now it's just a matter of performing a subset using the ids obtained:
system.time(ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid])))
# 0.576 0.047 0.662
So totally, ~1.4 seconds.
The advantage is that you can do the same way even if your data.table d has more than 1 column on which you've to get the combinations for, and using the same amount of memory (since we return the indices). In that case, you'd just do:
cbind(d[olaps$xid, ..your_cols], d[olaps$yid, ..your_cols])
But it's limited to replacing just combn(., 2L). Not more than 2L.
You could use combnPrim from gRbase
source("http://bioconductor.org/biocLite.R")
biocLite("gRbase") # will install dependent packages automatically.
system.time({
d.1 <- as.data.table(t(combn(d$id, 2)))
})
# user system elapsed
# 27.322 0.585 27.674
system.time({
d.2 <- as.data.table(t(combnPrim(d$id,2)))
})
# user system elapsed
# 2.317 0.110 2.425
identical(d.1[order(V1, V2),], d.2[order(V1,V2),])
#[1] TRUE
A post with any variation of the word Fast in the title is incomplete without benchmarks. Before we post any benchmarks, I would just like to mention that since this question was posted, two highly optimized packages, arrangements and RcppAlgos (I am the author) for generating combinations have been released for R. Note that since version 2.3.0 for RcppAlgos we can take advantage of multiple threads for even greater efficiency.
To give you an idea of their speed over combn and gRbase::combnPrim, here is a basic benchmark:
## We test generating just over 3 million combinations
choose(25, 10)
[1] 3268760
microbenchmark(arrngmnt = arrangements::combinations(25, 10),
combn = combn(25, 10),
gRBase = gRbase::combnPrim(25, 10),
serAlgos = RcppAlgos::comboGeneral(25, 10),
parAlgos = RcppAlgos::comboGeneral(25, 10, nThreads = 4),
unit = "relative", times = 20)
Unit: relative
expr min lq mean median uq max neval
arrngmnt 2.979378 3.072319 1.898390 3.756307 2.139258 0.4842967 20
combn 226.470755 230.410716 118.157110 232.905393 125.718512 17.7778585 20
gRBase 34.219914 34.209820 18.789954 34.218320 19.934485 3.6455493 20
serAlgos 2.836651 3.078791 2.458645 3.703929 2.231475 1.1652445 20
parAlgos 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 20
Now, we benchmark the other functions posted for the very specific case of producing combinations choose 2 and producing a data.table object.
The functions are as follows:
funAkraf <- function(d) {
a <- comb2.int(length(d$id)) ## comb2.int from the answer given by #akraf
setDT(list(V1 = d$id[a[,1]], V2 = d$id[a[,2]]))
}
funAnirban <- function(d) {
indices <- combi2inds(d$id)
ans2 <- setDT(list(d$id[indices$xid], d$id[indices$yid]))
ans2
}
funArun <- function(d) {
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid]
ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid]))
ans
}
funArrangements <- function(d) {
a <- arrangements::combinations(x = d$id, k = 2)
setDT(list(a[, 1], a[, 2]))
}
funGRbase <- function(d) {
a <- gRbase::combnPrim(d$id,2)
setDT(list(a[1, ], a[2, ]))
}
funOPCombn <- function(d) {
a <- combn(d$id, 2)
setDT(list(a[1, ], a[2, ]))
}
funRcppAlgos <- function(d) {
a <- RcppAlgos::comboGeneral(d$id, 2, nThreads = 4)
setDT(list(a[, 1], a[, 2]))
}
Benchmark with OP Data
And here are the benchmarks on the example given by the OP:
d <- data.table(id=as.character(paste0("A", 10001:15000)))
microbenchmark(funAkraf(d),
funAnirban(d),
funArrangements(d),
funArun(d),
funGRbase(d),
funOPCombn(d),
funRcppAlgos(d),
times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
funAkraf(d) 3.220550 2.971264 2.815023 2.665616 2.344018 3.383673 10
funAnirban(d) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
funArrangements(d) 1.464730 1.689231 1.834650 1.960233 1.932361 1.693305 10
funArun(d) 3.256889 2.908075 2.634831 2.729180 2.432277 2.193849 10
funGRbase(d) 3.513847 3.340637 3.327845 3.196399 3.291480 3.129362 10
funOPCombn(d) 30.310469 26.255374 21.656376 22.386270 18.527904 15.626261 10
funRcppAlgos(d) 1.676808 1.956696 1.943773 2.085968 1.949133 1.804180 10
We see that the function provided by #AnirbanMukherjee is the fastest for this task, followed by RcppAlgos/arrangements. For this task, nThreads has no effect as the vector passed is a character, which is not thread safe. What if we instead converted id to a factor?
Benchmarks with Factors (i.e. Categorical Variables)
dFac <- d
dFac$id <- as.factor(dFac$id)
library(microbenchmark)
microbenchmark(funAkraf(dFac),
funAnirban(dFac),
funArrangements(dFac),
funArun(dFac),
funGRbase(dFac),
funOPCombn(dFac),
funRcppAlgos(dFac),
times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval
funAkraf(dFac) 10.898202 10.949896 7.589814 10.01369 8.050005 5.557014 10
funAnirban(dFac) 3.104212 3.337344 2.317024 3.00254 2.471887 1.530978 10
funArrangements(dFac) 2.054116 2.058768 1.858268 1.94507 2.797956 1.691875 10
funArun(dFac) 10.646680 12.905119 7.703085 11.50311 8.410893 3.802155 10
funGRbase(dFac) 16.523356 21.609917 12.991400 19.73776 13.599870 6.498135 10
funOPCombn(dFac) 108.301876 108.753085 64.338478 95.56197 65.494335 28.183104 10
funRcppAlgos(dFac) 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 10
Now, we see that RcppAlgos is around 2x faster than any other solution. In particular, the RcppAlgos solution is about 3x than the formerly fastest solution given by Anirban. It should be noted that this increase in efficiency was possible because factor variables are really integers underneath the hood with some additional attributes.
Confirm Equality
They all give the same result as well. The only caveat is that the gRbase solution doesn't support factors. That is, if you pass a factor, it will be converted to character. Thus all of the solutions will give the same result if you were to pass dFac except for the gRbase solution:
identical(funAkraf(d), funOPCombn(d))
#[1] TRUE
identical(funAkraf(d), funArrangements(d))
#[1] TRUE
identical(funRcppAlgos(d), funArrangements(d))
#[1] TRUE
identical(funRcppAlgos(d), funAnirban(d))
#[1] TRUE
identical(funRcppAlgos(d), funArun(d))
#[1] TRUE
## different order... we must sort
identical(funRcppAlgos(d), funGRbase(d))
[1] FALSE
d1 <- funGRbase(d)
d2 <- funRcppAlgos(d)
## now it's the same
identical(d1[order(V1, V2),], d2[order(V1,V2),])
#[1] TRUE
Thanks to #Frank for pointing out how to compare two data.tables without going through the pains of creating new data.tables and then arranging them:
fsetequal(funRcppAlgos(d), funGRbase(d))
[1] TRUE
Here is a solution using Rcpp.
library(Rcpp)
library(data.table)
cppFunction('
Rcpp::DataFrame combi2(Rcpp::CharacterVector inputVector){
int len = inputVector.size();
int retLen = len * (len-1) / 2;
Rcpp::CharacterVector outputVector1(retLen);
Rcpp::CharacterVector outputVector2(retLen);
int start = 0;
for (int i = 0; i < len; ++i){
for (int j = i+1; j < len; ++j){
outputVector1(start) = inputVector(i);
outputVector2(start) = inputVector(j);
++start;
}
}
return(Rcpp::DataFrame::create(Rcpp::Named("id") = outputVector1,
Rcpp::Named("neighbor") = outputVector2));
};
')
# Toy data
d <- data.table(id=as.character(paste0("A", 10001:15000)))
system.time({
d.2 <- d[, list(neighbor=d$id[-which(d$id==id)]), by=c("id")]
})
# 1.908 0.397 2.389
system.time({
d[, `:=`(id1 = 1L, id2 = .I)] ## add interval columns for overlaps
setkey(d, id1, id2)
olaps <- foverlaps(d, d, type="within", which=TRUE)[xid != yid]
ans <- setDT(list(d$id[olaps$xid], d$id[olaps$yid]))
})
# 0.653 0.038 0.705
system.time(ans2 <- combi2(d$id))
# 1.377 0.108 1.495
Using the Rcpp function to get the indices and then form the data.table, works better.
cppFunction('
Rcpp::DataFrame combi2inds(const Rcpp::CharacterVector inputVector){
const int len = inputVector.size();
const int retLen = len * (len-1) / 2;
Rcpp::IntegerVector outputVector1(retLen);
Rcpp::IntegerVector outputVector2(retLen);
int indexSkip;
for (int i = 0; i < len; ++i){
indexSkip = len * i - ((i+1) * i)/2;
for (int j = 0; j < len-1-i; ++j){
outputVector1(indexSkip+j) = i+1;
outputVector2(indexSkip+j) = i+j+1+1;
}
}
return(Rcpp::DataFrame::create(Rcpp::Named("xid") = outputVector1,
Rcpp::Named("yid") = outputVector2));
};
')
system.time({
indices <- combi2inds(d$id)
ans2 <- setDT(list(d$id[indices$xid], d$id[indices$yid]))
})
# 0.389 0.027 0.425
Here are two base-R solutions if you don't want to use additional dependencies:
comb2.int uses rep and other sequence generating functions to generate the desired output.
comb2.mat creates a matrix, uses upper.tri() to get the upper triangle and which(..., arr.ind = TRUE) to obtain the column and row indices => all combinations.
Possibility 1: comb2.int
comb2.int <- function(n, rep = FALSE){
if(!rep){
# e.g. n=3 => (1,2), (1,3), (2,3)
x <- rep(1:n,(n:1)-1)
i <- seq_along(x)+1
o <- c(0,cumsum((n-2):1))
y <- i-o[x]
}else{
# e.g. n=3 => (1,1), (1,2), (1,3), (2,2), (2,3), (3,3)
x <- rep(1:n,n:1)
i <- seq_along(x)
o <- c(0,cumsum(n:2))
y <- i-o[x]+x-1
}
return(cbind(x,y))
}
Possibility 2: comb2.mat
comb2.mat <- function(n, rep = FALSE){
# Use which(..., arr.ind = TRUE) to get coordinates.
m <- matrix(FALSE, nrow = n, ncol = n)
idxs <- which(upper.tri(m, diag = rep), arr.ind = TRUE)
return(idxs)
}
The functions give the same result as combn(.):
for(i in 2:8){
# --- comb2.int ------------------
stopifnot(comb2.int(i) == t(combn(i,2)))
# => Equal
# --- comb2.mat ------------------
m <- comb2.mat(i)
colnames(m) <- NULL # difference 1: colnames
m <- m[order(m[,1]),] # difference 2: output order
stopifnot(m == t(combn(i,2)))
# => Equal up to above differences
}
But I have other elements in my vector than sequencial integers!
Use the return values as indices:
v <- LETTERS[1:5]
c <- comb2.int(length(v))
cbind(v[c[,1]], v[c[,2]])
#> [,1] [,2]
#> [1,] "A" "B"
#> [2,] "A" "C"
#> [3,] "A" "D"
#> [4,] "A" "E"
#> [5,] "B" "C"
#> [6,] "B" "D"
#> [7,] "B" "E"
#> [8,] "C" "D"
#> [9,] "C" "E"
#> [10,] "D" "E"
Benchmark:
time(combn) = ~5x time(comb2.mat) = ~80x time(comb2.int):
library(microbenchmark)
n <- 800
microbenchmark({
comb2.int(n)
},{
comb2.mat(n)
},{
t(combn(n, 2))
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> { comb2.int(n) } 4.394051 4.731737 6.350406 5.334463 7.22677 14.68808 100
#> { comb2.mat(n) } 20.131455 22.901534 31.648521 24.411782 26.95821 297.70684 100
#> { t(combn(n, 2)) } 363.687284 374.826268 391.038755 380.012274 389.59960 532.30305 100
Related
Simply said I have 378742 observations (each observation has a launch and deadline date) and I want to check the overlap of the duration of each observation against all other (378741) observations and sum them up.
I am running the following code which takes forever (my estimate is that it 205 days) because of the nested loop. Is there a way to speed up the calculations? (I use DescToolspackage for the Overlap command.)
a <- c(1:378742)
for (i in 1:378742) {
mydata$competition[i] <- sum(a, na.rm = T)
for (j in 1:378742) {
a[j] <- Overlap(c(mydata$Launched[i], mydata$Deadline[i]), c(mydata$Launched[j], mydata$Deadline[j]))
}
}
You can save significant time by vectorizing your inner loop (I then use apply() for the outer loop):
# We'll need both DescTools and microbenchmark
library(DescTools)
library(microbenchmark)
# Make example data
set.seed(123) # setting seed for reproducibility
n <- 10
x <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n)
y <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n)
(mat <- cbind(x, y))
#> x y
#> [1,] 15222 17667
#> [2,] 17050 15827
#> [3,] 15665 16645
#> [4,] 17395 16262
#> [5,] 17603 14547
#> [6,] 14338 17454
#> [7,] 16098 15069
#> [8,] 17425 14325
#> [9,] 16181 15367
#> [10,] 15835 17650
# First get the answer using nested loops
a <- z <- 1:n
for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
}
z
#> [1] 16102 9561 7860 7969 18169 18140 6690 18037 6017 12374
apply(mat, 1, function(r) sum(Overlap(r, mat)))
#> [1] 16102 9561 7860 7969 18169 18140 6690 18037 6017 12374
microbenchmark(apply = apply(mat, 1, function(r) sum(Overlap(r, mat))),
loop = for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> apply 7.538967 7.688929 7.894379 7.767989 7.891177 13.57523 100
#> loop 76.051011 77.203810 80.045325 78.158369 79.206538 114.68139 100
#> cld
#> a
#> b
Created on 2018-10-20 by the reprex package (v0.2.1)
Now let's try to get a sense of how it scales with (slightly) bigger example data (if the data gets too big the benchmarks take forever):
#
n <- 100
x <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n, r = T)
y <- sample(seq(as.Date("2008/10/20"), as.Date("2018/10/20"), "day"), n, r = T)
mat <- cbind(x, y)
a <- z <- 1:n
for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
z[i] <- sum(a, na.rm = T)
}
# In case you're concerned it still works:
all.equal(z, apply(mat, 1, function(r) sum(Overlap(r, mat))))
#> [1] TRUE
microbenchmark(apply = apply(mat, 1, function(r) sum(Overlap(r, mat))),
loop = for (i in 1:n) {
for (j in 1:n) {
a[j] <- Overlap(mat[i, ],mat[j, ])
}
# Noticed I've moved this sum to the bottom,
# so that our first element isn't just a sum from one to n
z[i] <- sum(a, na.rm = T)
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> apply 258.1151 262.8007 269.8172 265.9643 276.8799 296.2167 100
#> loop 5806.9834 5841.3362 5890.4988 5863.7317 5884.2308 6222.1670 100
#> cld
#> a
#> b
Created on 2018-10-20 by the reprex package (v0.2.1)
In Bioinformatics, we use for finding overlapping ranges the GenomicRanges packages.
I once also calculated using my usual for-loops and lapply functions that my computer would calculate 5 days long. But then I found the GenomicRanges package - and it did it in seconds!
(Shame on me, I still don't know how it exactly works ... will have to do with ordered tree structure and intersecting efficiently ... and partly also perhaps involving C++ code? .. )
The result is anyway that it is lightning fast.
You will be amazed!
GenomicRanges package for lightening fast range calculations
############################
# Install GenomicRanges package
############################
# since this year introduced: `BiocManager`
# Bioconductor is main code repository for Bionformaticians.
# It is kind of `CRAN` for Bioinformaticians programming with R
install.packages("BiocManager")
require(BiocManager)
BiocManager::install("GenomicRanges")
# In older systems, you have to do:
install.packages("BiocInstaller")
require(BiocInstaller)
biocLite("GenomicRanges")
############################
# Load the GenomicRanges package
############################
require(GenomicRanges)
############################
# create dates as positive intervals
############################
set.seed(123) # for reproducibility of random stuff
n <- 1000 # later: 378742
x <- sample(seq(as.Date("2008/10/20"), as.Date("2038/10/20"), "day"), replace=TRUE, n)
# y <- sapply(x, function(date) date + sample(1:1000, 1)) # too slow!
deltas <- sample(1:10000, replace=TRUE, n) # immediate response `sapply` needs very long
y <- x + deltas
df <- data.frame(seqnames="1", start=x, end=y)
gr <- GRanges(df)
gr <- sort(gr)
############################
# Be careful, GRanges obj is 1-based system and not 0-based!
############################
# each row is one index - gr behaves when indexing like a vector
gr[5] # selects fifth row
gr[4:7] # selects 4th to 7th row
############################
# which range overlaps with which range?
############################
system.time({hits <- findOverlaps(gr, gr)})
# system.time({ your-R-expression }) - very convenient speed measuring!
# the numbers in the table are the index (i-th row) in each of the tables
# query and subject table - which are in this case identical tables - gr
############################
# what is the amount of overlap?
############################
overlaps <- pintersect(gr[queryHits(hits)], gr[subjectHits(hits)])
amount.overlaps <- width(overlaps) - 1 # - 1 because 1-based systems do +1 when ranges
# 1-base versus 0-based coordinate systems: https://www.biostars.org/p/84686/
This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.
I am learning how to properly set up loops but still struggle with the correct indexing and syntax.
Below I have a list of two matrices, 2 columns x 3 rows each.
I want to look at the second column in each matrix, and create ideally a new column with values 1 if x>0.50, else = 0. Help will be much appreciated for both ways to do it, it should help me learn. Also, if you know any good reference of structuring loops in higher dimensional lists it would be much appreciated. Thanks so much.
a <- c(0.1,0.2,0.3)
b <- c(0.8,0.2,0.5)
c <- c(0.4,0.9,1.0)
d <- c(0.7,0.9,0.2)
ab <- cbind(a,b)
cd <- cbind(c,d)
abcd <- list(ab,cd)
presabs <- vector("list",ncol(y))
# Trying lapply but indexing something wrong
res <- lapply(abcd, function(x) if (x[,2]>0.5) {1} else {0})
# Other method also not working:
for (i in 1:length(abcd))
for (j in 1:length(a)){
{
if(abc[[i]][j]>0.50){
presabs[j] <- 1
} else {
presabs[j] <- 0
}
}
}
You can either use ifelse or in this case it is also not needed. The OP's question is creating a binary variable as the third column. This can be done using many variations,
lapply(abcd, function(x) cbind(x,new= +(x[,2]>0.5)))
or
lapply(abcd, function(x) cbind(x,new= (x[,2]>0.5)+0L))
Or
lapply(abcd, function(x) cbind(x,new= (x[,2]>0.5)*1))
Or
lapply(abcd, function(x) cbind(x,new= as.integer(x[,2]>0.5)))
If the values to be changed are different, for example
a <- 3
b <- 2
lapply(abcd, function(x) cbind(x, new= c(a, b)[(x[,2] > 0.5)+1L]))
Benchmarks
set.seed(25)
abcd1 <- lapply(1:60, function(i) matrix(rnorm(1e5*2), ncol=2))
viaChris <- function() lapply(abcd1, function(x) f(x, a=1, b=0, thresh =.5))
akrun <- function() lapply(abcd1, function(x) cbind(x, lab= +(x[,2] >0.5)))
system.time(viaChris())
# user system elapsed
#1.683 0.000 1.444
system.time(akrun())
# user system elapsed
# 0.481 0.000 0.322
library(microbenchmark)
microbenchmark(akrun(), viaChris(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# akrun() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
#viaChris() 3.726728 3.459581 3.475673 3.488114 3.400262 3.445557 20 b
The above answer works because logical values are coerced into 1 (for TRUE) and 0 (for FALSE). A more general solution might look something like:
lapply(abcd, function(x) cbind(x, ifelse(x[, 2] > .5, a, b)))
where a and b are numeric values you can specify. We can even be more general. For example:
## Define a general function that adds a new column of values
## based on whether or not the values in the i'th column of the
## matrix exceeds a threshold.
f = function(x, a, b, thresh, i = 2)
cbind(x, lab = ifelse(x[, i] > thresh, a, b))
## Apply the function above to each matrix in the list 'abcd', with
## a = 1, b = 0, and thresh = .5.
lapply(abcd, function(x) f(x, a = 1, b = 0, thresh = .5))
I have the following vectors
> X <- c(1,1,3,4)
> a <- c(1,1,2,2)
> b <- c(2,1,4,3)
> c <- c(2,1,4,6)
I want to compare each element of X with corresponding elements of a,b and c and finally I need a class assigned to each row of X. for eg.
The first element of X is 1 and it has a match in corresponding element vector a, then I need to assign a class as '1-1' (no matter from which vector it got the match)
The second element of X is 1 and it also has match (in fact 3) so, again the class is '1-1'
The third element of X is 3 and it doesn't have a match then I should look for next integer value, which is 4 and there is 4 (in b and c). So the class should be '3-4'
The fourth element of X is 4 and it doesn't have a match. Also there is no 5 (next integer) then it should look for the previous integer which is 3 and there is 3. So the class should be '4-3'
Actually I have thousand of rows for each vector and I have to do this for each row. Any suggestion to do it in a less complicated way. I would prefer to use base functions of R.
Based on rbatt's comment and answer I realized my original answer was quite lacking. Here's a redo...
match_nearest <- function( x, table )
{
dist <- x - table
tgt <- which( dist < 0, arr.ind=TRUE, useNames=F )
dist[tgt] <- abs( dist[tgt] + .5 )
table[ cbind( seq_along(x), max.col( -dist, ties.method="first" ) ) ]
}
X <- c(1,1,3,4)
a <- c(1,1,2,2)
b <- c(2,1,4,3)
c <- c(2,1,4,6)
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
## [1] "1-1" "1-1" "3-4" "4-3"
Compared to the original answer and rbatt's we find neither was correct!
set.seed(1)
X <- rbinom(n=1E4, size=10, prob=0.5)
a <- rbinom(n=1E4, size=10, prob=0.5)
b <- rbinom(n=1E4, size=10, prob=0.5)
c <- rbinom(n=1E4, size=10, prob=0.5)
T <- current_solution(X,a,b,c)
R <- rbatt_solution(X,a,b,c)
all.equal( T, R )
## [1] "195 string mismatches"
# Look at mismatched rows...
mismatch <- head( which( T != R ) )
cbind(X,a,b,c)[mismatch,]
## X a b c
## [1,] 4 6 3 3
## [2,] 5 7 4 7
## [3,] 5 8 3 9
## [4,] 5 7 7 4
## [5,] 4 6 3 7
## [6,] 5 7 4 2
T[mismatch]
## [1] "4-3" "5-4" "5-3" "5-4" "4-3" "5-4"
R[mismatch]
## [1] "4-6" "5-7" "5-8" "5-7" "4-6" "5-7"
and needlessly slow...
library(microbenchmark)
bm <- microbenchmark( current_solution(X,a,b,c),
previous_solution(X,a,b,c),
rbatt_solution(X,a,b,c) )
print(bm, order="median")
## Unit: milliseconds
## expr min lq median uq max neval
## current_solution(X, a, b, c) 7.088 7.298 7.996 8.268 38.25 100
## rbatt_solution(X, a, b, c) 33.920 38.236 46.524 53.441 85.50 100
## previous_solution(X, a, b, c) 83.082 93.869 101.997 115.961 135.98 100
Looks like the current_solution is getting it right; but without an expected output ...
Here's the functions...
current_solution <- function(X,a,b,c) {
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
}
# DO NOT USE... it is wrong!
previous_solution <- function(X,a,b,c) {
dat <- rbind(X,a,b,c)
v <- apply(dat,2, function(v) {
v2 <- v[1] - v
v2[v2<0] <- abs( v2[v2<0]) - 1
v[ which.min( v2[-1] ) + 1 ]
})
paste("X", v, sep="-")
}
# DO NOT USE... it is wrong!
rbatt_solution <- function(X,a,b,c) {
mat <- cbind(X,a,b,c)
diff.signed <- mat[,"X"]-mat[,c("a","b","c")]
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5
min.ind <- apply(diff.break, 1, which.min)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2)
match.value <- mat[,c("a","b","c")][ind.array]
ref.class <- paste(X, match.value, sep="-")
ref.class
}
This solution should provide the output you want. Also, it is ~ 3x faster than Thell's solution, because the differences are vectorized and are not calculated row-wise with apply.
I compare times for the two approaches below. Note that if you want the "class" as another column in a data.frame, just uncomment the last line of my function. I commented it out to make the calculation times between the two answers more comparable (creating a data.frame is quite slow).
# Example data from Thell, plus 1 more
X1 <- c(1,1,3,4,7,1, 5)
a1 <- c(1,1,2,2,2,2, 9)
b1 <- c(2,1,4,3,3,3, 3)
c1 <- c(2,1,4,6,6,6, 7)
# Random example data, much larger
# X1 <- rbinom(n=1E4, size=10, prob=0.5)
# a1 <- rbinom(n=1E4, size=10, prob=0.5)
# b1 <- rbinom(n=1E4, size=10, prob=0.5)
# c1 <- rbinom(n=1E4, size=10, prob=0.5)
My answer:
rbTest <- function(){
mat <- cbind(X1,a1,b1,c1)
diff.signed <- mat[,"X1"]-mat[,c("a1","b1","c1")] # differences (with sign)
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5 # penalize for differences that are negative by adding 0.5 to them (break ties by preferring higher integer)
min.ind <- apply(diff.break, 1, which.min) # index of smallest difference (prefer larger integers when there is a tie)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2) # array index format
match.value <- mat[,c("a1","b1","c1")][ind.array] # value of the smallest difference (value of the match)
ref.class <- paste(X1, match.value, sep="-") # the 'class' in the format 'ref-match'
ref.class
# data.frame(class=ref.class, mat)
}
Thell answer:
thTest <- function(){
dat <- rbind(X1,a1,b1,c1)
apply(dat,2, function(v) {
# Get distance
v2 <- v[1] - v
# Prefer values >= v[1]
v2[v2<0] <- abs( v2[v2<0]) - 1
# Obtain and return nearest v excluding v[1]
v[ which.min( v2[-1] ) + 1 ]
})
}
Benchmark on large matrix (10,000 rows)
# > microbenchmark(rbTest(), thTest())
# Unit: milliseconds
# expr min lq median uq max neval
# rbTest() 47.95451 52.01729 59.36161 71.94076 103.1314 100
# thTest() 167.49798 180.69627 195.02828 204.19916 315.0610 100
Benchmark on small matrix (7 rows)
# > microbenchmark(rbTest(), thTest())
# Unit: microseconds
# expr min lq median uq max neval
# rbTest() 108.299 112.3550 115.4225 119.4630 146.722 100
# thTest() 147.727 152.2015 155.9005 159.3115 235.898 100
Example output (small matrix):
# > rbTest()
# [1] "1-1" "1-1" "3-4" "4-3" "7-6" "1-2" "5-7" "6-1"
# > thTest()
# [1] 1 1 4 3 6 2 7
The function below calculates the mean of a vector. However, it first checks the proportion of NA's present in the vector
and if above a given threshold, returns NA instead of the mean.
My issue is that my current implementation is rather innefficient. It takes more than 7x longer than simply running mean(vec, na.rm=TRUE)
I tried an alternate method using na.omit, but that is even slower.
Given the size of my data, executing the single lapply is taking over 40 minutes.
Any suggestions on how to accomplish the same task more quickly?
UPDATE - RE: #thelatemail 's solution and #Arun's comment:
I am executing this function over several hundred groups, each group of varying size. The sample data (originally) provided in this question was provided as a neat data frame simply for ease of creating artificial data.
Alternate sample data to avoid the confusion
# Sample Data
# ------------
set.seed(1)
# slightly different sizes for each group
N1 <- 5e3
N2 <- N1 + as.integer(rnorm(1, 0, 100))
# One group has only a moderate amount of NA's
SAMP1 <- rnorm(N1)
SAMP1[sample(N1, .25 * N1, FALSE)] <- NA # add in NA's
# Another group has many NA's
SAMP2 <- rnorm(N2)
SAMP2[sample(N2, .95 * N2, FALSE)] <- NA # add in large number of NA's
# put them all in a list
SAMP.NEW <- list(SAMP1, SAMP2)
# keep it clean
rm(SAMP1, SAMP2)
# Execute
# -------
lapply(SAMP.NEW, meanIfThresh)
Original Sample Data, function etc
# Sample Data
# ------------
set.seed(1)
rows <- 20000 # actual data has more than 7M rows
cols <- 1000
SAMP <- replicate(cols, rnorm(rows))
SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA # add in NA's
# Select 5 random rows, and have them be 90% NA
tooSparse <- sample(rows, 5)
for (r in tooSparse)
SAMP[r, sample(cols, cols * .9, FALSE)] <- NA
# Function
# ------------
meanIfThresh <- function(vec, thresh=12/15) {
# Calculates the mean of vec, however,
# if the number of non-NA values of vec is less than thresh, returns NA
# thresh : represents how much data must be PRSENT.
# ie, if thresh is 80%, then there must be at least
len <- length(vec)
if( (sum(is.na(vec)) / len) > thresh)
return(NA_real_)
# if the proportion of NA's is greater than the threshold, return NA
# example: if I'm looking at 14 days, and I have 12 NA's,
# my proportion is 85.7 % = (12 / 14)
# default thesh is 80.0 % = (12 / 15)
# Thus, 12 NAs in a group of 14 would be rejected
# else, calculate the mean, removing NA's
return(mean(vec, na.rm=TRUE))
}
# Execute
# -----------------
apply(SAMP, 1, meanIfThresh)
# Compare with `mean`
#----------------
plain <- apply(SAMP, 1, mean, na.rm=TRUE)
modified <- apply(SAMP, 1, meanIfThresh)
# obviously different
identical(plain, modified)
plain[tooSparse]
modified[tooSparse]
microbenchmark( "meanIfThresh" = apply(SAMP, 1, meanIfThresh)
, "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
, times = 15L)
# With the actual data, the penalty is sevenfold
# Unit: seconds
# expr min lq median uq max neval
# meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871 15
# mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450 15
Couldn't you just replace the high NA rows' mean values afterwards like so?:
# changed `result <- apply(SAMP,1,mean,na.rm=TRUE)`
result <- rowMeans(SAMP, na.rm=TRUE)
NArows <- rowSums(is.na(SAMP))/ncol(SAMP) > 0.8
result[NArows] <- NA
Some benchmarking:
Ricardo <- function(vec, thresh=12/15) {
len <- length(vec)
if( (sum(is.na(vec)) / len) > thresh)
return(NA_real_)
return(mean(vec, na.rm=TRUE))
}
DanielFischer <- function(vec, thresh=12/15) {
len <- length(vec)
nas <- is.na(vec)
Nna <- sum(nas)
if( (Nna / len) > thresh)
return(NA_real_)
return(sum(vec[!nas])/(len-Nna))
}
thelatemail <- function(mat) {
result <- rowMeans(mat, na.rm=TRUE)
NArows <- rowSums(is.na(mat))/ncol(mat) > 0.8
result[NArows] <- NA
result
}
require(microbenchmark)
microbenchmark(m1 <- apply(SAMP, 1, Ricardo),
m2 <- apply(SAMP, 1, DanielFischer),
m3 <- thelatemail(SAMP), times = 5L)
Unit: milliseconds
expr min lq median uq max neval
m1 <- apply(SAMP, 1, Ricardo) 2923.7260 2944.2599 3066.8204 3090.8127 3105.4283 5
m2 <- apply(SAMP, 1, DanielFischer) 2643.4883 2683.1034 2755.7032 2799.5155 3089.6015 5
m3 <- latemail(SAMP) 337.1862 340.6339 371.6148 376.5517 383.4436 5
all.equal(m1, m2) # TRUE
all.equal(m1, m3) # TRUE
Is it so that you have to go twice through your vector vec in your function? If you can store your NA first, maybe it could speed up your calculations a bit:
meanIfThresh2 <- function(vec, thresh=12/15) {
len <- length(vec)
nas <- is.na(vec)
Nna <- sum(nas)
if( (Nna / len) > thresh)
return(NA_real_)
return(sum(vec[!nas])/(len-Nna))
}
EDIT: I performed the similar benchmarking, to see the effect on this change:
> microbenchmark( "meanIfThresh" = apply(SAMP, 1, meanIfThresh)
+ , "meanIfThresh2" = apply(SAMP, 1, meanIfThresh2)
+ , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+ , times = 15L)
Unit: seconds
expr min lq median uq max neval
meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493 15
meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028 15
mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495 15