I could not find a prior question about this, but this one is fairly close.
Often I make new objects and want them to have the same dimnames (names, colnames, rownames) as some other object. Normally, I would use names, or rownames + colnames, but I'm tired of doing this and I want a better solution. I also want a solution that allows for partial matching, so I need a new function. My trouble is that it is apparently not quite easy to get it exactly right.
First, a helper function:
get_dims = function(x) {
if (is.null(dim(x))) {
return(length(x))
} else {
return(dim(x))
}
}
This gets the dimensions of any object. dim() returns NULL for atomic objects (vectors and lists), whereas it really should just return their length.
Next, we make up some minimal test data:
t = matrix(1:9, nrow=3)
t2 = t
rownames(t) = LETTERS[1:3]; colnames(t) = letters[1:3]
Inspect:
> t
a b c
A 1 4 7
B 2 5 8
C 3 6 9
> t2
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
The test is that t2 should get the dimnames of t. I print them because == apparently cannot handle list comparison (returns logical(0)).
A simple solution is to take in the object whose names I want to copy, the object where I want to copy them to, and simply change the dimnames in the function and return the object back. This can be done like this:
copy_names1 = function(x, y, partialmatching = T) {
#find object dimensions
x_dims = get_dims(x)
y_dims = get_dims(y)
#set names if matching dims
if (all(x_dims == y_dims)) {
#loop over each dimension
for (dim in 1:length(dimnames(x))) {
dimnames(y)[[dim]] <- dimnames(x)[[dim]]
}
}
return(y)
}
Test:
> copy_names1(t, t2)
a b c
A 1 4 7
B 2 5 8
C 3 6 9
So it works fine, but returns the object, which means one has to use the assignment operator, which is not needed with the normal *names() functions.
We can also assign from within the function using assign():
copy_names2 = function(x, y, partialmatching = T) {
#find object dimensions
x_dims = get_dims(x)
y_dims = get_dims(y)
#what is the object in y parameter?
y_obj_name = deparse(substitute(y))
#set names if matching dims
if (all(x_dims == y_dims)) {
#loop over each dimension
for (dim in 1:length(dimnames(x))) {
dimnames(y)[[dim]] <- dimnames(x)[[dim]]
}
}
#assign in the outer envir
assign(y_obj_name, pos = 1, value = y)
}
Test:
> copy_names2(t, t2)
> t2
a b c
A 1 4 7
B 2 5 8
C 3 6 9
It also works: it does not require using the assignment operator and returns silently. However, it does copy the object in RAM (I think) which is bad when using large objects. It would be better to call dimnames on the existing object without copying it. So I try that:
copy_names3 = function(x, y, partialmatching = T) {
#find object dimensions
x_dims = get_dims(x)
y_dims = get_dims(y)
#what is the object in y parameter?
y_obj_name = deparse(substitute(y))
get(y_obj_name, pos = -1) #test that it works
#set names if matching dims
if (all(x_dims == y_dims)) {
#loop over each dimension
for (dim in 1:length(dimnames(x))) {
dimnames(get(y_obj_name, pos = -1))[[dim]] <- dimnames(x)[[dim]]
}
}
}
Test:
> copy_names3(t, t2)
Error in dimnames(get(y_obj_name, pos = -1))[[dim]] <- dimnames(x)[[dim]] :
could not find function "get<-"
A very cryptic error! According to the previous question, get() cannot be used like this because it only fetches values, not assigns them. The persons writes to use assign() instead. However, in the documentation for assign() we find:
assign does not dispatch assignment methods, so it cannot be used to
set elements of vectors, names, attributes, etc.
How does one copy dimnames without copying objects with a function?
I'm not sure how the "partial matching" is supposed to work, but maybe this:
t = matrix(1:9, nrow=3)
t2 = t
t2 <- rbind(t2, 11:13)
rownames(t) = LETTERS[1:3]; colnames(t) = letters[1:3]
d <- dim(t) == dim(t2)
dimnames(t2)[d] <- dimnames(t)[d]
t2
# a b c
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
#[4,] 11 12 13
Edit:
Here is how you can do this from inside a "setter" function without eval(parse(...)):
t = matrix(1:9, nrow=3)
t2 = t
t2 <- rbind(t2, 11:13)
rownames(t) = LETTERS[1:3]; colnames(t) = letters[1:3]
fun <- function(x, template, ...) {
y <- substitute(x)
z <- substitute(template)
d <- dim(x) == dim(template)
expr <- bquote(dimnames(.(y))[.(d)] <- dimnames(.(z))[.(d)])
eval(expr, ...)
invisible(NULL)
}
fun(t2, t, .GlobalEnv)
t2
# a b c
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
#[4,] 11 12 13
Of course, if you need something really fast, you need to implement it in C (as was done with the dimnames<- function).
One solution is to run the dimnames call in the parent environment instead of inside the function. It can be done like this:
copy_names4 = function(x, y, partialmatching = T) {
library(stringr)
#find object dimensions
x_dims = get_dims(x)
y_dims = get_dims(y)
#what is the object in y parameter?
x_obj_name = deparse(substitute(x))
y_obj_name = deparse(substitute(y))
#set names if matching dims
if (all(x_dims == y_dims)) {
#loop over each dimension
for (dim in 1:length(dimnames(x))) {
str_call = str_c("dimnames(", y_obj_name, ")[[", dim, "]] <- dimnames(" ,x_obj_name, ")[[", dim, "]]")
eval(parse(text = str_call), parent.frame(1))
}
}
}
Test it:
> t2
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> copy_names4(t, t2)
> t2
a b c
A 1 4 7
B 2 5 8
C 3 6 9
Success!
But is it faster?
library(microbenchmark)
microbenchmark(copy_names1 = {t2 = copy_names1(t, t2)},
copy_names2 = copy_names2(t, t2),
copy_names4 = copy_names4(t, t2))
Results:
Unit: microseconds
expr min lq mean median uq max neval
copy_names1 8.778 10.6795 14.57945 11.9960 15.653 46.812 100
copy_names2 24.869 27.7950 38.62004 33.7925 39.937 202.168 100
copy_names4 466.067 478.9405 507.48058 494.4460 514.488 840.559 100
Surprisingly, the initial version was much faster, by about 40-50 times. However, the last one should be faster for large objects. Let's try a larger test:
#larger test
t = matrix(1:9000000, nrow=3000)
t2 = t
rownames(t) = sample(LETTERS[1:26], size = 3000, replace = T); colnames(t) = sample(letters[1:26], size = 3000, replace = T)
t[1:5, 1:5]
t2[1:5, 1:5]
microbenchmark(copy_names1 = {t2 = copy_names1(t, t2)},
copy_names2 = copy_names2(t, t2),
copy_names4 = copy_names4(t, t2))
Results:
Unit: milliseconds
expr min lq mean median uq max neval
copy_names1 4.146032 4.442115 33.09852 12.14201 13.00495 242.2970 100
copy_names2 4.229708 4.553877 41.39389 12.23739 20.12995 229.4899 100
copy_names4 5.104497 5.499469 44.42764 13.24267 21.41507 228.7731 100
Now they are about equally fast, though the first two are still slightly faster.
Related
I want to build a data frame like
In the head I have a value of a number n
in factorial the factorial(n) which is a recursive function
in sum the sum of the previous values of the factiorials.
I write a recursive function that successfully generate the head and factorial columns but the still struggling with the sum column.
Thanks
Below R code
fact <- function(n, x){
if (n<=1){
return (n)
} else {
n*fact(n-1)
}
}
recurDf <- function(n, df){
if (n<=1){
df <- rbind (df, data.frame("value" = paste('Value', n) , "factorial" = n, "previous.sum" = 1) )
return (df)
} else {
if(is.null(df)) {
#df <- data.frame(matrix(ncol = 3, nrow = 0))
#colnames(df) <- c("value", "factorial", "previous.sum")
df <- data.frame("value"= 'va', "factorial" =0, "previous.sum" = 0)
}
rbind (recurDf(n-1,df), data.frame("value" = paste('Value', n) , "factorial" = fact(n), "previous.sum" = sum(recurDf(n-1,df)$factorial) ))
}
}
recurDf(4, NULL)
The following returns the factor of n in its first component and the cumulative sum of all factorials to n in its second argument.
fact2 <- function(n) {
if (n <= 1) c(1,1)
else {
prev <- Recall(n-1)
n * prev[1] + c(0, prev[2])
}
}
fact2(1)
## [1] 1 1
fact2(2)
## [1] 2 3
fact2(3)
## [1] 6 9
fact2(4)
## [1] 24 33
cbind(1:4, t(sapply(1:4, fact2)))
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 3
## [3,] 3 6 9
## [4,] 4 24 33
Is there a reason you need to do this recursively?
There are much simpler ways to get to your answer.
recurDf <- function(n){
df <- data.frame("value" = c(paste('Value',1:n)) , "factorial" = c(1:n))
df$factorial <- factorial(df$factorial)
df$previous.sum <- cumsum(df$factorial)
return (df)
}
recurDf(4)
This returns
value factorial previous.sum
1 Value 1 1 1
2 Value 2 2 3
3 Value 3 6 9
4 Value 4 24 33
Say I have a function for subsetting (this is just a minimal example):
f <- function(x, ind = seq(length(x))) {
x[ind]
}
(Note: one could use only seq(x) instead of seq(length(x)), but I don't find it very clear.)
So, if
x <- 1:5
ind <- c(2, 4)
ind2 <- which(x > 5) # integer(0)
I have the following results:
f(x)
[1] 1 2 3 4 5
f(x, ind)
[1] 2 4
f(x, -ind)
[1] 1 3 5
f(x, ind2)
integer(0)
f(x, -ind2)
integer(0)
For the last result, we would have wanted to get all x, but this is a common cause of error (as mentionned in the book Advanced R).
So, if I want to make a function for removing indices, I use:
f2 <- function(x, ind.rm) {
f(x, ind = `if`(length(ind.rm) > 0, -ind.rm, seq(length(x))))
}
Then I get what I wanted:
f2(x, ind)
[1] 1 3 5
f2(x, ind2)
[1] 1 2 3 4 5
My question is:
Can I do something cleaner and that doesn't need passing seq(length(x)) explicitly in f2 but using directly the default value of f's parameter ind when ind.rm is integer(0)?
If you anticipate having "empty" negative indices a lot, you can get a performance improvement for these cases if you can avoid the indexing used by x[seq(x)] as opposed to just x. In other words, if you are able to combine f and f2 into something like:
new_f <- function(x, ind.rm){
if(length(ind.rm)) x[-ind.rm] else x
}
There will be a huge speedup in the case of empty negative indices.
n <- 1000000L
x <- 1:n
ind <- seq(0L,n,2L)
ind2 <- which(x>n+1) # integer(0)
library(microbenchmark)
microbenchmark(
f2(x, ind),
new_f(x, ind),
f2(x, ind2),
new_f(x, ind2)
)
all.equal(f2(x, ind), new_f(x, ind)) # TRUE - same result at about same speed
all.equal(f2(x, ind2), new_f(x, ind2)) # TRUE - same result at much faster speed
Unit: nanoseconds
expr min lq mean median uq max neval
f2(x, ind) 6223596 7377396.5 11039152.47 9317005 10271521 50434514 100
new_f(x, ind) 6190239 7398993.0 11129271.17 9239386 10202882 59717093 100
f2(x, ind2) 6823589 7992571.5 11267034.52 9217149 10568524 63417978 100
new_f(x, ind2) 428 1283.5 5414.74 6843 7271 14969 100
What you have isn't bad, but if you want to avoid passing the default value of a default argument you could restructure like this:
f2 <- function(x, ind.rm) {
`if`(length(ind.rm) > 0, f(x,-ind.rm), f(x))
}
which is slightly shorter than what you have.
On Edit
Based on the comments, it seems you want to be able to pass a function nothing (rather than simply not pass at all), so that it uses the default value. You can do so by writing a function which is set up to receive nothing, also known as NULL. You can rewrite your f as:
f <- function(x, ind = NULL) {
if(is.null(ind)){ind <- seq(length(x))}
x[ind]
}
NULL functions as a flag which tells the receiving function to use a default value for the parameter, although that default value must be set in the body of the function.
Now f2 can be rewritten as
f2 <- function(x, ind.rm) {
f(x, ind = `if`(length(ind.rm) > 0, -ind.rm, NULL))
}
This is slightly more readable than what you have, but at the cost of making the original function slightly longer.
To implement "parameter1 = if(cond1) then value1 else default_value_of_param1", I used formals to get default parameters as a call:
f <- function(x, ind.row = seq_len(nrow(x)), ind.col = seq_len(ncol(x))) {
x[ind.row, ind.col]
}
f2 <- function(x, ind.row.rm = integer(0), ind.col.rm = integer(0)) {
f.args <- formals(f)
f(x,
ind.row = `if`(length(ind.row.rm) > 0, -ind.row.rm, eval(f.args$ind.row)),
ind.col = `if`(length(ind.col.rm) > 0, -ind.col.rm, eval(f.args$ind.col)))
}
Then:
> x <- matrix(1:6, 2)
> f2(x, 1:2)
[,1] [,2] [,3]
> f2(x, , 1:2)
[1] 5 6
> f2(x, 1, 2)
[1] 2 6
> f2(x, , 1)
[,1] [,2]
[1,] 3 5
[2,] 4 6
> f2(x, 1, )
[1] 2 4 6
> f2(x)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
I have a list of vectors of variable length, for example:
q <- list(c(1,3,5), c(2,4), c(1,3,5), c(2,5), c(7), c(2,5))
I need to count the number of occurrences for each of the vectors in the list, for example (any other suitable datastructure acceptable):
list(list(c(1,3,5), 2), list(c(2,4), 1), list(c(2,5), 2), list(c(7), 1))
Is there an efficient way to do this? The actual list has tens of thousands of items so quadratic behaviour is not feasible.
match and unique accept and handle "list"s too (?match warns for being slow on "list"s). So, with:
match(q, unique(q))
#[1] 1 2 1 3 4 3
each element is mapped to a single integer. Then:
tabulate(match(q, unique(q)))
#[1] 2 1 2 1
And find a structure to present the results:
as.data.frame(cbind(vec = unique(q), n = tabulate(match(q, unique(q)))))
# vec n
#1 1, 3, 5 2
#2 2, 4 1
#3 2, 5 2
#4 7 1
Alternatively to match(x, unique(x)) approach, we could map each element to a single value with deparseing:
table(sapply(q, deparse))
#
# 7 c(1, 3, 5) c(2, 4) c(2, 5)
# 1 2 1 2
Also, since this is a case with unique integers, and assuming in a small range, we could map each element to a single integer after transforming each element to a binary representation:
n = max(unlist(q))
pow2 = 2 ^ (0:(n - 1))
sapply(q, function(x) tabulate(x, nbins = n)) # 'binary' form
sapply(q, function(x) sum(tabulate(x, nbins = n) * pow2))
#[1] 21 10 21 18 64 18
and then tabulate as before.
And just to compare the above alternatives:
f1 = function(x)
{
ux = unique(x)
i = match(x, ux)
cbind(vec = ux, n = tabulate(i))
}
f2 = function(x)
{
xc = sapply(x, deparse)
i = match(xc, unique(xc))
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
f3 = function(x)
{
n = max(unlist(x))
pow2 = 2 ^ (0:(n - 1))
v = sapply(x, function(X) sum(tabulate(X, nbins = n) * pow2))
i = match(v, unique(v))
cbind(vec = x[!duplicated(v)], n = tabulate(i))
}
q2 = rep_len(q, 1e3)
all.equal(f1(q2), f2(q2))
#[1] TRUE
all.equal(f2(q2), f3(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.980041 8.161524 10.525946 8.291678 8.848133 178.96333 100 b
# f2(q2) 24.407143 24.964991 27.311056 25.514834 27.538643 45.25388 100 c
# f3(q2) 3.951567 4.127482 4.688778 4.261985 4.518463 10.25980 100 a
Another interesting alternative is based on ordering. R > 3.3.0 has a grouping function, built off data.table, which, along with the ordering, provides some attributes for further manipulation:
Make all elements of equal length and "transpose" (probably the most slow operation in this case, though I'm not sure how else to feed grouping):
n = max(lengths(q))
qq = .mapply(c, lapply(q, "[", seq_len(n)), NULL)
Use ordering to group similar elements mapped to integers:
gr = do.call(grouping, qq)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
i
#[1] 1 2 1 3 4 3
then, tabulate as before.
To continue the comparisons:
f4 = function(x)
{
n = max(lengths(x))
x2 = .mapply(c, lapply(x, "[", seq_len(n)), NULL)
gr = do.call(grouping, x2)
e = attr(gr, "ends")
i = rep(seq_along(e), c(e[1], diff(e)))[order(gr)]
cbind(vec = x[!duplicated(i)], n = tabulate(i))
}
all.equal(f3(q2), f4(q2))
#[1] TRUE
microbenchmark::microbenchmark(f1(q2), f2(q2), f3(q2), f4(q2))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(q2) 7.956377 8.048250 8.792181 8.131771 8.270101 21.944331 100 b
# f2(q2) 24.228966 24.618728 28.043548 25.031807 26.188219 195.456203 100 c
# f3(q2) 3.963746 4.103295 4.801138 4.179508 4.360991 35.105431 100 a
# f4(q2) 2.874151 2.985512 3.219568 3.066248 3.186657 7.763236 100 a
In this comparison q's elements are of small length to accomodate for f3, but f3 (because of large exponentiation) and f4 (because of mapply) will suffer, in performance, if "list"s of larger elements are used.
One way is to paste each vector , unlist and tabulate, i.e.
table(unlist(lapply(q, paste, collapse = ',')))
#1,3,5 2,4 2,5 7
# 2 1 2 1
This question is related to my previous one. Here is a small sample data. I have used both data.table and data.frame to find a faster solution.
test.dt <- data.table(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.dt[,rown:=as.numeric(row.names(test.dt))]
test.df <- data.frame(strt=c(1,1,2,3,5,2), end=c(2,1,5,5,5,4), a1.2=c(1,2,3,4,5,6),
a2.3=c(2,4,6,8,10,12), a3.4=c(3,1,2,4,5,1), a4.5=c(5,1,15,10,12,10),
a5.6=c(4,8,2,1,3,9))
test.df$rown <- as.numeric(row.names(test.df))
> test.df
strt end a1.2 a2.3 a3.4 a4.5 a5.6 rown
1 1 2 1 2 3 5 4 1
2 1 1 2 4 1 1 8 2
3 2 5 3 6 2 15 2 3
4 3 5 4 8 4 10 1 4
5 5 5 5 10 5 12 3 5
6 2 4 6 12 1 10 9 6
I want to use the start and end column values to determine the range of columns to subset (columns from a1.2 to a5.6) and obtain the mean. For example, in the first row, since strt=1 and end=2, I need to get the mean of a1.2 and a2.3; in the third row, I need to get the mean of a2.3, a3.4, a4.5, and a5.6
The output should be a vector like this
> k
1 2 3 4 5 6
1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
Here, is what I tried:
Solution 1: This uses the data.table and applies a function over it.
func.dt <- function(rown, x, y) {
tmp <- paste0("a", x, "." , x+1)
tmp1 <- paste0("a", y, "." , y+1)
rowMeans(test.dt[rown,get(tmp):get(tmp1), with=FALSE])
}
k <- test.dt[, func.dt(rown, strt, end), by=.(rown)]
Solution 2: This uses the data.frame and applies a function over it.
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
Solution 3: This uses the data.frame and loops through it.
test.ave <- rep(NA, length(test1$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i, as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
Benchmarking shows that Solution 2 is the fastest.
test replications elapsed relative user.self sys.self user.child sys.child
1 sol1 100 0.67 4.786 0.67 0 NA NA
2 sol2 100 0.14 1.000 0.14 0 NA NA
3 sol3 100 0.15 1.071 0.16 0 NA NA
But, this is not good enough for me. Given the size of my data, these functions would need to run for a few days before I get the output. I am sure that I am not fully utilizing the power of data.table and I also know that my functions are crappy (they refer to the dataset in the global environment without passing it). Unfortunately, I am out of my depth and do not know how to fix these issues and make my functions fast. I would greatly appreciate any suggestions that help in improving my function(s) or point to alternate solutions.
I was curious how fast I could make this without resorting to writing custom C or C++ code. The best I could come up with is below. Note that using mean.default will provide greater precision, since it does a second pass over the data for error correction.
f_jmu <- compiler::cmpfun({function(m) {
# remove start/end columns from 'm' matrix
ma <- m[,-(1:2)]
# column index for each row in 'ma' matrix
cm <- col(ma)
# logical index of whether we need the column for each row
i <- cm >= m[,1L] & cm <= m[,2L]
# multiply the input matrix by the index matrix and sum it
# divide by the sum of the index matrix to get the mean
rowSums(i*ma) / rowSums(i)
}})
The Rcpp function is still faster (not surprisingly), but the function above gets respectably close. Here's an example on 50 million observations on my laptop with an i7-4600U and 12GB of RAM.
set.seed(21)
N <- 5e7
test.df <- data.frame(strt = 1L,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$strt <- pmax(1L, test.df$end - sample(3, N, replace = TRUE) + 1L)
test.m <- as.matrix(test.df)
Also note that I take care to ensure that test.m is an integer matrix. That helps reduce the memory footprint, which can help make things faster.
R> system.time(st1 <- MYrcpp(test.m))
user system elapsed
0.900 0.216 1.112
R> system.time(st2 <- f_jmu(test.m))
user system elapsed
6.804 0.756 7.560
R> identical(st1, st2)
[1] TRUE
Unless you can think of a way to do this with a clever subsetting approach, I think you've reached R's speed barrier. You'll want to use a low-level language like C++ for this problem. Fortunately, the Rcpp package makes interfacing with C++ in R simple. Disclaimer: I've never written a single line of C++ code in my life. This code may be very inefficient.
library(Rcpp)
cppFunction('NumericVector MYrcpp(NumericMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
NumericVector out(nrow);
for (int i = 0; i < nrow; i++) {
double avg = 0;
int start = x(i,0);
int end = x(i,1);
int N = end - start + 1;
while(start<=end){
avg += x(i, start + 1);
start = start + 1;
}
out[i] = avg/N;
}
return out;
}')
For this code I'm going to pass the data.frame as a matrix (i.e. testM <- as.matrix(test.df))
Let's see if it works...
MYrcpp(testM)
[1] 1.500000 2.000000 6.250000 5.000000 3.000000 7.666667
How fast is it?
Unit: microseconds
expr min lq mean median uq max neval
f2() 1543.099 1632.3025 2039.7350 1843.458 2246.951 4735.851 100
f3() 1859.832 1993.0265 2642.8874 2168.012 2493.788 19619.882 100
f4() 281.541 315.2680 364.2197 345.328 375.877 1089.994 100
MYrcpp(testM) 3.422 10.0205 16.7708 19.552 21.507 56.700 100
Where f2(), f3() and f4() are defined as
f2 <- function(){
func.df <- function(rown, x, y) {
rowMeans(test.df[rown,(x+2):(y+2), drop=FALSE])
}
k1 <- mapply(func.df, test.df$rown, test.df$strt, test.df$end)
}
f3 <- function(){
test.ave <- rep(NA, length(test.df$strt))
for (i in 1 : length(test.df$strt)) {
test.ave[i] <- rowMeans(test.df[i,as.numeric(test.df[i,1]+2):as.numeric(test.df[i,2]+2), drop=FALSE])
}
}
f4 <- function(){
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean)
}
That's roughly a 20x increase over the fastest.
Note, to implement the above code you'll need a C complier which R can access. For windows look into Rtools. For more on Rcpp read this
Now let's see how it scales.
N = 5e3
test.df <- data.frame(strt = 1,
end = sample(5, N, replace = TRUE),
a1.2 = sample(3, N, replace = TRUE),
a2.3 = sample(7, N, replace = TRUE),
a3.4 = sample(14, N, replace = TRUE),
a4.5 = sample(8, N, replace = TRUE),
a5.6 = sample(30, N, replace = TRUE))
test.df$rown <- as.numeric(row.names(test.df))
test.dt <- as.data.table(test.df)
microbenchmark(f4(), MYrcpp(testM))
Unit: microseconds
expr min lq mean median uq max neval
f4() 88647.256 108314.549 125451.4045 120736.073 133487.5295 259502.49 100
MYrcpp(testM) 196.003 216.533 242.6732 235.107 261.0125 499.54 100
With 5e3 rows MYrcpp is now 550x faster. This partially due to the fact that f4() is not going to scale well as Richard discusses in the comment. The f4() is essentially invoking a nested for loop by calling an apply within a lapply. Interestingly, the C++ code is also invoking a nested loop by utilizing a while loop inside a for loop. The speed disparity is due in large part to the fact that the C++ code is already complied and does not need to be interrupted into something the machine can understand at run time.
I'm not sure how big your data set is, but when I run MYrcpp on a data.frame with 1e7 rows, which is the largest data.frame I could allocate on my crummy laptop, it ran in 500 milliseconds.
Update: R equivalent of C++ code
MYr <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
out <- matrix(NA, nrow = 1, ncol = nrow)
for(i in 1:nrow){
avg <- 0
start <- x[i,1]
end <- x[i,2]
N <- end - start + 1
while(start<=end){
avg <- avg + x[i, start + 2]
start = start + 1
}
out[i] <- avg/N
}
out
}
Both MYrcpp and MYr are similar in many ways. Let me discuss a couple of the differences
The first line of MYrcpp is different from the MYr. In words the first line of MYrcpp, NumericVector MYrcpp(NumericMatrix x), means that we are defining a function whose name is MYrcpp which returns an output of class NumericVector and takes an input x of class NumericMatrix.
In C++ you have to define the class of a variable when you introduce it, i.e. int nrow = x.row() is a variable whose name is nrow whose class is int (i.e. integer) and is assigned to be x.nrow() i.e. the number of rows of x. (IGNORE if you're overwhelmed, nrow() is a method for instances of class `NumericVector. Like in Python you call a method by attaching it to the instance. The R equivalent is S3 and S4 methods)
When you subset in C++ you use () instead of [] like in R. Also, indexing begins at zero (like in Python). For example, x(0,1) in C++ is equivalent to x[1,2] in R
++ is an operator that means increment by 1, i.e. j++ is the same as j + 1. += is an operator that means add to together and assign, i.e. a += b is the same as a = a + b
My solution is the first one in the benchmark
library(microbenchmark)
microbenchmark(
lapply(
apply(test.df,1, function(x){
x[(x[1]+2):(x[2]+2)]}),
mean),
test.dt[, func.dt(rown, strt, end), by=.(rown)]
)
min lq mean median uq max neval
138.654 175.7355 254.6245 201.074 244.810 3702.443 100
4243.641 4747.5195 5576.3399 5252.567 6247.201 8520.286 100
It seems to be 25 times faster, but this is a small dataset. I am sure there is a better way to do this than what I have done.
I have foreach loop that produces a list within each loop and a .combine function to combine them that looks like this:
mergelists = function(x,xn) {
padlen = length(x[[1]])
for (n in names(x)[!names(x) %in% names(xn)]) xn[[n]] = 0
for (n in names(xn)[!names(xn) %in% names(x)]) xn[[n]] = c(rep(0,padlen), xn[[n]])
for (idx in names(xn)) { x[[idx]] = c( x[[idx]], xn[[idx]] ) }
x
}
The first two for-loops modify the new list (xn) to make it compatible to the the one that gathers the results (x). The last one joins x and xn onto x.
I believe my code is ridiculously inefficient, because it re-allocates a lot and uses for-loops. But I can't think about a better solution. Any ideas?
Some more explanation:
I don't know the list names in advance (they are patterns from a bootstrap exercise which takes place in the foreach part).
Example:
> x
$foo
[1] 3 2
$bar
[1] 3 2
and
> xn
$foo
[1] 1
$baz
[1] 1
should join to
> x
$foo
[1] 3 2 1
$bar
[1] 3 2 0
$baz
[1] 0 0 1
That's it.
In my benchmarking, this approach takes longer than your approach, but since I already worked it out, I thought I'd post it anyway. Here's to doubling effort. If the names are completely unknown and you are forced to pad with zeros in the .combine function, you could try the following. (perhaps try it on a subset of your iterations first to see if it works):
library(reshape2)
mergeList2 <- function(x, xn) {
xDF <- data.frame(ID = seq_along(x[[1]]), x)
xnDF <- data.frame(ID = seq_along(xn[[1]]) + nrow(xDF), xn)
meltedX <- melt(xDF, id = "ID")
meltedXN <- melt(xnDF, id = "ID")
res <- as.list(dcast(rbind(meltedX, meltedXN), ID ~ variable,
fill = 0))[-1]
return(res)
}
Your example:
mergeList2(list(foo = c(3, 2), bar = c(3, 2)),
list(foo = 1, baz= 1))
# $foo
# [1] 3 2 1
# $bar
# [1] 3 2 0
# $baz
# [1] 0 0 1
Test it out with a foreach example
set.seed(1)
foreach(dd = 1:10, .combine = mergeList2) %do% {
theNames <- sample(c("foo", "bar", "baz"), 2)
ans <- as.list(rpois(2, 4))
names(ans) <- theNames
ans
}
# $foo
# [1] 4 7 2 4 0 2 0 4 5 3
# $baz
# [1] 7 0 0 5 3 5 3 4 0 5
# $bar
# [1] 0 5 2 0 5 0 0 0 6 0
If foo and bar exist in every list and are in order, then mapply works. As #BenBarnes suggested, having a pre-processing step to create the 0's makes this a viable option even if they do not exist everywhere. Sorting is easy. I've changed the 0's to NAs since that seems more appropriate.
# Make data
x <- list(foo=c(3,2),bar=c(6,7))
xn <- list(foo=c(1),bar=c(1),aught=c(5,2))
lol <- list(x=x,xn=xn)
# Pre-process
allnames <- sort(unique(unlist(lapply(lol, names))))
cleanlist <- function(l,allnames) {
ret <- l[allnames]
names(ret) <- allnames
ret[sapply(ret,is.null)] <- NA
ret
}
lol <- lapply(lol,cleanlist,allnames=allnames)
# Combine
do.call("mapply", c(c,lol) )
Which produces:
aught bar foo
x NA 6 3
xn1 5 7 2
xn2 2 1 1
Benchmarking
That said, if you're hoping for speed gains, the original version is still the fastest, presumably because it does the least. But the loopless approach is pretty elegant and scales to an arbitrary number of x's.
library(microbenchmark)
microbenchmark( mergelists(lol$x,lol$xn), mergeList2(lol$x,lol$xn), do.call("mapply", c(c,lol) ) )
Unit: microseconds
expr min lq median uq max
1 do.call("mapply", c(c, lol)) 155.048 159.5175 192.0635 195.5555 245.841
2 mergeList2(lol$x, lol$xn) 19938.288 20095.9905 20225.4750 20719.6730 27143.674
3 mergelists(lol$x, lol$xn) 63.416 68.1650 78.0825 84.3680 95.265