Efficiently modify list in R - r

I have foreach loop that produces a list within each loop and a .combine function to combine them that looks like this:
mergelists = function(x,xn) {
padlen = length(x[[1]])
for (n in names(x)[!names(x) %in% names(xn)]) xn[[n]] = 0
for (n in names(xn)[!names(xn) %in% names(x)]) xn[[n]] = c(rep(0,padlen), xn[[n]])
for (idx in names(xn)) { x[[idx]] = c( x[[idx]], xn[[idx]] ) }
x
}
The first two for-loops modify the new list (xn) to make it compatible to the the one that gathers the results (x). The last one joins x and xn onto x.
I believe my code is ridiculously inefficient, because it re-allocates a lot and uses for-loops. But I can't think about a better solution. Any ideas?
Some more explanation:
I don't know the list names in advance (they are patterns from a bootstrap exercise which takes place in the foreach part).
Example:
> x
$foo
[1] 3 2
$bar
[1] 3 2
and
> xn
$foo
[1] 1
$baz
[1] 1
should join to
> x
$foo
[1] 3 2 1
$bar
[1] 3 2 0
$baz
[1] 0 0 1
That's it.

In my benchmarking, this approach takes longer than your approach, but since I already worked it out, I thought I'd post it anyway. Here's to doubling effort. If the names are completely unknown and you are forced to pad with zeros in the .combine function, you could try the following. (perhaps try it on a subset of your iterations first to see if it works):
library(reshape2)
mergeList2 <- function(x, xn) {
xDF <- data.frame(ID = seq_along(x[[1]]), x)
xnDF <- data.frame(ID = seq_along(xn[[1]]) + nrow(xDF), xn)
meltedX <- melt(xDF, id = "ID")
meltedXN <- melt(xnDF, id = "ID")
res <- as.list(dcast(rbind(meltedX, meltedXN), ID ~ variable,
fill = 0))[-1]
return(res)
}
Your example:
mergeList2(list(foo = c(3, 2), bar = c(3, 2)),
list(foo = 1, baz= 1))
# $foo
# [1] 3 2 1
# $bar
# [1] 3 2 0
# $baz
# [1] 0 0 1
Test it out with a foreach example
set.seed(1)
foreach(dd = 1:10, .combine = mergeList2) %do% {
theNames <- sample(c("foo", "bar", "baz"), 2)
ans <- as.list(rpois(2, 4))
names(ans) <- theNames
ans
}
# $foo
# [1] 4 7 2 4 0 2 0 4 5 3
# $baz
# [1] 7 0 0 5 3 5 3 4 0 5
# $bar
# [1] 0 5 2 0 5 0 0 0 6 0

If foo and bar exist in every list and are in order, then mapply works. As #BenBarnes suggested, having a pre-processing step to create the 0's makes this a viable option even if they do not exist everywhere. Sorting is easy. I've changed the 0's to NAs since that seems more appropriate.
# Make data
x <- list(foo=c(3,2),bar=c(6,7))
xn <- list(foo=c(1),bar=c(1),aught=c(5,2))
lol <- list(x=x,xn=xn)
# Pre-process
allnames <- sort(unique(unlist(lapply(lol, names))))
cleanlist <- function(l,allnames) {
ret <- l[allnames]
names(ret) <- allnames
ret[sapply(ret,is.null)] <- NA
ret
}
lol <- lapply(lol,cleanlist,allnames=allnames)
# Combine
do.call("mapply", c(c,lol) )
Which produces:
aught bar foo
x NA 6 3
xn1 5 7 2
xn2 2 1 1
Benchmarking
That said, if you're hoping for speed gains, the original version is still the fastest, presumably because it does the least. But the loopless approach is pretty elegant and scales to an arbitrary number of x's.
library(microbenchmark)
microbenchmark( mergelists(lol$x,lol$xn), mergeList2(lol$x,lol$xn), do.call("mapply", c(c,lol) ) )
Unit: microseconds
expr min lq median uq max
1 do.call("mapply", c(c, lol)) 155.048 159.5175 192.0635 195.5555 245.841
2 mergeList2(lol$x, lol$xn) 19938.288 20095.9905 20225.4750 20719.6730 27143.674
3 mergelists(lol$x, lol$xn) 63.416 68.1650 78.0825 84.3680 95.265

Related

Find closest value with condition

I have a function that finds me the nearest values for each row in a matrix. It then reports a list with an index of the nearest rows. However, I want it to exclude values if they are +1 in the first AND +1 in the second column away from a particular set of values (-1 in the first and -1 in the second column should also be removed). Moreover, +1 in first column and -1 in second column with respect to the values of interest should also be avoided.
As an example, if I want things closes to c(2, 1), it should accept c(3,1) or (2,2) or (1,1), but NOT c(3,2) and not c(1,0).
Basically, for an output to be reported either column 1 or column 2 should be a value of 1 away from a row of interest, but not both.
input looks like this
x
v1 v2
[1,] 3 1
[2,] 2 1
[3,] 3 2
[4,] 1 2
[5,] 8 5
myfunc(x)
The output looks like this. Notice that the closest thing to row 2 ($V2 in output) is row 1,3,4. The answer should only be 1 though.
$V1
[1] 2 3
$V2
[1] 1 3 4
$V3
[1] 1 2
$V4
[1] 2
$V5
integer(0)
Here is myfunc
myfunc = function(t){
d1 <- dist(t[,1])
d2 <- dist(t[,2])
dF <- as.matrix(d1) <= 1 & as.matrix(d2) <= 1
diag(dF) <- NA
colnames(dF) <- NULL
dF2 <- lapply(as.data.frame(dF), which)
return(dF2)
}
Basically, the rows that you want to find should differ from your reference element by +1 or -1 in one column and be identical in the other column. That means that the sum over the absolute values of the differences is exactly one. For your example c(2, 1), this works as follows:
c(3, 1): difference is c(1, 0), thus sum(abs(c(1, 0))) = 1 + 0 = 1
c(1, 1): difference is c(-1, 0), thus sum(abs(c(-1, 0))) = 1 + 0 = 1
etc.
The following function checks exactly this:
myfunc <- function(x) {
do_row <- function(r) {
r_mat <- matrix(rep(r, length = length(x)), ncol = ncol(x), byrow = TRUE)
abs_dist <- abs(r_mat - x)
return(which(rowSums(abs_dist) == 1))
}
return(apply(x, 1, do_row))
}
do_row() does the job for a single row, and then apply() is used to do this with each row. For your example, I get:
myfunc(x)
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
Using sweep(), one can write a shorter function:
myfunc2 <- function(x) {
apply(x, 1, function(r) which(rowSums(abs(sweep(x, 2, r))) == 1))
}
But this seems harder to understand and it turns out that it is slower by about a factor two for your matrix x. (I have also tried it with a large matrix, and there, the efficiency seems about the same.)

Copying dimnames without copying objects?

I could not find a prior question about this, but this one is fairly close.
Often I make new objects and want them to have the same dimnames (names, colnames, rownames) as some other object. Normally, I would use names, or rownames + colnames, but I'm tired of doing this and I want a better solution. I also want a solution that allows for partial matching, so I need a new function. My trouble is that it is apparently not quite easy to get it exactly right.
First, a helper function:
get_dims = function(x) {
if (is.null(dim(x))) {
return(length(x))
} else {
return(dim(x))
}
}
This gets the dimensions of any object. dim() returns NULL for atomic objects (vectors and lists), whereas it really should just return their length.
Next, we make up some minimal test data:
t = matrix(1:9, nrow=3)
t2 = t
rownames(t) = LETTERS[1:3]; colnames(t) = letters[1:3]
Inspect:
> t
a b c
A 1 4 7
B 2 5 8
C 3 6 9
> t2
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
The test is that t2 should get the dimnames of t. I print them because == apparently cannot handle list comparison (returns logical(0)).
A simple solution is to take in the object whose names I want to copy, the object where I want to copy them to, and simply change the dimnames in the function and return the object back. This can be done like this:
copy_names1 = function(x, y, partialmatching = T) {
#find object dimensions
x_dims = get_dims(x)
y_dims = get_dims(y)
#set names if matching dims
if (all(x_dims == y_dims)) {
#loop over each dimension
for (dim in 1:length(dimnames(x))) {
dimnames(y)[[dim]] <- dimnames(x)[[dim]]
}
}
return(y)
}
Test:
> copy_names1(t, t2)
a b c
A 1 4 7
B 2 5 8
C 3 6 9
So it works fine, but returns the object, which means one has to use the assignment operator, which is not needed with the normal *names() functions.
We can also assign from within the function using assign():
copy_names2 = function(x, y, partialmatching = T) {
#find object dimensions
x_dims = get_dims(x)
y_dims = get_dims(y)
#what is the object in y parameter?
y_obj_name = deparse(substitute(y))
#set names if matching dims
if (all(x_dims == y_dims)) {
#loop over each dimension
for (dim in 1:length(dimnames(x))) {
dimnames(y)[[dim]] <- dimnames(x)[[dim]]
}
}
#assign in the outer envir
assign(y_obj_name, pos = 1, value = y)
}
Test:
> copy_names2(t, t2)
> t2
a b c
A 1 4 7
B 2 5 8
C 3 6 9
It also works: it does not require using the assignment operator and returns silently. However, it does copy the object in RAM (I think) which is bad when using large objects. It would be better to call dimnames on the existing object without copying it. So I try that:
copy_names3 = function(x, y, partialmatching = T) {
#find object dimensions
x_dims = get_dims(x)
y_dims = get_dims(y)
#what is the object in y parameter?
y_obj_name = deparse(substitute(y))
get(y_obj_name, pos = -1) #test that it works
#set names if matching dims
if (all(x_dims == y_dims)) {
#loop over each dimension
for (dim in 1:length(dimnames(x))) {
dimnames(get(y_obj_name, pos = -1))[[dim]] <- dimnames(x)[[dim]]
}
}
}
Test:
> copy_names3(t, t2)
Error in dimnames(get(y_obj_name, pos = -1))[[dim]] <- dimnames(x)[[dim]] :
could not find function "get<-"
A very cryptic error! According to the previous question, get() cannot be used like this because it only fetches values, not assigns them. The persons writes to use assign() instead. However, in the documentation for assign() we find:
assign does not dispatch assignment methods, so it cannot be used to
set elements of vectors, names, attributes, etc.
How does one copy dimnames without copying objects with a function?
I'm not sure how the "partial matching" is supposed to work, but maybe this:
t = matrix(1:9, nrow=3)
t2 = t
t2 <- rbind(t2, 11:13)
rownames(t) = LETTERS[1:3]; colnames(t) = letters[1:3]
d <- dim(t) == dim(t2)
dimnames(t2)[d] <- dimnames(t)[d]
t2
# a b c
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
#[4,] 11 12 13
Edit:
Here is how you can do this from inside a "setter" function without eval(parse(...)):
t = matrix(1:9, nrow=3)
t2 = t
t2 <- rbind(t2, 11:13)
rownames(t) = LETTERS[1:3]; colnames(t) = letters[1:3]
fun <- function(x, template, ...) {
y <- substitute(x)
z <- substitute(template)
d <- dim(x) == dim(template)
expr <- bquote(dimnames(.(y))[.(d)] <- dimnames(.(z))[.(d)])
eval(expr, ...)
invisible(NULL)
}
fun(t2, t, .GlobalEnv)
t2
# a b c
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
#[4,] 11 12 13
Of course, if you need something really fast, you need to implement it in C (as was done with the dimnames<- function).
One solution is to run the dimnames call in the parent environment instead of inside the function. It can be done like this:
copy_names4 = function(x, y, partialmatching = T) {
library(stringr)
#find object dimensions
x_dims = get_dims(x)
y_dims = get_dims(y)
#what is the object in y parameter?
x_obj_name = deparse(substitute(x))
y_obj_name = deparse(substitute(y))
#set names if matching dims
if (all(x_dims == y_dims)) {
#loop over each dimension
for (dim in 1:length(dimnames(x))) {
str_call = str_c("dimnames(", y_obj_name, ")[[", dim, "]] <- dimnames(" ,x_obj_name, ")[[", dim, "]]")
eval(parse(text = str_call), parent.frame(1))
}
}
}
Test it:
> t2
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> copy_names4(t, t2)
> t2
a b c
A 1 4 7
B 2 5 8
C 3 6 9
Success!
But is it faster?
library(microbenchmark)
microbenchmark(copy_names1 = {t2 = copy_names1(t, t2)},
copy_names2 = copy_names2(t, t2),
copy_names4 = copy_names4(t, t2))
Results:
Unit: microseconds
expr min lq mean median uq max neval
copy_names1 8.778 10.6795 14.57945 11.9960 15.653 46.812 100
copy_names2 24.869 27.7950 38.62004 33.7925 39.937 202.168 100
copy_names4 466.067 478.9405 507.48058 494.4460 514.488 840.559 100
Surprisingly, the initial version was much faster, by about 40-50 times. However, the last one should be faster for large objects. Let's try a larger test:
#larger test
t = matrix(1:9000000, nrow=3000)
t2 = t
rownames(t) = sample(LETTERS[1:26], size = 3000, replace = T); colnames(t) = sample(letters[1:26], size = 3000, replace = T)
t[1:5, 1:5]
t2[1:5, 1:5]
microbenchmark(copy_names1 = {t2 = copy_names1(t, t2)},
copy_names2 = copy_names2(t, t2),
copy_names4 = copy_names4(t, t2))
Results:
Unit: milliseconds
expr min lq mean median uq max neval
copy_names1 4.146032 4.442115 33.09852 12.14201 13.00495 242.2970 100
copy_names2 4.229708 4.553877 41.39389 12.23739 20.12995 229.4899 100
copy_names4 5.104497 5.499469 44.42764 13.24267 21.41507 228.7731 100
Now they are about equally fast, though the first two are still slightly faster.

Counting column data in a matrix with resets

I'm gathering data on how much my cats poop into a matrix:
m <- cbind(fluffy=c(1.1,1.2,1.3,1.4),misterCuddles=c(0.9,NA,1.1,1.0))
row.names(m) <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")
Which gives me this:
fluffy misterCuddles
2013-01-01 1.1 0.9
2013-01-02 1.2 NA
2013-01-03 1.3 1.1
2013-01-04 1.4 1.0
On every date, I'd like to know how many days in a row each cat has gone number 2. So the resulting matrix should look like this:
fluffy misterCuddles
2013-01-01 1 1
2013-01-02 2 0
2013-01-03 3 1
2013-01-04 4 2
Is there a way to do this efficiently? The cumsum function does something similar, but that's a primitive so I can't modify it to suit my dirty, dirty needs.
I could run a for loop and store a count like so:
m.output <- matrix(nrow=nrow(m),ncol=ncol(m))
for (column in 1:ncol(m)) {
sum <- 0
for (row in 1:nrow(m)) {
if (is.na(m[row,column])) sum <- 0
else sum <- sum + 1
m.output[row,column] <- sum
}
}
Is this the most efficient way to do this? I have a lot of cats, and I've recorded years worth of poop data. Can I parallellize this by column somehow?
All of the answers here are actually too complicated (including my own, from earlier, copied below). The Reduce family of answers is just masking a for-loop in a single function call. I like Roland's and Ananda's, but both I think have a little too much going on.
Thus, here's a simple vectorized solution:
reset <- function(x) {
s <- seq_along(x)
s[!is.na(x)] <- 0
seq_along(x) - cummax(s)
}
> apply(m, 2, reset)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
It also works on Roland's example:
m2 <- cbind(fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA))
> apply(m2, 2, reset)
fluffy misterCuddles
[1,] 0 0
[2,] 1 1
[3,] 2 2
[4,] 3 0
[5,] 4 0
[6,] 5 1
[7,] 6 0
From earlier: this is not vectorized, but also works:
pooprun <- function(x){
z <- numeric(length=length(x))
count <- 0
for(i in 1:length(x)){
if(is.na(x[i]))
count <- 0
else
count <- + count + 1
z[i] <- count
}
return(z)
}
apply(m, 2, pooprun)
> apply(m, 2, pooprun)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
THE BENCHMARKING
Here I simply wrap everyone's answers in a function call (based on their name).
> library(microbenchmark)
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 1.540 4.6200 5.3890 6.1590 372.185 1000
hadley() 87.755 92.758 94.298 96.6075 1767.012 1000
thomas() 92.373 99.6860 102.7655 106.6140 315.223 1000
matthew() 128.168 136.2505 139.7150 145.4880 5196.344 1000
thomasloop() 133.556 141.6390 145.1030 150.4920 84131.427 1000
usobi() 148.182 159.9210 164.7320 174.1620 5010.445 1000
ananda() 720.507 742.4460 763.6140 801.3335 5858.733 1000
And here are the results for Roland's example data:
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 2.310 5.3890 6.1590 6.9290 75.438 1000
hadley() 75.053 78.902 80.058 83.136 1747.767 1000
thomas() 90.834 97.3770 100.2640 104.3050 358.329 1000
matthew() 139.715 149.7210 154.3405 161.2680 5084.728 1000
thomasloop() 144.718 155.4950 159.7280 167.4260 5182.103 1000
usobi() 177.048 188.5945 194.3680 210.9180 5360.306 1000
ananda() 705.881 729.9370 753.4150 778.8175 8226.936 1000
Note: Alexis's and Hadley's solutions took quite a while to actually define as functions on my machine, whereas the others work out-of-the-box, but Alexis's is otherwise the clear winner.
This should work. Note that each of your cats is an independent individual so you can turn your data frame into a list and use mclapply which uses a paralleled approach.
count <- function(y,x){
if(is.na(x)) return(0)
return (y + 1)
}
oneCat = m[,1]
Reduce(count,oneCat,init=0,accumulate=TRUE)[-1]
EDIT: here is the full answer
count <- function(x,y){
if(is.na(y)) return(0)
return (x + 1)
}
mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
EDIT2: The main bad problem is that I do get extra 0's at the beginning so...
result = mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
finalResult = do.call('cbind',result)[-1,]
rownames(finalResult) = rownames(m)
does the job.
Another option, similar #Usobi's in that it uses Reduce, but with a slightly different approach:
apply(!is.na(m), 2, Reduce, f=function(x,y) if (y) x + y else y, accumulate=TRUE)
# fluffy misterCuddles
# [1,] 1 1
# [2,] 2 0
# [3,] 3 1
# [4,] 4 2
I had saved a snippet from here that translates almost exactly for a problem like this:
countReset <- function(x) {
x[!is.na(x)] <- 1
y <- ave(x, rev(cumsum(rev(is.na(x)))), FUN=cumsum)
y[is.na(y)] <- 0
y
}
apply(m, 2, countReset)
# fluffy misterCuddles
# 2013-01-01 1 1
# 2013-01-02 2 0
# 2013-01-03 3 1
# 2013-01-04 4 2
Since I'm in a period where I'm trying to get used to .Call, here's another idea that seems to work and -probably- is fast. (Don't take my word for it, though, my skills are not trustworthy!!):
library(inline) #use "inline" package for convenience
f <- cfunction(sig = c(R_mat = "numeric", R_dims = "integer"), body = '
R_len_t *dims = INTEGER(R_dims);
R_len_t rows = dims[0], cols = dims[1];
double *mat = REAL(R_mat);
SEXP ans;
PROTECT(ans = allocMatrix(INTSXP, rows, cols));
R_len_t *pans = INTEGER(ans);
for(int ic = 0; ic < cols; ic++)
{
pans[0 + ic*rows] = ISNA(mat[0 + ic*rows]) ? 0 : 1;
for(int ir = 1; ir < rows; ir++)
{
if(ISNA(mat[ir + ic*rows]))
{
pans[ir + ic*rows] = 0;
}else
{
if(!ISNA(mat[(ir - 1) + ic*rows]))
{
pans[ir + ic*rows] = pans[(ir - 1) + ic*rows] + 1;
}else
{
pans[ir + ic*rows] = 1;
}
}
}
}
UNPROTECT(1);
return(ans);
')
f(m, dim(m))
# [,1] [,2]
#[1,] 1 1
#[2,] 2 0
#[3,] 3 1
#[4,] 4 2
f(mm, dim(mm)) #I named Roland's matrix, mm ; I felt that I had to pass this test!
# [,1] [,2]
#[1,] 0 0
#[2,] 1 1
#[3,] 2 2
#[4,] 3 0
#[5,] 4 0
#[6,] 5 1
#[7,] 6 0
So the solution to this problem has two parts:
A function that accepts a vector per cat and returns a vector telling me at each date, how many days since the last NA
A function that accepts an NxM matrix and returns an NxM matrix, applying function (1) to each column
For (2), I adapted this from #Usobi's answer:
daysSinceLastNA <- function(matrix, vectorFunction, cores=1) {
listResult <- mclapply(as.data.frame(matrix), vectorFunction, mc.cores=cores)
result <- do.call('cbind', listResult)
rownames(result) <- rownames(matrix)
result
}
For (1), I have two solutions:
#ananda-mahto's solution:
daysSinceLastNA_1 <- function(vector) {
vector[!is.na(vector)] <- 1
result <- ave(vector, rev(cumsum(rev(is.na(vector)))), FUN=cumsum)
result[is.na(result)] <- 0
result
}
#Usobi's solution:
daysSinceLastNA_2 <- function(vector) {
reduction <- function(total, additional) ifelse(is.na(additional), 0, total + 1)
Reduce(reduction, vector, init=0, accumulate=TRUE)[-1]
}
Then I call them like this:
> system.time(result1 <- daysSinceLastNA (test, daysSinceLastNA_1 ))
user system elapsed
5.40 0.01 5.42
> system.time(result2 <- daysSinceLastNA (test, daysSinceLastNA_2 ))
user system elapsed
58.02 0.00 58.03
On my test dataset, which is roughly a 2500x2500 matrix, the first approach is an order of magnitude faster.
If I run on linux with 64 cores, solution (1) runs in 2 seconds, and solution (2) runs in 6 seconds.
For this sort of problem, which is easily solved with a for loop, I find Rcpp a very natural answer.
library(Rcpp)
cppFunction("NumericVector cumsum2(NumericVector x) {
int n = x.length();
NumericVector out(x);
for(int i = 0; i < n; ++i) {
if (NumericVector::is_na(x[i]) || i == 0) {
x[i] = 0;
} else {
x[i] = x[i - 1] + 1;
}
}
return out;
}")
The code requires a little more bookkeeping than the equivalent R code, but the bulk of the function is a very simple for loop.
You can then apply in R like any other vectorised function:
m2 <- cbind(
fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA)
)
apply(m2, 2, cumsum2)
You could of course make the C++ code iterate over the columns of the matrix, but I think that since this is already easily expressed in R, you might as well use the built in tools.

How could I make this R snippet faster and more R-ish?

Coming from various other languages, I find R powerful and intuitive, but I am not thrilled with its performance. So I decided to try to improve some snippet I wrote and learn how to code better in R.
Here's a function I wrote, trying to determine if a vector is binary-valued (two distinct values or just one value) or not:
isBinaryVector <- function(v) {
if (length(v) == 0) {
return (c(0, 1))
}
a <- v[1]
b <- a
lapply(v, function(x) { if (x != a && x != b) {if (a != b) { return (c()) } else { b = x }}})
if (a < b) {
return (c(a, b))
} else {
return (c(b, a))
}
}
EDIT: This function is expected to look through a vector then return c() if it is not binary-valued, and return c(a, b) if it is, a being the small value and b being the larger one (if a == b then just c(a, a). E.g., for
A B C
1 1 1 0
2 2 2 0
3 3 1 0
I will lapply this isBinaryVector and get:
$A
[1] 1 1
$B
[1] 1 1
$C
[1] 0 0
The time it took on a moderate sized dataset (about 1800 * 3500, 2/3 of them are binary-valued) is about 15 seconds. The set contains only floating-point numbers.
Is there anyway I could do this faster?
Thanks for any inputs!
You are essentially trying to write a function that returns TRUE if a vector has exactly two unique values, and FALSE otherwise.
Try this:
> dat <- data.frame(
+ A = 1:3,
+ B = c(1, 2, 1),
+ C = 0
+ )
>
> sapply(dat, function(x)length(unique(x))==2)
A B C
FALSE TRUE FALSE
Next, you want to get the min and max value. The function range does this. So:
> sapply(dat, range)
A B C
[1,] 1 1 0
[2,] 3 2 0
And there you have all the ingredients to make a small function that is easy to understand and should be extremely quick, even on large amounts of data:
isBinary <- function(x)length(unique(x))==2
binaryValues <- function(x){
if(isBinary(x)) range(x) else NA
}
sapply(dat, binaryValues)
$A
[1] NA
$B
[1] 1 2
$C
[1] NA
This function returns true or false for vectors (or columns of a data frame):
is.binary <- function(v) {
x <- unique(v)
length(x) - sum(is.na(x)) == 2L
}
Also take a look at this post
I'd use something like that to get column indicies:
bivalued <- apply(my.data.frame, 2, is.binary)
nominal <- my.data.frame[,!bivalued]
binary <- my.data.frame[,bivalued]
Sample data:
my.data.frame <- data.frame(c(0,1), rnorm(100), c(5, 19), letters[1:5], c('a', 'b'))
> apply(my.data.frame, 2, is.binary)
c.0..1. rnorm.100. c.5..19. letters.1.5. c..a....b..
TRUE FALSE TRUE FALSE TRUE

Aggregate rows in a large matrix by rowname

I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.
Here's a solution using by and colSums, but requires some fiddling due to the default output of by.
M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
V1 V2 V3
1 3 9 15
2 3 6 9
There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10
b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10
c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10
> identical(as.vector(a),as.vector(c))
[1] TRUE
EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.
The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:
combineByRow <- function(m) {
m <- m[ order(rownames(m)), ]
## keep track of previous row name
prev <- rownames(m)[1]
i.start <- 1
i.end <- 1
## cache the rownames -- profiling shows that it takes
## forever to look at them
m.rownames <- rownames(m)
stopifnot(all(!is.na(m.rownames)))
## go through matrix in a loop, as we need to combine some unknown
## set of rows
for (i in 2:(1+nrow(m))) {
curr <- m.rownames[i]
## if we found a new row name (or are at the end of the matrix),
## combine all rows and mark invalid rows
if (prev != curr || is.na(curr)) {
if (i.start < i.end) {
m[i.start,] <- apply(m[i.start:i.end,], 2, max)
m.rownames[(1+i.start):i.end] <- NA
}
prev <- curr
i.start <- i
} else {
i.end <- i
}
}
m[ which(!is.na(m.rownames)),]
}
Testing it shows that is about 10x faster than the answer using by (2 vs. 20 seconds in this example):
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)
start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)
all(m1 == m2)

Resources