what's a fast way to match two matrices (one and two) together and to extract the index of matrix two for the matches. Matrix two is large (hundreds to thousands of rows).
one
[,1] [,2]
9 11
13 2
head(two)
[,1][,2]
[1,] 9 11
[2,] 11 9
[3,] 2 3
[4,] 13 2
[5,] 2 4
[6,] 3 3
The output should be (notice how index 2 is not an output value)
1 4
One way of doing this :
a = apply(one, 1, paste0, collapse = "-")
b = apply(two, 1, paste0, collapse = "-")
match(a, b)
#[1] 1 4
We paste all the columns together row-wise for both the matrices and then match them to get the rows which are same.
Just for reference,
a
#[1] "9-11" "13-2"
b
#[1] "9-11" "11-9" "2-3" "13-2" "2-4" "3-3"
You could write a C++ loop to do it fairly quick
library(Rcpp)
cppFunction('NumericVector matrixIndex(NumericMatrix m1, NumericMatrix m2){
int m1Rows = m1.nrow();
int m2Rows = m2.nrow();
NumericVector out;
for (int i = 0; i < m1Rows; i++){
for (int j = 0; j < m2Rows; j++){
if(m1(i, 0) == m2(j, 0) && m1(i, 1) == m2(j, 1)){
//out[j] = (j+1);
out.push_back(j + 1);
}
}
}
return out;
}')
matrixIndex(m1, m2)
[1] 1 4
Although I suspect it would be faster to pre-allocate the result vector first, something like
cppFunction('NumericVector matrixIndex(NumericMatrix m1, NumericMatrix m2){
int m1Rows = m1.nrow();
int m2Rows = m2.nrow();
NumericVector out(m2Rows);
for (int i = 0; i < m1Rows; i++){
for (int j = 0; j < m2Rows; j++){
if(m1(i, 0) == m2(j, 0) && m1(i, 1) == m2(j, 1)){
out[j] = (j+1);
//out.push_back(j + 1);
}
}
}
return out;
}')
matrixIndex(m1, m2)
[1] 1 0 0 4 0 0
## 0 == nomatch.
You don't say if by "fast" you mean compute time or person time. If it only needs doing once, the overall time is probably shortest if you optimize person time, and Ronak's answer is going to be hard to beat, it's clear and robust.
If the numbers are all less than a certain number (say, 100, as in your example data), you can do a similar thing but use arithmetic to combine the two columns together and then match. I suspect (but haven't tested) that this would be faster than converting to character vectors. There are of course other arithmetic options too depending on your circumstance.
a <- one[,1]*100 + one[,2]
b <- two[,1]*100 + two[,2]
match(a, b)
We can use %in%
which(do.call(paste, as.data.frame(two)) %in% do.call(paste, as.data.frame(one)))
#[1] 1 4
Related
I need to create a function cprod -> that takes a numeric vector as an argument and returns a cumulative vector of the same length. So, if I have cprod(c(1,2,3)), the returning vector should be c (1, 1 * 2, 1 * 2 * 3) = c (1, 2, 6).
Can this be done without cumprod? Maybe with prod or for-loop?
One option could be:
Reduce(`*`, x, accumulate = TRUE)
[1] 1 2 6
It doesn't use cumprod...
x <- c(1,2,3)
exp(cumsum(log(x)))
#> [1] 1 2 6
Try this with a loop:
#Code
v1 <- c(1,2,3)
#Empty vector
v2 <- numeric(length(v1))
#Loop
for(i in 1:length(v1))
{
#Move around each element
e1 <- v1[1:i]
#Compute prod
vp <- prod(e1)
#Save
v2[i] <- vp
}
Output:
v2
[1] 1 2 6
Something like this?
> x <- 1:3
> cumprod(x)
[1] 1 2 6
> for (i in 2:length(x)) {
+ x[i] <- x[i-1] * x[i]
+ }
> x
[1] 1 2 6
purrr may also help
x <- 1:5
purrr::accumulate(x, ~ .x*.y)
[1] 1 2 6 24 120
An Rcpp variant.
library(Rcpp)
cppFunction('
NumericVector cumProd(NumericVector x) {
int n = x.size();
NumericVector out(n);
out[0] = x[0];
for(int i = 1; i < n; ++i) {
out[i] = out[i - 1] * x[i];
}
return out;
}
')
cumProd(1:10)
# [1] 1 2 6 24 120 720 5040 40320 362880 3628800
stopifnot(all.equal(cumProd(1:10), cumprod(1:10)))
My data looks like this:
S
0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000
the S indicates the column from which I am talking. It is col 26. All four rows share a 1 at that position.
I would need to be able to count for each row from 2 to 4:
How many columns to the left and right are the same as row 1?
For row 2 it would be 3 to the right (as it reaches 1/0) and 8 to the left (as it reaches 0/1).
The result for every row should be entered into a matrix like this:
row2 8 3
row3 11 9
Is there a fast and efficient way to do that? The matrix I am dealing with is very large.
If you need something fast, you could use Rcpp:
mat <- as.matrix(read.fwf(textConnection("0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000"), widths = rep(1, 61)))
library(Rcpp)
cppFunction('
IntegerMatrix countLR(const LogicalMatrix& mat, const int S) {
const int nr(mat.nrow()), nc(mat.ncol());
IntegerMatrix res(nr - 1, 2);
for(int i=1; i<nr;i++){
for(int j=S-2; j>=0;j--) {
if (mat(0,j) != mat(i,j)) break;
else res(i-1,0)++;
}
for(int j=S; j<nc;j++) {
if (mat(0,j) != mat(i,j)) break;
else res(i-1,1)++;
}
}
return(res);
}' )
countLR(mat, 26)
# [,1] [,2]
#[1,] 8 2
#[2,] 10 2
#[3,] 6 0
I assumed that column 26 itself doesn't count for the result. I also assumed that the matrix can only contain 0/1 (i.e., boolean) values. Adjust as needed.
It's pretty easy with strsplit and rle to pull apart and assemble this data:
> S <- scan(what="") #input of character mode
1: 0101001010000000000000000100111100000000000011101100010101010
2: 1001010000000001100000000100000000000100000010101110101010010
3: 1101010101010010000000000100000000100101010010110101010101011
4: 0000000000000000001000000111000110000000000000000000000000000
5:
s2 <- strsplit(S, split="")
sapply(s2, "[[", 26) # verify the 26th position is all ones
#[1] "1" "1" "1" "1"
#length of strings from 26th postion to right
rtlen <- length(s2[[1]])-(26-1)
# Pick from the `rle` $values where values TRUE
rle( tail( s2[[1]] == s2[[2]], rtlen) )
Run Length Encoding
lengths: int [1:11] 3 4 5 1 7 1 4 1 1 6 ...
values : logi [1:11] TRUE FALSE TRUE FALSE TRUE FALSE ...
Now that you have an algorithm for a single instance, you can iterate of the rest of the items in s2. To do the backwards look I just did the same operation on a rev-ersed section of the strings.
m<-matrix(NA, 3,2);
for (i in 2:4) { m[i-1,2] <- rle(tail( s2[[1]] == s2[[i]], rtlen) )$lengths[1]
m[i-1, 1] <- rle( rev( head( s2[[1]] == s2[[i]], 26)) )$lengths[1] }
m
[,1] [,2]
[1,] 9 3 # I think you counted wrong
[2,] 11 3
[3,] 7 1
Notice that I was comparing each one to the first row and your results suggest you were doing something else...perhaps comparing to the row above. That could easily be done instead with only a very small mod to the code indices for choice of the comparison vector:
m<-matrix(NA, 3,2);
for (i in 2:4) { m[i-1,2] <- rle(tail( s2[[i-1]] == s2[[i]], rtlen) )$lengths[1]
m[i-1, 1] <- rle( rev( head( s2[[i-1]] == s2[[i]], 26)) )$lengths[1] }
m
[,1] [,2]
[1,] 9 3
[2,] 9 9 #Again I think you may have miscounted. Easy to do, eh?
[3,] 7 1
This problem intrigued me. Since the matrix is binary, it's far more efficient to pack the matrix into a raw matrix than it is to use sparse matrices. It means that the storage for a 1,000 x 21,000,000 pattern matrix is approx. 2.4 GiB (print(object.size(raw(1000 * 21000000 / 8)), units = "GB")).
The following should be a relatively efficient way to tackle the problem. The Rcpp code takes a raw matrix which indicates the differences between the first row of the original matrix and the other rows. For efficiency in the R code, it's actually arranged with the patterns in columns rather than rows. The other functions help to convert existing sparse or regular matrices into packed ones and to read a matrix directly from a file.
library("Rcpp")
library("Matrix")
writeLines("0101001010000000000000000100111100000000000011101100010101010
1001010000000001100000000100000000000100000010101110101010010
1101010101010010000000000100000000100101010010110101010101011
0000000000000000001000000111000110000000000000000000000000000", "example.txt")
cppFunction('
IntegerMatrix countLRPacked(IntegerMatrix mat, long S) {
long l = S - 2;
long r = S;
long i, cl, cr;
int nr(mat.nrow()), nc(mat.ncol());
IntegerMatrix res(nc, 2);
for(int i=0; i<nc;i++){
// First the left side
// Work out which byte is the first to have a 1 in it
long j = l >> 3;
int x = mat(j, i) & ((1 << ((l & 7) + 1)) - 1);
long cl = l & 7;
while(j > 0 && !x) {
j --;
x = mat(j, i);
cl += 8;
}
// Then work out where the 1 is in the byte
while (x >>= 1) --cl;
// Now the right side
j = r >> 3;
x = mat(j, i) & ~((1 << ((r & 7))) - 1);
cr = 8 - (r & 7);
while(j < (nr-1) && !x) {
j ++;
x = mat(j, i);
cr += 8;
}
cr--;
while (x = (x << 1) & 0xff) --cr;
res(i, 0) = cl;
res(i, 1) = cr;
}
return(res);
}')
# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readBinaryMatrix <- function(file = NULL, text = NULL) {
if (missing(file) && !missing(text)) {
file <- textConnection(text)
on.exit(close(file))
}
if (is.character(file)) {
file <- file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
lst <- list()
i <- 1
while(length(line <- readLines(file, n = 1)) > 0) {
lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]))
i <- i + 1
}
do.call("cbind", lst)
}
# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of 8
packRow <- function(row) {
packBits(as.raw(c(row, rep(0, (8 - length(row)) %% 8 ))))
}
# Converts a binary integer matrix to a packed raw matrix
# Note the matrix is transposed (makes the subsequent xor more efficient)
packMatrix <- function(mat) {
stopifnot(class(mat) %in% c("matrix", "dgCMatrix"))
apply(mat, 1, packRow)
}
# Takes either a packed raw matrix or a binary integer matrix, uses xor to compare all the first row
# with the others and then hands it over to the Rcpp code for processing
countLR <- function(mat, S) {
stopifnot(class(mat) %in% c("matrix", "dgCMatrix"))
if (storage.mode(mat) != "raw") {
mat <- packMatrix(mat)
}
stopifnot(8 * nrow(mat) > S)
y <- xor(mat[, -1, drop = FALSE], mat[, 1, drop = TRUE])
countLRPacked(y, S)
}
sMat <- Matrix(as.matrix(read.fwf("example.txt", widths = rep(1, 61))))
pMat <- readBinaryMatrix("example.txt")
countLR(sMat, 26)
countLR(pMat, 26)
You should note that the width of the pattern matrix is right-padded to a multiple of 8, so if the patterns match all the way to the right hand side this will result in the right hand count being possibly a bit high. This could be corrected if need be.
Slow R version to do this (moved from duplicate):
countLR <- function(mat, S) {
mat2 <- mat[1, ] != t(mat[-1, , drop = FALSE])
l <- apply(mat2[(S - 1):1, ], 2, function(x) which(x)[1] - 1)
l[is.na(l)] <- S - 1
r <- apply(mat2[(S + 1):nrow(mat2), ], 2, function(x) which(x)[1] - 1)
r[is.na(l)] <- ncol(mat) - S
cbind(l, r)
}
Inside a function of type SEXP myfun(SEXP n, SEXP d) I allocate an (n, d)-matrix which contains the result of a function computed in C:
SEXP res = PROTECT(allocMatrix(REALSXP, n, d));
I would like to 'fill' this matrix (it can then be returned by myfun) and thus would like to convert it to an (n, d)-matrix in C (or pointer or so). How can this be done?
If res was a vector, I could do:
double *res_ = REAL(res);
and then walk through res_ with a for loop. However, when I use that for the above matrix, it loses its dimension, i.e., indexing via res_[i][j] for computing the result fails. Of course one could work with a vector and keep track of the row/col indices oneself, but ideally I would like to simply write res_[i][j]. Is this doable without significant amount of extra code?
The matrix is going to be stored as a vector internally, which should imply that [i][j] is meaningless in C. Here is an example from Writing R Extensions on how you would do this:
#include <R.h>
#include <Rinternals.h>
SEXP out(SEXP x, SEXP y)
{
R_len_t i, j, nx = length(x), ny = length(y);
double tmp, *rx = REAL(x), *ry = REAL(y), *rans;
SEXP ans;
PROTECT(ans = allocMatrix(REALSXP, nx, ny));
rans = REAL(ans);
for(i = 0; i < nx; i++) {
tmp = rx[i];
for(j = 0; j < ny; j++)
rans[i + nx*j] = tmp * ry[j];
}
UNPROTECT(1);
return(ans);
}
Speed if course very important, but I also like code clarity. The solution by BrodieG could be written as a much shorter RcppArmadillo function -- it really is just a single outer product:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
arma::mat outCpp(arma::colvec x, arma::rowvec y) {
return x * y;
}
If we test this against Brodie's (renamed to outC()) we get his:
R> sourceCpp("/tmp/marius.cpp")
R> library(rbenchmark)
R> a <- as.numeric(1:3)
R> b <- as.numeric(1:4)
R> outC(a, b)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 4 6 8
[3,] 3 6 9 12
R> outCpp(a, b)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 4 6 8
[3,] 3 6 9 12
R> benchmark(outC(a,b), outCpp(a,b), replications=1e5)[,1:4]
test replications elapsed relative
1 outC(a, b) 100000 0.382 1.000
2 outCpp(a, b) 100000 0.484 1.267
R>
So running 100,000 replications takes 380 vs 480 msec.
That means for each call, the difference is one millionth of a second. I think I take for having a single line of code that is easier to read and maintain.
Suppose a data frame like this:
> n <- 3
> a <- data.frame(x=1:n,y=sample(letters,n,replace = T),stringsAsFactors = F)
> rownames(a) <- paste0("p",1:n)
> a
x y
p1 1 a
p2 2 e
p3 3 b
I want to transform the data frame to a list like this:
$p1
$p1$x
[1] 1
$p1$y
[1] "a"
$p2
$p2$x
[1] 2
$p2$y
[1] "e"
$p3
$p3$x
[1] 3
$p3$y
[1] "b"
One intuitive ways to perform such transformation is to use lapply to iterate over all rows, but it is really slow. If it were a matrix, another way is apply(a,1,as.list). I do some benchmark tests and they show that apply approach is 5 times fasters than lapply approach. Further more, I also tested apply(a,1,as.vector,mode="list") approach and it's 4 times faster than as.list approach. Unfortunately, it is a data frame with heterogeneous types of columns.
When the number of rows of the data frame is larger, all methods seem to work slowly. Is there a way to do this even faster? (Use Rcpp? and how?)
For the record (and since you've mentioned "Rcpp"), I'm adding an approach at the C level. The speedup is about 7x; there could be better / faster solutions, but -agreeing with the comments- it may be more suitable to plan a different approach than trying to make a specific part as fast as it gets especially if it's hard to get significant speedups.
library(inline)
ff <- cfunction(sig = c(R_df = "data.frame"), body = '
R_len_t nr = LENGTH(VECTOR_ELT(R_df, 0)), nc = LENGTH(R_df);
SEXP ans;
PROTECT(ans = allocVector(VECSXP, nr));
for(int i = 0; i < nr; i++) {
SET_VECTOR_ELT(ans, i, allocVector(VECSXP, nc));
setAttrib(VECTOR_ELT(ans, i), R_NamesSymbol,
getAttrib(R_df, R_NamesSymbol));
}
setAttrib(ans, R_NamesSymbol, getAttrib(R_df, R_RowNamesSymbol));
for(int i = 0; i < nc; i++) {
SEXP tmp;
PROTECT(tmp = coerceVector(VECTOR_ELT(R_df, i),
TYPEOF(VECTOR_ELT(R_df, i))));
switch(TYPEOF(tmp)) {
case LGLSXP:
case INTSXP: {
R_len_t *ptmp = INTEGER(tmp);
for(int j = 0; j < nr; j++)
SET_VECTOR_ELT(VECTOR_ELT(ans, j), i,
ScalarInteger(ptmp[j]));
break;
}
case REALSXP: {
double *ptmp = REAL(tmp);
for(int j = 0; j < nr; j++)
SET_VECTOR_ELT(VECTOR_ELT(ans, j), i,
ScalarReal(ptmp[j]));
break;
}
case STRSXP: {
for(int j = 0; j < nr; j++)
SET_VECTOR_ELT(VECTOR_ELT(ans, j), i,
ScalarString(STRING_ELT(tmp, j)));
break;
}
}
UNPROTECT(1);
}
UNPROTECT(1);
return(ans);
')
ff(a)
#$p1
#$p1$x
#[1] 1
#
#$p1$y
#[1] "k"
#
#
#$p2
#$p2$x
#[1] 2
#
#$p2$y
#[1] "o"
#
#
#$p3
#$p3$x
#[1] 3
#
#$p3$y
#[1] "l"
And comparing with the approach of yours (mentioned in the comments) that proved to be fast:
identical(setNames(do.call(Map,
c(function(...)
"names<-"(list(...), colnames(a)), a)),
row.names(a)),
ff(a))
#[1] TRUE
And on a larger "data.frame":
set.seed(101)
DF = do.call(cbind.data.frame,
replicate(4, cbind.data.frame(x = I(sample(letters, 1e5, T)),
y = runif(1e5),
z = sample(1e5)), simplify = F))
names(DF) = make.unique(names(DF), "")
identical(setNames(do.call(Map,
c(function(...)
"names<-"(list(...), colnames(DF)), DF)),
row.names(DF)),
ff(DF))
#[1] TRUE
library(microbenchmark)
microbenchmark(ans1 = setNames(do.call(Map,
c(function(...)
"names<-"(list(...), colnames(DF)),
DF)),
row.names(DF)),
ff(DF),
times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# ans1 3504.1825 3862.4333 3931.0853 4063.691 4162.9370 10
# ff(DF) 143.0398 340.6897 365.5144 404.475 498.3854 10
It looks like you want the rows split into a list and then within each of these split the rows into a list with all the elements. Here's an approach that matches the OP's output but I think #Roland's is more useful. The use of sprintf is to address reordering done by split. This has the advantage over the apply(a, 1, as.list) solution in that the individual elements of the nested lists are numeric and character whereas apply coerces everything to character (it forms a matrix).
rows <- 1:nrow(a)
breaks <- paste0("p", sprintf(paste0("%0", nchar(max(rows)), "d"), rows))
lapply(split(a, breaks), as.list)
## $p1
## $p1$x
## [1] 1
##
## $p1$y
## [1] "g"
##
##
## $p2
## $p2$x
## [1] 2
##
## $p2$y
## [1] "c"
##
##
## $p3
## $p3$x
## [1] 3
##
## $p3$y
## [1] "t"
From your comments I'd suggest to either use a real database or to use package data.table:
DT <- data.table(name=c("Ken","Ashley"),type=c("A","B"),score=c(9,8))
setkey(DT, name)
interests <- data.table(name=c("Ken", "Ashley"),
interests=list(c("reading","music"), c("dancing","swimming")))
DT[interests]
# name type score interests
#1: Ken A 9 reading,music
#2: Ashley B 8 dancing,swimming
Note that at its core this is a list:
unclass(DT[interests])
$name
[1] "Ken" "Ashley"
$type
[1] "A" "B"
$score
[1] 9 8
$interests
$interests[[1]]
[1] "reading" "music"
$interests[[2]]
[1] "dancing" "swimming"
attr(,"row.names")
[1] 1 2
attr(,".internal.selfref")
<pointer: 0x7fc7c4007978>
I'm gathering data on how much my cats poop into a matrix:
m <- cbind(fluffy=c(1.1,1.2,1.3,1.4),misterCuddles=c(0.9,NA,1.1,1.0))
row.names(m) <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")
Which gives me this:
fluffy misterCuddles
2013-01-01 1.1 0.9
2013-01-02 1.2 NA
2013-01-03 1.3 1.1
2013-01-04 1.4 1.0
On every date, I'd like to know how many days in a row each cat has gone number 2. So the resulting matrix should look like this:
fluffy misterCuddles
2013-01-01 1 1
2013-01-02 2 0
2013-01-03 3 1
2013-01-04 4 2
Is there a way to do this efficiently? The cumsum function does something similar, but that's a primitive so I can't modify it to suit my dirty, dirty needs.
I could run a for loop and store a count like so:
m.output <- matrix(nrow=nrow(m),ncol=ncol(m))
for (column in 1:ncol(m)) {
sum <- 0
for (row in 1:nrow(m)) {
if (is.na(m[row,column])) sum <- 0
else sum <- sum + 1
m.output[row,column] <- sum
}
}
Is this the most efficient way to do this? I have a lot of cats, and I've recorded years worth of poop data. Can I parallellize this by column somehow?
All of the answers here are actually too complicated (including my own, from earlier, copied below). The Reduce family of answers is just masking a for-loop in a single function call. I like Roland's and Ananda's, but both I think have a little too much going on.
Thus, here's a simple vectorized solution:
reset <- function(x) {
s <- seq_along(x)
s[!is.na(x)] <- 0
seq_along(x) - cummax(s)
}
> apply(m, 2, reset)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
It also works on Roland's example:
m2 <- cbind(fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA))
> apply(m2, 2, reset)
fluffy misterCuddles
[1,] 0 0
[2,] 1 1
[3,] 2 2
[4,] 3 0
[5,] 4 0
[6,] 5 1
[7,] 6 0
From earlier: this is not vectorized, but also works:
pooprun <- function(x){
z <- numeric(length=length(x))
count <- 0
for(i in 1:length(x)){
if(is.na(x[i]))
count <- 0
else
count <- + count + 1
z[i] <- count
}
return(z)
}
apply(m, 2, pooprun)
> apply(m, 2, pooprun)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
THE BENCHMARKING
Here I simply wrap everyone's answers in a function call (based on their name).
> library(microbenchmark)
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 1.540 4.6200 5.3890 6.1590 372.185 1000
hadley() 87.755 92.758 94.298 96.6075 1767.012 1000
thomas() 92.373 99.6860 102.7655 106.6140 315.223 1000
matthew() 128.168 136.2505 139.7150 145.4880 5196.344 1000
thomasloop() 133.556 141.6390 145.1030 150.4920 84131.427 1000
usobi() 148.182 159.9210 164.7320 174.1620 5010.445 1000
ananda() 720.507 742.4460 763.6140 801.3335 5858.733 1000
And here are the results for Roland's example data:
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 2.310 5.3890 6.1590 6.9290 75.438 1000
hadley() 75.053 78.902 80.058 83.136 1747.767 1000
thomas() 90.834 97.3770 100.2640 104.3050 358.329 1000
matthew() 139.715 149.7210 154.3405 161.2680 5084.728 1000
thomasloop() 144.718 155.4950 159.7280 167.4260 5182.103 1000
usobi() 177.048 188.5945 194.3680 210.9180 5360.306 1000
ananda() 705.881 729.9370 753.4150 778.8175 8226.936 1000
Note: Alexis's and Hadley's solutions took quite a while to actually define as functions on my machine, whereas the others work out-of-the-box, but Alexis's is otherwise the clear winner.
This should work. Note that each of your cats is an independent individual so you can turn your data frame into a list and use mclapply which uses a paralleled approach.
count <- function(y,x){
if(is.na(x)) return(0)
return (y + 1)
}
oneCat = m[,1]
Reduce(count,oneCat,init=0,accumulate=TRUE)[-1]
EDIT: here is the full answer
count <- function(x,y){
if(is.na(y)) return(0)
return (x + 1)
}
mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
EDIT2: The main bad problem is that I do get extra 0's at the beginning so...
result = mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
finalResult = do.call('cbind',result)[-1,]
rownames(finalResult) = rownames(m)
does the job.
Another option, similar #Usobi's in that it uses Reduce, but with a slightly different approach:
apply(!is.na(m), 2, Reduce, f=function(x,y) if (y) x + y else y, accumulate=TRUE)
# fluffy misterCuddles
# [1,] 1 1
# [2,] 2 0
# [3,] 3 1
# [4,] 4 2
I had saved a snippet from here that translates almost exactly for a problem like this:
countReset <- function(x) {
x[!is.na(x)] <- 1
y <- ave(x, rev(cumsum(rev(is.na(x)))), FUN=cumsum)
y[is.na(y)] <- 0
y
}
apply(m, 2, countReset)
# fluffy misterCuddles
# 2013-01-01 1 1
# 2013-01-02 2 0
# 2013-01-03 3 1
# 2013-01-04 4 2
Since I'm in a period where I'm trying to get used to .Call, here's another idea that seems to work and -probably- is fast. (Don't take my word for it, though, my skills are not trustworthy!!):
library(inline) #use "inline" package for convenience
f <- cfunction(sig = c(R_mat = "numeric", R_dims = "integer"), body = '
R_len_t *dims = INTEGER(R_dims);
R_len_t rows = dims[0], cols = dims[1];
double *mat = REAL(R_mat);
SEXP ans;
PROTECT(ans = allocMatrix(INTSXP, rows, cols));
R_len_t *pans = INTEGER(ans);
for(int ic = 0; ic < cols; ic++)
{
pans[0 + ic*rows] = ISNA(mat[0 + ic*rows]) ? 0 : 1;
for(int ir = 1; ir < rows; ir++)
{
if(ISNA(mat[ir + ic*rows]))
{
pans[ir + ic*rows] = 0;
}else
{
if(!ISNA(mat[(ir - 1) + ic*rows]))
{
pans[ir + ic*rows] = pans[(ir - 1) + ic*rows] + 1;
}else
{
pans[ir + ic*rows] = 1;
}
}
}
}
UNPROTECT(1);
return(ans);
')
f(m, dim(m))
# [,1] [,2]
#[1,] 1 1
#[2,] 2 0
#[3,] 3 1
#[4,] 4 2
f(mm, dim(mm)) #I named Roland's matrix, mm ; I felt that I had to pass this test!
# [,1] [,2]
#[1,] 0 0
#[2,] 1 1
#[3,] 2 2
#[4,] 3 0
#[5,] 4 0
#[6,] 5 1
#[7,] 6 0
So the solution to this problem has two parts:
A function that accepts a vector per cat and returns a vector telling me at each date, how many days since the last NA
A function that accepts an NxM matrix and returns an NxM matrix, applying function (1) to each column
For (2), I adapted this from #Usobi's answer:
daysSinceLastNA <- function(matrix, vectorFunction, cores=1) {
listResult <- mclapply(as.data.frame(matrix), vectorFunction, mc.cores=cores)
result <- do.call('cbind', listResult)
rownames(result) <- rownames(matrix)
result
}
For (1), I have two solutions:
#ananda-mahto's solution:
daysSinceLastNA_1 <- function(vector) {
vector[!is.na(vector)] <- 1
result <- ave(vector, rev(cumsum(rev(is.na(vector)))), FUN=cumsum)
result[is.na(result)] <- 0
result
}
#Usobi's solution:
daysSinceLastNA_2 <- function(vector) {
reduction <- function(total, additional) ifelse(is.na(additional), 0, total + 1)
Reduce(reduction, vector, init=0, accumulate=TRUE)[-1]
}
Then I call them like this:
> system.time(result1 <- daysSinceLastNA (test, daysSinceLastNA_1 ))
user system elapsed
5.40 0.01 5.42
> system.time(result2 <- daysSinceLastNA (test, daysSinceLastNA_2 ))
user system elapsed
58.02 0.00 58.03
On my test dataset, which is roughly a 2500x2500 matrix, the first approach is an order of magnitude faster.
If I run on linux with 64 cores, solution (1) runs in 2 seconds, and solution (2) runs in 6 seconds.
For this sort of problem, which is easily solved with a for loop, I find Rcpp a very natural answer.
library(Rcpp)
cppFunction("NumericVector cumsum2(NumericVector x) {
int n = x.length();
NumericVector out(x);
for(int i = 0; i < n; ++i) {
if (NumericVector::is_na(x[i]) || i == 0) {
x[i] = 0;
} else {
x[i] = x[i - 1] + 1;
}
}
return out;
}")
The code requires a little more bookkeeping than the equivalent R code, but the bulk of the function is a very simple for loop.
You can then apply in R like any other vectorised function:
m2 <- cbind(
fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA)
)
apply(m2, 2, cumsum2)
You could of course make the C++ code iterate over the columns of the matrix, but I think that since this is already easily expressed in R, you might as well use the built in tools.