I would like to save the output of every function call of the following recursive function to a list. Moreover, I need to know which (j,l)-pair correspond to which entry of the resulting list.
I have created a stripped down version to reproduce the problem. Please let me know if I should provide more information to help solve the problem. Any help is highly appreciated. Thank you.
#the recursive function
phi <- function(phik,j,l,k,d) {
if(j==0) {
diag(d)
}
else{
if(j==1) {
if(l>k) {
0 * diag(d)
}
else{
phik[[l]]
}
}
else {
if(l>k) {
0 + phi(phik,j-1,l,k,d) %*% phik[[1]]
}
else {
phi(phik,j-1,l+1,k,d) + phi(phik,j-1,l,k,d) %*% phik[[1]]
}
}
}
}
#related stuff
set.seed(123456)
phik <- vector(mode="list", length=3)
phik[[1]] <- matrix(rnorm(n=16,mean=0,s=1),nrow=4)
phik[[2]] <- matrix(rnorm(n=16,mean=0,s=1),nrow=4)
phik[[3]] <- matrix(rnorm(n=16,mean=0,s=1),nrow=4)
d <- nrow(phik[[1]])
k <- length(phik)
#function call
phiout <- phi(phik,j=10,l=1,k=k,d=d)
So, it is a little tricky with recursive functions, because if you want the results of the intermediate steps, you have glue them together in a list. Of course, that means when you use the results of the recursion in the function, you have to dig out the value that you want. That sounds a little convoluted, but in your case, I just mean that you have to return a little list of phi, j, and l at every step, but pull out just phi when you do the multiplications. Here is a little example:
#the recursive function
phi <- function(phik,j,l,k,d) {
if(j==0)
list(list(phi=diag(d),j=j,l=l))
else{
if(j==1) {
if(l>k)
list(list(phi=0 * diag(d),j=j,l=l))
else
list(list(phi=phik[[l]],j=j,l=l))
}
else {
if(l>k) {
first<-phi(phik,j-1,l,k,d)
second<-list(list(phi=0 + first[[1]]$phi %*% phik[[1]], j=j,l=l))
c(second,first)
}
else {
first<-phi(phik,j-1,l+1,k,d)
second<-phi(phik,j-1,l,k,d)
third<-list(list(phi=first[[1]]$phi+(second[[1]]$phi %*% phik[[1]]), j=j, l=l))
c(third,first,second)
}
}
}
}
You might be interested in why I nested the results in the first to third cases (when j is 0 or 1). If you look at the other cases, it might become clear. When l>k (and j is not 0 or 1), then there are two calls phi made. In this case, there will be a list returned, with two sets of phi, i, and j, so it is necessarily a list of lists. When I want to pull out phi from a returned value, it is difficult to tell whether it was going to be just a list or a list of lists, so I just standardized them all to the same thing.
I think return statements are ugly, but others disagree. You can add them in if you like, but they are strictly unnecessary (in this case).
Some sample output:
set.seed(123456)
phik <- vector(mode="list", length=3)
phik[[1]] <- matrix(rnorm(n=16,mean=0,s=1),nrow=4)
phik[[2]] <- matrix(rnorm(n=16,mean=0,s=1),nrow=4)
phik[[3]] <- matrix(rnorm(n=16,mean=0,s=1),nrow=4)
d <- nrow(phik[[1]])
k <- length(phik)
phi(phik,j=2,l=3,k,d)
# [[1]]
# [[1]]$phi
# [,1] [,2] [,3] [,4]
# [1,] -0.9087417 -2.064341 -0.9962198 0.7713081
# [2,] -2.9595280 -5.330120 -4.0488408 2.3357631
# [3,] -1.3754167 -3.866457 -0.8592336 1.4135614
# [4,] -0.1021518 -4.332802 0.4883886 -2.2130314
#
# [[1]]$j
# [1] 2
#
# [[1]]$l
# [1] 3
#
#
# [[2]]
# [[2]]$phi
# [,1] [,2] [,3] [,4]
# [1,] 0 0 0 0
# [2,] 0 0 0 0
# [3,] 0 0 0 0
# [4,] 0 0 0 0
#
# [[2]]$j
# [1] 1
#
# [[2]]$l
# [1] 4
#
#
# [[3]]
# [[3]]$phi
# [,1] [,2] [,3] [,4]
# [1,] -1.0461983 1.560074 -1.0713045 0.1582893
# [2,] -2.7488684 1.015088 0.9678209 -0.5019485
# [3,] -1.1298596 1.043994 0.1710325 -0.9659226
# [4,] -0.8616848 -1.115905 -0.8962503 -0.1137341
#
# [[3]]$j
# [1] 1
#
# [[3]]$l
# [1] 3
Related
I want to compute cumulative sum for the first (n-1) columns(if we have n columns matrix) and subsequently average the values. I created a sample matrix to do this task. I have the following matrix
ma = matrix(c(1:10), nrow = 2, ncol = 5)
ma
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
I wanted to find the following
ans = matrix(c(1,2,2,3,3,4,4,5), nrow = 2, ncol = 4)
ans
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
The following are my r function.
ColCumSumsAve <- function(y){
for(i in seq_len(dim(y)[2]-1)) {
y[,i] <- cumsum(y[,i])/i
}
}
ColCumSumsAve(ma)
However, when I run the above function its not producing any output. Are there any mistakes in the code?
Thanks.
There were several mistakes.
Solution
This is what I tested and what works:
colCumSumAve <- function(m) {
csum <- t(apply(X=m, MARGIN=1, FUN=cumsum))
res <- t(Reduce(`/`, list(t(csum), 1:ncol(m))))
res[, 1:(ncol(m)-1)]
}
Test it with:
> colCumSumAve(ma)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
which is correct.
Explanation:
colCumSumAve <- function(m) {
csum <- t(apply(X=m, MARGIN=1, FUN=cumsum)) # calculate row-wise colsum
res <- t(Reduce(`/`, list(t(csum), 1:ncol(m))))
# This is the trickiest part.
# Because `csum` is a matrix, the matrix will be treated like a vector
# when `Reduce`-ing using `/` with a vector `1:ncol(m)`.
# To get quasi-row-wise treatment, I change orientation
# of the matrix by `t()`.
# However, the output, the output will be in this transformed
# orientation as a consequence. So I re-transform by applying `t()`
# on the entire result at the end - to get again the original
# input matrix orientation.
# `Reduce` using `/` here by sequencial list of the `t(csum)` and
# `1:ncol(m)` finally, has as effect `/`-ing `csum` values by their
# corresponding column position.
res[, 1:(ncol(m)-1)] # removes last column for the answer.
# this, of course could be done right at the beginning,
# saving calculation of values in the last column,
# but this calculation actually is not the speed-limiting or speed-down-slowing step
# of these calculations (since this is sth vectorized)
# rather the `apply` and `Reduce` will be rather speed-limiting.
}
Well, okay, I could do then:
colCumSumAve <- function(m) {
csum <- t(apply(X=m[, 1:(ncol(m)-1)], MARGIN=1, FUN=cumsum))
t(Reduce(`/`, list(t(csum), 1:ncol(m))))
}
or:
colCumSumAve <- function(m) {
m <- m[, 1:(ncol(m)-1)] # remove last column
csum <- t(apply(X=m, MARGIN=1, FUN=cumsum))
t(Reduce(`/`, list(t(csum), 1:ncol(m))))
}
This is actually the more optimized solution, then.
Original Function
Your original function makes only assignments in the for-loop and doesn't return anything.
So I copied first your input into a res, processed it with your for-loop and then returned res.
ColCumSumsAve <- function(y){
res <- y
for(i in seq_len(dim(y)[2]-1)) {
res[,i] <- cumsum(y[,i])/i
}
res
}
However, this gives:
> ColCumSumsAve(ma)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1.5 1.666667 1.75 9
[2,] 3 3.5 3.666667 3.75 10
The problem is that the cumsum in matrices is calculated in column-direction instead row-wise, since it treats the matrix like a vector (which goes columnwise through the matrix).
Corrected Original Function
After some frickeling, I realized, the correct solution is:
ColCumSumsAve <- function(y){
res <- matrix(NA, nrow(y), ncol(y)-1)
# create empty matrix with the dimensions of y minus last column
for (i in 1:(nrow(y))) { # go through rows
for (j in 1:(ncol(y)-1)) { # go through columns
res[i, j] <- sum(y[i, 1:j])/j # for each position do this
}
}
res # return `res`ult by calling it at the end!
}
with the testing:
> ColCumSumsAve(ma)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
Note: dim(y)[2] is ncol(y) - and dim(y)[1] is nrow(y) -
and instead seq_len(), 1: is shorter and I guess even slightly faster.
Note: My solution given first will be faster, since it uses apply, vectorized cumsum and Reduce. - for-loops in R are slower.
Late Note: Not so sure that the first solution is faster. Since R-3.x it seems that for loops are faster. Reduce will be the speed limiting funtion and can be sometimes incredibly slow.
k <- t(apply(ma,1,cumsum))[,-ncol(k)]
for (i in 1:ncol(k)){
k[,i] <- k[,i]/i
}
k
This should work.
All you need is rowMeans:
nc <- 4
cbind(ma[,1],sapply(2:nc,function(x) rowMeans(ma[,1:x])))
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
Here's how I did it
> t(apply(ma, 1, function(x) cumsum(x) / 1:length(x)))[,-NCOL(ma)]
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
This applies the cumsum function row-wise to the matrix ma and then divides by the correct length to get the average (cumsum(x) and 1:length(x) will have the same length). Then simply transpose with t and remove the last column with [,-NCOL(ma)].
The reason why there is no output from your function is because you aren't returning anything. You should end the function with return(y) or simply y as Marius suggested. Regardless, your function doesn't seem to give you the correct response anyway.
While there is a function used for Trace Matrix as seen below:
sum(diag(matrix))
This may incorrectly give you a result if the matrix is not Square (i.e. an "n x n" size). Are there any other inbuilt functions for running "Trace" of a matrix?
Package: psych
Function: tr()
Example:
> x <- matrix(replicate(9,1), ncol = 3, nrow = 3)
> x
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> tr(x)
[1] 3
> x <- matrix(replicate(12,1), ncol = 4, nrow = 3)
> x
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
> tr(x)
Fehler in tr(x) : m must be a square matrix
("Fehler" means error)
Moreover
Package: matrixcalc
Function: matrix.trace
Below is a quick function to test if the object is a matrix and then test if it is also square.
tr <- function (m)
{
total_sum <- 0
if(is.matrix(m))
{
row_count <- nrow(m)
col_count <- ncol(m)
if(row_count == col_count)
{
total_sum <-sum(diag(m))
total_sum
}
else
{
message ('Matrix is not square')
}
}
else
{
message( 'Object is not a matrix')
}
}
I also found the following package for Matrix.Trace:
Matrixcalc
You can try using eigenvalues
# first find eigenvalues
e = eigen(matrix)
# Calculate the trace of the matrix, and compare with the sum of the eigenvalues.
# function to calculate the trace using sum of the diagonal
trace <- function(data)sum(diag(data))
trace(H)
# using sum of the eigenvalues
sum(e$values)
Hope it helps.
I'm trying to figure out, how I can run a loop on some variables that have a consecutive label.
I want to do matrix.2-Matrix.1 and store it in matrix x.1, then Matrix.3-matrix.2 and store it in matrix x.2. There are 300 matrices(Matrix.1,Matrix.2,...Matrix.300) but for this example, I would like to just work on matrix 1,2 and 3.
I first tried an approach that involved the list function, but it didn't work, and then I thought about using a MACRO just like in SAS (the % symbol). But the Macro approach seemed not to work in R.
My code is below:
(The list approach)
> Matrix.1=matrix(c(1:6),nrow=2,ncol=3,byrow=TRUE)
> Matrix.2=matrix(c(1,8,9,17,15,2),nrow=2,ncol=3,byrow=TRUE)
> Matrix.3=matrix(c(0,1,2,3,6,0),nrow=2,ncol=3,byrow=TRUE)
> x.1=matrix(rep(0,6),nrow=2,ncol=3,byrow=TRUE)
> x.2=matrix(rep(0,6),nrow=2,ncol=3,byrow=TRUE)
> m=list(Matrix.1=Matrix.1,Matrix.2=Matrix.2,Matrix.3=Matrix.3)
> x=list(x.1=x.1,x.2=x.2)
> m[1]
$Matrix.1
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> m[2]
$Matrix.2
[,1] [,2] [,3]
[1,] 1 8 9
[2,] 17 15 2
> m[3]
$Matrix.3
[,1] [,2] [,3]
[1,] 0 1 2
[2,] 3 6 0
> x[1]
$x.1
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
> x[2]
$x.2
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
> for (i in 1:2){
+ x[i]=m[i+1]-m[i]
+ print(x[i])
+ }
Error in m[i + 1] - m[i] : non-numeric argument to binary operator
>
How can I make operations on list?
> #Other approach inspired from SAS
> for (i in i:2){
+ x.i=Matrix.i+1-Matrix.i
+ print(R.i)
+ }
Error: object 'Matrix.i' not found
This second approach isn't even doable in R.
What is the best way of dealing loops involving consecutively labelled variables?
Since m and x are both lists, you need to use m[[1]] and x[[1]] to extract its elements.
for (i in 1:2){
x[[i]] <- m[[i+1]]-m[[i]]
print(x[[i]])
}
On the other hand, if you have 300 matrices (Matrix.1, Matrix.2, ... Matrix.300), you could use get and assign to deal with the numerical labels. Here I first assign values to 300 matrices with names Matrix.1 through Matrix.300. Then I use get function to extract these matrices and generate list x.
for (i in 1:300) {
assign(paste("Matrix.", i, sep = ""), matrix(rnorm(9), 3, 3))
}
x <- list()
for (i in 2:300) {
x[[i-1]] <- get(paste("Matrix.", i, sep = "")) - get(paste("Matrix.", i-1, sep = ""))
}
It is the preferred method in R to use the apply family of functions to loop through objects. For lists, you can use lapply which returns a list, or sapply which returns the most simplified object it can without losing any information. With these functions, you output is stored in the same order as the input, which makes comparisons or additional steps much easier.
myProcessedList <- lapply(x, FUN=<some function>)
This is a lot simpler and more straightforward than using assign and get and is worth the investment to learn. SO has many useful examples.
I am trying to do something that I am sure should be quite simple: I am trying to make a function which turns a list of number pairs (pairedList) and a vector (botList) into a series of vectors (one for each pair) of length(botlist) where the numbers in those vectors are all equal to zero except for those corresponding to the index points identified by the pair which will be 1.
#generating mock data to simulate my application:
pair1 <- c(2,4)
pair2 <- c(1,3)
pair3 <- c(5,6)
pairedList <- c(pair1, pair2, pair3)
botList <- c(1:length(pairedList))
Here is what the output should ultimately look like:
[1] 0 1 0 1 0 0
[1] 1 0 1 0 0 0
[1] 0 0 0 0 1 1
The code below allows me to print the vectors in the right manner (by replacing the line in the if loop with print(prob) and commenting out the final print statement):
library(gtools)
test <- function() {
#initialising empty list
output <- list()
for (i in botList) {
x <- rep(0, length(pairedList))
ind <- pairedList[i:(i+1)]
ind.inv <- sort(ind, decreasing=T)
val <- rep(1,length(ind))
new.x <- vector(mode="numeric",length(x)+length(val))
new.x <- new.x[-ind]
new.x[ind] <- val
prob <- new.x
if (odd(i)) {
output[i] <- prob
}
print(output)
}
}
However I need to return this list of vectors from my function rather than printing it and when I do so, I get the following output and am met with an error and a number of warnings:
[[1]]
[1] 0
[[1]]
[1] 0
[[1]]
[1] 0
[[2]]
NULL
[[3]]
[1] 1
[[1]]
[1] 0
[[2]]
NULL
[[3]]
[1] 1
[[1]]
[1] 0
[[2]]
NULL
[[3]]
[1] 1
[[4]]
NULL
[[5]]
[1] 0
Error in new.x[-ind] : only 0's may be mixed with negative subscripts
In addition: Warning messages:
1: In output[i] <- prob :
number of items to replace is not a multiple of replacement length
2: In output[i] <- prob :
number of items to replace is not a multiple of replacement length
3: In output[i] <- prob :
number of items to replace is not a multiple of replacement length
My question is:
How can I change my code to output what I need from this function? I thought this was going to be a five minute job, and after hours on this one little thing I am stuck!
Thanks in advance
Something you can try, although there must be nicer ways:
# create a list with all the "pair1", "pair2", ... objects
l_pairs <- mget(ls(pattern="^pair\\d+"))
# compute maximum number among the values of pair., it determines the number of columns of the results
n_max <- max(unlist(l_pairs))
# finally, create for each pair. a vector of 0s and put 1s at the positions specified in pair.
res <- t(sapply(l_pairs, function(x){y <- rep(0, n_max); y[x]<-1; y}))
res
# [,1] [,2] [,3] [,4] [,5] [,6]
#pair1 0 1 0 1 0 0
#pair2 1 0 1 0 0 0
#pair3 0 0 0 0 1 1
You could use row/col indexing
m1 <- matrix(0, ncol=max(pairedList), nrow=3)
m1[cbind(rep(1:nrow(m1),each=2), pairedList)] <- 1
m1
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 1 0 1 0 0
#[2,] 1 0 1 0 0 0
#[3,] 0 0 0 0 1 1
James, the following should work. I've just tested it.
pair1 <- c(2,4)
pair2 <- c(1,3)
pair3 <- c(5,6)
pairedList <- c(pair1, pair2, pair3)
botList <- c(1:(length(pairedList)/2)
library(gtools)
test <- function(pairedList, botList) {
#initialising empty list
output <- list()
for (i in botList) {
x <- rep(0, length(pairedList))
ind <- pairedList[i:(i+1)]
ind.inv <- sort(ind, decreasing=T)
val <- rep(1,length(ind))
new.x <- vector(mode="numeric",length(x)+length(val))
new.x <- new.x[-ind]
new.x[ind] <- val
prob <- new.x
output[[i]] <- prob
print(prob)
}
return(output)
}
The reason for the strange error is that botList was being created as length 6 rather than length 3. Also if you want to assign a value to a list within a function you need to use double [[]] rather than []
Once you've removed them from the function rbind them all together as follows:
output <- test(pairedList, botList)
result <- do.call(rbind,output)
I'm gathering data on how much my cats poop into a matrix:
m <- cbind(fluffy=c(1.1,1.2,1.3,1.4),misterCuddles=c(0.9,NA,1.1,1.0))
row.names(m) <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")
Which gives me this:
fluffy misterCuddles
2013-01-01 1.1 0.9
2013-01-02 1.2 NA
2013-01-03 1.3 1.1
2013-01-04 1.4 1.0
On every date, I'd like to know how many days in a row each cat has gone number 2. So the resulting matrix should look like this:
fluffy misterCuddles
2013-01-01 1 1
2013-01-02 2 0
2013-01-03 3 1
2013-01-04 4 2
Is there a way to do this efficiently? The cumsum function does something similar, but that's a primitive so I can't modify it to suit my dirty, dirty needs.
I could run a for loop and store a count like so:
m.output <- matrix(nrow=nrow(m),ncol=ncol(m))
for (column in 1:ncol(m)) {
sum <- 0
for (row in 1:nrow(m)) {
if (is.na(m[row,column])) sum <- 0
else sum <- sum + 1
m.output[row,column] <- sum
}
}
Is this the most efficient way to do this? I have a lot of cats, and I've recorded years worth of poop data. Can I parallellize this by column somehow?
All of the answers here are actually too complicated (including my own, from earlier, copied below). The Reduce family of answers is just masking a for-loop in a single function call. I like Roland's and Ananda's, but both I think have a little too much going on.
Thus, here's a simple vectorized solution:
reset <- function(x) {
s <- seq_along(x)
s[!is.na(x)] <- 0
seq_along(x) - cummax(s)
}
> apply(m, 2, reset)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
It also works on Roland's example:
m2 <- cbind(fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA))
> apply(m2, 2, reset)
fluffy misterCuddles
[1,] 0 0
[2,] 1 1
[3,] 2 2
[4,] 3 0
[5,] 4 0
[6,] 5 1
[7,] 6 0
From earlier: this is not vectorized, but also works:
pooprun <- function(x){
z <- numeric(length=length(x))
count <- 0
for(i in 1:length(x)){
if(is.na(x[i]))
count <- 0
else
count <- + count + 1
z[i] <- count
}
return(z)
}
apply(m, 2, pooprun)
> apply(m, 2, pooprun)
fluffy misterCuddles
[1,] 1 1
[2,] 2 0
[3,] 3 1
[4,] 4 2
THE BENCHMARKING
Here I simply wrap everyone's answers in a function call (based on their name).
> library(microbenchmark)
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 1.540 4.6200 5.3890 6.1590 372.185 1000
hadley() 87.755 92.758 94.298 96.6075 1767.012 1000
thomas() 92.373 99.6860 102.7655 106.6140 315.223 1000
matthew() 128.168 136.2505 139.7150 145.4880 5196.344 1000
thomasloop() 133.556 141.6390 145.1030 150.4920 84131.427 1000
usobi() 148.182 159.9210 164.7320 174.1620 5010.445 1000
ananda() 720.507 742.4460 763.6140 801.3335 5858.733 1000
And here are the results for Roland's example data:
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
expr min lq median uq max neval
alexis() 2.310 5.3890 6.1590 6.9290 75.438 1000
hadley() 75.053 78.902 80.058 83.136 1747.767 1000
thomas() 90.834 97.3770 100.2640 104.3050 358.329 1000
matthew() 139.715 149.7210 154.3405 161.2680 5084.728 1000
thomasloop() 144.718 155.4950 159.7280 167.4260 5182.103 1000
usobi() 177.048 188.5945 194.3680 210.9180 5360.306 1000
ananda() 705.881 729.9370 753.4150 778.8175 8226.936 1000
Note: Alexis's and Hadley's solutions took quite a while to actually define as functions on my machine, whereas the others work out-of-the-box, but Alexis's is otherwise the clear winner.
This should work. Note that each of your cats is an independent individual so you can turn your data frame into a list and use mclapply which uses a paralleled approach.
count <- function(y,x){
if(is.na(x)) return(0)
return (y + 1)
}
oneCat = m[,1]
Reduce(count,oneCat,init=0,accumulate=TRUE)[-1]
EDIT: here is the full answer
count <- function(x,y){
if(is.na(y)) return(0)
return (x + 1)
}
mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
EDIT2: The main bad problem is that I do get extra 0's at the beginning so...
result = mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
finalResult = do.call('cbind',result)[-1,]
rownames(finalResult) = rownames(m)
does the job.
Another option, similar #Usobi's in that it uses Reduce, but with a slightly different approach:
apply(!is.na(m), 2, Reduce, f=function(x,y) if (y) x + y else y, accumulate=TRUE)
# fluffy misterCuddles
# [1,] 1 1
# [2,] 2 0
# [3,] 3 1
# [4,] 4 2
I had saved a snippet from here that translates almost exactly for a problem like this:
countReset <- function(x) {
x[!is.na(x)] <- 1
y <- ave(x, rev(cumsum(rev(is.na(x)))), FUN=cumsum)
y[is.na(y)] <- 0
y
}
apply(m, 2, countReset)
# fluffy misterCuddles
# 2013-01-01 1 1
# 2013-01-02 2 0
# 2013-01-03 3 1
# 2013-01-04 4 2
Since I'm in a period where I'm trying to get used to .Call, here's another idea that seems to work and -probably- is fast. (Don't take my word for it, though, my skills are not trustworthy!!):
library(inline) #use "inline" package for convenience
f <- cfunction(sig = c(R_mat = "numeric", R_dims = "integer"), body = '
R_len_t *dims = INTEGER(R_dims);
R_len_t rows = dims[0], cols = dims[1];
double *mat = REAL(R_mat);
SEXP ans;
PROTECT(ans = allocMatrix(INTSXP, rows, cols));
R_len_t *pans = INTEGER(ans);
for(int ic = 0; ic < cols; ic++)
{
pans[0 + ic*rows] = ISNA(mat[0 + ic*rows]) ? 0 : 1;
for(int ir = 1; ir < rows; ir++)
{
if(ISNA(mat[ir + ic*rows]))
{
pans[ir + ic*rows] = 0;
}else
{
if(!ISNA(mat[(ir - 1) + ic*rows]))
{
pans[ir + ic*rows] = pans[(ir - 1) + ic*rows] + 1;
}else
{
pans[ir + ic*rows] = 1;
}
}
}
}
UNPROTECT(1);
return(ans);
')
f(m, dim(m))
# [,1] [,2]
#[1,] 1 1
#[2,] 2 0
#[3,] 3 1
#[4,] 4 2
f(mm, dim(mm)) #I named Roland's matrix, mm ; I felt that I had to pass this test!
# [,1] [,2]
#[1,] 0 0
#[2,] 1 1
#[3,] 2 2
#[4,] 3 0
#[5,] 4 0
#[6,] 5 1
#[7,] 6 0
So the solution to this problem has two parts:
A function that accepts a vector per cat and returns a vector telling me at each date, how many days since the last NA
A function that accepts an NxM matrix and returns an NxM matrix, applying function (1) to each column
For (2), I adapted this from #Usobi's answer:
daysSinceLastNA <- function(matrix, vectorFunction, cores=1) {
listResult <- mclapply(as.data.frame(matrix), vectorFunction, mc.cores=cores)
result <- do.call('cbind', listResult)
rownames(result) <- rownames(matrix)
result
}
For (1), I have two solutions:
#ananda-mahto's solution:
daysSinceLastNA_1 <- function(vector) {
vector[!is.na(vector)] <- 1
result <- ave(vector, rev(cumsum(rev(is.na(vector)))), FUN=cumsum)
result[is.na(result)] <- 0
result
}
#Usobi's solution:
daysSinceLastNA_2 <- function(vector) {
reduction <- function(total, additional) ifelse(is.na(additional), 0, total + 1)
Reduce(reduction, vector, init=0, accumulate=TRUE)[-1]
}
Then I call them like this:
> system.time(result1 <- daysSinceLastNA (test, daysSinceLastNA_1 ))
user system elapsed
5.40 0.01 5.42
> system.time(result2 <- daysSinceLastNA (test, daysSinceLastNA_2 ))
user system elapsed
58.02 0.00 58.03
On my test dataset, which is roughly a 2500x2500 matrix, the first approach is an order of magnitude faster.
If I run on linux with 64 cores, solution (1) runs in 2 seconds, and solution (2) runs in 6 seconds.
For this sort of problem, which is easily solved with a for loop, I find Rcpp a very natural answer.
library(Rcpp)
cppFunction("NumericVector cumsum2(NumericVector x) {
int n = x.length();
NumericVector out(x);
for(int i = 0; i < n; ++i) {
if (NumericVector::is_na(x[i]) || i == 0) {
x[i] = 0;
} else {
x[i] = x[i - 1] + 1;
}
}
return out;
}")
The code requires a little more bookkeeping than the equivalent R code, but the bulk of the function is a very simple for loop.
You can then apply in R like any other vectorised function:
m2 <- cbind(
fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA)
)
apply(m2, 2, cumsum2)
You could of course make the C++ code iterate over the columns of the matrix, but I think that since this is already easily expressed in R, you might as well use the built in tools.