I am working with a rasterbrick "a" with thousands of layers, closer description is not necessary for my problem. I am using following function to create a rasterlayer of the total amount of runs of at least 5 days with values greater than 1 (one layer in brick is one day):
indices<-rep(1:69,each=90)
ff<-function(x,na.rm=TRUE){
y<-x > 1
n<- ave(y,cumsum(y == 0), FUN = cumsum)
sum(n==5)
}
Y<-stackApply(a,indices,fun=ff)
This works great, I tested that. In a similar manner, I wrote new function:
fff<-function(x,na.rm = TRUE){
y <- x > 1
n <- ave(y, cumsum(y == 0), FUN = cumsum)
mean(n[n >= 5])
}
X<-stackApply(a,indices,fun=fff)
Using this function, I wanted to create a rasterlayer of average lengths of those runs greater than 5 days. It seems reasonable and fine, but it does not work correctly. For example, when there is a run of 6 days (satisfying my criterion of value>1), it counts two runs, one of 5 and another one of six, and thus the average is 5,5 instead of 6. I am not sure how to adjust my function fff. If there is a way to do it, it would be great, otherwise I would be greatful if anyone shares another way how to calculate means of those runs. Thanks!
In the future, please include a minimal, reproducible, self-contained example. Do not describe the behavior of your code, but show it. Also, be very clear aobut the question. As-is it is hard to see that your question is not about raster data at all, as you are looking for a function that works on any numeric vector (that you may then apply to raster data).
You are looking for function that finds local maxima larger than 5, in the cumulated sum of neighbors that are > 1; and then average these local maxima.
You have this function
ff<-function(x,na.rm=TRUE){
y<-x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
sum(n==5)
}
Example data
x <- c(-1:10, -1:3, -1:6)
x
# [1] -1 0 1 2 3 4 5 6 7 8 9 10 -1 0 1 2 3 -1 0 1 2 3 4 5 6
ff(x)
# [1] 2
(two local maxima that are larger than 5)
To write the function you want we can start with what we have
y <-x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n
# [1] 0 0 0 1 2 3 4 5 6 7 8 9 0 0 0 1 2 0 0 0 1 2 3 4 5
In this case, you need to find the numbers 9 and 5. You can start with
n[n<5] <- 0
n
# [1] 0 0 0 0 0 0 0 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 5
And now we can use diff to find the local maxima. These are the values for which the difference with the previous value is negative. Note the zero added to n to consider the last element of the vector.
i <- which(diff(c(n, 0)) < 0)
i
# [1] 12 25
n[i]
# [1] 9 5
Such that we can put the above together in a function like this
f <- function(x) {
y <- x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n[n<5] <- 0
i <- which(diff(c(n, 0)) < 0)
mean(n[i])
}
f(x)
# [1] 7
If you have NAs you may do
f <- function(x) {
y <- x > 1
y[is.na(y)] <- FALSE
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n[n<5] <- 0
i <- which(diff(c(n, 0)) < 0)
mean(n[i])
}
Related
I have a time series (or simply a vector) that is binary, returning 0 or 1's depending on some condition (generated with ifelse). I would like to be able to return the counts (in this case corresponds to time series, so days) in between the 1's.
I can do this very easily in Excel, by simply calling the Column I am trying to calculate and then adding the row above (if working with Ascending data, or calling row below if working with descending). See below
I tried doing something similar in R but I am getting an error.
DaysBetweenCondition1 = as.numeric(ifelse((Condition1 ==0 ),0,lag(DaysBetweenCondition1)+1))
Is there an easier way to do this besides making a function
Row# Date Condition1 DaysBetweenCondition1
1 5/2/2007 NA NA
2 5/3/2007 NA NA
3 5/4/2007 NA NA
4 5/5/2007 NA NA
5 5/6/2007 0 NA
6 5/7/2007 0 NA
7 5/8/2007 0 NA
8 5/9/2007 0 NA
9 5/10/2007 0 NA
10 5/11/2007 0 NA
11 5/12/2007 0 NA
12 5/13/2007 0 NA
13 5/14/2007 1 0
14 5/15/2007 0 1
15 5/16/2007 0 2
16 5/17/2007 0 3
17 5/18/2007 0 4
18 5/19/2007 0 5
19 5/20/2007 0 6
20 5/21/2007 0 7
21 5/22/2007 1 0
22 5/23/2007 0 1
23 5/24/2007 0 2
24 5/25/2007 0 3
25 5/26/2007 0 4
26 5/27/2007 1 0
27 5/28/2007 0 1
28 5/29/2007 0 2
29 5/30/2007 1 0
(fwiw, the Dates in this example are made up, in the real data I am using business days so a bit different, and I dont want to reference them, just put in for clarity)
This gets the counting done in one line. Borrowing PhiSeu's code and a line from How to reset cumsum at end of consecutive string and modifying it to count zeros:
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
df_date$DaysBetweenCondition1<-sequence(rle(!df_date$Condition1)$lengths) * !df_date$Condition1
R is very good when working with rows that don't depend on each other. Therefore a lot of functions are vectorized. When working with functions that depend on the value of other rows it is not so easy.
At the moment I can only provide you with a solution using a loop. I assume there is a better solution without a loop.
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
# loop over rows
for(i in 1:nrow(df_date)){
if(is.na(df_date$Condition1[i])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0 & is.na(df_date$Condition1[i-1])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0) {
df_date$DaysBetweenCondition1[i] <- df_date$DaysBetweenCondition1[i-1]+1
} else {
df_date$DaysBetweenCondition1[i] <- 0
}
}
Here's a solution that should be relatively fast
f0 = function(x) {
y = x # template for return value
isna = is.na(x) # used a couple of times
grp = cumsum(x[!isna]) # use '1' to mark start of each group
lag = lapply(tabulate(grp + 1), function(len) {
seq(0, length.out=len) # sequence from 0 to len-1
})
split(y[!isna], grp) <- lag # split y, set to lag element, unsplit
data.frame(x, y)
}
A faster version avoids the lapply() loop; it creates a vector along x (seq_along(x)) and an offset vector describing how the vector along x should be corrected based on the start value of the original vector
f1 = function(x0) {
y0 = x0
x = x0[!is.na(x0)]
y = seq_along(x)
offset = rep(c(1, y[x==1]), tabulate(cumsum(x) + 1))
y0[!is.na(y0)] = y - offset
data.frame(x0, y)
}
Walking through the first solution, here's some data
> set.seed(123)
> x = c(rep(NA, 5), rbinom(30, 1, .15))
> x
[1] NA NA NA NA NA 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
[26] 1 0 0 1 0 0 0 0 0 0
use cumsum() to figure out the group the non-NA data belong to
> isna = is.na(x)
> grp = cumsum(x[!isna])
> grp
[1] 0 0 0 1 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 6 7 7 7 8 8 8 8 8 8 8
use tabulate() to figure out the number of elements in each group, lapply() to generate the relevant sequences
> lag = lapply(tabulate(grp + 1), function(len) seq(0, length.out=len))
finally, create a vector to hold the result, and use spilt<- to update with the lag
> y = x
> split(y[!isna], grp) <- lag
> data.frame(x, y)
x y
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 0 0
7 0 1
8 0 2
9 1 0
10 1 0
11 0 1
12 0 2
13 1 0
14 0 1
15 0 2
16 1 0
17 0 1
...
The key to the second solution is the calculation of the offset. The goal is to be able to 'correct' y = seq_along(x) by the value of y at the most recent 1 in x, kind of like 'fill down' in Excel. The starting values are c(1, y[x==1]) and each needs to be replicated by the number of elements in the group tabulate(cumsum(x) + 1).
I have the following df and I would like to extract all rows based on the following start and end signals.
Start signal : When status changes from 1 to 0
End signal : When status changes from 0 to -1.
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
time status
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
6 6 0
7 7 -1
8 8 0
9 9 1
10 10 0
11 11 0
12 12 0
13 13 -1
14 14 0
Desire:
time status
4 4 0
5 5 0
6 6 0
10 10 0
11 11 0
12 12 0
Here's a possible solution using the data.table package. I'm basically first grouping by status == 1 appearances and then checking per group if there was also a status == -1, if so, I'm sub-setting the group from the second incident until the -1 incident minus 1
library(data.table)
setDT(df)[, indx := cumsum(status == 1)]
df[, if(any(status == -1)) .SD[2:(which(status == -1) - 1)], by = indx]
# indx time status
# 1: 2 4 0
# 2: 2 5 0
# 3: 2 6 0
# 4: 3 10 0
# 5: 3 11 0
# 6: 3 12 0
We count start and end markers, then use those values and the cumulative-sum of (start - end) to filter rows. The (cumsum(start)-cumsum(end)>1) is a slight fiddle to avoid the cumulative counts being upset by row 2 which starts but doesn't end; otherwise row 14 would unwantedly get included.
require(dplyr)
df %>% mutate(start=(status==1), end=(status==-1)) %>%
filter(!start & !end & (cumsum(start)-cumsum(end)>1) ) %>%
select(-start, -end)
# time status
# 1 4 0
# 2 5 0
# 3 6 0
# 4 10 0
# 5 11 0
# 6 12 0
A little ugly, but you can always just loop over the values and keep a flag for determining whether the element should be kept or not.
keepers <- rep(FALSE, nrow(df))
flag <- FALSE
for(i in 1:(nrow(df)-1)) {
if(df$status[i] == 1 && df$status[i+1] == 0) {
flag <- TRUE
next # keep signal index false
}
if(df$status[i] == -1 && df$status[i+1] == 0) {
flag <- FALSE
next # keep signal index false
}
keepers[i] <- flag
}
keepers[nrow(df)] <- flag # Set the last element to final flag value
newdf <- df[keepers, ] # subset based on the T/F values determined
Do you have some more data (or can you gen some more data you know the outcome of) to see if this/these generalize?
Two similar approaches:
library(stringr)
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
dfr <- rle(df$status)
# first approach
find_seq_str <- function() {
str_locate_all(paste(gsub("-1", "x", dfr$values), collapse=""), "10")[[1]][,2]
}
df[as.vector(sapply(find_seq_str(),
function(n) {
i <- sum(dfr$lengths[1:(n-1)])
tail(i:(i+dfr$lengths[n]), -1)
})),]
# second approach
find_seq_ts <- function() {
which(apply(embed(dfr$values, 2), 1, function(x) all(x == c(0, 1))))
}
df[as.vector(sapply(find_seq_ts(),
function(n) {
i <- sum(dfr$lengths[1:(n)])+1
head(i:(i+dfr$lengths[n+1]), -1)
})),]
Both approaches need a run length encoding of the status vector.
The first does a single character replacement for -1 so we can make an unambiguous, contiguous string to then use str_locate to find the pairs that tell us when the target sequence starts then rebuilds the ranges of zeroes from the rle lengths.
If it needs to be base R I can try to whip up something with regexpr.
The second builds a paired matrix and compares for the same target sequence.
Caveats:
I did no benchmarking
Both create potentially big things if status is big.
I'm not completely positive it generalizes (hence my initial q).
David's is far more readable, maintainable & transferrable code but you get to deal with all the "goodness" that comes with using data.table ;-)
I wrapped the approaches in functions as they could potentially then be parameterized, but you could just as easily just assign the value to a variable or shove it into the sapply (ugh, tho).
I am trying to randomly sample 7 numbers from 0 to 7 (with replacement), but subject to the constraint that the numbers chosen add up to 7. So for instance, the output 0 1 1 2 3 0 0 is okay, but the output 1 2 3 4 5 6 7 is not. Is there a way to use the sample command with added constraints?
I intend to use the replicate() function with the sample command as an argument, to return a list of N different vectors form the sample command. The way I am currently using the sample command (without any constraints), I need N to be very large in order to get as many possible vectors that sum to exactly 7 as possible. I figure there must be an easier way to do this!
Here is my code for that part:
x <- replicate(100000, sample(0:7, 7, replace=T))
Ideally, I want 10,000 or 100,000 vectors in x to sum to 7, but would need an enormous N value to do this. Thanks for any help.
To make sure you're sampling uniformly, you could just generate all the permutations and limit to those that sum to 7:
library(gtools)
perms <- permutations(8, 7, 0:7, repeats.allowed=T)
perms7 <- perms[rowSums(perms) == 7,]
From nrow(perms7), we see there are only 1716 possible permutations that sum to 7. Now you can uniformly sample from the permutations:
set.seed(144)
my.perms <- perms7[sample(nrow(perms7), 100000, replace=T),]
head(my.perms)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0 0 0 2 5 0 0
# [2,] 1 3 0 1 2 0 0
# [3,] 1 4 1 1 0 0 0
# [4,] 1 0 0 3 0 3 0
# [5,] 0 2 0 0 0 5 0
# [6,] 1 1 2 0 0 2 1
An advantage of this approach is that it's easy to see that we're sampling uniformly at random. Also, it's quite quick -- building perms7 took 0.3 seconds on my computer and building a 1 million-row my.perms took 0.04 seconds. If you need to draw many vectors this will be quite a bit quicker than a recursive approach because you're just using matrix indexing into perms7 instead of generating each vector separately.
Here's a distribution of counts of numbers in the sample:
# 0 1 2 3 4 5 6 7
# 323347 188162 102812 51344 22811 8629 2472 423
Start with all zeroes, add one to any element, do 7 times:
sumTo = function(){
v = rep(0,7)
for(i in 1:7){
addTo=sample(7)[1]
v[addTo]=v[addTo]+1
}
v
}
Or equivalently, just choose which of the 7 elements you are going to increment in one sample of length 7, then tabulate those, making sure you tabulate up to 7:
sumTo = function(){tabulate(sample(7, 7, replace = TRUE), 7)}
> sumTo()
[1] 2 1 0 0 4 0 0
> sumTo()
[1] 1 3 1 0 1 0 1
> sumTo()
[1] 1 1 0 2 1 0 2
I don't know if this will produce a uniform sample from all possible combinations...
The distribution of individual elements over 100,000 reps is:
> X = replicate(100000,sumTo())
> table(X)
X
0 1 2 3 4 5 6
237709 277926 138810 38465 6427 627 36
Didn't hit a 0,0,0,0,0,7 that time!
This recursive algorithm will output a distribution with a higher probability for large numbers than the other solutions. The idea is to throw a random number y in 0:7 in any of the seven available slots, then repeat with a random number in 0:(7-y), etc:
sample.sum <- function(x = 0:7, n = 7L, s = 7L) {
if (n == 1) return(s)
x <- x[x <= s]
y <- sample(x, 1)
sample(c(y, Recall(x, n - 1L, s - y)))
}
set.seed(123L)
sample.sum()
# [1] 0 4 0 2 0 0 1
Drawing 100,000 vectors took 11 seconds on my machine and here is the distribution I get:
# 0 1 2 3 4 5 6 7
# 441607 98359 50587 33364 25055 20257 16527 14244
There may be an easier and/or more elegant way, but here's a brute-force method using the LSPM:::.nPri function. The link includes the definition for an R-only version of the algorithm, for those interested.
#install.packages("LSPM", repos="http://r-forge.r-project.org")
library(LSPM)
# generate all possible permutations, since there are only ~2.1e6 of them
# (this takes < 40s on my 2.2Ghz laptop)
x <- lapply(seq_len(8^7), nPri, n=8, r=7, replace=TRUE)
# set each permutation that doesn't sum to 7 to NULL
y <- lapply(x, function(p) if(sum(p-1) != 7) NULL else p-1)
# subset all non-NULL permutations
z <- y[which(!sapply(y, is.null))]
Now you can sample from z and be assured that you're getting a permutation that sums to 7.
I find this question intriguing and gave it some extra thought. Another (more general) approach to (approximate) sample uniformly from all feasible solutions, without generating and storing all permutations (which is clearly not possible in the case with much more than 7 numbers), in R by sample(), could be a simple MCMC implementation:
S <- c(0, 1, 1, 2, 3, 0, 0) #initial solution
N <- 100 #number of dependent samples (or burn in period)
series <- numeric(N)
for(i in 1:N){
b <- sample(1:length(S), 2, replace=FALSE) #pick 2 elements at random
opt <- sum(S[-b]) #sum of complementary elements
a <- sample(0:(7-opt), 1) #sample a substistute
S[b[1]] <- a #change elements
S[b[2]] <- 7 - opt - a
}
S #new sample
This is of course really fast for a few samples. The "distribution":
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 321729 189647 103206 52129 22287 8038 2532 432
Of course in this case, where it's actually possible to find and store all combinations, and if you want a huge sample from all feasible outcomes, just use partitions::compositions(7, 7), as also suggested by Josh O'Brien in the comments, to avoid calculating all the permutations, when only a small fraction is needed:
perms7 <- partitions::compositions(7, 7)
>tabulate(perms7[, sample(ncol(perms7), 100000, TRUE)]+1, 8)
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 323075 188787 102328 51511 22754 8697 2413 435
I am trying to randomly sample 7 numbers from 0 to 7 (with replacement), but subject to the constraint that the numbers chosen add up to 7. So for instance, the output 0 1 1 2 3 0 0 is okay, but the output 1 2 3 4 5 6 7 is not. Is there a way to use the sample command with added constraints?
I intend to use the replicate() function with the sample command as an argument, to return a list of N different vectors form the sample command. The way I am currently using the sample command (without any constraints), I need N to be very large in order to get as many possible vectors that sum to exactly 7 as possible. I figure there must be an easier way to do this!
Here is my code for that part:
x <- replicate(100000, sample(0:7, 7, replace=T))
Ideally, I want 10,000 or 100,000 vectors in x to sum to 7, but would need an enormous N value to do this. Thanks for any help.
To make sure you're sampling uniformly, you could just generate all the permutations and limit to those that sum to 7:
library(gtools)
perms <- permutations(8, 7, 0:7, repeats.allowed=T)
perms7 <- perms[rowSums(perms) == 7,]
From nrow(perms7), we see there are only 1716 possible permutations that sum to 7. Now you can uniformly sample from the permutations:
set.seed(144)
my.perms <- perms7[sample(nrow(perms7), 100000, replace=T),]
head(my.perms)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0 0 0 2 5 0 0
# [2,] 1 3 0 1 2 0 0
# [3,] 1 4 1 1 0 0 0
# [4,] 1 0 0 3 0 3 0
# [5,] 0 2 0 0 0 5 0
# [6,] 1 1 2 0 0 2 1
An advantage of this approach is that it's easy to see that we're sampling uniformly at random. Also, it's quite quick -- building perms7 took 0.3 seconds on my computer and building a 1 million-row my.perms took 0.04 seconds. If you need to draw many vectors this will be quite a bit quicker than a recursive approach because you're just using matrix indexing into perms7 instead of generating each vector separately.
Here's a distribution of counts of numbers in the sample:
# 0 1 2 3 4 5 6 7
# 323347 188162 102812 51344 22811 8629 2472 423
Start with all zeroes, add one to any element, do 7 times:
sumTo = function(){
v = rep(0,7)
for(i in 1:7){
addTo=sample(7)[1]
v[addTo]=v[addTo]+1
}
v
}
Or equivalently, just choose which of the 7 elements you are going to increment in one sample of length 7, then tabulate those, making sure you tabulate up to 7:
sumTo = function(){tabulate(sample(7, 7, replace = TRUE), 7)}
> sumTo()
[1] 2 1 0 0 4 0 0
> sumTo()
[1] 1 3 1 0 1 0 1
> sumTo()
[1] 1 1 0 2 1 0 2
I don't know if this will produce a uniform sample from all possible combinations...
The distribution of individual elements over 100,000 reps is:
> X = replicate(100000,sumTo())
> table(X)
X
0 1 2 3 4 5 6
237709 277926 138810 38465 6427 627 36
Didn't hit a 0,0,0,0,0,7 that time!
This recursive algorithm will output a distribution with a higher probability for large numbers than the other solutions. The idea is to throw a random number y in 0:7 in any of the seven available slots, then repeat with a random number in 0:(7-y), etc:
sample.sum <- function(x = 0:7, n = 7L, s = 7L) {
if (n == 1) return(s)
x <- x[x <= s]
y <- sample(x, 1)
sample(c(y, Recall(x, n - 1L, s - y)))
}
set.seed(123L)
sample.sum()
# [1] 0 4 0 2 0 0 1
Drawing 100,000 vectors took 11 seconds on my machine and here is the distribution I get:
# 0 1 2 3 4 5 6 7
# 441607 98359 50587 33364 25055 20257 16527 14244
There may be an easier and/or more elegant way, but here's a brute-force method using the LSPM:::.nPri function. The link includes the definition for an R-only version of the algorithm, for those interested.
#install.packages("LSPM", repos="http://r-forge.r-project.org")
library(LSPM)
# generate all possible permutations, since there are only ~2.1e6 of them
# (this takes < 40s on my 2.2Ghz laptop)
x <- lapply(seq_len(8^7), nPri, n=8, r=7, replace=TRUE)
# set each permutation that doesn't sum to 7 to NULL
y <- lapply(x, function(p) if(sum(p-1) != 7) NULL else p-1)
# subset all non-NULL permutations
z <- y[which(!sapply(y, is.null))]
Now you can sample from z and be assured that you're getting a permutation that sums to 7.
I find this question intriguing and gave it some extra thought. Another (more general) approach to (approximate) sample uniformly from all feasible solutions, without generating and storing all permutations (which is clearly not possible in the case with much more than 7 numbers), in R by sample(), could be a simple MCMC implementation:
S <- c(0, 1, 1, 2, 3, 0, 0) #initial solution
N <- 100 #number of dependent samples (or burn in period)
series <- numeric(N)
for(i in 1:N){
b <- sample(1:length(S), 2, replace=FALSE) #pick 2 elements at random
opt <- sum(S[-b]) #sum of complementary elements
a <- sample(0:(7-opt), 1) #sample a substistute
S[b[1]] <- a #change elements
S[b[2]] <- 7 - opt - a
}
S #new sample
This is of course really fast for a few samples. The "distribution":
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 321729 189647 103206 52129 22287 8038 2532 432
Of course in this case, where it's actually possible to find and store all combinations, and if you want a huge sample from all feasible outcomes, just use partitions::compositions(7, 7), as also suggested by Josh O'Brien in the comments, to avoid calculating all the permutations, when only a small fraction is needed:
perms7 <- partitions::compositions(7, 7)
>tabulate(perms7[, sample(ncol(perms7), 100000, TRUE)]+1, 8)
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 323075 188787 102328 51511 22754 8697 2413 435
How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}