Performing two actions based on two conditions with dplyr - r

In the below dataset, for each id, I have flagged (m_flag for column m and f_flag for column w) the first occurrence of a 1 or 2 following a 3 in columns m OR w.
I am trying to:
1) set m_flag to 1 in the row preceding a 3 in m if m is missing but var 1 is not
Then, convert the previous 1 in m_flag to 0
2) set f_flag to 1 in the row preceding a 3 in w if w is missing but var 2 is not (e.g., row 7)
Then, convert the previous 1 in f_flag to 0 (e.g., row 6)
df <- data.frame(id=c(1,1,1, 2,2, 3,3,3, 4,4,4),
m=c(2,NA,NA, 2,3, 2,2,3, 2,2,3),
w=c(2,NA,3, 2,NA, 2,NA,3, 2,NA,3),
var1=c(5,NA,NA, 6,6,7,7,7, 8,8,8),
var2=c(3,3,3, 4,NA, 5,5,5, 6,NA,6),
m_flag=c(1,0,NA, 1,NA, 0,1,NA, 0,1,NA),
f_flag=c(1,0,NA, 1,NA, 1,0,NA, 1,0,NA))
> df
id m w var1 var2 m_flag f_flag
1 1 2 2 5 3 1 1
2 1 NA NA NA 3 0 0
3 1 NA 3 NA 3 NA NA
4 2 2 2 6 4 1 1
5 2 3 NA 6 NA NA NA
6 3 2 2 7 5 0 1
7 3 2 NA 7 5 1 0
8 3 3 3 7 5 NA NA
9 4 2 2 8 6 0 1
10 4 2 NA 8 NA 1 0
11 4 3 3 8 6 NA NA
Output (note: only 1 in row 7 would change from 0 to 1 and 0 in row 6 from 1 to 0)
output <- data.frame(id=c(1,1,1, 2,2, 3,3,3, 4,4,4),
m=c(2,NA,NA, 2,3, 2,2,3, 2,2,3),
w=c(2,NA,3, 2,NA, 2,NA,3, 2,NA,3),
var1=c(5,NA,NA, 6,6,7,7,7, 8,8,8),
var2=c(3,3,3, 4,NA, 5,5,5, 6,NA,6),
m_flag=c(1,0,NA, 1,NA, 0,1,NA, 0,1,NA),
f_flag=c(1,0,NA, 1,NA, 0,1,NA, 1,0,NA))
> output
id m w var1 var2 m_flag f_flag
1 1 2 2 5 3 1 1
2 1 NA NA NA 3 0 0
3 1 NA 3 NA 3 NA NA
4 2 2 2 6 4 1 1
5 2 3 NA 6 NA NA NA
6 3 2 2 7 5 0 **0**
7 3 2 NA 7 5 1 **1**
8 3 3 3 7 5 NA NA
9 4 2 2 8 6 0 1
10 4 2 NA 8 NA 1 0
11 4 3 3 8 6 NA NA
Thank you

First, create columns corresponding to the condition in step(s) 1. We'll call these meet_condition_f and meet_condition_m. Then, we'll use lead() to look at the value of the condition in the next row. If it's true, we'll reset the corresponding flag to 0. Then, for rows where the condition is true, we'll set the flag to 1 (this is the second piece of step 1).
If you needed to do it by group, add group_by(id), for example, prior to the mutate. Don't forget to ungroup afterwards.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(id=c(1,1,1, 2,2, 3,3,3, 4,4,4),
m=c(2,NA,NA, 2,3, 2,2,3, 2,2,3),
w=c(2,NA,3, 2,NA, 2,NA,3, 2,NA,3),
var1=c(5,NA,NA, 6,6,7,7,7, 8,8,8),
var2=c(3,3,3, 4,NA, 5,5,5, 6,NA,6),
m_flag=c(1,0,NA, 1,NA, 0,1,NA, 0,1,NA),
f_flag=c(1,0,NA, 1,NA, 1,0,NA, 1,0,NA))
df %>% mutate(
# Create an indicator column for the condition specified.
# `lead` looks at the "m" value for the next row.
# `if_else` takes a logical condition and returns the result
# from true/false/missing depending which criteria each one meets.
meet_condition_m = if_else(
is.na(m) &
lead(m) == 3 &
!is.na(var1),
true = TRUE,
false = FALSE,
missing = NA),
meet_condition_f = if_else(
is.na(w) &
lead(w) == 3 &
!is.na(var2),
true = TRUE,
false = FALSE,
missing = NA
),
# First, perform step to to convert the previous 1 to 0
m_flag = if_else(lead(meet_condition_m) & m_flag == 1, 0, m_flag, m_flag),
# Then execute the first step
m_flag = if_else(meet_condition_m, 1, m_flag, m_flag),
# Repeat for f
f_flag = if_else(lead(meet_condition_f) & f_flag == 1, 0, f_flag, f_flag),
f_flag = if_else(meet_condition_f, 1, f_flag, f_flag)) %>%
# Drop intermediate columns.
select(-meet_condition_m, -meet_condition_f)
#> id m w var1 var2 m_flag f_flag
#> 1 1 2 2 5 3 1 0
#> 2 1 NA NA NA 3 0 1
#> 3 1 NA 3 NA 3 NA NA
#> 4 2 2 2 6 4 1 1
#> 5 2 3 NA 6 NA NA NA
#> 6 3 2 2 7 5 0 0
#> 7 3 2 NA 7 5 1 1
#> 8 3 3 3 7 5 NA NA
#> 9 4 2 2 8 6 0 1
#> 10 4 2 NA 8 NA 1 0
#> 11 4 3 3 8 6 NA NA
Created on 2019-11-20 by the reprex package (v0.3.0)

Related

Creating an indexed column in R, grouped by user_id, and not increase when NA

I want to create a column (in R) that indexes the presence of a number in another column grouped by a user_id column. And when the other column is NA, the new desired column should not increase.
The example should bring clarity.
I have this df:
data <- data.frame(user_id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
one=c(1,NA,3,2,NA,0,NA,4,3,4,NA))
user_id tobeindexed
1 1 1
2 1 NA
3 1 3
4 2 2
5 2 NA
6 2 0
7 2 NA
8 3 4
9 3 3
10 3 4
11 3 NA
I want to make a new column looking like "desired" in the following df:
> cbind(data,data.frame(desired = c(1,1,2,1,1,2,2,1,2,3,3)))
user_id tobeindexed desired
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 1
5 2 NA 1
6 2 0 2
7 2 NA 2
8 3 4 1
9 3 3 2
10 3 4 3
11 3 NA 3
How can I solve this?
Using colsum and group_by gets me close, but the count does not start over from 1 when the user_id changes...
> data %>% group_by(user_id) %>% mutate(desired = cumsum(!is.na(tobeindexed)))
user_id tobeindexed desired
<dbl> <dbl> <int>
1 1 1 1
2 1 NA 1
3 1 3 2
4 2 2 3
5 2 NA 3
6 2 0 4
7 2 NA 4
8 3 4 5
9 3 3 6
10 3 4 7
11 3 NA 7
Given the sample data you provided (with the one) column, this works unchanged. The code is retained below for demonstration.
base R
data$out <- ave(data$one, data$user_id, FUN = function(z) cumsum(!is.na(z)))
data
# user_id one out
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3
dplyr
library(dplyr)
data %>%
group_by(user_id) %>%
mutate(out = cumsum(!is.na(one))) %>%
ungroup()
# # A tibble: 11 × 3
# user_id one out
# <dbl> <dbl> <int>
# 1 1 1 1
# 2 1 NA 1
# 3 1 3 2
# 4 2 2 1
# 5 2 NA 1
# 6 2 0 2
# 7 2 NA 2
# 8 3 4 1
# 9 3 3 2
# 10 3 4 3
# 11 3 NA 3

Filter to remove all rows before a particular value in a specific column, while this particular value occurs several time

I would like to filter to remove all rows before a particular value in a specific column. For example, in the data frame below, I would like to remove all rows before "1" that appears in column x, for as much as "1" occurs. Please note that the value of "1" repeats many times and I want to remove the "NA" rows before the "1" in column x, regarding column a.
Thanks
a b x
1 1 NA
1 2 NA
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 1 NA
2 2 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 1 NA
3 2 NA
3 3 NA
3 4 NA
3 5 1
3 6 0
3 7 NA
the desired output would be like this:
a b x
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 5 1
3 6 0
3 7 NA
Does this solve your problem?
library(tidyverse)
dat <- read.table(text = "a b x
1 1 NA
1 2 NA
1 3 1
1 4 0
1 5 0
1 6 NA
1 7 NA
2 1 NA
2 2 NA
2 3 1
2 4 NA
2 5 0
2 6 0
2 7 NA
3 1 NA
3 2 NA
3 3 NA
3 4 NA
3 5 1
3 6 0
3 7 NA", header = TRUE)
dat %>%
group_by(a) %>%
filter(cummax(!is.na(x)) == 1)
#> # A tibble: 13 × 3
#> # Groups: a [3]
#> a b x
#> <int> <int> <int>
#> 1 1 3 1
#> 2 1 4 0
#> 3 1 5 0
#> 4 1 6 NA
#> 5 1 7 NA
#> 6 2 3 1
#> 7 2 4 NA
#> 8 2 5 0
#> 9 2 6 0
#> 10 2 7 NA
#> 11 3 5 1
#> 12 3 6 0
#> 13 3 7 NA
Created on 2021-12-07 by the reprex package (v2.0.1)

Applying custom function to each row uses only first value of argument

I am trying to recode NA values to 0 in a subset of columns using the following dataset:
set.seed(1)
df <- data.frame(
id = c(1:10),
trials = sample(1:3, 10, replace = T),
t1 = c(sample(c(1:9, NA), 10)),
t2 = c(sample(c(1:7, rep(NA, 3)), 10)),
t3 = c(sample(c(1:5, rep(NA, 5)), 10))
)
Each row has a certain number of trials associated with it (between 1-3), specified by the trials column. columns t1-t3 represent scores for each trial.
The number of trials indicates the subset of columns in which NAs should be recoded to 0: NAs that are within the number of trials represent missing data, and should be recoded as 0, while NAs outside the number of trials are not meaningful, and should remain NAs. So, for a row where trials == 3, an NA in column t3 would be recoded as 0, but in a row where trials == 2, an NA in t3 would remain an NA.
So, I tried using this function:
replace0 <- function(x, num.sun) {
x[which(is.na(x[1:(num.sun + 2)]))] <- 0
return(x)
}
This works well for single vectors. When I try applying the same function to a data frame with apply(), though:
apply(df, 1, replace0, num.sun = df$trials)
I get a warning saying:
In 1:(num.sun + 2) :
numerical expression has 10 elements: only the first used
The result is that instead of having the value of num.sun change every row according to the value in trials, apply() simply uses the first value in the trials column for every single row. How could I apply the function so that the num.sun argument changes according to the value of df$trials?
Thanks!
Edit: as some have commented, the original example data had some non-NA scores that didn't make sense according to the trials column. Here's a corrected dataset:
df <- data.frame(
id = c(1:5),
trials = c(rep(1, 2), rep(2, 1), rep(3, 2)),
t1 = c(NA, 7, NA, 6, NA),
t2 = c(NA, NA, 3, 7, 12),
t3 = c(NA, NA, NA, 4, NA)
)
Another approach:
# create an index of the NA values
w <- which(is.na(df), arr.ind = TRUE)
# create an index with the max column by row where an NA is allowed to be replaced by a zero
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
# subset 'w' such that only the NA's which fall in the scope of 'm' remain
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
# use 'i' to replace the allowed NA's with a zero
df[i] <- 0
which gives:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
You could easily wrap this in a function:
replace.NA.with.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
df[i] <- 0
return(df)
}
Now, using replace.NA.with.0(df) will produce the above result.
As noted by others, some rows (1, 3 & 10) have more values than trails. You could tackle that problem by rewriting the above function to:
replace.with.NA.or.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
df[w] <- 0
v <- tapply(m[,2], m[,1], FUN = function(x) tail(x:5,-1))
ina <- matrix(as.integer(unlist(stack(v)[2:1])), ncol = 2)
df[ina] <- NA
return(df)
}
Now, using replace.with.NA.or.0(df) produces the following result:
id trials t1 t2 t3
1 1 1 3 NA NA
2 2 2 2 2 NA
3 3 2 6 6 NA
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 NA
9 9 2 1 3 NA
10 10 1 9 NA NA
Here I just rewrite your function using double subsetting x[paste0('t',x['trials'])], which overcome the problem in the other two solutions with row 6
replace0 <- function(x){
#browser()
x_na <- x[paste0('t',x['trials'])]
if(is.na(x_na)){x[paste0('t',x['trials'])] <- 0}
return(x)
}
t(apply(df, 1, replace0))
id trials t1 t2 t3
[1,] 1 1 3 NA 5
[2,] 2 2 2 2 NA
[3,] 3 2 6 6 4
[4,] 4 3 NA 1 2
[5,] 5 1 5 NA NA
[6,] 6 3 7 NA 0
[7,] 7 3 8 7 0
[8,] 8 2 4 5 1
[9,] 9 2 1 3 NA
[10,] 10 1 9 4 3
Here is a way to do it:
x <- is.na(df)
df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
The output looks like this:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
> x <- is.na(df)
> df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
Note: row 1/3/10, is problematic since there are more non-NA values than the trials.
Here's a tidyverse way, note that it doesn't give the same output as other solutions.
Your example data shows results for trials that "didn't happen", I assumed your real data doesn't.
library(tidyverse)
df %>%
nest(matches("^t\\d")) %>%
mutate(data = map2(data,trials,~mutate_all(.,replace_na,0) %>% select(.,1:.y))) %>%
unnest
# id trials t1 t2 t3
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA
Using the more commonly used gather strategy this would be:
df %>%
gather(k,v,matches("^t\\d")) %>%
arrange(id) %>%
group_by(id) %>%
slice(1:first(trials)) %>%
mutate_at("v",~replace(.,is.na(.),0)) %>%
spread(k,v)
# # A tibble: 10 x 5
# # Groups: id [10]
# id trials t1 t2 t3
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA

R - Replace missing values with highest of 4 previous values

This is a variation of the last observation carried forward problem in a vector with some missing values. Instead of filling in NA values with the last non NA observation, I would like to fill in NA values with the highest value in the 4 observations preceding it. If all 4 observations preceding are also NA, the NA missing value should be retained. Would also appreciate it this can be done by groups in a data frame/data table.
Example:
Original DF:
ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA
Output DF:
ID Week Value
a 1 5
a 2 1
a 3 5
a 4 5
a 5 3
a 6 4
a 7 4
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 1
lag shifts the column by n steps and lets you peek at previous values. pmax is element-wise maximum and lets to pick the highest value for each set/row of the observations.
To abstract away notion of 4 and maintain vectorized performance, you may use quasiquotes from rlang: http://dplyr.tidyverse.org/articles/programming.html#quasiquotation
It can look a little cryptic at first but is very precise and expressive.
df <- readr::read_table(
" ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA")
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df %>%
group_by(ID) %>%
mutate(
Value = if_else(is.na(Value), pmax(lag(Value, 1), lag(Value, 2), lag(Value, 3), lag(Value, 4), na.rm = TRUE), Value)
)
#> # A tibble: 14 x 3
#> # Groups: ID [2]
#> ID Week Value
#> <chr> <int> <int>
#> 1 a 1 5
#> 2 a 2 1
#> 3 a 3 5
#> 4 a 4 5
#> 5 a 5 3
#> 6 a 6 4
#> 7 a 7 4
#> 8 b 1 NA
#> 9 b 2 NA
#> 10 b 3 NA
#> 11 b 4 NA
#> 12 b 5 NA
#> 13 b 6 1
#> 14 b 7 1
# or if you are an rlang ninja
library(purrr)
pmax_lag_n <- function(column, n) {
column <- enquo(column)
1:n %>%
map(~quo(lag(!!column, !!.x))) %>%
{ quo(pmax(!!!., na.rm = TRUE)) }
}
df %>%
group_by(ID) %>%
mutate(Value = if_else(is.na(Value), !!pmax_lag_n(Value, 4), Value))
#> # A tibble: 14 x 3
#> # Groups: ID [2]
#> ID Week Value
#> <chr> <int> <int>
#> 1 a 1 5
#> 2 a 2 1
#> 3 a 3 5
#> 4 a 4 5
#> 5 a 5 3
#> 6 a 6 4
#> 7 a 7 4
#> 8 b 1 NA
#> 9 b 2 NA
#> 10 b 3 NA
#> 11 b 4 NA
#> 12 b 5 NA
#> 13 b 6 1
#> 14 b 7 1
Define function Max which accepts a vector x and returns NA if all its elements are NA. Otherwise, if the last value is NA it returns the maximum of all non-NA elements and if the last value is not NA then it returns it.
Also define na.max which runs Max on a rolling window of length n (given by the second argument to na.max -- default 5).
Finally apply na.max to Value by ID using ave.
library(zoo)
Max <- function(x) {
last <- tail(x, 1)
if (all(is.na(x))) NA
else if (is.na(last)) max(x, na.rm = TRUE)
else last
}
na.max <- function(x, n = 5) rollapplyr(x, n, Max, partial = TRUE)
transform(DF, Value = ave(Value, ID, FUN = na.max))
giving:
ID Week Value
1 a 1 5
2 a 2 1
3 a 3 5
4 a 4 5
5 a 5 3
6 a 6 4
7 a 7 4
8 b 1 NA
9 b 2 NA
10 b 3 NA
11 b 4 NA
12 b 5 NA
13 b 6 1
14 b 7 1
Note: Input DF in reproducible form:
Lines <- "
ID Week Value
a 1 5
a 2 1
a 3 NA
a 4 NA
a 5 3
a 6 4
a 7 NA
b 1 NA
b 2 NA
b 3 NA
b 4 NA
b 5 NA
b 6 1
b 7 NA"
DF <- read.table(text = Lines, header = TRUE)

Count consecutive strings of zeroes and ones over multiple groups

There have been several discussions about counting consecutive strings of zeroes and ones (or other values) using functions like rle or cumsum. I have played around with these functions, but I can't easily figure out how to get them to apply to my specific problem.
I am working with ecological presence/absence data ("pres.abs" = 1 or 0) organized by time ("year") and location ("id"). For each location id, I would like to separately calculate the length of consecutive ones and zeroes through time. Where these cannot be calculated, I want to return "NA".
Below is a sample of what the data looks like (first 3 columns) and the output I am hoping to achieve (last 2 columns). Ideally, this would be a pretty fast function avoiding for-loops since the real data frame contains ~15,000 rows.
year = rep(1:10, times=3)
id = c(rep(1, times=10), rep(2, times=10), rep(3, times=10))
pres.abs.id.1 = c(0, 0, 0, 1, 1, 1, 0, 0, 1, 1) #Pres/abs data at site 1 across time
pres.abs.id.2 = c(1, 1, 0, 1, 0, 0, 1, 0, 0, 0) #Pres/abs data at site 2 across time
pres.abs.id.3 = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1) #Pres/abs data at site 3 across time
pres.abs = c(pres.abs.id.1, pres.abs.id.2, pres.abs.id.3)
dat = data.frame(id, year, pres.abs)
dat$cumul.zeroes = c(1,2,3,NA,NA,NA,1,2,NA,NA,NA,NA,1,NA,1,2,NA,1,2,3,1,2,3,4,5,NA,NA,NA,NA,NA)
dat$cumul.ones = c(NA,NA,NA,1,2,3,NA,NA,1,2,1,2,NA,1,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,2,3,4,5)
> dat
id year pres.abs cumul.zeroes cumul.ones
1 1 1 0 1 NA
2 1 2 0 2 NA
3 1 3 0 3 NA
4 1 4 1 NA 1
5 1 5 1 NA 2
6 1 6 1 NA 3
7 1 7 0 1 NA
8 1 8 0 2 NA
9 1 9 1 NA 1
10 1 10 1 NA 2
11 2 1 1 NA 1
12 2 2 1 NA 2
13 2 3 0 1 NA
14 2 4 1 NA 1
15 2 5 0 1 NA
16 2 6 0 2 NA
17 2 7 1 NA 1
18 2 8 0 1 NA
19 2 9 0 2 NA
20 2 10 0 3 NA
21 3 1 0 1 NA
22 3 2 0 2 NA
23 3 3 0 3 NA
24 3 4 0 4 NA
25 3 5 0 5 NA
26 3 6 1 NA 1
27 3 7 1 NA 2
28 3 8 1 NA 3
29 3 9 1 NA 4
30 3 10 1 NA 5
Thanks very much for your help.
Here's a base R way using rle and sequence:
dat <- within(dat, {
cumul.counts <- unlist(lapply(split(pres.abs, id), function(x) sequence(rle(x)$lengths)))
cumul.zeroes <- replace(cumul.counts, pres.abs == 1, NA)
cumul.ones <- replace(cumul.counts, pres.abs == 0, NA)
rm(cumul.counts)
})
# id year pres.abs cumul.ones cumul.zeroes
# 1 1 1 0 NA 1
# 2 1 2 0 NA 2
# 3 1 3 0 NA 3
# 4 1 4 1 1 NA
# 5 1 5 1 2 NA
# 6 1 6 1 3 NA
# 7 1 7 0 NA 1
# 8 1 8 0 NA 2
# 9 1 9 1 1 NA
# 10 1 10 1 2 NA
# 11 2 1 1 1 NA
# 12 2 2 1 2 NA
# 13 2 3 0 NA 1
# 14 2 4 1 1 NA
# 15 2 5 0 NA 1
# 16 2 6 0 NA 2
# 17 2 7 1 1 NA
# 18 2 8 0 NA 1
# 19 2 9 0 NA 2
# 20 2 10 0 NA 3
# 21 3 1 0 NA 1
# 22 3 2 0 NA 2
# 23 3 3 0 NA 3
# 24 3 4 0 NA 4
# 25 3 5 0 NA 5
# 26 3 6 1 1 NA
# 27 3 7 1 2 NA
# 28 3 8 1 3 NA
# 29 3 9 1 4 NA
# 30 3 10 1 5 NA
Here's one option with dplyr:
require(dplyr)
dat %>%
group_by(id, x = cumsum(c(0,diff(pres.abs)) != 0)) %>%
mutate(cumul.zeros = ifelse(pres.abs, NA_integer_, row_number()),
cumul.ones = ifelse(!pres.abs, NA_integer_, row_number())) %>%
ungroup() %>% select(-x)
#Source: local data frame [30 x 5]
#
# id year pres.abs cumul.zeros cumul.ones
#1 1 1 0 1 NA
#2 1 2 0 2 NA
#3 1 3 0 3 NA
#4 1 4 1 NA 1
#5 1 5 1 NA 2
#6 1 6 1 NA 3
#7 1 7 0 1 NA
#8 1 8 0 2 NA
#9 1 9 1 NA 1
#10 1 10 1 NA 2
#11 2 1 1 NA 1
#12 2 2 1 NA 2
#13 2 3 0 1 NA
#14 2 4 1 NA 1
#15 2 5 0 1 NA
#16 2 6 0 2 NA
#17 2 7 1 NA 1
#18 2 8 0 1 NA
#19 2 9 0 2 NA
#20 2 10 0 3 NA
#21 3 1 0 1 NA
#22 3 2 0 2 NA
#23 3 3 0 3 NA
#24 3 4 0 4 NA
#25 3 5 0 5 NA
#26 3 6 1 NA 1
#27 3 7 1 NA 2
#28 3 8 1 NA 3
#29 3 9 1 NA 4
#30 3 10 1 NA 5

Resources