I have a dataframe in R:
Subject T O E P Score
1 0 1 0 1 256
2 1 0 1 0 325
2 0 1 0 1 125
3 0 1 0 1 27
4 0 0 0 1 87
5 0 1 0 1 125
6 0 1 1 1 100
This is just a display of the dataframe. In reality, I have a lot of lines for each of the subjects. But the subjects are only from 1 to 6
For each Subject, the possible values are:
T : 0 or 1
O : 0 or 1
E : 0 or 1
P : 0 or 1
Score : Numeric value
I want to create a new dataframe with 6 lines (one for each subject) and the calculated MEAN score for each of these combinations :
T , O , E , P , TO , TE, TP, OE , OP , PE , TOP , TOE , POE , PET
The above will the columns of the new dataframe.
The final output should look like this
Subject T O E P TO TE TP OE OP PE TOP TOE POE PET
1
2
3
4
5
6
For each of these lines x columns the value is the MEAN SCORE
I tried aggregate and table but I can't seem to get what I want
Sorry I am new to R
Thanks
I had to rebuild sample data to answer the question as I understood it, tell me if it works for you :
set.seed(2)
df <- data.frame(subject=sample(1:3,9,T),
T = sample(c(0,1),9,T),
O = sample(c(0,1),9,T),
E = sample(c(0,1),9,T),
P = sample(c(0,1),9,T),
score=round(rnorm(9,10,3)))
# subject T O E P score
# 1 1 1 0 0 1 12
# 2 3 1 0 1 0 9
# 3 2 0 1 0 1 13
# 4 1 1 0 0 0 3
# 5 3 0 1 0 1 14
# 6 3 0 0 1 0 13
# 7 1 1 0 1 0 17
# 8 3 1 0 1 0 12
# 9 2 0 0 1 1 14
cols1 <- c("T","O","E","P")
df$comb <- apply(df[cols1],1,function(x) paste(names(df[cols1])[as.logical(x)],collapse=""))
# subject T O E P score comb
# 1 1 1 0 0 1 12 TP
# 2 3 1 0 1 0 9 TE
# 3 2 0 1 0 1 13 OP
# 4 1 1 0 0 0 3 T
# 5 3 0 1 0 1 14 OP
# 6 3 0 0 1 0 13 E
# 7 1 1 0 1 0 17 TE
# 8 3 1 0 1 0 12 TE
# 9 2 0 0 1 1 14 EP
library(tidyverse)
df %>%
group_by(subject,comb) %>%
summarize(score=mean(score)) %>%
spread(comb,score) %>%
ungroup
# # A tibble: 3 x 7
# subject E EP OP T TE TP
# * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 NA NA NA 3 17.0 12
# 2 2 NA 14 13 NA NA NA
# 3 3 13 NA 14 NA 10.5 NA
The second step in base R:
means <- aggregate(score ~ subject + comb,df,mean)
means2 <- reshape(means,timevar="comb",idvar="subject",direction="wide")
setNames(means2,c("subject",sort(unique(df$comb))))
# subject E EP OP T TE TP
# 1 3 13 NA 14 NA 10.5 NA
# 2 2 NA 14 13 NA NA NA
# 5 1 NA NA NA 3 17.0 12
I'd do it like this:
# using your table data
df = read.table(text =
"Subject T O E P Score
1 0 1 0 1 256
2 1 0 1 0 325
2 0 1 0 1 125
3 0 1 0 1 27
4 0 0 0 1 87
5 0 1 0 1 125
6 0 1 1 1 100", stringsAsFactors = FALSE, header=TRUE)
# your desired column names
new_names <- c("T", "O", "E", "P", "TO", "TE", "TP", "OE",
"OP", "PE", "TOP", "TOE", "POE", "PET")
# assigning each of your scores to one of the desired column names
assign_comb <- function(dfrow) {
selection <- c("T", "O", "E", "P")[as.logical(dfrow[2:5])]
do.call(paste, as.list(c(selection, sep = "")))
}
df$comb <- apply(df, 1, assign_comb)
# aggregate all the means together
df_agg <- aggregate(df$Score ~ df$comb + df$Subject, FUN = mean)
# reshape the data to wide format
df_new <- reshape(df_agg, v.names = "df$Score", idvar = "df$Subject",
timevar = "df$comb", direction = "wide")
# clean up the column names to match your desired output
# any column names not found will be added as NA
colnames(df_new) <- gsub("df\\$|Score\\.", "", colnames(df_new))
df_new[, new_names[!new_names %in% colnames(df_new)]] <- NA
df_new <- df_new[, c("Subject", new_names)]
With the result:
> df_new
Subject T O E P TO TE TP OE OP PE TOP TOE POE PET
1 1 NA NA NA NA NA NA NA NA 256 NA NA NA NA NA
2 2 NA NA NA NA NA 325 NA NA 125 NA NA NA NA NA
4 3 NA NA NA NA NA NA NA NA 27 NA NA NA NA NA
5 4 NA NA NA 87 NA NA NA NA NA NA NA NA NA NA
6 5 NA NA NA NA NA NA NA NA 125 NA NA NA NA NA
7 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Related
I have a data frame that looks like this;
df <- data.frame(Trip =c(rep("A",10),rep("B",10)),
State =c(0,0,0,1,1,1,0,0,1,0,0,1,1,0,0,0,1,1,1,0),
Distance = c(0,2,9,4,3,1,4,5,6,3,2,6,1,5,3,3,6,1,8,2),
DistanceToNext = c(NA,NA,NA,3,1,15,NA,NA,NA,NA,NA,1,17,NA,NA,NA,1,8,NA,NA))
Trip State Distance DistanceToNext
1 A 0 1 NA
2 A 0 2 NA
3 A 0 9 NA
4 A 1 4 3
5 A 1 3 1
6 A 1 1 15
7 A 0 4 NA
8 A 0 5 NA
9 A 1 6 NA
10 A 0 3 NA
11 B 0 2 NA
12 B 1 6 1
13 B 1 1 17
14 B 0 5 NA
15 B 0 3 NA
16 B 0 3 NA
17 B 1 6 1
18 B 1 1 8
19 B 1 8 NA
20 B 0 2 NA
The State column indicates whether a fishing boat is fishing (State = 1) or not fishing (State = 0). I want to calculate the Distance travelled between each fishing event (State = 1).
The Distance column indicates the distance between that rows location and the previous row (e.g. it is the lag distance).
The DistanceToNext column is the answer I am trying to generate, it should be NA for all rows in the Trip until the first row where the fishing State = 1. For this row DistanceToNext should equal the sum of the Distance column of subsequent rows until the next fishing State = 1.
For example row 4 is the first fishing event (State = 1) in Trip A, the DistanceToNext cell should be the Distance travelled before the next fishing event which in his case is the very next row (row 5) which has a distance of 3.
For row 5 the next fishing event is again the very next row (row 6) which has a distance of 1. However for row 6 we see that there isn't another fishing event until row 9 so I want a cumulative sum of the d column for the rows between 6 and 9 which is 15.
If it is the last State = 1 row in it's x grouping (A or B) then there isn't another fishing event so there is not distance to calculate so I want it to give NA.
Here is another solution you could use. I also used a custom function for every State/ Distance vectors in each group that results in the desired output:
fn <- function(State, Distance) {
out <- rep(NA, length(State))
inds <- which(State == 1)
for(i in inds) {
if(State[i] == 1 & State[i + 1] == 1) {
out[i] <- Distance[i + 1]
} else if (State[i] == 1 & State[i + 1] == 0 & i != inds[length(inds)]) {
nx <- which(inds == i)
out[i] <- sum(Distance[(i+1):(inds[nx + 1])])
} else {
NA
}
}
out
}
df %>%
group_by(Trip) %>%
mutate(MyDistance = fn(State, Distance))
# A tibble: 20 x 5
# Groups: Trip [2]
Trip State Distance DistanceToNext MyDistance
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 0 0 NA NA
2 A 0 2 NA NA
3 A 0 9 NA NA
4 A 1 4 3 3
5 A 1 3 1 1
6 A 1 1 15 15
7 A 0 4 NA NA
8 A 0 5 NA NA
9 A 1 6 NA NA
10 A 0 3 NA NA
11 B 0 2 NA NA
12 B 1 6 1 1
13 B 1 1 17 17
14 B 0 5 NA NA
15 B 0 3 NA NA
16 B 0 3 NA NA
17 B 1 6 1 1
18 B 1 1 8 8
19 B 1 8 NA NA
20 B 0 2 NA NA
In base R you would do:
fun <- function(df){
a <- which(df$State == 1)
b <- rep(NA, nrow(df))
d <- mapply(function(x, y) sum(df$Distance[(x+1):y]), head(a,-1), tail(a, -1))
b[a] <- c(d, NA)
transform(df, DisttoNext = b)
}
do.call(rbind, by(df, df$Trip, fun))
Trip State Distance DistanceToNext DisttoNext
A.1 A 0 0 NA NA
A.2 A 0 2 NA NA
A.3 A 0 9 NA NA
A.4 A 1 4 3 3
A.5 A 1 3 1 1
A.6 A 1 1 15 15
A.7 A 0 4 NA NA
A.8 A 0 5 NA NA
A.9 A 1 6 NA NA
A.10 A 0 3 NA NA
B.11 B 0 2 NA NA
B.12 B 1 6 1 1
B.13 B 1 1 17 17
B.14 B 0 5 NA NA
B.15 B 0 3 NA NA
B.16 B 0 3 NA NA
B.17 B 1 6 1 1
B.18 B 1 1 8 8
B.19 B 1 8 NA NA
B.20 B 0 2 NA NA
A data.table alternative.
library(data.table)
setDT(df)
df[,`:=`(next_dist = shift(Distance, type = "lead"), g = cumsum(State), ri = .I),
by = Trip]
d = df[ , .(ri = ri[1], State = State[1], s = sum(next_dist)), by = .(Trip, g)]
df[d[State == 1, .SD[-.N], by = Trip], on = .(ri), s := s]
df[ , `:=`(ri = NULL, next_dist = NULL, g = NULL)]
# Trip State Distance DistanceToNext s
# 1: A 0 0 NA NA
# 2: A 0 2 NA NA
# 3: A 0 9 NA NA
# 4: A 1 4 3 3
# 5: A 1 3 1 1
# 6: A 1 1 15 15
# 7: A 0 4 NA NA
# 8: A 0 5 NA NA
# 9: A 1 6 NA NA
# 10: A 0 3 NA NA
# 11: B 0 2 NA NA
# 12: B 1 6 1 1
# 13: B 1 1 17 17
# 14: B 0 5 NA NA
# 15: B 0 3 NA NA
# 16: B 0 3 NA NA
# 17: B 1 6 1 1
# 18: B 1 1 8 8
# 19: B 1 8 NA NA
# 20: B 0 2 NA NA
Explanation:
Convert data to data.table (setDT(df)).
For each 'Trip' (by = Trip), create new variables by reference (:=): next distance (shift(Distance, type = "lead")), a grouping variable which increases everytime 'State' is 1 (cumsum(State)), a row index used to join result (.I; this also could be done first, without the grouping).
For each 'Trip' and 'State group' (by = .(Trip, g)), select first row index (ri[1]), first 'State' (State = State[1]), and sum the lead distances (sum(next_dist)).
From the result above, select rows where 'State' is 1 (State == 1). Then, for each 'Trip' (by = Trip), select the subset of data (.SD) except the last row (-.N). Join to the original data on row index (on = .(ri)). Create a new column, sum of distances, 's' by reference (:=). If desired, remove temp variables.
I have a very complex problem, i hope someone can help -> i want to copy a row value (i.e. Player 1 or Player 2) into two other rows (for Player 3 and 4) if and only if these players are in the same Treatment, Group and Period AND this player was indeed picked (see column Player.Picked)
I know that with tidyverse I can group_by my columns of interest: Treatment, Group, and Period.
However, I am unsure how to proceed with the condition that Player Picked is fulfilled and then how to extract this value appropriately for the players 3 and 4 in the same treatment, group, period.
The column "extracted.Player 1/2 Value" should be the output. (I have manually provided the first four correct solutions).
Any ideas? Help would be very much appreciated. Thanks a lot in advance!
df
T Player Group Player.Picked Period Player1/2Value extracted.Player1/2Value
1 1 6 1 1 10
1 2 6 1 1 9
1 3 5 2 1 NA -> 4
1 4 6 1 1 NA -> 10
1 5 3 1 1 NA
1 1 5 2 1 8
1 2 1 0 1 7
1 3 6 1 1 NA -> 10
1 4 2 2 1 NA
1 5 2 2 1 NA
1 1 1 0 1 7
1 2 2 2 1 11
1 3 3 1 1 NA
1 4 4 1 1 NA
1 5 4 1 1 NA
1 1 2 2 1 21
1 2 4 1 1 17
1 3 1 0 1 NA
1 4 5 2 1 NA -> 4
1 5 6 1 1 NA
1 1 3 1 1 12
1 2 3 1 1 15
1 3 4 1 1 NA
1 4 1 0 1 NA
1 5 1 0 1 NA
1 1 4 1 1 11
1 2 5 2 1 4
1 3 2 2 1 NA
1 4 3 1 1 NA
1 5 5 2 1 NA
I'm not sure if I understood the required logic; here I'm assuming that Player 5 always picks Player 1 or 2 per Group.
So, here is my go at this using library(data.table):
library(data.table)
DT <- data.table::data.table(
check.names = FALSE,
T = c(1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L),
Player = c(1L,2L,3L,
4L,5L,1L,2L,3L,4L,5L,1L,2L,3L,4L,5L,
1L,2L,3L,4L,5L,1L,2L,3L,4L,5L,1L,
2L,3L,4L,5L),
Group = c(6L,6L,5L,
6L,3L,5L,1L,6L,2L,2L,1L,2L,3L,4L,4L,
2L,4L,1L,5L,6L,3L,3L,4L,1L,1L,4L,
5L,2L,3L,5L),
Player.Picked = c(1L,1L,2L,
1L,1L,2L,0L,1L,2L,2L,0L,2L,1L,1L,1L,
2L,1L,0L,2L,1L,1L,1L,1L,1L,0L,0L,
1L,2L,2L,2L),
Period = c(1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
1L,1L,1L,1L),
`Player1/2Value` = c(10L,9L,NA,
NA,NA,8L,7L,NA,NA,NA,7L,11L,NA,NA,
NA,21L,17L,NA,NA,NA,12L,15L,NA,NA,NA,
11L,4L,NA,NA,NA),
`extracted.Player1/2Value` = c(NA,NA,4L,
10L,NA,NA,NA,10L,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA,4L,NA,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA)
)
setorderv(DT, cols = c("T", "Group", "Period", "Player"))
Player5PickedDT <- DT[Player == 5, Player.Picked, by = c("T", "Group", "Period")]
setnames(Player5PickedDT, old = "Player.Picked", new = "Player5Picked")
DT <- DT[Player5PickedDT, on = c("T", "Group", "Period")]
extractedDT <- DT[Player == Player5Picked & Player5Picked > 0, `Player1/2Value`, by = c("T", "Group", "Period")]
setnames(extractedDT, old = "Player1/2Value", new = "extractedValue")
DT[, "Player5Picked" := NULL]
DT <- extractedDT[DT, on = c("T", "Group", "Period")]
DT[, extractedValue := fifelse(Player %in% c(3, 4), yes = extractedValue, no = NA_real_)]
setcolorder(DT, c("T", "Group", "Period", "Player", "Player.Picked", "Player1/2Value", "extracted.Player1/2Value", "extractedValue"))
DT
The resulting table differs from your expected result (extracted.Player1/2Value vs extractedValue, but in my eyes it is following the explained logic):
T Group Period Player Player.Picked Player1/2Value extracted.Player1/2Value extractedValue
1: 1 1 1 1 0 7 NA NA
2: 1 1 1 2 0 7 NA NA
3: 1 1 1 3 0 NA NA NA
4: 1 1 1 4 1 NA NA NA
5: 1 1 1 5 0 NA NA NA
6: 1 2 1 1 2 21 NA NA
7: 1 2 1 2 2 11 NA NA
8: 1 2 1 3 2 NA NA 11
9: 1 2 1 4 2 NA NA 11
10: 1 2 1 5 2 NA NA NA
11: 1 3 1 1 1 12 NA NA
12: 1 3 1 2 1 15 NA NA
13: 1 3 1 3 1 NA NA 12
14: 1 3 1 4 2 NA NA 12
15: 1 3 1 5 1 NA NA NA
16: 1 4 1 1 0 11 NA NA
17: 1 4 1 2 1 17 NA NA
18: 1 4 1 3 1 NA NA 11
19: 1 4 1 4 1 NA NA 11
20: 1 4 1 5 1 NA NA NA
21: 1 5 1 1 2 8 NA NA
22: 1 5 1 2 1 4 NA NA
23: 1 5 1 3 2 NA 4 4
24: 1 5 1 4 2 NA 4 4
25: 1 5 1 5 2 NA NA NA
26: 1 6 1 1 1 10 NA NA
27: 1 6 1 2 1 9 NA NA
28: 1 6 1 3 1 NA 10 10
29: 1 6 1 4 1 NA 10 10
30: 1 6 1 5 1 NA NA NA
T Group Period Player Player.Picked Player1/2Value extracted.Player1/2Value extractedValue
I have the next question.
If I have the following data frame db and I want to rearrange the columns so that they the NULL columns stay at the ends (as in db2).
How can I do it dynamically?
Thank you
db<-data.frame(N=c(2,4,6,8),
a=c(1,1,1,1),
b=c(1,1,1,1),
c=c(NA,1,1,1),
d=c(NA,1,1,1),
e=c(NA,NA,1,1),
f=c(NA,NA,1,1),
g=c(NA,NA,NA,1),
h=c(NA,NA,NA,1))
db2<-data.frame(N=c(2,4,6,8),
a=c(NA,NA,NA,1),
b=c(NA,NA,1,1),
c=c(NA,1,1,1),
d=c(1,1,1,1),
e=c(1,1,1,1),
f=c(NA,1,1,1),
g=c(NA,NA,1,1),
h=c(NA,NA,NA,1))
N a b c d e f g h
1 2 NA NA NA 1 1 NA NA NA
2 4 NA NA 1 1 1 1 NA NA
3 6 NA 1 1 1 1 1 1 NA
4 8 1 1 1 1 1 1 1 1
If the number of NAs per row are always even, then loop through the rows, rearrange the NA by appending half the NAs at the start and end
db[-1] <- t(apply(db[-1], 1, function(x) {
i1 <- is.na(x)
if(sum(i1) > 0) setNames(c(rep(NA,sum(i1)/2), x[!i1],
rep(NA, sum(i1)/2)), names(x)) else x}))
db
# N a b c d e f g h
#1 2 NA NA NA 1 1 NA NA NA
#2 4 NA NA 1 1 1 1 NA NA
#3 6 NA 1 1 1 1 1 1 NA
#4 8 1 1 1 1 1 1 1 1
I have this data set:
Defects.I Defects.D Treatment
1 2 A
1 3 B
And I'm trying to do a descriptive statistics for defects detected and isolated, grouped per treatment.
After searching for a while I found a nice function on the psych library called describeBy().
With the following code:
describeBy(myData[1:2],myData$Treatment)
I got this output:
Treatment A
Mean. Median. Trimed.
Defects.I x x x
Defects.D x x x
Treatment B
Mean. Median. Trimed.
Defects.I x x x
Defects.D x x x
But in reality I was looking for something like
Mean. Median. Trimed.
A B A B A B
Defects.I x x x x x x
Defects.D x x x x x x
Data
myData <- structure(list(Defects.I = c(1L, 1L), Defects.D = 2:3, Treatment = c("A",
"B")), .Names = c("Defects.I", "Defects.D", "Treatment"), class = "data.frame", row.names = c(NA,
-2L))
Since describeBy returns a lists of data frames, we could just cbind them all, but that doesn't get the right order. Instead we can interleave the columns
myData <- structure(list(Defects.I = c(1L, 1L), Defects.D = 2:3,
Treatment = c("A", "B")),
.Names = c("Defects.I", "Defects.D", "Treatment"),
class = "data.frame", row.names = c(NA, -2L))
l <- psych::describeBy(myData[1:2], myData$Treatment)
So interleave using this order
order(sequence(c(ncol(l$A), ncol(l$B))))
# [1] 1 14 2 15 3 16 4 17 5 18 6 19 7 20 8 21 9 22 10 23 11 24 12 25 13 26
rather than what cbind alone would do
c(1:13, 1:13)
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13
so this
do.call('cbind', l)[, order(sequence(lengths(l)))]
# A.vars B.vars A.n B.n A.mean B.mean A.sd B.sd A.median B.median A.trimmed B.trimmed A.mad B.mad
# Defects.I 1 1 1 1 1 1 NA NA 1 1 1 1 0 0
# Defects.D 2 2 1 1 2 3 NA NA 2 3 2 3 0 0
# A.min B.min A.max B.max A.range B.range A.skew B.skew A.kurtosis B.kurtosis A.se B.se
# Defects.I 1 1 1 1 0 0 NA NA NA NA NA NA
# Defects.D 2 3 2 3 0 0 NA NA NA NA NA NA
or as a function
interleave <- function(l, how = c('cbind', 'rbind')) {
how <- match.arg(how)
if (how %in% 'rbind')
do.call(how, l)[order(sequence(sapply(l, nrow))), ]
else do.call(how, l)[, order(sequence(sapply(l, ncol))), ]
}
interleave(l)
# A.vars B.vars A.n B.n
# Defects.I 1 1 1 1
# Defects.D 2 2 1 1 ...
# ...
interleave(l, 'r')
# vars n mean sd median trimmed mad min max range skew kurtosis se
# A.Defects.I 1 1 1 NA 1 1 0 1 1 0 NA NA NA
# B.Defects.I 1 1 1 NA 1 1 0 1 1 0 NA NA NA
# A.Defects.D 2 1 2 NA 2 2 0 2 2 0 NA NA NA
# B.Defects.D 2 1 3 NA 3 3 0 3 3 0 NA NA NA
You can try the mat = TRUE argument. It's not exactly what you're looking for, but it's closer:
library(psych)
mydata = data.frame(Defects.I = c(1,1), Defects.D = c(2,3), Treatment = c('A','B'))
describeBy(mydata[1:2], mydata$Treatment, mat = TRUE)
gives
item group1 vars n mean sd median trimmed mad min max range skew kurtosis se
Defects.I1 1 A 1 1 1 NA 1 1 0 1 1 0 NA NA NA
Defects.I2 2 B 1 1 1 NA 1 1 0 1 1 0 NA NA NA
Defects.D1 3 A 2 1 2 NA 2 2 0 2 2 0 NA NA NA
Defects.D2 4 B 2 1 3 NA 3 3 0 3 3 0 NA NA NA
There have been several discussions about counting consecutive strings of zeroes and ones (or other values) using functions like rle or cumsum. I have played around with these functions, but I can't easily figure out how to get them to apply to my specific problem.
I am working with ecological presence/absence data ("pres.abs" = 1 or 0) organized by time ("year") and location ("id"). For each location id, I would like to separately calculate the length of consecutive ones and zeroes through time. Where these cannot be calculated, I want to return "NA".
Below is a sample of what the data looks like (first 3 columns) and the output I am hoping to achieve (last 2 columns). Ideally, this would be a pretty fast function avoiding for-loops since the real data frame contains ~15,000 rows.
year = rep(1:10, times=3)
id = c(rep(1, times=10), rep(2, times=10), rep(3, times=10))
pres.abs.id.1 = c(0, 0, 0, 1, 1, 1, 0, 0, 1, 1) #Pres/abs data at site 1 across time
pres.abs.id.2 = c(1, 1, 0, 1, 0, 0, 1, 0, 0, 0) #Pres/abs data at site 2 across time
pres.abs.id.3 = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1) #Pres/abs data at site 3 across time
pres.abs = c(pres.abs.id.1, pres.abs.id.2, pres.abs.id.3)
dat = data.frame(id, year, pres.abs)
dat$cumul.zeroes = c(1,2,3,NA,NA,NA,1,2,NA,NA,NA,NA,1,NA,1,2,NA,1,2,3,1,2,3,4,5,NA,NA,NA,NA,NA)
dat$cumul.ones = c(NA,NA,NA,1,2,3,NA,NA,1,2,1,2,NA,1,NA,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,1,2,3,4,5)
> dat
id year pres.abs cumul.zeroes cumul.ones
1 1 1 0 1 NA
2 1 2 0 2 NA
3 1 3 0 3 NA
4 1 4 1 NA 1
5 1 5 1 NA 2
6 1 6 1 NA 3
7 1 7 0 1 NA
8 1 8 0 2 NA
9 1 9 1 NA 1
10 1 10 1 NA 2
11 2 1 1 NA 1
12 2 2 1 NA 2
13 2 3 0 1 NA
14 2 4 1 NA 1
15 2 5 0 1 NA
16 2 6 0 2 NA
17 2 7 1 NA 1
18 2 8 0 1 NA
19 2 9 0 2 NA
20 2 10 0 3 NA
21 3 1 0 1 NA
22 3 2 0 2 NA
23 3 3 0 3 NA
24 3 4 0 4 NA
25 3 5 0 5 NA
26 3 6 1 NA 1
27 3 7 1 NA 2
28 3 8 1 NA 3
29 3 9 1 NA 4
30 3 10 1 NA 5
Thanks very much for your help.
Here's a base R way using rle and sequence:
dat <- within(dat, {
cumul.counts <- unlist(lapply(split(pres.abs, id), function(x) sequence(rle(x)$lengths)))
cumul.zeroes <- replace(cumul.counts, pres.abs == 1, NA)
cumul.ones <- replace(cumul.counts, pres.abs == 0, NA)
rm(cumul.counts)
})
# id year pres.abs cumul.ones cumul.zeroes
# 1 1 1 0 NA 1
# 2 1 2 0 NA 2
# 3 1 3 0 NA 3
# 4 1 4 1 1 NA
# 5 1 5 1 2 NA
# 6 1 6 1 3 NA
# 7 1 7 0 NA 1
# 8 1 8 0 NA 2
# 9 1 9 1 1 NA
# 10 1 10 1 2 NA
# 11 2 1 1 1 NA
# 12 2 2 1 2 NA
# 13 2 3 0 NA 1
# 14 2 4 1 1 NA
# 15 2 5 0 NA 1
# 16 2 6 0 NA 2
# 17 2 7 1 1 NA
# 18 2 8 0 NA 1
# 19 2 9 0 NA 2
# 20 2 10 0 NA 3
# 21 3 1 0 NA 1
# 22 3 2 0 NA 2
# 23 3 3 0 NA 3
# 24 3 4 0 NA 4
# 25 3 5 0 NA 5
# 26 3 6 1 1 NA
# 27 3 7 1 2 NA
# 28 3 8 1 3 NA
# 29 3 9 1 4 NA
# 30 3 10 1 5 NA
Here's one option with dplyr:
require(dplyr)
dat %>%
group_by(id, x = cumsum(c(0,diff(pres.abs)) != 0)) %>%
mutate(cumul.zeros = ifelse(pres.abs, NA_integer_, row_number()),
cumul.ones = ifelse(!pres.abs, NA_integer_, row_number())) %>%
ungroup() %>% select(-x)
#Source: local data frame [30 x 5]
#
# id year pres.abs cumul.zeros cumul.ones
#1 1 1 0 1 NA
#2 1 2 0 2 NA
#3 1 3 0 3 NA
#4 1 4 1 NA 1
#5 1 5 1 NA 2
#6 1 6 1 NA 3
#7 1 7 0 1 NA
#8 1 8 0 2 NA
#9 1 9 1 NA 1
#10 1 10 1 NA 2
#11 2 1 1 NA 1
#12 2 2 1 NA 2
#13 2 3 0 1 NA
#14 2 4 1 NA 1
#15 2 5 0 1 NA
#16 2 6 0 2 NA
#17 2 7 1 NA 1
#18 2 8 0 1 NA
#19 2 9 0 2 NA
#20 2 10 0 3 NA
#21 3 1 0 1 NA
#22 3 2 0 2 NA
#23 3 3 0 3 NA
#24 3 4 0 4 NA
#25 3 5 0 5 NA
#26 3 6 1 NA 1
#27 3 7 1 NA 2
#28 3 8 1 NA 3
#29 3 9 1 NA 4
#30 3 10 1 NA 5