Calculate run length aggregated by subject ID conditional on observation == 1 - r

I am trying to use the rle function in R to calculate the run lengths for the variable positive in the example below, aggregated by the variable id.
Here is a toy dataset (that admittedly has a few quirks):
test <- c('id', 'positive')
test$id <- rep(1:3, c(24, 24, 24))
set.seed(123456)
test$positive <- round(runif(72, 0, 1))
test <- data.frame(test)
test <- subset(test, select = -X.id.)
test <- subset(test, select = -X.positive.)
result <- aggregate(positive ~ id, data = test, FUN = rle)
The way this currently is set up it reads the run lengths for all possible values (0 and 1) of the variable positive. Is it possible to condition this function such that it only evaluates the run lengths when positive == 1?
At the end of the day, I ultimately want to figure out how to count the number of instances in which two or more consecutive months were positive (positive == 1) for each subject.
UPDATE:
I have a variable called event that has values of 0 or 1. For each of the occurrences of two or more positives that were developed from the code featured in the suggestions below, is it possible to stratify our results such that if event == 1 occurs during any of the positive months it would be classified differently than a run of positives in which event == 0 for all of the months?
The toy dataset looks like this:
set.seed(123456)
x <- c(1, 2, 1)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)), event = round(runif(72, 0, 1)))
results <- aggregate(positive ~ id + event, data = test, FUN=function(x) with(rle(x), sum(lengths > 1 & values == 1)))
aggregate(positive ~ event, data = result, FUN=sum)
However, this code gives all possible permutations of event and positive, while I would like to delimit the results to counting only those occurrences of two or more consecutive positive months for which any event == 1. Alternatively, if it is easier to evaluate only the number of consecutive positive months for which all event == 0 that would be a fine solution too.

To count occurrences of two or more consecutive positives, use this:
aggregate(positive ~ id, data=test, FUN=function(x) with(rle(x), sum(lengths>=2 & values==1)))
(inspired in #sgibb's answer.)
EDIT: Counting the number of 2 or more consecutive positives such that any of them has event==1, separated by id:
Calculate the run to which each record belongs:
tmp <- within(test, run <- ave(positive, by=id, FUN=function(x)cumsum(c(1,diff(x)!=0))))
# id positive event run
# 1 1 1 1
# 1 1 0 1
# 1 0 1 2
# 1 0 0 2
# 1 0 1 2
# 1 0 0 2
For each id and each run mark if there was at least one record with event==1 and run length >= 2:
tmp2 <- aggregate(event~id+positive+run, data=tmp, function(x)any(x>0) && length(x)>=2)
# id positive run event
# 2 0 1 FALSE
# 1 1 1 TRUE
# 3 1 1 FALSE
# 1 0 2 TRUE
# 3 0 2 TRUE
# 2 1 2 TRUE
Now simply count how many marked runs are there in each id and each kind of run (positive==1 or positive==0):
aggregate(event~positive+id, tmp2, sum)
# positive id event
# 0 1 1
# 1 1 2
# 0 2 1
# 1 2 3
# 0 3 3
# 1 3 1

Do you mean something like this?:
aggregate(positive ~ id, data=test, FUN=function(x) {
r <- rle(x);
return(r$length[r$value == 1])
})
# id positive
# 1 1 2, 1, 1, 7, 1
# 2 2 4, 2, 1, 4, 2, 1, 2
# 3 3 1, 7, 1, 1, 1

A ddply version for the 'at the end of the day' part:
library(plyr)
set.seed(123456)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)))
ddply(.data = test, .variables = .(id), function(x){
rl <- rle(x$positive)
sum(rl$length[rl$value == 1] > 1)
}
)
# id V1
# 1 1 2
# 2 2 5
# 3 3 1

Related

Find all subsequences with specific length in sequence of numbers in R

I want to find all subsequences within a sequence with (minimum) length of n. Lets assume I have this sequence
sequence <- c(1,2,3,2,5,3,2,6,7,9)
and I want to find the increasing subsequences with minimum length of 3. The ouput should be a dataframe with start and end position for each subsequence found.
df =data.frame(c(1,7),c(3,10))
colnames(df) <- c("start", "end")
Can somebody give a hint how to solve my problem?
Thanks in advance!
One way using only base R
n <- 3
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
split the index of sequence based on the continuous incremental subsequences, if the length of each group is greater than equal to n return the start and end index of that group.
To understand lets break this down and understand it step by step
Using diff we can find difference between consecutive elements
diff(sequence)
#[1] 0 1 1 -1 3 -2 -1 4 1 2
We check which of them do not have increasing subsequences
diff(sequence) < 1
#[1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
and take cumulative sum over them to create groups
cumsum(c(0, diff(sequence)) < 1)
#[1] 1 1 1 2 2 3 4 4 4 4
Based on this groups, we split the index from 1:length(sequence)
split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1))
#$`1`
#[1] 1 2 3
#$`2`
#[1] 4 5
#$`3`
#[1] 6
#$`4`
#[1] 7 8 9 10
Using sapply we loop over this list and return the start and end index of the list if the length of the list is >= n (3 in this case)
sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)]))
#$`1`
#start end
# 1 3
#$`2`
# NULL
#$`3`
#NULL
#$`4`
#start end
# 7 10
Finally, rbind all of them together using do.call. NULL elements are automatically ignored.
do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)),
function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))
# start end
#1 1 3
#4 7 10
Here is another solution using base R. I tried to comment it well but it may still be hard to follow. It seems like you wanted direction / to learn, more than an outright answer so definitely follow up with questions if anything is unclear (or doesn't work for your actual application).
Also, for your data, I added a 12 on the end to make sure it was returning the correct position for repeated increases greater than n (3 in this case):
# Data (I added 11 on the end)
sequence <- c(1,2,3,2,5,3,2,6,7,9, 12)
# Create indices for whether or not the numbers in the sequence increased
indices <- c(1, diff(sequence) >= 1)
indices
[1] 1 1 1 0 1 0 0 1 1 1 1
Now that we have the indices, we need to get the start and end postions for repeates >= 3
# Finding increasing sequences of n length using rle
n <- 3
n <- n - 1
# Examples
rle(indices)$lengths
[1] 3 1 1 2 4
rle(indices)$values
[1] 1 0 1 0 1
# Finding repeated TRUE (1) in our indices vector
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
reps
[1] TRUE FALSE FALSE FALSE TRUE
# Creating a vector of positions for the end of a sequence
# Because our indices are true false, we can use cumsum along
# with rle to create the positions of the end of the sequences
rle_positions <- cumsum(rle(indices)$lengths)
rle_positions
[1] 3 4 5 7 11
# Creating start sequence vector and subsetting start / end using reps
start <- c(1, head(rle_positions, -1))[reps]
end <- rle_positions[reps]
data.frame(start, end)
start end
1 1 3
2 7 11
Or, concisely:
n <- 3
n <- n-1
indices <- c(1, diff(sequence) >= 1)
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
rle_positions <- cumsum(rle(indices)$lengths)
data.frame(start = c(1, head(rle_positions, -1))[reps],
end = rle_positions[reps])
start end
1 1 3
2 7 11
EDIT: #Ronak's update made me realize I should be using diff instead of sapply with an anonymous function for my first step. Updated the answer b/c it was not catching an increase at the end of the vector (e.g., sequence <- c(1,2,3,2,5,3,2,6,7,9,12, 11, 11, 20, 100), also needed to add one more line under n <- 3. This should work as intended now.

Take first non-0 value or last 0 value if that's all there is

Ciao,
Here is my replicating example.
HAVE <- data.frame(ID=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6),
ABSENCE=c(NA,NA,NA,0,0,0,0,0,1,NA,0,NA,0,1,2,0,0,0),
TIME=c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3))
WANT <- data.frame(ID=c(1,2,3,4,5,6),
ABSENCE=c(NA,0,1,0,1,0),
TIME=c(NA,3,3,2,2,3))
The tall data file HAVE is the one I need to convert to WANT. So essentially for each ID I need to identify the first non-zero value and that value goes into the data file WANT. If all values of absence is NA than TIME is NA. If all values of ABSENCE is 0 then I report the last possible row in WANT (as reflected in the TIME variable)
This is my attempt:
WANT <- group_by(HAVE,ID) %>% slice(seq_len(min(which(ABSENCE > 0), n())))
but I do not know how to take the last of the 0 rows if there are only 0s.
library(data.table)
setDT(HAVE)
res = unique(HAVE[, .(ID)])
# look up first ABSENCE > 0
res[, c("ABSENCE", "TIME") := unique(HAVE[ABSENCE > 0], by="ID")[.SD, on=.(ID), .(ABSENCE, TIME)]]
# if nothing was found, look up last ABSENCE == 0
res[is.na(ABSENCE), c("ABSENCE", "TIME") := unique(HAVE[ABSENCE == 0], by="ID", fromLast=TRUE)[.SD, on=.(ID), .(ABSENCE, TIME)]]
# check
all.equal(as.data.frame(res), WANT)
# [1] TRUE
ID ABSENCE TIME
1: 1 NA NA
2: 2 0 3
3: 3 1 3
4: 4 0 2
5: 5 1 2
6: 6 0 3
I'm using data.table since the tidyverse does not and never will support sub-assignment / modifying only rows selected by a condition (like the is.na(ABSENCE) here).
If there two rules can be made more consistent with each other, this should be doable in a left join or a single group_by + slice as the OP attempted, though. Okay, here's one way, though it looks impossible to debug:
HAVE %>%
arrange(ID, -(ABSENCE > 0), TIME*(ABSENCE > 0), -TIME) %>%
distinct(ID, .keep_all = TRUE)
ID ABSENCE TIME
1 1 NA 3
2 2 0 3
3 3 1 3
4 4 0 2
5 5 1 2
6 6 0 3
Using data.table as well, based on subsetting the .I row counter:
WANT <- HAVE[
HAVE[,
if(all(is.na(ABSENCE))) .I[1] else
if(!any(ABSENCE > 0, na.rm=TRUE)) max(.I[ABSENCE==0], na.rm=TRUE) else
min(.I[ABSENCE > 0], na.rm=TRUE),
by=ID
]$V1,
]
WANT[is.na(ABSENCE), TIME := NA_integer_]
# ID ABSENCE TIME
#1: 1 NA NA
#2: 2 0 3
#3: 3 1 3
#4: 4 0 2
#5: 5 1 2
#6: 6 0 3
Here are two approaches using dplyr and custom functions. Both rely on the data being sorted by TIME.
Filter Approach
# We'll use this function inside filter() to keep only the desired rows
flag_wanted <- function(absence){
flags <- rep(FALSE, length(absence))
if (any(absence > 0, na.rm = TRUE)) {
# There's a nonzero value somewhere in x; we want the first one.
flags[which.max(absence > 0)] <- TRUE
} else if (any(absence == 0, na.rm = TRUE)) {
# There's a zero value somewhere in x; we want the last one.
flags[max(which(absence == 0))] <- TRUE
} else {
# All values are NA; we want the last row
flags[length(absence)] <- TRUE
}
return(flags)
}
# After filtering, we have to flip TIME to NA if ABSENCE is NA
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
filter(flag_wanted(ABSENCE)) %>%
mutate(TIME = ifelse(is.na(ABSENCE), NA, TIME)) %>%
ungroup()
# A tibble: 6 x 3
ID ABSENCE TIME
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 0. 3.
3 3. 1. 3.
4 4. 0. 2.
5 5. 1. 2.
6 6. 0. 3.
The filter() step reduces the dataframe to the rows you need. Since it doesn't modify the TIME values, we need to mutate() as well.
Summarize Approach
# This function captures the general logic of getting the value of one variable
# based on the value of another
get_wanted <- function(of_this, by_this){
# If there are any positive values of `by_this`, use the first
if (any(by_this > 0, na.rm = TRUE)) {
return( of_this[ which.max(by_this > 0) ] )
}
# If there are any zero values of `by_this`, use the last
if (any(by_this == 0, na.rm = TRUE)) {
return( of_this[ max(which(by_this == 0)) ] )
}
# Otherwise, use NA
return(NA)
}
HAVE %>%
arrange(ID, TIME) %>%
group_by(ID) %>%
summarize(TIME = get_first_nz(of_this = TIME, by_this = ABSENCE),
ABSENCE = get_first_nz(of_this = ABSENCE, by_this = ABSENCE))
# A tibble: 6 x 3
ID TIME ABSENCE
<dbl> <dbl> <dbl>
1 1. NA NA
2 2. 3. 0.
3 3. 3. 1.
4 4. 2. 0.
5 5. 2. 1.
6 6. 3. 0.
The order of summarization matters because we're overwriting variables, so this approach is risky. It only produces the output WANT if you summarize TIME and then ABSENCE.

Vectorize for loop over two rows with condition

I want to perform some operations on my dataframe, but I have some problems with performance, so I was wondering how I could speed up the performance of my code.
My data has several columns and if the column X is 0, I want to do some operations on other columns (adding and max). If X is 1, do nothing (X can only be 1 or 0)
df <- data.frame(X = c(0,0,1,0,1),Y = c(10,0,0,3,7),Z = c(2,2,0,4,5))
df
X Y Z
1 0 10 2
2 0 0 2
3 1 0 0
4 0 3 4
5 1 7 5
Right now my code looks like:
for(i in 1:(nrow(df)-1)){
if(df$X[i] == 0){
df$Y[i+1] <- df$Y[i]+df$Y[i+1]
df$Z[i+1] <- max(df$Z[i],df$Z[i+1])
}
}
The result should look like:
df
X Y Z
1 0 10 2
2 0 10 2
3 1 10 2
4 0 3 4
5 1 10 5
Is there a way to write this more efficiently?
Additionally, a lot of the rows contain only 0's, so I was wondering if there is an efficient way to skip the operations for these rows, as the value won't change.
Edit:
As I was a bit unspecific about the rules, here they are in greater detail:
Y should get summed up until there is 1 again (the sum (including the value for the row, where the 1 is) should replace the value of the row with the 1). The same principle should be applied to the X variable, but this time with the max() function.
Many thanks!
How about something like this? This reproduces your expected output:
df <- data.frame(X = c(0,0,1,0,1),Y = c(10,0,0,3,7),Z = c(2,2,0,4,5))
df %>%
mutate(
group = cumsum(c(0, diff(X) == -1))) %>%
group_by(group) %>%
mutate(
n = 1:n(),
Y = cumsum(Y),
Z = ifelse(n > 1, max(Z, lead(Z, default = 0)), Z)) %>%
ungroup() %>%
select(X, Y, Z)
# # A tibble: 5 x 3
# X Y Z
# <dbl> <dbl> <dbl>
#1 0. 10. 2.
#2 0. 10. 2.
#3 1. 10. 2.
#4 0. 3. 4.
#5 1. 10. 5.
Explanation: Group entries based on 0-series terminated by 1; replace Y with the cumsum of Y; replace Z with the maximum of entries in that row and from the next row, starting from the second row (n > 1).

R: Count of consecutive identical values across columns using RLE

I'm using R. I have a data frame that consists of a row for each player and then columns representing each month and a number of points they earned (illustrative data with random values below). I would like to add a new column (Points$ConsecutiveShutouts) that contains the longest consecutive streak for a specified point total over say the past 5 months.
Points <- data.frame("Player" = c("Alpha", "Beta", "Charlie", "Delta", "Echo", "Foxtrot", "Gamma"), "MayPts" = c(floor(runif(7, 0, 3))), "JunPts" = c(floor(runif(7, 0, 3))), "JulPts" = c(floor(runif(7, 0, 3))), "AugPts" = c(floor(runif(7, 0, 3))), "SepPts" = c(floor(runif(7, 0, 3))), "OctPts" = c(floor(runif(7, 0, 3))), "NovPts" = c(floor(runif(7, 0, 3))),"DecPts" = c(floor(runif(7, 0, 3))))
Player MayPts JunPts JulPts AugPts SepPts OctPts NovPts DecPts
Alpha 0 0 1 0 2 2 2 0
Beta 1 0 1 1 1 1 1 2
Charlie 1 2 2 0 2 1 1 0
Delta 0 1 1 2 2 2 0 0
Echo 1 1 0 2 1 2 0 1
Foxtrot 1 0 0 0 0 0 2 1
Gamma 2 0 1 1 0 2 0 1
I have tried using rle(points):
# Establish the start and end months
StartMonth <- which(colnames(Points) == "SepPts")
EndMonth <- which(colnames(Points) == "DecPts")
# Find total of consecutive months with 0 points
Points$ConsecutiveShutOuts <- max(rel(Points[ ,StartMonth:EndMonth] == 0), lengths[!values])
Doing this, I end up with the error "'X' must be a vector of an atomic type"
Any advice on what I am doing wrong and how I can fix? Or alternative approaches?
Thanks in advance! [Beginner here, so hopefully I followed the correct approach to question asking :)]
I would use long form as well. I would first create a function like this.
myfun <- function(series,value){
tmp <- rle(series); runs <- tmp$lengths[tmp$values == value]
if (length(runs)==0) return(0)
else return(max(runs))
}
Using tidyr/dplyr, you can proceed as
library(dplyr)
library(tidyr)
Points %>%
gather(months,Pts,MayPts:DecPts) %>%
group_by(Player) %>%
summarise(x=myfun(tail(Pts,5),0))
# Past 5 month, number of consecutive zeros for each player.
Of course, you can join the result to the original wide-form data frame if you'd like to.
If you want to sum based upon some condition (e.g., only summing points higher than 1), you can melt and restrict the summation only to rows greater than that value.
Points <- as.data.table(Points)
Points <- melt(Points, id="Player", variable.name = "Month", value.name = "PTs")
Points <- Points[PTs>1, list(PTs = sum(PTs, na.rm=TRUE)), by="Player"] #change ">1" if you prefer a different value

For Loop in R - deleting all rows which match by one variable

I'm trying to completely delete rows in a dataset for cases with matching variables (case ID) with the help of this function I wrote:
del_row_func <- function(x){
for(i in 1:length(x$FALL_ID)){
for(j in 1:length(x$FALL_ID)){
if(x$FALL_ID[i] == x$FALL_ID[j] & i != j){
x[-i, ]
}
}
}
}
Anybody have an idea, why it doesn't work?
The reason your code didn't work was that you weren't modifying or returning x. However, there is a better way to remove all rows with a duplicated ID:
dat = data.frame(FALL_ID = c(1, 2, 2, 3), y = 1:4)
dat
# FALL_ID y
# 1 1 1
# 2 2 2
# 3 2 3
# 4 3 4
dat[!duplicated(dat$FALL_ID) & !duplicated(dat$FALL_ID, fromLast=T),]
# FALL_ID y
# 1 1 1
# 4 3 4

Resources