R: Count of consecutive identical values across columns using RLE - r

I'm using R. I have a data frame that consists of a row for each player and then columns representing each month and a number of points they earned (illustrative data with random values below). I would like to add a new column (Points$ConsecutiveShutouts) that contains the longest consecutive streak for a specified point total over say the past 5 months.
Points <- data.frame("Player" = c("Alpha", "Beta", "Charlie", "Delta", "Echo", "Foxtrot", "Gamma"), "MayPts" = c(floor(runif(7, 0, 3))), "JunPts" = c(floor(runif(7, 0, 3))), "JulPts" = c(floor(runif(7, 0, 3))), "AugPts" = c(floor(runif(7, 0, 3))), "SepPts" = c(floor(runif(7, 0, 3))), "OctPts" = c(floor(runif(7, 0, 3))), "NovPts" = c(floor(runif(7, 0, 3))),"DecPts" = c(floor(runif(7, 0, 3))))
Player MayPts JunPts JulPts AugPts SepPts OctPts NovPts DecPts
Alpha 0 0 1 0 2 2 2 0
Beta 1 0 1 1 1 1 1 2
Charlie 1 2 2 0 2 1 1 0
Delta 0 1 1 2 2 2 0 0
Echo 1 1 0 2 1 2 0 1
Foxtrot 1 0 0 0 0 0 2 1
Gamma 2 0 1 1 0 2 0 1
I have tried using rle(points):
# Establish the start and end months
StartMonth <- which(colnames(Points) == "SepPts")
EndMonth <- which(colnames(Points) == "DecPts")
# Find total of consecutive months with 0 points
Points$ConsecutiveShutOuts <- max(rel(Points[ ,StartMonth:EndMonth] == 0), lengths[!values])
Doing this, I end up with the error "'X' must be a vector of an atomic type"
Any advice on what I am doing wrong and how I can fix? Or alternative approaches?
Thanks in advance! [Beginner here, so hopefully I followed the correct approach to question asking :)]

I would use long form as well. I would first create a function like this.
myfun <- function(series,value){
tmp <- rle(series); runs <- tmp$lengths[tmp$values == value]
if (length(runs)==0) return(0)
else return(max(runs))
}
Using tidyr/dplyr, you can proceed as
library(dplyr)
library(tidyr)
Points %>%
gather(months,Pts,MayPts:DecPts) %>%
group_by(Player) %>%
summarise(x=myfun(tail(Pts,5),0))
# Past 5 month, number of consecutive zeros for each player.
Of course, you can join the result to the original wide-form data frame if you'd like to.

If you want to sum based upon some condition (e.g., only summing points higher than 1), you can melt and restrict the summation only to rows greater than that value.
Points <- as.data.table(Points)
Points <- melt(Points, id="Player", variable.name = "Month", value.name = "PTs")
Points <- Points[PTs>1, list(PTs = sum(PTs, na.rm=TRUE)), by="Player"] #change ">1" if you prefer a different value

Related

How to mutate row-wise and replace first 'n' values with 0

I'm trying to build a measurement tracker form. I'd like to populate 0 into the "Measurement" columns for each row, based on the Quantity value for that row. (For Quantity = 2, first 2 measurements = 0, the rest of the row = NA). (For Quantity = 4, all measurements = 0).
I'm wondering how to mutate these rows and replace with the proper number of 0s in the correctly indexed positions, like so:
Feature Tool MIN MAX Quantity Measurement_1 Measurement_2 Measurement_3 Measurement_4
1 a m 0.5 1.0 2 0 0 NA NA
2 b n 0.4 1.2 4 0 0 0 0
The sample code to generate the dataframe is here:
#sample data
A1 <- data.frame(Feature = c("a","b"), Tool = c("m","n"), MIN = c(0.5,0.4), MAX = c(1.0,1.2), Quantity = c(2,4))
# Create empty data frame of NA
df <- data.frame(matrix(NA,
nrow = 1,
ncol = max(A1$Quantity)))
#create list of sequential measurements based on maximum quantity of measurements
M <- c(sprintf("Measurement_%01d", seq(1,max(A1$Quantity))))
#set column names to these measurements
colnames(df) <- M
#combine sample data with measurements
new_dat <- cbind(A1, df)
My first attempts to accomplish this are something like this:
new_dat %>% rowwise() %>% mutate(new_dat[,6:new_dat$Quantity] <- 0)
But it's clear I'm missing something here.
Thanks!
A vectorized approach in base R would be using row/column indexing. Create a matrix with row index replicated and the column index for each rows, and then do the assignment
j1 <- grep("Measurement", names(new_dat))
new_dat[cbind(rep(seq_len(nrow(new_dat)), new_dat$Quantity),
j1[sequence(new_dat$Quantity)])] <- 0
-output
> new_dat
Feature Tool MIN MAX Quantity Measurement_1 Measurement_2 Measurement_3 Measurement_4
1 a m 0.5 1.0 2 0 0 NA NA
2 b n 0.4 1.2 4 0 0 0 0
Or with dplyr, we could do
library(dplyr)
new_dat %>%
mutate(across(starts_with("Measurement"),
~ replace(.x, readr::parse_number(cur_column()) <= Quantity, 0)))
-output
Feature Tool MIN MAX Quantity Measurement_1 Measurement_2 Measurement_3 Measurement_4
1 a m 0.5 1.0 2 0 0 NA NA
2 b n 0.4 1.2 4 0 0 0 0

Calculate the difference between a row in a dataset and all rows in another dataset in R

I have 2 datasets, i want for each row in datset1 to calculate the difference between all rows in another dataset2. I also replace any negative difference by 0. Here is a simple example of my 2 datasets (because i have datasets around 1000*1000).
df1 <- data.frame(ID = c(1, 2), Obs = c(1.0, 2.0), var=c(2.0,5.0))
df2 <- data.frame(ID = c(2, 1), Obs = c(3.0, 2.0),var=c(7.0,3.0))
df1
ID Obs var
1 1 1 2
2 2 2 5
df2
ID Obs var
1 2 3 7
2 1 2 3
for(i in 1:nrow(df1)){
b1=as.matrix(df1)
b2=as.matrix(df2)
diff= b1-b2
diff[which(diff < 0 )] <- 0
diff.data= data.frame(cbind(diff, total = rowSums(diff)))
}
diff.data
ID Obs var total
1 0 0 0 0
2 1 0 2 3
This is what i have been able to do, i did the difference between the 2 datasets, replace the negative values by 0 and also was interested to sum the values of the columns after. For the first row in df1 i would like to calculate the difference between all the rows in df2, and for the second row in df1 calculate the difference between all the rows in df2 (and so on). Note that i should not calculate the difference between the IDs (i don't know how to do it, maybe changing diff= b1-b2 by diff= b1[,-1]-b2[,-1]? ). I want to keep the ID from df1 to keep track of my patients (the case of my dataset). I would like to have something like that
diff.data
ID Obs var total
1 0 0 0
1 0 0 0
2 0 0 0
2 0 2 2
I thank you in advance for your help.
Here is what i have using your answer, i wanted to create a simple function. But i would like to have the option that my datasets could be either matrices or dataframes, i was only able to generate an error if the datasets are not dataframes:
difference=function(df1,df2){
if(class(df1) != "data.frame" || class(df2) != "data.frame") stop(" df1 or df2 is not a dataframe!")
df1=data.frame(df1)
df2=data.frame(df2)
ID1=seq(nrow(df1))
ID2=seq(nrow(df2))
new_df1 = df1[rep(ID1, each = nrow(df2)), ]
new_df1[-1] = new_df1[-1] - df2[rep(seq(nrow(df2)), nrow(df1)), -1]
new_df1[new_df1 < 0] = 0
new_df1$total = rowSums(new_df1[-1])
rownames(new_df1) = NULL
output=new_df1
return(output)
}
I know the fact that i specified df1=data.frame(df1) must be a dataframe its just i don't know how to also include that it could be a matrix.
Thank you again in advance for your help.
You can repeat each row in df1 with for nrow(df2) times and each row in df2 for nrow(df1) times so that the size of dataframes is equal and we can directly subtract the values.
#Repeat each row of df1 nrow(df2) times
new_df1 <- df1[rep(df1$ID, each = nrow(df2)), ]
#Repeat rows of df2 and subtract
new_df1[-1] <- new_df1[-1] - df2[rep(seq(nrow(df2)), nrow(df1)), -1]
#Replace negative values with 0
new_df1[new_df1 < 0] <- 0
#Add row-wise sum
new_df1$total <- rowSums(new_df1[-1])
#Remove rownames
rownames(new_df1) <- NULL
new_df1
# ID Obs var total
#1 1 0 0 0
#2 1 0 0 0
#3 2 0 0 0
#4 2 0 2 2

R How to loop over paired columns to create new columns

I am trying to loop over specific pair columns (they have similar names) and create columns based on a conditional statement.
Example dataset:
set.seed(2)
df <- data.frame (id=rep(1:5),
s1=rnorm(5, 0, 3),
s2=rnorm(5, 0, 3),
s2a=rnorm(5, 0, 3),
st1=rnorm(5, 3, 3),
st2=rnorm(5, 3, 3),
st2a=rnorm(5, 3, 3))
> df
id s1 s2 s2a st1 st2 st2a
1 1 -2.6907436 0.3972609 1.252952 -3.933207 9.2724576 -4.355119
2 2 0.5545476 2.1238642 2.945258 5.635814 -0.5997775 4.431712
3 3 4.7635360 -0.7190941 -1.178086 3.107420 7.7689146 1.210325
4 4 -3.3911270 5.9534218 -3.119007 6.038486 8.8639549 5.376610
5 5 -0.2407553 -0.4163610 5.346687 4.296795 3.0148133 3.868910
Column s1 is paired with column st1 etc. I want to indicate 1/0 if the equality between these columns is -3 to 0. E.g. df$ys1<-ifelse(df$s1<=-3 & df$st1>=0, 1, 0). The ultimate aim is to create the final variable yes_no (1/0) to indicate if any of the differences between the pairs of columns are 1 e.g. df$yes_no<-ifelse(df$ys1==1 | df$ys2==1 | df$ys2a==1, 1, 0)
The new dataset should look like this:
> df
id s1 s2 s2a st1 st2 st2a ys1 ys2 ys2a yes_no
1 1 -2.6907436 0.3972609 1.252952 -3.933207 9.2724576 -4.355119 0 0 0 0
2 2 0.5545476 2.1238642 2.945258 5.635814 -0.5997775 4.431712 0 0 0 0
3 3 4.7635360 -0.7190941 -1.178086 3.107420 7.7689146 1.210325 0 0 0 0
4 4 -3.3911270 5.9534218 -3.119007 6.038486 8.8639549 5.376610 1 0 1 1
5 5 -0.2407553 -0.4163610 5.346687 4.296795 3.0148133 3.868910 0 0 0 0
I'm sure there is a way of doing a loop without actually creating all additional columns (i.e. just create the final column, yes_no ) but I would be interested in how to create these just to know how to do it, in addition to a neater method.
I think a way of doing it would be to break up the dataset into two sets based on the pairs and then use in a loop:
firstt<-(df[,c(2:4)])
final<-(df[,c(5:7)])
or skip that and try directly in a loop
for(i in names(df[,c(2:4)])){
r<-(df[,c(5:7)])
df[i] <-ifelse(df$[i]<=-3 & df$[r]>=0, 1, 0)
}
Obviously that wont work but that is the idea of what I was trying.
Any help would be appreciated.
Here a solution in base R:
df$yes_no <-
rowSums(mapply(function(i,r)
ifelse(df[[r]]<=-3 & df[[i]]>=0, 1, 0)
, grep("st",names(df),value=TRUE),
gsub("t","",grep("st",names(df),value=TRUE)))) >0
1- I amusing regex to extract names. You can use indices here also. $X
X = "st1" "st2" "st2a"
Y = "s1" "s2" "s2a"
2- I am using mapply to apply to the paired elemend ( first elt of X and first element of Y and so ..)
3- rowSums to aggregate the 3 columns in one , and >0 to convert it to a logical vector
Here is another solution with for loops
a <- names(df[,c(2:4)])
b <- names(df[,c(5:7)])
for(i in seq_along(a)){
df$temp<-ifelse(df[,names(df)[names(df)==a[i]]]<=-3 & df[,names(df)[names(df)==b[i]]]>=0, 1, 0)
names(df)[names(df)=="temp"] <- paste0("ys", i)
}
df$yes_no <- apply(df[grep("ys", names(df))]==1,1, function(k) ifelse(TRUE %in% k, 1, 0) )
print(df)
id s1 s2 s2a st1 st2 st2a ys1 ys2 ys3 yes_no
1 1 -2.6907436 0.3972609 1.252952 -3.933207 9.2724576 -4.355119 0 0 0 0
2 2 0.5545476 2.1238642 2.945258 5.635814 -0.5997775 4.431712 0 0 0 0
3 3 4.7635360 -0.7190941 -1.178086 3.107420 7.7689146 1.210325 0 0 0 0
4 4 -3.3911270 5.9534218 -3.119007 6.038486 8.8639549 5.376610 1 0 1 1
5 5 -0.2407553 -0.4163610 5.346687 4.296795 3.0148133 3.868910 0 0 0 0

Count consecutive occurrences of a specific value in every row of a data frame in R

I've got a data.frame of monthly values of a variable for many locations (so many rows) and I want to count the numbers of consecutive months (i.e consecutive cells) that have a value of zero. This would be easy if it was just being read left to right, but the added complication is that the end of the year is consecutive to the start of the year.
For example, in the shortened example dataset below (with seasons instead of months),location 1 has 3 '0' months, location 2 has 2, and 3 has none.
df<-cbind(location= c(1,2,3),
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
How can I count these consecutive zero values? I've looked at rle but I'm still none the wiser currently!
Many thanks for any help :)
You've identified the two cases that the longest run can take: (1) somewhere int he middle or (2) split between the end and beginning of each row. Hence you want to calculate each condition and take the max like so:
df<-cbind(
Winter=c(0,0,3),
Spring=c(0,2,4),
Summer=c(0,2,7),
Autumn=c(3,0,4))
#> Winter Spring Summer Autumn
#> [1,] 0 0 0 3
#> [2,] 0 2 2 0
#> [3,] 3 4 7 4
# calculate the number of consecutive zeros at the start and end
startZeros <- apply(df,1,function(x)which.min(x==0)-1)
#> [1] 3 1 0
endZeros <- apply(df,1,function(x)which.min(rev(x==0))-1)
#> [1] 0 1 0
# calculate the longest run of zeros
longestRun <- apply(df,1,function(x){
y = rle(x);
max(y$lengths[y$values==0],0)}))
#> [1] 3 1 0
# take the max of the two values
pmax(longestRun,startZeros +endZeros )
#> [1] 3 2 0
Of course an even easier solution is:
longestRun <- apply(cbind(df,df),# tricky way to wrap the zeros from the start to the end
1,# the margin over which to apply the summary function
function(x){# the summary function
y = rle(x);
max(y$lengths[y$values==0],
0)#include zero incase there are no zeros in y$values
})
Note that the above solution works because my df does not include the location field (column).
Try this:
df <- data.frame(location = c(1, 2, 3),
Winter = c(0, 0, 3),
Spring = c(0, 2, 4),
Summer = c(0, 2, 7),
Autumn = c(3, 0, 4))
maxcumzero <- function(x) {
l <- x == 0
max(cumsum(l) - cummax(cumsum(l) * !l))
}
df$N.Consec <- apply(cbind(df[, -1], df[, -1]), 1, maxcumzero)
df
# location Winter Spring Summer Autumn N.Consec
# 1 1 0 0 0 3 3
# 2 2 0 2 2 0 2
# 3 3 3 4 7 4 0
This adds a column to the data frame specifying the maximum number of times zero has occurred consecutively in each row of the data frame. The data frame is column bound to itself to be able to detect consecutive zeroes between autumn and winter.
The method used here is based on that of Martin Morgan in his answer to this similar question.

Calculate run length aggregated by subject ID conditional on observation == 1

I am trying to use the rle function in R to calculate the run lengths for the variable positive in the example below, aggregated by the variable id.
Here is a toy dataset (that admittedly has a few quirks):
test <- c('id', 'positive')
test$id <- rep(1:3, c(24, 24, 24))
set.seed(123456)
test$positive <- round(runif(72, 0, 1))
test <- data.frame(test)
test <- subset(test, select = -X.id.)
test <- subset(test, select = -X.positive.)
result <- aggregate(positive ~ id, data = test, FUN = rle)
The way this currently is set up it reads the run lengths for all possible values (0 and 1) of the variable positive. Is it possible to condition this function such that it only evaluates the run lengths when positive == 1?
At the end of the day, I ultimately want to figure out how to count the number of instances in which two or more consecutive months were positive (positive == 1) for each subject.
UPDATE:
I have a variable called event that has values of 0 or 1. For each of the occurrences of two or more positives that were developed from the code featured in the suggestions below, is it possible to stratify our results such that if event == 1 occurs during any of the positive months it would be classified differently than a run of positives in which event == 0 for all of the months?
The toy dataset looks like this:
set.seed(123456)
x <- c(1, 2, 1)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)), event = round(runif(72, 0, 1)))
results <- aggregate(positive ~ id + event, data = test, FUN=function(x) with(rle(x), sum(lengths > 1 & values == 1)))
aggregate(positive ~ event, data = result, FUN=sum)
However, this code gives all possible permutations of event and positive, while I would like to delimit the results to counting only those occurrences of two or more consecutive positive months for which any event == 1. Alternatively, if it is easier to evaluate only the number of consecutive positive months for which all event == 0 that would be a fine solution too.
To count occurrences of two or more consecutive positives, use this:
aggregate(positive ~ id, data=test, FUN=function(x) with(rle(x), sum(lengths>=2 & values==1)))
(inspired in #sgibb's answer.)
EDIT: Counting the number of 2 or more consecutive positives such that any of them has event==1, separated by id:
Calculate the run to which each record belongs:
tmp <- within(test, run <- ave(positive, by=id, FUN=function(x)cumsum(c(1,diff(x)!=0))))
# id positive event run
# 1 1 1 1
# 1 1 0 1
# 1 0 1 2
# 1 0 0 2
# 1 0 1 2
# 1 0 0 2
For each id and each run mark if there was at least one record with event==1 and run length >= 2:
tmp2 <- aggregate(event~id+positive+run, data=tmp, function(x)any(x>0) && length(x)>=2)
# id positive run event
# 2 0 1 FALSE
# 1 1 1 TRUE
# 3 1 1 FALSE
# 1 0 2 TRUE
# 3 0 2 TRUE
# 2 1 2 TRUE
Now simply count how many marked runs are there in each id and each kind of run (positive==1 or positive==0):
aggregate(event~positive+id, tmp2, sum)
# positive id event
# 0 1 1
# 1 1 2
# 0 2 1
# 1 2 3
# 0 3 3
# 1 3 1
Do you mean something like this?:
aggregate(positive ~ id, data=test, FUN=function(x) {
r <- rle(x);
return(r$length[r$value == 1])
})
# id positive
# 1 1 2, 1, 1, 7, 1
# 2 2 4, 2, 1, 4, 2, 1, 2
# 3 3 1, 7, 1, 1, 1
A ddply version for the 'at the end of the day' part:
library(plyr)
set.seed(123456)
test <- data.frame(id = rep(1:3, each = 24), positive = round(runif(72, 0, 1)))
ddply(.data = test, .variables = .(id), function(x){
rl <- rle(x$positive)
sum(rl$length[rl$value == 1] > 1)
}
)
# id V1
# 1 1 2
# 2 2 5
# 3 3 1

Resources