Calculate Run Length Sequence and Maximum by Subject ID - r

We have time series data in which repeated observations were measured for several subjects. I would like to calculate the number of occasions in which the variable positive == 1 occurs for each subject (variable id).
A second aim is to identify the maximum length of these runs of consecutive observations in which positive == 1. For each subject there are likely to be multiple runs within the study period. Rather than calculating the maximum number of consecutive positive observations per subject, I would like to calculate the maximum run length within an individual run.
Here is a toy data set that illustrates the problem:
set.seed(1234)
test <- data.frame(id = rep(1:3, each = 10), positive = round(runif(30,0,1)))
test$run <- sequence(rle(test$positive)$lengths)
test$run_positive <- ifelse(test$positive == '0', '0', test$run)
test$episode <- ifelse(test$run_positive == '1', '1', '0')
count(test$episode)
x freq
1 0 25
2 1 5
The code above gets close to answering my first question in which I am attempting to count the number of positive episodes, however it is not conditioned by subject. This has the unfortunate effect of counting the last observation of Subject #1 and the first observation of Subject #2 in the same run. Can anyone help me develop code to condition this run length encoding by subject?
Secondly, how can one extract only the maximum run length for each run in which positive == 1? I would like to add an additional column in which only the observations in which the maximum run length are recorded. For Subject #1, this would look like:
id positive run run_positive episode max_run
1 1 0 1 0 0 0
2 1 1 1 1 1 0
3 1 1 2 2 0 0
4 1 1 3 3 0 0
5 1 1 4 4 0 0
6 1 1 5 5 0 5
7 1 0 1 0 0 0
8 1 0 2 0 0 0
9 1 1 1 1 1 0
10 1 1 2 2 0 2
If anyone can come up with a method to do this I would be extremely grateful.

I think this answers your first question:
aggregate(positive ~ id, data = test, FUN = sum)
id positive
1 1 7
2 2 4
3 3 4
This might answer your second question, but I would need to see the desired result for each id to check:
set.seed(1234)
test <- data.frame(id = rep(1:3, each = 10), positive = round(runif(30,0,1)))
test$run <- sequence(rle(test$positive)$lengths)
test$run_positive <- ifelse(test$positive == '0', '0', test$run)
test$episode <- ifelse(test$run_positive == '1', '1', '0')
test$group <- paste(test$id*10, test$positive, sep='')
my.seq <- data.frame(rle(test$group)$lengths)
test$first <- unlist(apply(my.seq, 1, function(x) seq(1,x)))
test$last <- unlist(apply(my.seq, 1, function(x) seq(x,1,-1)))
test$max <- ifelse(test$last == 1 & test$positive==1, test$run, 0)
test
id positive run run_positive episode group first last max
1 1 0 1 0 0 100 1 1 0
2 1 1 1 1 1 101 1 5 0
3 1 1 2 2 0 101 2 4 0
4 1 1 3 3 0 101 3 3 0
5 1 1 4 4 0 101 4 2 0
6 1 1 5 5 0 101 5 1 5
7 1 0 1 0 0 100 1 2 0
8 1 0 2 0 0 100 2 1 0
9 1 1 1 1 1 101 1 2 0
10 1 1 2 2 0 101 2 1 2
11 2 1 3 3 0 201 1 2 0
12 2 1 4 4 0 201 2 1 4
13 2 0 1 0 0 200 1 1 0
14 2 1 1 1 1 201 1 1 1
15 2 0 1 0 0 200 1 1 0
16 2 1 1 1 1 201 1 1 1
17 2 0 1 0 0 200 1 4 0
18 2 0 2 0 0 200 2 3 0
19 2 0 3 0 0 200 3 2 0
20 2 0 4 0 0 200 4 1 0
21 3 0 5 0 0 300 1 5 0
22 3 0 6 0 0 300 2 4 0
23 3 0 7 0 0 300 3 3 0
24 3 0 8 0 0 300 4 2 0
25 3 0 9 0 0 300 5 1 0
26 3 1 1 1 1 301 1 4 0
27 3 1 2 2 0 301 2 3 0
28 3 1 3 3 0 301 3 2 0
29 3 1 4 4 0 301 4 1 4
30 3 0 1 0 0 300 1 1 0

Related

group data which are either 0 or 1 [duplicate]

This question already has answers here:
Create counter of consecutive runs of a certain value
(4 answers)
Closed 1 year ago.
I have a vector Blinks whose values are either 0 or 1:
df <- data.frame(
Blinks = c(0,0,1,1,1,0,0,1,1,1,1,0,0,1,1)
)
I want to insert a grouping variable for when Blinks == 1. I'm using rleidfor this but the grouping seems to count in the instances where Blinks == 0:
library(dplyr)
library(data.table)
df %>%
mutate(Blinks_grp = ifelse(Blinks > 0, rleid(Blinks), Blinks))
Blinks Blinks_grp
1 0 0
2 0 0
3 1 2
4 1 2
5 1 2
6 0 0
7 0 0
8 1 4
9 1 4
10 1 4
11 1 4
12 0 0
13 0 0
14 1 6
15 1 6
How can I obtain the correct result:
1 0 0
2 0 0
3 1 1
4 1 1
5 1 1
6 0 0
7 0 0
8 1 2
9 1 2
10 1 2
11 1 2
12 0 0
13 0 0
14 1 3
15 1 3
One option could be:
df %>%
mutate(Blinks_grp = with(rle(Blinks), rep(cumsum(values) * values, lengths)))
Blinks Blinks_grp
1 0 0
2 0 0
3 1 1
4 1 1
5 1 1
6 0 0
7 0 0
8 1 2
9 1 2
10 1 2
11 1 2
12 0 0
13 0 0
14 1 3
15 1 3

R how to generate a descending sequence by subject measuring the distance from the next uninterrupted series of a given value

I have spent a lot of time trying to figure out how to create a descending sequence which is subject specific and measures the distance from the next uninterrupted series of a given value in another column. Do you have any suggestions?
Here is an example of the problem:
Given the following data, where the "id" column is the subject unique identifier and the column "dummy" is an attribute
mydata<-data.frame(id=rep(seq(1,3),each=5), dummy=c(0,0,0,1,1,0,0,1,0,1,0,0,0,0,0))
id dummy
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 2 0
7 2 0
8 2 1
9 2 0
10 2 1
11 3 0
12 3 0
13 3 0
14 3 0
15 3 0
Generate a new column measuring the distance from the next uninterrupted series of the value 1 in the "dummy" column (notice: I am considering an individual occurrence of the value 1 as an interrupted series). Here is an example of the output:
id dummy output
1 1 0 3
2 1 0 2
3 1 0 1
4 1 1 0
5 1 1 0
6 2 0 2
7 2 0 1
8 2 1 0
9 2 0 1
10 2 1 0
11 3 0 0
12 3 0 0
13 3 0 0
14 3 0 0
15 3 0 0
Thanks,
H
Here's an attempt using the data.table package in two steps.
First step is to shift the dummy column one step further in order to afterwards check if the zero sequences are being followed by one.
Second step is to calculate the sequences by condition that they are zero sequences and being followed by one.
I'm using the shift function from the latest data.table version (v 1.9.6+) for this task, but you can just use indx := c(dummy[-1L], 0L) instead
library(data.table) # V1.9.6+
setDT(mydata)[, indx := shift(dummy, type = "lead", fill = 0L)]
mydata[, output := .N:1L*(dummy == 0L)*(indx[.N] == 1L), by = .(id, cumsum(dummy == 1L))]
# id dummy indx output
# 1: 1 0 0 3
# 2: 1 0 0 2
# 3: 1 0 1 1
# 4: 1 1 1 0
# 5: 1 1 0 0
# 6: 2 0 0 2
# 7: 2 0 1 1
# 8: 2 1 0 0
# 9: 2 0 1 1
# 10: 2 1 0 0
# 11: 3 0 0 0
# 12: 3 0 0 0
# 13: 3 0 0 0
# 14: 3 0 0 0
# 15: 3 0 0 0
Here is an option with base R. First we label the number of consecutive identical entries (with rle) in the dummy column in reverse order:
mydata$output<- unlist(sapply(rle(mydata$dummy)$lengths,function(x) rev(seq(x))))
Then we set the values of the output column to zero for all rows in which dummy is not equal to zero:
mydata$output[mydata$dummy!=0] <- 0
In a last step, we identify the sets of id which only contain zeros as values for dummy and set their entries of the output column to zero, too:
mydata[mydata$id==which(aggregate(dummy ~ id,mydata,sum)$dummy==0),]$output <- 0
#> mydata
# id dummy output
#1 1 0 3
#2 1 0 2
#3 1 0 1
#4 1 1 0
#5 1 1 0
#6 2 0 2
#7 2 0 1
#8 2 1 0
#9 2 0 1
#10 2 1 0
#11 3 0 0
#12 3 0 0
#13 3 0 0
#14 3 0 0
#15 3 0 0
This solution assumes that there are no negative values in the dummy column.

Resetting TIME column when AMT > 0

I have a data frame that looks like this:
ID TIME AMT
1 0 50
1 1 0
1 2 0
1 3 0
1 4 0
1 4 50
1 5 0
1 7 0
1 9 0
1 10 0
1 10 50
The TIME column in the above data frame is continuous. I want to add another time column that resets time from zero when AMT>0. So, my output data frame should look like this:
ID TIME AMT TIME2
1 0 50 0
1 1 0 1
1 2 0 2
1 3 0 3
1 4 0 4
1 4 50 0
1 5 0 1
1 7 0 3
1 9 0 5
1 10 0 6
1 10 50 0
This is basically achieved by subtracting the TIME from a "fixed" reference TIME when AMT>0 (For example; the reference time for the second AMT>0 is 4. So, the TIME2 is calculated by subtracting 5-4=1 ;7-4=3; 9-4=5 etc. How can I do this automatically in R.
A data.table solution :
library(data.table)
setDT(DT)[,TIME2 := TIME-TIME[1],cumsum(AMT>0)]
# ID TIME AMT TIME2
# 1: 1 0 50 0
# 2: 1 1 0 1
# 3: 1 2 0 2
# 4: 1 3 0 3
# 5: 1 4 0 4
# 6: 1 4 50 0
# 7: 1 5 0 1
# 8: 1 7 0 3
# 9: 1 9 0 5
# 10: 1 10 0 6
# 11: 1 10 50 0
Was originally posting the same answer as #agstudy, so here's alternatively a possible base R solution
with(df, ave(TIME, cumsum(AMT > 0L), ID, FUN = function(x) x - x[1L]))
## [1] 0 1 2 3 4 0 1 3 5 6 0
Or
library(dplyr)
df %>%
group_by(cumsum(AMT > 0), ID) %>%
mutate(TIME2 = TIME - first(TIME))

How to count number of particular values

My data looks like this:
ID CO MV
1 0 1
1 5 0
1 0 1
1 9 0
1 8 0
1 0 1
2 69 0
2 0 1
2 8 0
2 0 1
2 78 0
2 53 0
2 0 1
2 3 0
3 54 0
3 0 1
3 8 0
3 90 0
3 0 1
3 56 0
4 0 1
4 56 0
4 0 1
4 45 0
4 0 1
4 34 0
4 31 0
4 0 1
4 45 0
5 0 1
5 0 1
5 67 0
I want it to look like this:
ID CO MV CONUM
1 0 1 3
1 5 0 3
1 0 1 3
1 9 0 3
1 8 0 3
1 0 1 3
2 69 0 5
2 0 1 5
2 8 0 5
2 0 1 5
2 78 0 5
2 53 0 5
2 0 1 5
2 3 0 5
3 54 0 4
3 0 1 4
3 8 0 4
3 90 0 4
3 0 1 4
3 56 0 4
4 0 1 5
4 56 0 5
4 0 1 5
4 45 0 5
4 0 1 5
4 34 0 5
4 31 0 5
4 0 1 5
4 45 0 5
5 0 1 1
5 0 1 1
5 67 0 1
I want to create a column CONUM which is the total number of values other than zero in the CO column for each value in the ID column. So for example the CO column for ID 1 has 3 values other than zero, therefore the corresponding values in CONUM column is 3. The MV column is 0 if CO column has a value and 1 if CO column is 0. So another way to accomplish creating the CONUM column would be to count the number of zeros per ID . It would be great if you could help me with the r code to accomplish this. Thanks.
Here is an option with data.table
library(data.table)
setDT(df)[,CONUM:=sum(CO!=0) ,ID][]
You can use ave in base R:
dat <- transform(dat, CONUM = ave(as.logical(CO), ID, FUN = sum))
and an option with dplyr
# install.packages("dplyr")
library(dplyr)
dat <- dat %>%
group_by(ID) %>%
mutate(CONUM = sum(CO != 0))

In R, how to replace values in multiple columns with a vector of values equal to the same width?

I am trying to replace every row's values in 2 columns with a vector of length 2. It is easier to show you.
First here is a some data.
set.seed(1234)
x<-data.frame(x=sample(c(0:3), 10, replace=T))
x$ab<-0 #column that will be replaced
x$cd<-0 #column that will be replaced
The data looks like this:
x ab cd
1 0 0 0
2 2 0 0
3 2 0 0
4 2 0 0
5 3 0 0
6 2 0 0
7 0 0 0
8 0 0 0
9 2 0 0
10 2 0 0
Every time x=2 or x=3, I want to ab=0 and cd=1.
My attempt is this:
x[with(x, which(x==2|x==3)), c(2:3)] <- c(0,1)
Which does not have the intended results:
x ab cd
1 0 0 0
2 2 0 1
3 2 1 0
4 2 0 1
5 3 1 0
6 2 0 1
7 0 0 0
8 0 0 0
9 2 1 0
10 2 0 1
Can you help me?
The reason it doesn't work as you want is because R stores matrices and arrays in column-major layout. And when you a assign a shorter array to a longer array, R cycles through the shorter array. For example if you have
x<-rep(0,20)
x[1:10]<-c(2,3)
then you end up with
[1] 2 3 2 3 2 3 2 3 2 3 0 0 0 0 0 0 0 0 0 0
What is happening in your case is that the sub-array where x is equal to 2 or 3 is being filled in column-wise by cycling through the vector c(0,1). I don't know of any simple way to change this behavior.
Probably the easiest thing to do here is simply fill in the columns one at a time. Or, you could do something like this:
indices<-with(x, which(x==2|x==3))
x[indices,c(2,3)]<-rep(c(0,1),each=length(indices))
Another alternative: Using a data.table, this is a one-liner:
require(data.table)
DT <- data.table(x)
DT[x%in%2:3,`:=`(ab=0,cd=1)]
Original answer: You can pass a matrix of row-column pairs:
ijs <- expand.grid(with(x, which(x==2|x==3)),c(2:3))
ijs <- ijs[order(ijs$Var1),]
x[as.matrix(ijs)] <- c(0,1)
which yields
x ab cd
1 0 0 0
2 2 0 1
3 2 0 1
4 2 0 1
5 3 0 1
6 2 0 1
7 0 0 0
8 0 0 0
9 2 0 1
10 2 0 1
My original answer worked on my computer, but not a commenter's.
Generalized for multi-columns and multi-values:
mycol<-as.list(names(x)[-1])
myvalue<-as.list(c(0,1))
kk<-Map(function(y,z) list(x[x[,1] %in% c(2,3),y]<-z,x),mycol, myvalue)
myresult<-data.frame(kk[[2]][[2]])
x ab cd
1 1 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 3 0 1
7 2 0 1
8 3 0 1
9 3 0 1
10 0 0 0
You could use ifelse:
> set.seed(1234)
> dat<-data.frame(x=sample(c(0:3), 10, replace=T))
> dat$ab <- 0
> dat$cd <- ifelse(dat$x==2 | dat$x==3, 1, 0)
x ab cd
1 0 0 0
2 2 0 1
3 2 0 1
4 2 0 1
5 3 0 1
6 2 0 1
7 0 0 0
8 0 0 0
9 2 0 1
10 2 0 1
What about that?
x[x$x%in%c(2,3),c(2,3)]=matrix(rep(c(0,1),sum(x$x%in%c(2,3))),ncol=2,byrow=TRUE)
x$ab[x$x==2 | x$x==3] <- 0
x$cd[x$x==2 | x$x==3] <- 1
EDIT
Here is a general approach that would work with lots of columns. You simply create a vector of the replacement values you wish to use for each column.
set.seed(1234)
y<-data.frame(x=sample(c(0:3), 10, replace=T))
y$ab<-4 #column that will be replaced
y$cd<-2 #column that will be replaced
y$ef<-0 #column that will be replaced
y
# x ab cd ef
#1 0 4 2 0
#2 2 4 2 0
#3 2 4 2 0
#4 2 4 2 0
#5 3 4 2 0
#6 2 4 2 0
#7 0 4 2 0
#8 0 4 2 0
#9 2 4 2 0
#10 2 4 2 0
replacement.values <- c(10,20,30)
y2 <- y
y2[,2:ncol(y)] <- sapply(2:ncol(y), function(j) {
apply(y, 1, function(i) {
ifelse((i[1] %in% c(2,3)), replacement.values[j-1], i[j])
})
})
y2
# x ab cd ef
#1 0 4 2 0
#2 2 10 20 30
#3 2 10 20 30
#4 2 10 20 30
#5 3 10 20 30
#6 2 10 20 30
#7 0 4 2 0
#8 0 4 2 0
#9 2 10 20 30
#10 2 10 20 30

Resources