Calculate recovery time to a condition in time series in R - r

I would like to calculate the number of days from the time a condition is not met, to when it is met again, in a time series of daily data in R.
Toy data:
day <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2017,07,01,0,0,0),
to = ISOdatetime(2017,08,26,0,0,0),
by = "1 day" ))
var <- c(5,6,5,5,0,0,0,0,0,1,1,2,3,3,4,3,4,5,4,5,5,4,5,4,0,1,1,2,3,4,5,5,5,4,4,4,4,5,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,1,1,0,0)
ts = cbind(day, var)
The condition is var > 3.
I'd like to identify each "recovery" period as the time where var > 0 but <= 3, but only following var going to zero. Then, I'd like the number of days to recovery for each period.
So, for the example data given here, I'd expect this output:
period 1 6
period 2 5
Since var never "recovers" at the end of the dataset, I would either want it not identified as a recovery period, or given a recovery time of 0 days.
I tried this:
ifelse(ts$var >3, 0 ,(ifelse(ts$var>0 & ts$var<4, 1, 0)))
and I think I could pair this if else statement with something that only counts sequential 1s and that would mostly do it. Only problem is that it identifies the end period with the slow drop-off as a “recovery period”, and it shouldn’t. It should only identify periods following a zero as a recovery period.
Here is what this example data look like: plot of var over time. I think it's the minimal data I can provide that show the realistic issues I've had with making counts of data outside of recovery periods.
I need to do this over a long and much more dynamic time series, so an efficient way to do this would be greatly appreciated.

edit
- I don't think this will behave the way you expect it to if var does something like this
[... 0, 1, 2, 1, 0, 2, 4, ...]
But may possibly be adapted to handle this case.
original answer
I haven't tested this much, I'd suggest checking it works with weirder cases (e.g. var is all zeros, starts or ends at a period boundary, other corner cases...)
# ignore zeroes if they precede another zero
s <- which(var == 0 & c(tail(var, -1), NA) != 0)
e <- which(var > 3)
sapply(s, function(x) head(e[e > x], 1) - x)
The approach here is to identify all possible start and end points of periods, then find the first end point that occurs after each start point and taking the difference. A simple loop or maybe even a clever regex could be a good alternative.

Here is an alternative approach which uses the rleid() function from the data.table package to group by contiguous streaks of zero and non-zero values. It then finds the position within each group of the first occurrence of a value > 3:
library(data.table)
setDT(ts)[, if (.GRP > 1) first(which(var > 3)), rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
The first group is skipped because it is either a streak of zeros or has no preceeding zero value.
This approach works even in the case Callum Webb has described in the edit of his answer:
# append data
var <- c(var, 0,1,2,1,0,2,4)
date = seq.POSIXt(
from = ISOdatetime(2017,07,01,0,0,0),
along.with = var,
by = "1 day" )
ts = data.frame(date, var)
setDT(ts)[, if (.GRP > 1) first(which(var > 3)), rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
3: 9 2
So, it has recognized that there is a recovery period of 2 days after the final zero.
For the sake of completeness, in case the sequence 0, 1, 2, 1, 0 is considered to include also a recovery period of 3 days length although it has not reached a value greater 3:
setDT(ts)[, if (.GRP > 1) if (all(var %between% c(1, 3))) .N else first(which(var > 3)),
rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
3: 7 3
4: 9 2
Here all days between two zeros are counted if all values lie between 1 and 3.

Related

Finding the first trigger in a series and taking average from that point on

I hope this finds you well. I was hoping to get some help analyzing some code where I identify a series of trials based on the start trigger (but ignoring the immediate triggers that follow). In the example below I would like to find the first 1 in a series of 1's and take the average across the next three numbers in Value_1 and Value_2. It should then find the next start period (the 8th value with the next set of 1's) and again take the average for the following 3 values, and so on. Thank you for your help and I am happy to answer any questions.
df <- data.frame(Value_1 = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10), Value_2 = c(10,2,3,4,5,6,7,8,10,10,1,2,3,4,5,6,7,8,9,10), Triggers = c(0,1,1,1,0,0,0,1,1,1,0,0,0,0,0,1,1,1,0,0))
In the updated_df example below I would like the code to be able to work through possible interruptions in the Trigger value (e.g., a 0 in list of 1) and find the first 1 in a group of 1's and possible zeros and take the average across the next four numbers in Value_1 and Value_2. It should then find the next start period (the 9th value with the next set of 1's and 0's) and again take the average for the following 4 values, and so on. Thank you for your help and I am happy to answer any questions.
updated_df <-df <- data.frame(
Value_1 = c(1,2,3,3,4,5,6,7,8,9,9,10,1,2,3,4,5,6,7,8,9,9,10),
Value_2 = c(10,2,3,3,4,5,6,7,8,10,10,10,1,2,3,4,5,6,6,7,8,9,10),
Triggers = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,1,1,0,0)
)
Here's a base R solution that handles the updated question ("interruptions in trigger values"). It includes a lag function based on this SO answer.
updated_df <- data.frame(
Value_1 = c(1,2,3,3,4,5,6,7,8,9,9,10,1,2,3,4,5,6,7,8,9,9,10),
Value_2 = c(10,2,3,3,4,5,6,7,8,10,10,10,1,2,3,4,5,6,6,7,8,9,10),
Triggers = c(0,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,1,1,0,0)
)
# lag function, based on #Andrew's answer at
# https://stackoverflow.com/a/13128713/17303805
lag_fx <- function(x, by = 1L, default = NA) {
if (by < 0 || !isTRUE(all.equal(by, round(by)))) {
stop("`by` should be a whole number >= 0")
}
c(rep(default, by), x)[1:length(x)]
}
# number of trials per set
set_k <- 4
### to find index of each start trigger:
# (1) make matrix to "look back" at previous k - 1 trials
lagged <- sapply(
1:(set_k - 1),
\(x) lag_fx(updated_df$Triggers, by = x, default = 0)
)
# (2) then find rows where trigger == 1, but no 1s in previous k - 1 trials
starts <- which(updated_df$Triggers == 1 & rowSums(lagged) == 0)
# indices of each trigger and following k - 1 rows
sets <- lapply(starts, \(x) x + 0:(set_k - 1))
# means of each set of trials
Value_1 <- sapply(sets, \(x) mean(updated_df$Value_1[x]))
Value_2 <- sapply(sets, \(x) mean(updated_df$Value_2[x]))
# back to a data.frame
data.frame(Value_1, Value_2)
# Value_1 Value_2
# 1 3.0 3.00
# 2 9.0 9.50
# 3 7.5 6.75

Generate id for durations with dplyr

I have a column with dates (Time), I consider one duration as consecutive times of 1s:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"))
I would like to generate an Id for each duration like this:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"),Id = c(1,1,1,2,2))
With dplyr...
Thank you
Up front:
cumsum(c(TRUE, as.numeric(diff(as.POSIXct(data$Time)), units = "secs") > 1L))
# [1] 1 1 1 2 2
First, you should really be working with real timestamps and not strings. If you're doing anything else with your Time field, it is almost certainly going to be a number-like operation, so you should do this up-front with
data$Time <- as.POSIXct(data$Time)
This works easily here because they are well-formed along the default format of "%Y-%m-%d %H:%M:%S"; see ?strptime for the %-codes.
From here, you want to keep track of when a difference in time is more than 1 second. The differencing is easy enough with:
as.numeric(diff(data$Time), units = "secs")
# [1] 1 1 2 1
Really, the key operator is diff, but it can report minutes or hours or such if the data is widely-enough spaced; there's an internal heuristic for that. Wrapping it in as.numeric(., units="secs") forces it to always be in seconds.
From here, we need a cumulative sum of when it is above 1, ergo > 1L, so cumsum(. > 1L).
Note that we have input length 5 but output length 4, this makes sense realizing that differences are between two elements. We force the first difference-test to be TRUE. If you have since changed to POSIXt-class, then the original code is reduced slightly to be
cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))
and therefore to store it as Id,
data$Id <- cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))

Is there a way to determine how many rows in a dataset have the same categorical variable for multiple conditions (columns)?

For example, i have the dataset below where 1 = yes and 0 = no, and I need to figure out how many calls were made by landline that lasted under 10 minutes.
Image of example dataset
You can also specifically define the values you're looking for in each column when you're finding the sum. (This will help if you need count rows with values other than 1 in a column.)
sum(df$landline == 1 & df$`under 10 minutes` == 1)
We can use sum
sum(df1[, "under 10 minutes"])
If two columns are needed
colSums(df1[, c("landline", "under 10 minutes")])
If we are checking both columns, use rowSums
sum(rowSums(df1[, c("landline", "under 10 minutes")], na.rm = TRUE) == 2)
The grep function finds the rows where landline=1. We then only call those rows and sum the under 10 min column.
sum( df[ grep(1,df[,1]) ,4] )
R will conveniently treat 1 and 0 as if they mean TRUE and FALSE, so we can apply logical Boolean operations like AND (&) and OR (|) on them.
df <- data.frame(x = c(1, 0, 1, 0),
y = c(0, 0, 1, 1))
> sum(df$x & df$y)
[1] 1
> sum(df$x | df$y)
[1] 3
For future questions, you should look up how to use functions like dput or other ways to give an example data set instead of using an image.

Converting relative observations into numerical values

this is my first project in R, after just having learned java.
I have a (large) data set that I have imported from a csv file into data frame.
I have identified the two relevent columns for this question, the first that has the name of the patient, and second that asks the patient the level of swelling.
The level of swelling is relative i.e. better, worse or about the same.
Not all patients have the same number of observations.
I am having difficulty converting these relative values into numerical values that can be used as part of a greater analysis.
Below is psuedocode to what i think could be an appropriate solution:
for row in 'patientname'
patientcounter = dtfr1[row, 'patientname'];
if dtfr1[row, 'patientname'] == patientcounter
if dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditioncounter--;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
elseif [row, 'Does.you.swelling.seem.better.or.worse'] == 'better'
conditoncounter++;
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
else
dtfr1[row, 'Does.you.swelling.seem.better.or.worse'] = conditioncounter;
if dtfr1[row, 'patientname'] =! patientcounter
patientcounter = dtfr1[row, 'patientname'];
What would your advice be for a good solution to this problem? Thanks!
If I'm understanding correctly, you want the difference in the counts of worse and better, by patient? If so, something like this would work.
# Simulated data
dtfr1 <- data.frame(patient = sample(letters[1:3], 100, replace=TRUE),
condition = sample(c("better", "worse"), 100, replace=TRUE))
head(dtfr1)
# patient condition
# 1 a worse
# 2 b better
# 3 b worse
# 4 a better
# 5 c worse
# 6 a better
better_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "better"))
worse_count <- tapply(dtfr1$condition, dtfr1$patient, function(x) sum(x == "worse"))
worse_count - better_count
# a b c
# 5 0 -1

Reducing Row Sequences In R Lengths

I'm looking for a nice way to count the longest number of consecutive reductions in a row in a data.table (package version 1.9.2) in R. I am horribly lost and any help is much appreciated. For the example I am trying to do, a reduction is where a value is less than or equal to the previous value (<=).
Below is an toy sample of the data I am dealing with. I have also put down my best attempt so far which to be honest went horribly wrong and it returned an error. My attempt also uses 2 for loops which I'm not hugely keen on since I have been advised apply loops are more often used in R. I have tried searching this site and the web for a similar solution but haven't had any luck. The number of rows I actually have in my full data table is just over 1 million and the number of columns I have is 17.
library(data.table)
TEST_DF <- data.table(COL_1 = c(5,2,3,1), COL_2 = c(1,0,4,2),
COL_3 = c(0,1,6,3), COL_4 = c(0,0,0,4))
TEST_DF$COUNT <- as.numeric(0)
for( i in 1:NROW(TEST_DF))
{
for (j in 1:(NCOL(TEST_DF) - 1))
{
TEST_DF$COUNT[j] <- if (TEST_DF[i, j, with = FALSE] >=
TEST_DF[i, j + 1, with = FALSE])
{
TEST_DF$COUNT[j] + 2
}
}
}
DESIRED <- data.table(COL_1 = c(5,2,3,1), COL_2 = c(1,0,4,2),
COL_3 = c(0,1,6,3), COL_4 = c(0,0,0,4),
COUNT = c(4,2,1,0))
The desired output appears at the bottom of the code. As the 4 four "COL" columns appear in the longest reduction sequence, the COUNT column for the first row would get a value of 4. In the second row, there is a reduction in the first 2 columns and the last two but none in between so the COUNT would get a value of 2 for this. In the last column, there is a reduction from COL_3 to COL_4 so COUNT would get a value of 2 for this row. In any row where there is no reduction such as the last there would be a value of 0 for the COUNT.
Let me know if any further clarification or information is needed.
Thank you so much in advance.
You can use the functions diff() and rle() to build a function to extract the run lengths. Then use apply() across the rows of your data:
foo <- function(x) {
runs <- rle(c(x[2] <= x[1], diff(x) <= 0))
if(all(runs$value == 0)) 0 else max(runs$lengths[runs$value == 1])
}
apply(TEST_DF, 1, foo)
[1] 4 2 1 0
I used apply with one four loop to accomplish what you're looking for. The apply acts on each row, and the for loop compares consecutive columns.
COUNT <- rep(0,4)
for (i in 1:(ncol(TEST_DF)-1)) {
COUNT<-COUNT+apply(TEST_DF,1,function(x) ifelse(x[i]>=x[i+1],1,0))
}
This produces: 3, 2, 1, 0, as there are 3 reductions in the first row. The last column has nothing to compare to, so there can only be three comparisons. I'm not sure why you want it to be 4?
If you want count to be part of your original table:
TEST_DF$COUNT<-COUNT

Resources