Splitting days into episode identifiers and start days - r

I have a dataset of event days (in a date format), and each event belongs to an (unknown) episode. I want to categorize the events into episodes in such a way that all events within 180 days of the first day of the episode are considered part of the same episode, and the first day of the episode is assigned to all events part of that episode. For example, given a vector of event dates
event_dates <- c(34, 102, 190, 202, 245, 460, 500, 517)
I'm hoping to get a vector of episode IDs
c(1, 1, 1, 1, 2, 3, 3, 3)
and a vector of episode start days
c(34, 34, 34, 34, 245, 460, 460, 460)
This 5th entry begins a new episode because it is more than 180 days after the first date of the first episode; the 6th entry begins a new episode because it is more than 180 days after the first date of the second episode, etc.
I have do perform this operation on millions of separate patients, so ideally I would prefer a vectorized solution that could work with by in a data.table or in grouped tibble even if it is a bit opaque over a readable but slow solution, which I currently have. Thanks!

Using event_dates from the Note at the end (copied from the question) here are two approaches.
1) Reduce Use Reduce to loop through the events:
f <- function(base, x) if (x > base + 180) x else base
st <- Reduce(f, init = -Inf, event_dates, acc = TRUE)[-1]; st
## [1] 34 34 34 34 245 460 460 460
as.numeric(factor(st))
## [1] 1 1 1 1 2 3 3 3
2) for loop Loop through the values maintaining a variable base that is the latest baes value.
base <- -Inf
st <- event_dates
for(i in seq_along(event_dates)) {
if (st[i] > base + 180) base <- st[i]
st[i] <- base
}
st
## [1] 34 34 34 34 245 460 460 460
as.numeric(factor(out))
## [1] 1 1 1 1 2 3 3 3
3) C++
Create a file called event_dates.cpp in the current directory containing:
// To build & load: library(Rcpp); source("event_dates.cpp")
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector grouper(NumericVector x) {
NumericVector y(clone(x));
int n = y.size();
double base = y[0] - 200;
for(int i = 0; i < n; i++) {
if (y[i] > base + 180.0) base = y[i];
y[i] = base;
}
return y;
}
and then run this.
library(Rcpp)
sourceCpp("event_dates.cpp")
st <- grouper(event_dates); st
## [1] 34 34 34 34 245 460 460 460
as.numeric(factor(st))
## [1] 1 1 1 1 2 3 3 3
Note
event_dates <- c(34, 102, 190, 202, 245, 460, 500, 517)

Using a while loop
event_grp <- event_dates
tmp <- event_dates
index <- rep(1, length(event_dates))
i <- 1
while(TRUE) {
to_compare <- event_dates[i]
i1 <- which((tmp - to_compare) > 180)[1] -1
if(is.na(i1)) i1 <- length(event_dates)
event_grp[i:i1] <- to_compare
if(i > 1) index[i:i1] <- index[i-1] + 1
tmp[i:i1] <- NA
if(i1 == length(event_dates)) break
i <- i1+1
}
-output
> event_grp
[1] 34 34 34 34 245 460 460 460
> index
[1] 1 1 1 1 2 3 3 3

Related

In a series of series, how to subtract every 1st number in each sub-series event from every nth number in those events?

I have multiple series of timepoints. Some series have five timepoints, others have ten or fifteen timepoints. The series are in multiples of five because the event I am measuring is always five timepoints long; some recordings have multiple events in succession. For instance:
Series 1:
0
77
98
125
174
Series 2:
0
69
95
117
179
201
222
246
277
293
0 marks the beginning of each series. Series 1 is a single event, but Series 2 is two events in succession. The 6th timepoint in Series 2 is the start of the second event in that series.
I have an R dataframe that contains every timepoint in one column:
dd <- data.frame(
timepoint=c(0, 77, 98, 125, 174,
0, 69, 95, 117, 179, 201, 222, 246, 277, 293)
)
I need to know the duration from the start of each event to the 4th timepoint in each event. For the above data, that means:
Duration 1: 125 - 0 = 125
Duration 2: 179 - 0 = 179
Duration 3: 277 - 201 = 76
How can I write a simple piece of R code that will tell me the duration of that interval regardless of how many series or events there are, i.e. regardless of how many numbers are in the column?
I tried using diff() and seq_along(), but that seems only useful for every nth number, which doesn't work in this case.
diff(vec[seq_along(vec) %% 4 == 1])
This is maybe one way to do it with dplyr. We break up the data into "runs" which reset at each 0 and them we have the "sequences" which reset each 5 values.
dd %>%
group_by(run =cumsum(timepoint==0)) %>%
mutate(seq = (row_number()-1) %/% 5 + 1) %>%
group_by(run, seq) %>%
summarize(diff=timepoint[4]-timepoint[1])
# run seq diff
# <int> <dbl> <dbl>
# 1 1 1 125
# 2 2 1 117
# 3 2 2 76
It makes it somewhat easy to tie the value back to where it came from.
If you just wanted to use indexing, here's a helper function
diff4v1 <- function(x) {
idx <- (seq_along(x)-1) %% 5+1;
x[idx==4] - x[idx==1]
}
diff4v1(dd$timepoint)
# [1] 125 117 76
This is your data frame (hypothetical)
df = data.frame(series = round(rnorm(40, 100, 50)))
head(df)
series
1 16
2 35
3 75
4 125
5 190
6 85
And these are your differences
idx = c(1:nrow(df))
df[which(idx %% 5 == 4), "series"] - df[which(idx %% 5 == 1), "series"]
[1] 109 -38 -101 -47 34 -52 -63 -5

If() statement in R

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

R: Formatting youtube video duration into proper time (seconds)

I have vector (column data) which contains youtube playback duration in a character string format in R.
x <- c(PT1H8S, PT9M55S, PT13M57S, PT1M5S, PT30M12S, PT1H21M5S, PT6M48S, PT31S, PT2M)
How do I get rid of PT then get the overall duration in seconds format?
Resultant vector should be c(3608, 595, 837, 65, 1812, 4865, 408, 31, 120)
example: PT1H21M5S in the form of seconds = 4865.
(calculated as 1H = 1*3600, 21M = 21*60, 5S = 5*1)
I wrote a little apply loop with regex commands, deleting everything but the seconds, minutes, or hours and then converting everything into seconds.
x <- c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S")
x2 <- sapply(x, function(i){
t <- as.numeric(gsub("^(.*)M|^(.*)H|S$", "", i))
if(grepl("M", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)H|M(.*)$", "",i)) * 60
if(grepl("H", i)) t <- t + as.numeric(gsub("^(.*)PT|H(.*)$", "",i)) * 3600
t
})
x2
PT1H8S PT9M55S PT13M57S PT1M5S PT30M12S PT1H21M5S PT6M48S
3608 595 837 65 1812 4865 408
EDIT: Per request
x <- c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S", "PT31S", "PT2M")
x2 <- sapply(x, function(i){
t <- 0
if(grepl("S", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)M|^(.*)H|S$", "", i))
if(grepl("M", i)) t <- t + as.numeric(gsub("^(.*)PT|^(.*)H|M(.*)$", "",i)) * 60
if(grepl("H", i)) t <- t + as.numeric(gsub("^(.*)PT|H(.*)$", "",i)) * 3600
t
})
x2
PT1H8S PT9M55S PT13M57S PT1M5S PT30M12S PT1H21M5S PT6M48S PT31S PT2M
3608 595 837 65 1812 4865 408 31 120
This should cover all the cases. If there are more, the trick is to alter the regex. ^ is the beginning of the character vector, $ is the end. (.*) is everything. So ^(.*)H means everything between beginning and H. We replace this with nothing.
Here's a dplyr and stringr solution:
df %>%
# extract hours, minutes, and seconds and convert to numeric:
mutate(
h = as.numeric(str_extract(x, "(?<=PT)\\d+(?=H)")),
m = as.numeric(str_extract(x, "(?<=PT|H)\\d+(?=M)")),
s = as.numeric(str_extract(x, "(?<=PT|H|M)\\d+(?=S)"))
) %>%
# replace NA with 0:
mutate(
across(everything(), replace_na, 0)
) %>%
# calculate time in seconds:
mutate(sec = h*3600+m*60+s)
x h m s sec
1 PT1H8S 1 0 8 3608
2 PT9M55S 0 9 55 595
3 PT13M57S 0 13 57 837
4 PT1M5S 0 1 5 65
5 PT30M12S 0 30 12 1812
6 PT1H21M5S 1 21 5 4865
7 PT6M48S 0 6 48 408
8 PT31S 0 0 31 31
9 PT2M 0 2 0 120
Data:
df <- data.frame(x = c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S", "PT31S", "PT2M"))
You can use Lubridate package:
library(lubridate)
x <- c("PT1H8S", "PT9M55S", "PT13M57S", "PT1M5S", "PT30M12S", "PT1H21M5S", "PT6M48S")
x2 <- as.numeric(duration(x))
x2
[1] 3608 595 837 65 1812 4865 408

Rolling queue size

I want to calculate number of items waiting or queued over. Let's say, I have fixed capacity of 102 item/hour and different incoming items for 9 hours.
as data table:
dt<-data.table(hour = c(1,2,3,4,5,6,7,8,9),
incoming = c(78,102,115,117,105,99,91,80,71),
capacity = rep(102,9))
I want to calculate queued items in each period.
In 1 and 2 capacity is enough and queue is 0.
In 3, 13 items are queued
In 4, 15+13 backlogged items are queued.
In 6, there were 31 backlogged items and 3 items are deducted so 28 were queued.
I have tried several options but could not figure out how to calculate.
Result should be:
Explicit looping in R won't get you far, and I don't see a vectorized solution for this, but this is trivial to solve using Rcpp:
library(Rcpp)
cppFunction("NumericVector queue(NumericVector x) {
NumericVector res(x.size());
res[0] = std::max<double>(0, x[0]);
for (int i = 1, size = x.size(); i < size; ++i) {
res[i] = std::max<double>(0, res[i-1] + x[i]);
}
return res;
}")
dt[, queued := queue(incoming - capacity)][]
# hour incoming capacity queued
#1: 1 78 102 0
#2: 2 102 102 0
#3: 3 115 102 13
#4: 4 117 102 28
#5: 5 105 102 31
#6: 6 99 102 28
#7: 7 91 102 17
#8: 8 80 102 0
#9: 9 71 102 0
I'd create a separate function to get queued number like #sebastian-c did, but with #R.S. 's logic. Like this
get_queue <- function(x){
n <- length(x)
y <- c(max(0, x[[1]]), rep(0, n - 1))
for(i in 2:n){
y[i] <- max(0, y[i - 1] + x[i])
}
y
}
And then
dt[,incoming_capacity := incoming - capacity]
dt[,queued := get_queue(incoming_capacity)]
Another alternative:
require(data.table)
dt<-data.table(hour = c(1,2,3,4,5,6,7,8,9),
incoming = c(78,102,115,117,105,99,91,80,71),
capacity = rep(102,9))
dt$incoming_capactity<- dt$incoming-dt$capacity
dt$carriedover<- 0
dt$carriedover[1]<- max(0,dt$incoming_capactity[1]) #added
for( i in 2:length(dt$carriedover)) {
dt$carriedover[i]<- max(0,dt$incoming_capactity[i] + dt$carriedover[i-1])
}
dt

replacing specific elements of a vector

I am trying to make a user-defined function below using the R
wrkexpcode.into.month <- function(vec) {
tmp.vec <- vec
tmp.vec[tmp.vec == 0 | tmp.vec == 9] <- NA
tmp.vec[tmp.vec == 1] <- 4
tmp.vec[tmp.vec == 2] <- 13
tmp.vec[tmp.vec == 3] <- 31
tmp.vec[tmp.vec == 4] <- 78
tmp.vec[tmp.vec == 5] <- 174
tmp.vec[tmp.vec == 6] <- 240
return (tmp.vec)
}
but when I execute with a simple command like
wrkexpcode.into.month(c(3,2,2,3,1,3,5,6,4))
the result comes like
[1] 31 13 13 31 78 31 174 240 78
but I expect the result like
[1] 31 13 13 31 **4** 31 174 240 78
How can I fix this?
You have to carefully follow the flow of your function, evaluating what the values are. You are expecting 1 to be replaced by 4 based on tmp.vec[tmp.vec == 1] <- 4, however in tmp.vec[tmp.vec == 4] <- 78 later down the road, the 4 is replaced by a 78. This is caused by replacing the values in tmp.vec and using tmp.vec for determining what needs to be replaced. Like #MattewPlourde said, you need to base the replacement on vec:
tmp.vec[vec == 1] <- 4
Although I would simply replace the code by:
wrkexpcode.into.month <- function(vec) {
translation_vector = c('0' = NA, '1' = 4, '2' = 13, '3' = 31,
'4' = 78, '5' = 174, '6' = 240, '9' = NA)
return(translation_vector[as.character(vec)])
}
wrkexpcode.into.month(c(3,2,2,3,1,3,5,6,4))
# 3 2 2 3 1 3 5 6 4
# 31 13 13 31 4 31 174 240 78
See also a blogpost I wrote recently about this kind of operation.
It think it will be much easier to use one of the many recode functions that are designed for such purposes instead of hard-coding it. It's just a one-liner then, e.g.
library(likert)
x <- c(3,2,2,3,1,3,5,6,4)
recode(x, from=c(0:6, 9), to=c(NA, 4,13,31,78,174,240,NA))
[1] 31 13 13 31 4 31 174 240 78
And if desired, wrap it into a function, e.g.
wrkexpcode.into.month <- function(x)
recode(x, from=c(0:6, 9), to=c(NA, 4,13,31,78,174,240,NA))
wrkexpcode.into.month(x)
[1] 31 13 13 31 4 31 174 240 78
You could create matrix pointing the input value (column1) to the desired output value (column2)
table=matrix(c(0,1,2,3,4,5,6,9,NA,4,13,31,78,174,240,NA),ncol=2)
And using sapply on the vector c(3,2,2,3,1,3,5,6,4)
sapply(c(3,2,2,3,1,3,5,6,4), function(x) table[which(table[,1] == x),2] )
to give you the desired output too

Resources