Number of days spent in each STATE in r - r

I'm trying to calculate the number of days that a patient spent during a given state in R.
The image of an example data is included below. I only have columns 1 to 3 and I want to get the answer in column 5. I am thinking if I am able to create a date column in column 4 which is the first recorded date for each state, then I can subtract that from column 2 and get the days I am looking for.
I tried a group_by(MRN, STATE) but the problem is, it groups the second set of 1's as part of the first set of 1's, so does the 2's which is not what I want.

Use mdy_hm to change OBS_DTM to POSIXct type, group_by ID and rleid of STATE so that first set of 1's are handled separately than the second set. Use difftime to calculate difference between OBS_DTM with the minimum value in the group in days.
If your data is called data :
library(dplyr)
data %>%
mutate(OBS_DTM = lubridate::mdy_hm(OBS_DTM)) %>%
group_by(MRN, grp = data.table::rleid(STATE)) %>%
mutate(Answer = as.numeric(difftime(OBS_DTM, min(OBS_DTM),units = 'days'))) %>%
ungroup %>%
select(-grp) -> result
result

You could try the following:
library(dplyr)
df %>%
group_by(ID, State) %>%
mutate(priorObsDTM = lag(OBS_DTM)) %>%
filter(!is.na(priorObsDTM)) %>%
ungroup() %>%
mutate(Answer = as.numeric(OBS_DTM - priorObsDTM, units = 'days'))
The dataframe I used for this example:
df <- df <- data.frame(
ID = 1,
OBS_DTM = as.POSIXlt(
c('2020-07-27 8:44', '2020-7-27 8:56', '2020-8-8 20:12',
'2020-8-14 10:13', '2020-8-15 13:32')
),
State = c(1, 1, 2, 2, 2),
stringsAsFactors = FALSE
)
df
# A tibble: 3 x 5
# ID OBS_DTM State priorObsDTM Answer
# <dbl> <dttm> <dbl> <dttm> <dbl>
# 1 1 2020-07-27 08:56:00 1 2020-07-27 08:44:00 0.00833
# 2 1 2020-08-14 10:13:00 2 2020-08-08 20:12:00 5.58
# 3 1 2020-08-15 13:32:00 2 2020-08-14 10:13:00 1.14

Related

Addition of missing data after floor_date / detect and fill in missing data gaps

I would like to sum up a larger set of data per month. floor_date offers the right functionality to sum up the data from the individual days on a monthly level. But unfortunately I need to make sure that all months are included in the final table. The initial data therefore does not always cover all months, but after floor_date there must be 0 in the corresponding months; the rows / months must not simply be missing. How can I ensure this automatically?
The following exemplary code clarifies my problem:
df <- data.frame(
time = c(as.Date("01-01-2020", format = "%d-%m-%Y"), as.Date("02-01-2020", format = "%d-%m-%Y"), as.Date("01-03-2020", format = "%d-%m-%Y")),
text = c("A", "A", "B")
)
df2 <- df %>%
mutate(month = floor_date(time, unit = "month")) %>%
select(text, month) %>%
group_by(month, text) %>%
summarise(n = n())
df2
# A tibble: 2 x 3
# Groups: month [2]
month text n
<date> <fct> <int>
1 2020-01-01 A 2
2 2020-03-01 B 1
It should be recognized that there is no data for B in month 2020-01, no data for A&B in month 2020-02 and no data for A in month 2020-03: this rows should be added with value 0.
Unfortunately, so far I have not found a solution to solve the problem in an automated way.
Thanks in advance!
I cannot understand the need of using format while mutating the variable for a given month (floor_date). This formatting turns the variable into character type and hence no further calculations can be performed.
Remove that step, and use tidyr::complete you can fill missing months as shown under-
df <- data.frame(
time = c(as.Date("01-01-2020", format = "%d-%m-%Y"), as.Date("02-01-2020", format = "%d-%m-%Y"), as.Date("01-03-2020", format = "%d-%m-%Y")),
text = c("A", "A", "B")
)
library(lubridate, warn.conflicts = F)
library(tidyverse, warn.conflicts = F)
df %>%
mutate(month = floor_date(time, unit = "month")) %>%
group_by(text, month) %>%
summarise(n = n(), .groups = 'drop') %>%
complete(nesting(text), month = seq.Date(from = min(month), to = max(month), by = '1 month'), fill = list(n = 0))
# A tibble: 6 x 3
text month n
<chr> <date> <dbl>
1 A 2020-01-01 2
2 A 2020-02-01 0
3 A 2020-03-01 0
4 B 2020-01-01 0
5 B 2020-02-01 0
6 B 2020-03-01 1
Created on 2021-07-06 by the reprex package (v2.0.0)
Base R option using cut -
stack(table(cut(df$time,'month')))[2:1]
# ind values
#1 2020-01-01 2
#2 2020-02-01 0
#3 2020-03-01 1

Filtering multiple time series values in R

I have a problem with a time series which I don´t know to solve.
I have a tibble with 4 different variables. In my real dataset there are over 10.000 Documents.
document date author label
1 2018-04-05 Mr.X 1
2 2018-02-05 Mr.Y 0
3 2018-04-17 Mr.Z 1
So now my problem is that in the first step I want to count my articles which are occur in a specific month and a specific year for every month in my time series.I know that I can filter for a specific month in a year like this:
tibble%>%
filter(date > "2018-02-01" && date < "2018-02-28")
Result out of this would be a tibble with 1 Observation, but my problem is that I have 360 different time periods in my data. Can I write a function for this to solve this problem or do I need to make 360 own calculations?
The best solution for me would be a table with 360 different columns where in every column the amount of articles which are counted in this month are represented. Is this possible?
Thank you so much in advance.
If you want each result into a separate list, you can do something like this
suppressMessages(library(dplyr))
df %>% mutate(date = as.Date(date)) %>%
group_split(substr(date, 1, 7), .keep = F)
<list_of<
tbl_df<
document: integer
date : date
author : character
label : integer
>
>[2]>
[[1]]
# A tibble: 1 x 4
document date author label
<int> <date> <chr> <int>
1 2 2018-02-05 Mr.Y 0
[[2]]
# A tibble: 2 x 4
document date author label
<int> <date> <chr> <int>
1 1 2018-04-05 Mr.X 1
2 3 2018-04-17 Mr.Z 1
you can further use list2env() to save each item of this list as a separate item.
To count the number of rows for each month-year combination, in tidyverse you can do :
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date),
year_mon = format(date, '%Y-%m')) %>%
select(year_mon) %>%
pivot_wider(names_from = year_mon, values_from = year_mon,
values_fn = length, values_fill = 0)
# `2018-04` `2018-02`
# <int> <int>
#1 2 1
and in base R :
df$date <- as.Date(df$date)
table(format(df$date, '%Y-%m'))

How to get the difference of a lagged variable by date?

Consider the following example:
library(tidyverse)
library(lubridate)
df = tibble(client_id = rep(1:3, each=24),
date = rep(seq(ymd("2016-01-01"), (ymd("2016-12-01") + years(1)), by='month'), 3),
expenditure = runif(72))
In df you have stored information on monthly expenditure from a bunch of clients for the past 2 years. Now you want to calculate the monthly difference between this year and the previous year for each client.
Is there any way of doing this maintaining the "long" format of the dataset? Here I show you the way I am doing it nowadays, which implies going wide:
df2 = df %>%
mutate(date2 = paste0('val_',
year(date),
formatC(month(date), width=2, flag="0"))) %>%
select(client_id, date2, value) %>%
pivot_wider(names_from = date2,
values_from = value)
df3 = (df2[,2:13] - df2[,14:25])
However I find tihs unnecessary complex, and in large datasets going from long to wide can take quite a lot of time, so I think there must be a better way of doing it.
If you want to keep data in long format, one way would be to group by month and date value for each client_id and calculate the difference using diff.
library(dplyr)
df %>%
group_by(client_id, month_date = format(date, "%m-%d")) %>%
summarise(diff = -diff(expenditure))
# client_id month_date diff
# <int> <chr> <dbl>
# 1 1 01-01 0.278
# 2 1 02-01 -0.0421
# 3 1 03-01 0.0117
# 4 1 04-01 -0.0440
# 5 1 05-01 0.855
# 6 1 06-01 0.354
# 7 1 07-01 -0.226
# 8 1 08-01 0.506
# 9 1 09-01 0.119
#10 1 10-01 0.00819
# … with 26 more rows
An option with data.table
library(data.table)
library(zoo)
setDT(df)[, .(diff = -diff(expenditure)), .(client_id, month_date = as.yearmon(date))]

Is there a way of splitting time and duration variable while leaving other variables unchanged in R?

I have a dataset that I'd now like to split at 12:00pm (midday) into two, i.e. if variable goes from 08:00-13:00 it becomes 08:00-12:00 and 12:00-13:00 across two rows. The variable duration and cumulative sum would need to be changed accordingly, but the other variables should be as in the original (unchanged).
This should be applicable across different id variables.
id = unchanged from row 1, just repeated
start = changed in both rows
end = changed in both rows
day = unchanged from row 1, just repeated
duration = changed in both rows
cumulative time = changed in both row
ORIGINAL DATAFILE
#Current dataframe
id<-c("m1","m1")
x<-c("2020-01-03 10:00:00","2020-01-03 19:20:00")
start<-strptime(x,"%Y-%m-%d %H:%M:%S")
y<-c("2020-01-03 16:00:00","2020-01-03 20:50:00")
end<-strptime(y,"%Y-%m-%d %H:%M:%S")
day<-c(1,1)
mydf<-data.frame(id,start,end,day)
# calculate duration and time
mydf$duration<-as.numeric(difftime(mydf$end,mydf$start,units = "hours"))
mydf$time<-c(cumsum(mydf$duration))
REQUIRED DATAFILE
#Required dataframe
id2<-c("m1","m1","m1")
x2<-c("2020-01-03 10:00:00","2020-01-03 12:00:00","2020-01-03 19:20:00")
start2<-strptime(x2,"%Y-%m-%d %H:%M:%S")
y2<-c("2020-01-03 12:00:00","2020-01-03 16:00:00","2020-01-03 20:50:00")
end2<-strptime(y2,"%Y-%m-%d %H:%M:%S")
day2<-c(1,1,1)
mydf2<-data.frame(id2,start2,end2,day2)
# calculate duration and time
mydf2$duration<-c(2,4,1.5)
mydf2$time<-c(2,6,7.5)
Good question. So, each line implicitly contains either one or two intervals, so you should be able to just define those interval(s) on each line and then pivot to long, but you can't pivot with interval values (yet?). So, here's my approach, which computes up to two shift start times for each line, and then infers the shift end from the start of the next shift after pivoting. Comments inline.
library(lubridate, warn.conflicts = FALSE)
library(tidyverse)
library(magrittr, warn.conflicts = FALSE)
library(hablar, warn.conflicts = FALSE)
(mydf <- tibble(
id = "m1",
start = as_datetime(c("2020-01-03 10:00:00", "2020-01-03 19:20:00")),
end = as_datetime(c("2020-01-03 16:00:00", "2020-01-03 20:50:00")),
day = 1
))
#> # A tibble: 2 x 4
#> id start end day
#> <chr> <dttm> <dttm> <dbl>
#> 1 m1 2020-01-03 10:00:00 2020-01-03 16:00:00 1
#> 2 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1
(mydf2 <-
mydf %>%
# Assume the relevant noontime cutoff is on the same day as the start
mutate(midday =
start %>% as_date() %>%
add(12 %>% hours()) %>%
fit_to_timeline() %>%
# No relevant midday if the shift doesn't include noon
na_if(not(. %within% interval(start, end)))) %>%
# Make an original row ID since there doesn't seem to be one, and we will need
# to build intervals within the data stemming from each original row
rownames_to_column("orig_shift") %>%
pivot_longer(cols = c(start, midday, end),
# The timestamps we have here will be treated as start times
values_to = "start",
# Drop rows that would exist due to irrelevant middays
values_drop_na = TRUE) %>%
select(-name) %>%
# Infer shift end times as the start of the next shift, within lines defined
# by the original shifts
group_by(orig_shift) %>%
arrange(start) %>%
mutate(end = lead(start)) %>%
ungroup() %>%
# Drop lines that represent the end of the last shift and not a full one
drop_na() %>%
# Compute those durations and times (should times really be globally
# cumulative? Also, your specified mydf2 seems to have an incorrect first time
# value)
mutate(duration = start %--% end %>% as.numeric("hours"),
time = cumsum(duration)) %>%
select(id, start, end, day, duration, time))
#> # A tibble: 3 x 6
#> id start end day duration time
#> <chr> <dttm> <dttm> <dbl> <dbl> <dbl>
#> 1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00 1 2 2
#> 2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00 1 4 6
#> 3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1 1.5 7.5
Created on 2019-10-23 by the reprex package (v0.3.0)
Here is mine solution for a more general case when you have many observations with different dates. The logic is the following.
Firstly, I create a data frame with 12:00pm (midday) splitters.
Next, I identify the rows which should be split by joining the data frame to the initial one and saving them in separate data frame.
Next, I duplicate the rows and create the split_rows
From the original dataset I delete the rows which I split and join the correct doubled rows.
library(dplyr)
split_time_data =
tibble(split_time = as.POSIXct(seq(0, 365*60*60*24, 60*60*24),
origin="2020-01-01 17:00:00")) %>%
mutate(key = TRUE)# I use 17:00 to make it 12:00 EST, adjust for your purposes
data_to_split =
mydf %>%
mutate(key = TRUE) %>%
left_join(split_time_data) %>%
filter(between(split_time, start, end)) %>%
select(-key)
library(lubridate)
split_rows =
data_to_split %>%
rbind(data_to_split) %>%
arrange(start) %>%
group_by(start) %>%
mutate(row_number = row_number() ) %>%
ungroup() %>%
mutate(start = if_else(row_number == 1, start, split_time ),
end = if_else(row_number == 1, split_time, end )) %>%
select(-row_number, -split_time) %>%
mutate(duration = hour(end) - hour(start) )
mydf %>%
anti_join(data_to_split) %>%
full_join(split_rows) %>%
arrange(start) %>%
mutate(time = cumsum(duration) )
The output
id start end day duration time
1 m1 2020-01-03 10:00:00 2020-01-03 12:00:00 1 2.0 2.0
2 m1 2020-01-03 12:00:00 2020-01-03 16:00:00 1 4.0 6.0
3 m1 2020-01-03 19:20:00 2020-01-03 20:50:00 1 1.5 7.5

Using lag in mutate() on a arranged data

I am working on a data set which is similar to
data <-tribble(
~id, ~ dates, ~days_prior,
1,20190101, NA,
1,NA, 15,
1,NA, 20,
2, 20190103, NA,
2,NA, 3,
2,NA, 4)
I have the first date for each ID and I am trying to calculate the next date by adding days_prior to the previous date. I am using the lag function to refer to the previous date.
df<- df%>% mutate(dates = as.Date(ymd(dates)), days_prior =as.integer(days_prior))
df<-df %>% mutate(dates =
as.Date(ifelse(is.na(days_prior),dates,days_prior+lag(dates)),
origin="1970-01-01"))
This works but only for the next row as you can see attached data.
What am I doing wrong? I would like all the dates to be calculated by mutate(). What different approach should I take to calculate this.
I don't really see how lag would help here; unless I misunderstood here is an option using tidyr::fill
data %>%
group_by(id) %>%
mutate(dates = as.Date(ymd(dates))) %>%
fill(dates) %>%
mutate(dates = dates + if_else(is.na(days_prior), 0L, as.integer(days_prior))) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 NA
#2 1 2019-01-16 15
#3 1 2019-01-21 20
#4 2 2019-01-03 NA
#5 2 2019-01-06 3
#6 2 2019-01-07 4
Or a slight variation, replacing the NA entries in days_prior with 0
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + as.integer(days_prior)) %>%
ungroup()
Update
In response to your clarifications in the comments, here is what you can do
data %>%
group_by(id) %>%
mutate(
dates = as.Date(ymd(dates)),
days_prior = replace(days_prior, is.na(days_prior), 0)) %>%
fill(dates) %>%
mutate(dates = dates + cumsum(days_prior)) %>%
ungroup()
## A tibble: 6 x 3
# id dates days_prior
# <dbl> <date> <dbl>
#1 1 2019-01-01 0
#2 1 2019-01-16 15
#3 1 2019-02-05 20
#4 2 2019-01-03 0
#5 2 2019-01-06 3
#6 2 2019-01-10 4
You can use the na.locf from the zoo package to fill in the last observed date before adding the prior days.
library("tidyverse")
library("zoo")
data %>%
# Fill in NA dates with the previous non-NA date
# The `locf` stands for "last observation carried forward"
# Fill in NA days_prior with 0
mutate(dates = zoo::na.locf(dates),
days_prior = replace_na(days_prior, 0)) %>%
mutate(dates = lubridate::ymd(dates) + days_prior)
This solution makes two assumptions:
The rows are sorted by id. You can get around this assumption with a group_by(id) followed by an ungroup() statement as shows in the solution by Maurits Evers.
For each id, the row with the observed date is first in the group. This needs to be true in any case with either na.locf and fill because both functions fill in NAs using the previous non-NA entry.
If you don't want to make any assumptions about the ordering, you can sort the rows at the start with data %>% arrange(id, dates).

Resources