In R, use Lubridate to get a conditional average duration between events - r

Background
I've got a dataframe d:
d <- data.frame(ID = c("a","a","a","a","a","a","a","b","b","c","c"),
treatment = c(0,1,0,0,0,1,0,1,0,1,1),
#event = c(0,0,1,1,1,1,1,0,1,1,1),
service_date = as.Date(c("2011-01-01",
"2011-08-21",
"2011-12-23",
"2012-02-23",
"2013-09-14",
"2013-04-07",
"2014-10-14",
"2013-01-01",
"2013-12-12",
"2014-06-17",
"2015-09-29")),
stringsAsFactors=FALSE)
> d
ID treatment service_date
1 a 0 2011-01-01
2 a 1 2011-08-21
3 a 0 2011-12-23
4 a 0 2012-02-23
5 a 0 2013-09-14
6 a 1 2013-04-07
7 a 0 2014-10-14
8 b 1 2013-01-01
9 b 0 2013-12-12
10 c 1 2014-06-17
11 c 1 2015-09-29
It describes some people (ID), whether or not they had a treatment, and the date of each entry (row).
The Problem
I want to calculate the mean duration between the first and last treatment==1 for IDs who have more than 1 row where treatment==1.
To make that more clear, let's lay out the steps as if we were doing this manually, and also see what answer I want:
Take ID a. Mr. A has 7 rows of data, but only two rows in which treatment==1: one from 2011-08-21 (row 2) and another from 2013-09-14 (row 6). If you hand-calculate the difference, you find that there are 595 days between the two.
For ID b, we do nothing, as they only have 1 treatment==1. (We'll use filter to skip people like b in the code.)
For Mr. c, we get a difference of 469 days.
Average duration of treatment in this group: (595 days + 469 days) / 2 people = 532 days. This is the desired result.
(It's entirely possible I've done this hand-calculation wrong, and that's fine, as long as it suffices to understand what I'm trying to do. Happy to clarify further if needed; let me know!)
What I've tried
I'm trying to adapt some old code from a similar query to work for this:
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
mutate(treatment_years = lubridate::time_length(max(service_date) - min(service_date), unit = "year")) %>%
ungroup() %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))
This code runs, and gets me nearly there. It's filtering out the unwanted IDs and making a mean (and SD) calculation for a defined interval of time for each person.
But it's not quite correct: in lubridate::time_length, it isn't specifying the condition "max service date where treatment==1" minus "min service date where treatment==1". (The bolded parts are what's missing, and needed.)
How do I get it to do that?
I've tried something like this but it just throws an error:
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
mutate(treatment_years = lubridate::time_length(max(service_date) & treatment==1 - min(service_date) & treatment==1, unit = "year")) %>%
ungroup() %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))

We may subset the service_date with a logical vector treatment == 1 i.e. service_date[treatment == 1] (assuming there is at least one 'treatment' level 1)
library(dplyr)
library(lubridate)
d %>%
group_by(ID) %>%
filter(sum(treatment) >1) %>%
summarise(treatment_years = lubridate::time_length(max(service_date[treatment == 1]) -
min(service_date[treatment == 1]), unit = "day"), .groups = 'drop') %>%
summarise(avg = mean(treatment_years),
sd = sd(treatment_years))
-output
# A tibble: 1 × 2
avg sd
<dbl> <dbl>
1 532 89.1

An option using by and just subtracting the treated dates.
by(d, d$ID, \(x) {
if (all(x$treatment == 0)) NA_real_
else diff(x$service_date[x$treatment == 1]) |> as.numeric()
}) |> unlist() |> {\(x) c(mean=mean(x, na.rm=TRUE), sd=sd(x, na.rm=TRUE))}()
# mean sd
# 532.00000 89.09545

Related

Calculating average rle$lengths over grouped data

I would like to calculate duration of state using rle() on grouped data. Here is test data frame:
DF <- read.table(text="Time,x,y,sugar,state,ID
0,31,21,0.2,0,L0
1,31,21,0.65,0,L0
2,31,21,1.0,0,L0
3,31,21,1.5,1,L0
4,31,21,1.91,1,L0
5,31,21,2.3,1,L0
6,31,21,2.75,0,L0
7,31,21,3.14,0,L0
8,31,22,3.0,2,L0
9,31,22,3.47,1,L0
10,31,22,3.930,0,L0
0,37,1,0.2,0,L1
1,37,1,0.65,0,L1
2,37,1,1.089,0,L1
3,37,1,1.5198,0,L1
4,36,1,1.4197,2,L1
5,36,1,1.869,0,L1
6,36,1,2.3096,0,L1
7,36,1,2.738,0,L1
8,36,1,3.16,0,L1
9,36,1,3.5703,0,L1
10,36,1,3.970,0,L1
", header = TRUE, sep =",")
I want to know the average length for state == 1, grouped by ID. I have created a function inspired by: https://www.reddit.com/r/rstats/comments/brpzo9/tidyverse_groupby_and_rle/
to calculate the rle average portion:
rle_mean_lengths = function(x, value) {
r = rle(x)
cond = r$values == value
data.frame(count = sum(cond), avg_length = mean(r$lengths[cond]))
}
And then I add in the grouping aspect:
DF %>% group_by(ID) %>% do(rle_mean_lengths(DF$state,1))
However, the values that are generated are incorrect:
ID
count
avg_length
1 L0
2
2
2 L1
2
2
L0 is correct, L1 has no instances of state == 1 so the average should be zero or NA.
I isolated the problem in terms of breaking it down into just summarize:
DF %>% group_by(ID) %>% summarize_at(vars(state),list(name=mean)) # This works but if I use summarize it gives me weird values again.
How do I do the equivalent summarize_at() for do()? Or is there another fix? Thanks
As it is a data.frame column, we may need to unnest afterwards
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
summarise(new = list(rle_mean_lengths(state, 1)), .groups = "drop") %>%
unnest(new)
Or remove the list and unpack
DF %>%
group_by(ID) %>%
summarise(new = rle_mean_lengths(state, 1), .groups = "drop") %>%
unpack(new)
# A tibble: 2 × 3
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN
In the OP's do code, the column that should be extracted should be not from the whole data, but from the data coming fromt the lhs i.e. . (Note that do is kind of deprecated. So it may be better to make use of the summarise with unnest/unpack
DF %>%
group_by(ID) %>%
do(rle_mean_lengths(.$state,1))
# A tibble: 2 × 3
# Groups: ID [2]
ID count avg_length
<chr> <int> <dbl>
1 L0 2 2
2 L1 0 NaN

Calculate cumulative sum after a set period of time

I have a data frame with COVID data and I'm trying to make a column calculating the number of recovered people based off of the number of positive tests.
My data has a location, a date, and the number of tests administered/positive results/negative results each day. Here's a few lines using one location as an example (the real data has several months worth of dates):
loc date tests pos neg active
spot1 2020-04-10 1 1 0 5
spot1 2020-04-11 2 1 1 6
spot1 2020-04-12 0 0 0 6
spot1 2020-04-13 11 1 10 7
I want to make a new column that cumulatively counts each positive test in each location 14 days after it is recorded. On 2020-04-24, the 5 active classes are not active anymore, so I want a recovered column with 5. For each date I want the newly nonactive cases to be added.
My first thought was to try it in a loop:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = for (i in 1:nrow(df)) {
#getting number of new cases
x <- df$pos[i]
#add 14 days to the date
d <- df$date + 14
df$rec <- sum(x)
})
As you can see, I'm not the best at writing for loops. That gives me a bunch of numbers, but bear very little meaningful relationship to the data.
Also tried it with map_dbl:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = map_dbl(date, ~sum(pos[(date <= . + 14) & date >= .])))
Which resulted in the same number printed on the entire rec column.
Any suggestions? (Sorry for the lengthy explanation, just want to make sure this all makes sense)
Your sample data shows that -
you have all continuous dates despite 0 tests (12 April)
Active column seems like already a cumsum
Therefore I think you can simply use lag function with argument 14
example code
df %>% group_by(loc) %>% mutate(recovered = lag(active, 14)) %>% ungroup()
You could use aggregate to sum the specific column and then applying
cut in order to set a 14 day time frame for each sum:
df <- data.frame(loc = rep("spot1", 30),
date = seq(as.Date('2020-04-01'), as.Date('2020-04-30'),by = 1),
test = seq(1:30),
positive = seq(1:30),
active = seq(1:30))
output <- aggregate(positive ~ cut(date, "14 days"), df, sum)
output
Console output:
cut(date, "14 days") positive
1 2020-04-01 105
2 2020-04-15 301
3 2020-04-29 59
my solution:
library(dplyr)
date_seq <- seq(as.Date("2020/04/01"), by = "day", length.out = 30)
pos <- rpois(n = 60, lambda = 10)
mydf <-
data.frame(loc = c(rep('loc1', 30), rep('loc2', 30)),
date = date_seq,
pos = pos)
head(mydf)
getPosSum <- function(max, tbl, myloc, daysBack = 14) {
max.Date <- as.Date(max)
sum(tbl %>%
filter(date >= max.Date - (daysBack - 1) &
date <= max.Date & loc == myloc) %>%
select(pos))
}
result <-
mydf %>%
group_by(date, loc) %>%
mutate(rec = getPosSum(max = date, tbl = mydf, myloc = loc))
library(tidyverse)
library(lubridate)
data %>%
mutate(date = as_date(date),
cut = cut(date, '14 days') %>%
group_by(loc) %>%
arrange(cut) %>%
mutate(cum_pos = accumulate(pos, `+`)) # accumulate(pos, sum) should also work
As a general rule of thumb, avoid loops, especially within mutate - that won't work. Instead of map_dbl you should check out purrr::accumulate. There's specialized functions for this in R's base library such as cumsum and cummin but their behavior is a lot less predictable in relation to purrr's.

R - manipulate last rows depending on group and previous elements

Im fairly new to R and struggling to find a solution for the following problem:
I have a tibble consisting of 3 columns. First column describes ids of stocks (e.g. ID1,ID2..), the second the Date of observation and third the corresponding return. (ID | Date | Return )
For tidying my dataset I need to delete all zero returns starting from end of sample period until i reach the first non zero return.
The following picture further visualises my issue.
DatasetExample
In case of the example Dataset depicted above, I need to delete the yellow coloured elements.
Hence, one needs to first group by ID and second iterate over the table from bottom to top until reaching a non zero return.
I already found a way by converting the tibble into a matrix and then looping over each element but this apporach is rather naive and does not perform well on large datasets (+2 mio. observations), which is exactly my case.
Is there any more effcient way to achieve this aim? Solutions using dplyr would be highly appreciated.
Thanks in advance.
Here is a dplyr solution. I believe it's a bit complicated, but it works.
library(dplyr)
df1 %>%
mutate(Date = as.Date(Date, format = "%d.%m.%Y")) %>%
group_by(ID) %>%
arrange(desc(Date), .by_group = TRUE) %>%
mutate(flag = min(which(Return == 0)),
flag = cumsum(Return != 0 & flag <= row_number())) %>%
filter(flag > 0) %>%
select(-flag) %>%
arrange(Date, .by_group = TRUE)
## A tibble: 7 x 3
## Groups: ID [2]
# ID Date Return
# <int> <date> <dbl>
#1 1 2020-09-20 0.377
#2 1 2020-09-21 0
#3 1 2020-09-22 -1.10
#4 2 2020-09-20 0.721
#5 2 2020-09-21 0
#6 2 2020-09-22 0
#7 2 2020-09-23 1.76
Test data creation code
set.seed(2020)
df1 <- data.frame(ID = rep(1:2, each = 5), Date = Sys.Date() - 5:1, Return = rnorm(10))
df1$Date <- format(df1$Date, "%d.%m.%Y")
df1$Return[sample(1:5, 2)] <- 0
df1$Return[sample(6:10, 2)] <- 0
df1$Return[10] <- 0
There might be a more elegant way but this could work:
split_data <- split(data,data$ID)
split_tidy_data <- lapply(split_data,function(x) x[1:which.max(x[,"Return"]!=0),])
tidy_data <- do.call(rbind,split_tidy_data)
Note: This only works if there is at least 1 "Return" which is not equal 0

Summarise but keep length variable (dplyr)

Basic dplyr question... Respondents could select multiple companies that they use. For example:
library(dplyr)
test <- tibble(
CompanyA = rep(c(0:1),5),
CompanyB = rep(c(1),10),
CompanyC = c(1,1,1,1,0,0,1,1,1,1)
)
test
If it were a forced-choice question - i.e., respondents could make only one selection - I would do the following for a basic summary table:
test %>%
summarise_all(funs(sum), na.rm = TRUE) %>%
gather(Response, n) %>%
arrange(desc(n)) %>%
mutate("%" = round(100*n/sum(n)))
Note, however, that the "%" column is not what I want. I'm instead looking for the proportion of total respondents for each individual response option (since they could make multiple selections).
I've tried adding mutate(totalrows = nrow(.)) %>% prior to the summarise_all command. This would allow me to use that variable as the denominator in a later mutate command. However, summarise_all eliminates the "totalrows" var.
Also, if there's a better way to do this, I'm open to ideas.
To get the proportion of respondents who chose an option when that variable is binary, you can take the mean. To do this with your test data, you can use sapply:
sapply(test, mean)
CompanyA CompanyB CompanyC
0.5 1.0 0.8
If you wanted to do this in a more complicated fashion (say your data is not binary encoded, but is stored as 1 and 2 instead), you could do that with the following:
test %>%
gather(key='Company') %>%
group_by(Company) %>%
summarise(proportion = sum(value == 1) / n())
# A tibble: 3 x 2
Company proportion
<chr> <dbl>
1 CompanyA 0.5
2 CompanyB 1
3 CompanyC 0.8
If you put all functions in a list within summarise, then this will work. You'll need to do some quick tidying up after though.
test %>%
summarise_all(
list(
rows = length,
n = function(x){sum(x, na.rm = T)},
perc = function(x){sum(x,na.rm = T)/length(x)}
)) %>%
tidyr::gather(Response, n) %>%
tidyr::separate(Response, c("Company", "Metric"), '_') %>%
tidyr::spread(Metric, n)
And you'll get this
Company n perc rows
<chr> <dbl> <dbl> <dbl>
1 CompanyA 5 0.5 10
2 CompanyB 10 1 10
3 CompanyC 8 0.8 10
Here is a solution using tidyr::gather:
test %>%
gather(Company, response) %>%
group_by(Company) %>%
summarise(`%` = 100 * sum(response) / n())

R: sum row based on several conditions

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.
I have the a dataset looking like this:
county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)
For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed
The final dataset should look like this:
county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)
I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.
We can do
library(dplyr)
data$naics <- as.character(data$naics)
data %>%
filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%
summarise(naics = "423620", employment = sum(employment)) %>%
bind_rows(., filter(data, !naics %in% c(423620, 423720)))
# A tibble: 5 x 3
# county naics employment
# <fctr> <chr> <dbl>
#1 1001 423620 10
#2 1202 423620 10
#3 1303 423620 5
#4 1001 423630 6
#5 1303 423550 6
With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:
# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
v
}
data %>%
dplyr::group_by(county) %>%
dplyr::mutate(naics = onlyThoseNAICS(naics)) %>%
dplyr::group_by(county, naics) %>%
dplyr::summarise(employment = sum(employment)) %>%
dplyr::ungroup()

Resources