My idea is to count observations (grouped by Id's) within 30 days windows. My problem is that I want to introduce an exception in the counting process: if during the 30 days analyzed there is an observation that will be discarded (because n> 1) the count is only constructed with the data not discarded. (n is the variable that counts the number of observations within 30 days windows).
Example
id date
1 1/1/2021
1 22/1/2021
1 1/2/2021
Code:
test<-test%>%
group_by(id)%>%
mutate(n=sapply(seq(length(date)),
function(x) sum(between(date[1:x],date[x]-days(30),date[x]))))
id date n
1 1/1/2021 1
1 22/1/2021 2
1 1/2/2021 2
1 3/3/2021 2
1 2/2/2021 3
1 7/7/2021 1
Expected result:
id date n nexpected
1 1/1/2021 1 1
1 22/1/2021 2 2
1 1/2/2021 2 1
1 3/3/2021 2 2
1 2/2/2021 3 1
1 7/7/2021 1 1
Alternative explanation
I just want to keep an observation (grouped by ID) for every 30 days. I want to do this by creating a variable that tells me which observations are left inside (1) and which ones are outside (0) of the filter.
Not sure this is what you want, but lubridate::floor_date is often useful in those situations:
library(tidyverse)
library(lubridate)
test %>%
mutate(date = dmy(date)) %>%
group_by(id, floor = floor_date(date, 'month')) %>%
mutate(n = row_number())
id date floor n
<int> <date> <date> <int>
1 1 2021-01-01 2021-01-01 1
2 1 2021-01-22 2021-01-01 2
3 1 2021-02-01 2021-02-01 1
Related
Thank you, experts for previous answers (How to filter by range of dates in R?)
I am still having some problems dealing with the data.
Example:
id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
My idea is to eliminate the observations that have more than 3 "units" in a period of 30 days. That is, if "a" has a unit "q" on "12/02/2021" [dd/mm]yyyy]: (a) if between 12/01/2021 and 12/02/2021 there are already 3 observations it must be deleted . (b) If there are less than 3 this one must remain.
My expected result is:
p q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
With this code:
df <- df %>%
mutate(day = dmy(data))%>%
group_by(p) %>%
arrange(day, .by_group = TRUE) %>%
mutate(diff = day - first(day)) %>%
mutate(row = row_number()) %>%
filter(row <= 3 | !diff < 30)
But the result is:
P Q DATE DAY DIFF ROW
a 1 1/1/2021 1/1/2021 0 1
a 1 1/1/2021 1/1/2021 0 2
a 1 21/1/2021 21/1/2021 20 3
a 1 12/2/2021 12/2/2021 42 5
a 1 12/2/2021 12/2/2021 42 6
a 1 12/2/2021 12/2/2021 42 7
a 1 12/2/2021 12/2/2021 42 8
The main problem is that the diff variable must count days in periods of 30 days from the last day of the previous 30-days period - not since the first observation day.
Any help? Thanks
Using floor_date it is quite straighforward:
library(lubridate)
library(dplyr)
df %>%
group_by(floor = floor_date(date, '30 days')) %>%
slice_head(n = 3) %>%
ungroup() %>%
select(-floor)
# A tibble: 6 x 3
id q date
<chr> <int> <date>
1 a 1 2021-01-01
2 a 1 2021-01-01
3 a 1 2021-01-21
4 a 1 2021-02-12
5 a 1 2021-02-12
6 a 1 2021-02-12
data
df <- read.table(header = T, text = "id q date
a 1 01/01/2021
a 1 01/01/2021
a 1 21/01/2021
a 1 21/01/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021
a 1 12/02/2021")
df$date<-as.Date(df$date, format = "%d/%m/%Y")
In R, I need to find which treatments are occurring concurrently and work out what the dose for that day would be. I need to do this by patient, so presumably using a group_by statement in dplyr.
user_id
treatment
dosage
treatment_start
treatment_end
1
1
3
01/28/2019
07/30/2019
1
1
2
05/26/2019
11/25/2019
1
2
1
08/13/2019
02/12/2020
1
1
2
12/06/2019
04/07/2020
1
2
1
12/09/2019
06/10/2020
Ideally the final form of it will be the user id, the treatments they're on, the sum of the dosage of all treatments, and the dates that they're on all of those treatments. I've made an example results table with a few rows below.
user_id
treatments
total_dosage
treatment_start
treatment_end
1
1
3
01/28/2019
05/25/2019
1
1
5
05/26/2019
07/30/2019
1
1
2
07/31/2019
08/12/2019
1
1,2
3
08/13/2019
11/25/2019
I worked out how to find if an event overlaps with other events but it doesn't get the resulting dates, and doesn't sum the dosages so I don't know if it's usable. In this case, course is a combination of the treatment and dosage column.
DF %>% group_by(user_id ) %>%
mutate(overlap = purrr::map2_chr(treatment_start, treatment_end,
~toString(course[.x >= treatment_start & .x < treatment_end| .y > treatment_start & .y < treatment_end]))) %>%
ungroup()
This is an interesting question. One way is to expand the dataframe to have one row for each day, and then summarise the data by date:
library(tidyverse)
library(lubridate)
dat %>%
# Convert dates to date format
mutate(across(treatment_start:treatment_end, ~ mdy(.x))) %>%
# Expand the dataframe
group_by(user_id, treatment_start, treatment_end) %>%
mutate(date = list(seq(treatment_start, treatment_end, by = "day"))) %>%
unnest(date) %>%
# Summarise by day
group_by(user_id, date) %>%
summarise(dosage = sum(dosage),
treatment = toString(unique(treatment))) %>%
# Summarise by different dosage (and create periods)
group_by(user_id, treatment, dosage) %>%
summarise(treatment_start = min(date),
treatment_ends = max(date)) %>%
arrange(treatment_start)
output:
user_id treatment dosage treatment_start treatment_ends
<int> <chr> <int> <date> <date>
1 1 1 3 2019-01-28 2019-05-25
2 1 1 5 2019-05-26 2019-07-30
3 1 1 2 2019-07-31 2019-08-12
4 1 1, 2 3 2019-08-13 2020-04-07
5 1 2 1 2019-11-26 2020-06-10
6 1 2, 1 3 2019-12-06 2019-12-08
7 1 2, 1 4 2019-12-09 2020-02-12
If I had:
person_ID visit date
1 2/25/2001
1 2/27/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
and I wanted another column to indicate the earliest recurring observation within 90 days, grouped by patient ID, with the desired output:
person_ID visit date date
1 2/25/2001 2/27/2001
1 2/27/2001 4/2/2001
1 4/2/2001 NA
2 3/18/2004 NA
3 9/22/2004 10/27/2004
3 10/27/2004 NA
3 5/15/2008 NA
Thank you!
We convert the 'visit_date' to Date class, grouped by 'person_ID', create a binary column that returns 1 if the difference between the current and next visit_date is less than 90 or else 0, using this column, get the correponding next visit_date' where the value is 1
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(i1 = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0),
date = case_when(as.logical(i1)~ lead(visit_date)), i1 = NULL ) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date date
# <int> <date> <date>
#1 1 2001-02-25 2001-02-27
#2 1 2001-02-27 2001-04-02
#3 1 2001-04-02 NA
#4 2 2004-03-18 NA
#5 3 2004-09-22 2004-10-27
#6 3 2004-10-27 NA
#7 3 2008-05-15 NA
I am currently working with a dataset containing sensordata. I wish to get some summary statistics. More precisely I wish to get the number of visits, and the total occupancy length. One visit is defined if there are several 0 values over X amount of minutes after a timestamp having value 1
my data looks like this
SensorId timestamp value
1 10:10:10 1
1 10:12:10 1
1 10:14:00 1
1 10:16:00 0
1 10:18:00 0
1 10:20:00 0
2 13:10:10 1
2 13:12:10 1
2 13:14:00 1
2 13:20:00 1
2 13:22:00 0
this is my desired result:
SensorId total time in use Number of visits
1 4 1
2 10 1
there are quite a lot of rows, so I wish for the total time in use, and number of visits to update each time.
We can convert timestamp to POSIXct class, arrange them, group them by SensorId and consecutive similar value and take subtraction of last timestamp with the first one.
library(dplyr)
df %>%
mutate(timestamp = as.POSIXct(timestamp, format = "%T")) %>%
arrange(SensorId, timestamp) %>%
group_by(SensorId, grp = data.table::rleid(value)) %>%
summarise(total_time = round(last(timestamp) - first(timestamp)),
number_of_visit = first(value)) %>%
filter(number_of_visit == 1) %>%
select(-grp)
# SensorId total_time number_of_visit
# <int> <drtn> <int>
#1 1 4 mins 1
#2 2 10 mins 1
I would like to get the time elapsed between events in my dataframe, for each grouping of data by the ID. The dates i want to use are in their own columns. I have done the following already using dplyr:
Grouped my data by the ID
Ordered by the ID
This is how the data looks. I would like the output to be the time_diff column. Any help would be very much appreciated!
ID: Status: Start-time: End-time: time-diff:
1 Active 01/01/2018 NA 0
1 Complete NA 01/02/2018 1
2 Active 03/02/2018 0
2 Active NA 0
2 Complete NA 03/06/2018 4
Taking the time difference between a time and a NA value will just return NA. A more meaningful approach would be to take the individual time different of each event, and then summarize over each group (id).
d <- tibble(id = c(1,1,2,2),
st = ymd(c("2019-05-03", "2019-02-06", "2019-07-11","2019-05-13")),
et = ymd(c("2019-05-10", "2019-02-16", "2019-07-04","2019-05-09")))
d2 <- d %>%
mutate(td = et-st, # calculate the time difference (td)
atd = abs(td)) %>% # calculate the absolute td (atd)
d2
# A tibble: 4 x 5
id st et td atd
<dbl> <date> <date> <time> <time>
1 1 2019-05-03 2019-05-10 7 days 7 days
2 1 2019-02-06 2019-02-16 10 days 10 days
3 2 2019-07-11 2019-07-04 -7 days 7 days
4 2 2019-05-13 2019-05-09 -4 days 4 days
Then you can take the mean of the absolute differences for example and get:
d2 %>%
group_by(id) %>% # for each group (id)
summarise(mtd = mean(atd)) # calculate the mean time difference (mtd)
# A tibble: 2 x 2
id mtd
<dbl> <time>
1 1 8.5 days
2 2 5.5 days