R calculate overlapping months between two start and end dates - r

I have a dataframe with two start and stop dates that looks like this:
ID G1_START G1_END G2_START G2_END LOCATION
1 1/1/2021 5/31/2021 2/1/2021 5/31/2021 A
2 12/1/2020 3/31/2021 10/1/2020 5/31/2021 B
What I would like to do is create one row per month per patient where the months overlap between the four dates. For example
ID MONTH ACTIVE LOCATION
1 2/1/2021 1 A
1 3/1/2021 1 A
1 4/1/2021 1 A
1 5/1/2021 1 A
2 12/1/2020 1 B
2 1/1/2021 1 B
2 2/1/2021 1 B
2 3/1/2021 1 B
Where active means the ID was on both G1 and G2 during these months.

Here is a method in tidyverse
Reshape data from wide to long format - pivot_longer
Convert the date columns 'START', 'END' to Date class (mdy)
Loop over the 'START', 'END' with map2, get the sequence by '1 month'
Floor the date by month - floor_date
grouped by ID, LOCATION, MONTH, filter the groups where 'Categ' distinct elements are 2
Create 'ACTIVE' column of 1 after returning the distinct rows
library(dplyr)
library(tidyr)
library(lubridate)
library(purrr)
pivot_longer(df1, cols = contains("_"),
names_to = c("Categ", ".value"), names_sep= "_") %>%
transmute(ID, LOCATION, Categ, MONTH = map2(mdy(START), mdy(END), ~
floor_date(seq(.x, .y, by = '1 month'), 'month'))) %>%
unnest(MONTH) %>%
group_by(ID, LOCATION, MONTH) %>%
filter(n_distinct(Categ) == 2) %>%
ungroup %>%
distinct(ID, LOCATION, MONTH) %>%
mutate(ACTIVE = 1) %>%
select(ID, MONTH, ACTIVE, LOCATION)
-output
# A tibble: 8 x 4
ID MONTH ACTIVE LOCATION
<int> <date> <dbl> <chr>
1 1 2021-02-01 1 A
2 1 2021-03-01 1 A
3 1 2021-04-01 1 A
4 1 2021-05-01 1 A
5 2 2020-12-01 1 B
6 2 2021-01-01 1 B
7 2 2021-02-01 1 B
8 2 2021-03-01 1 B
data
df1 <- structure(list(ID = 1:2, G1_START = c("1/1/2021", "12/1/2020"
), G1_END = c("5/31/2021", "3/31/2021"), G2_START = c("2/1/2021",
"10/1/2020"), G2_END = c("5/31/2021", "5/31/2021"), LOCATION = c("A",
"B")), class = "data.frame", row.names = c(NA, -2L))

Related

R Count monthly frequency of occurrence between date range

x <- data.frame(ID = c(1,2,3,4),
Line_name = c("AB", "CD", "AB", "CD"),
start_dt = c("12/1/2020", "2/1/2021", "2/1/2021", "3/1/2021"),
end_dt = c("4/1/2021", "4/1/2021", "3/1/2021", "4/1/2021"))
ID Line_name start_dt end_dt
1 AB 12/1/2020 4/1/2021
2 CD 2/1/2021 4/1/2021
3 AB 2/1/2021 3/1/2021
4 CD 3/1/2021 4/1/2021
I have a dataframe that looks likes this. It has items that is used within date range (start date to end date). I need to count frequency of use of each item for every month. The resulting output would look something like this.
Line_name Jan2021 Feb2021 Mar2021 Apr2021
1 AB 1 2 2 1
2 CD 0 1 2 2
In Jan, only AB was used. For that ID 1, the date ranges from Jan to April. So we would need to count that row for every month from Jan to April.
I am not sure how I can do it. I was thinking for instance January, I would check if 1/1/2021 date falls within start_dt and end_dt and if that condition is true than count.
(date %within% interval(start_dt, end_dt))
An option is to get a sequence of dates by month between the 'start_dt', and 'end_dt' columns with map2 into a list, then unnest the list column, get the count and reshape back from 'long' to 'wide' with pivot_wider
library(lubridate)
library(dplyr)
library(tidyr)
x %>%
transmute(Line_name, Year_month = map2(mdy(start_dt), mdy(end_dt),
~ format(seq(.x, .y, by = '1 month'), '%b%Y'))) %>%
unnest(c(Year_month)) %>%
count(Line_name,
Year_month = factor(Year_month, levels = unique(Year_month))) %>%
pivot_wider(names_from = Year_month, values_from = n, values_fill = 0)
-output
# A tibble: 2 x 5
Line_name Jan2021 Feb2021 Mar2021 Apr2021
<chr> <int> <int> <int> <int>
1 AB 1 2 2 1
2 CD 0 1 2 2

Getting counts grouped by hour

I would like to get the counts per hour for each type (version1 and version2).
Sample data:
type <- c('version1','version1','version1','version2','version2')
startdate <- as.POSIXct(c('2017-11-1 02:11:02.000','2018-3-25 02:13:02.000','2019-3-14 03:45:02.000',
'2017-3-14 02:55:02.000','2018-3-14 03:45:02.000'))
df <- data.frame(type, startdate)
df
type startdate
1 version1 2017-11-01 02:11:02
2 version1 2018-03-25 02:13:02
3 version1 2019-03-14 03:45:02
4 version2 2017-03-14 02:55:02
5 version2 2018-03-14 03:45:02
In this df we see that version1 has two counts for 02h and one count for 03h.
And version2 has one count for 02h and one count for 03h.
Desired output:
hour version1 version2
1 00:00 0 0
2 01:00 0 0
3 02:00 2 1
4 03:00 1 1
We can first get hours from startdate, count number of rows for each hour and type. complete missing hours and fill their count with 0 and use pivot_wider to get data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(hr = lubridate::hour(startdate)) %>%
count(hr, type) %>%
complete(type, hr = seq(0, max(hr)), fill = list(n = 0)) %>%
pivot_wider(names_from = type, values_from = n)
# A tibble: 4 x 3
# hr version1 version2
# <int> <dbl> <dbl>
#1 0 0 0
#2 1 0 0
#3 2 2 1
#4 3 1 1
Something was wrong with your start date variable. Thus I set it up with the package lubridate
library(dplyr)
library(tidyr)
type = c('version1','version1','version1','version2','version2')
startdate = lubridate::ymd_hms(c('2017-11-1T02:11:02.000','2018-3-25T02:13:02.000',
'2019-3-14T03:45:02.000','2017-3-14T02:55:02.000',
'2018-3-14T03:45:02.000'))
tibble(type = type, startdate = startdate) %>%
count(type, hour = lubridate::hour(startdate)) %>%
spread(type, n)
# A tibble: 2 x 3
hour version1 version2
<int> <int> <int>
1 2 2 1
2 3 1 1
Base R solution:
# Extract the hour and store it as a vector:
df$hour <- gsub(".* ", "", trunc(df$startdate, units = "hours"))
# Count the number of observations of each type in each hour:
df$type_hour_cnt <- with(df,
ave(paste(type, hour, sep = " - "),
paste(type, hour, sep = " - "), FUN = seq_along))
# Reshape dataframe:
df <- as.data.frame(as.matrix(xtabs(type_hour_cnt ~ hour + type, df, sparse = T)))
# Extract rownames and store them as "hour" vector and then delete row.names:
df <- data.frame(cbind(hour = row.names(df), df), row.names = NULL)

How to find observations within a certain time range of each other in R

I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1

R dplyr count observations within groups

I have a data frame with yes/no values for different days and hours. For each day, I want to get a total number of hours where I have data, as well as the total number of hours where there is a value of Y.
df <- data.frame(day = c(1,1,1,2,2,3,3,3,3,4),
hour = c(1,2,3,1,2,1,2,3,4,1),
YN = c("Y","Y","Y","Y","Y","Y","N","N","N","N"))
df %>%
group_by(day) %>%
summarise(tot.hour = n(),
totY = WHAT DO I PUT HERE?)
Using boolean then add it up
df %>%
group_by(day) %>%
dplyr::summarise(tot.hour = n(),
totY = sum(YN=='Y'))
# A tibble: 4 x 3
day tot.hour totY
<dbl> <int> <int>
1 1 3 3
2 2 2 2
3 3 4 1
4 4 1 0

How to get date difference from two dataframe in R

I have two datafram as mentioned:
DF_1
ID Date1
1 12/01/2017
2 15/02/2017
3 18/03/2017
DF_2
ID Date1
1 05/01/2017
1 15/01/2017
1 18/01/2017
2 10/02/2017
2 13/02/2017
2 15/02/2017
3 22/03/2017
I want to calculate the difference between dates for a particular id in DF_1 to same id in DF_2 with most recent old date in DF_2 as compare with date of DF_1.
For Example: For ID=1 the Date of DF_1 is 12-01-2017 and Most recent old date for that in DF_2 would be 05-01-2017 because 15 & 18 both are > than DF_1 Date.
Required Output:
ID Date1 Count
1 12/01/2017 7
2 15/02/2017 0
3 18/03/2017 -4
The following reproduces your expected output:
library(tidyverse);
df1 <- read.table(text =
"ID Date1
1 12/01/2017
2 15/02/2017
3 18/03/2017", header = T) %>%
mutate(Date1 = as.Date(Date1, format = "%d/%m/%Y"));
df2 <- read.table(text =
"ID Date1
1 05/01/2017
1 15/01/2017
1 18/01/2017
2 10/02/2017
2 13/02/2017
2 15/02/2017
3 22/03/2017", header = T) %>%
mutate(Date1 = as.Date(Date1, format = "%d/%m/%Y"));
left_join(df1, df2, by = "ID") %>%
mutate(Count = Date1.x - Date1.y) %>%
group_by(ID) %>%
slice(ifelse(
all(Count < 0),
which.min(abs(Count)),
which.min(Count[Count >= 0]))) %>%
select(ID, Date1.x, Count)
## A tibble: 3 x 3
## Groups: ID [3]
# ID Date1.x Count
# <int> <date> <time>
#1 1 2017-01-12 7
#2 2 2017-02-15 0
#3 3 2017-03-18 -4
Explanation: Calculate the time difference between df1$Date1 and df2$Date2, group entries by ID, and keep only the row which has the smallest positive time difference, unless all time differences are negative in which case report the smallest absolute time difference.
I think your last row is wrong, as for ID=3 the df2 value is not before the df1 value. Assuming that is correct, you can do this...
df3 <- df2 %>% rename(Date2=Date1) %>%
left_join(df1) %>%
mutate(datediff=as.Date(Date1,format="%d/%m/%Y")-as.Date(Date2,format="%d/%m/%Y")) %>%
filter(datediff>=0) %>%
group_by(ID) %>%
summarise(Date1=first(Date1),Count=min(datediff))
df3
ID Date1 Count
1 1 12/01/2017 7
2 2 15/02/2017 0

Resources