Grouping dates by periods and id using R - r

I have a list of events by ID and would like to group them in two week periods. The two weeks should start whenever the first event occurs for each ID. The grouped event data should look something like the following,
ID Date Group
<dbl> <date> <dbl>
1 2018-01-01 1
1 2018-01-02 1
1 2018-01-02 1
1 2018-02-01 2
1 2018-03-01 3
2 2018-01-01 4
2 2018-04-01 5
dat = structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), Date = structure(c(17532,
17533, 17533, 17563, 17591, 17532, 17622), class = "Date"), Group = c(1L,
1L, 1L, 2L, 3L, 4L, 5L)), .Names = c("ID", "Date", "Group"), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
I was originally thinking of lagging by ID and filtering for events that happen within a two week period, but there may be many events that correspond to a single two week period.

You can use cut and seq to round to the nearest two week cutoff, then group_indices to make an increasing index:
dat %>%
group_by(ID) %>%
mutate(g = cut(Date, seq(first(Date), max(Date) + 14, by="2 weeks")) %>% as.character) %>%
ungroup %>%
mutate(g = group_indices(., ID, g))
# A tibble: 7 x 4
ID Date Group g
<int> <date> <int> <int>
1 1 2018-01-01 1 1
2 1 2018-01-02 1 1
3 1 2018-01-02 1 1
4 1 2018-02-01 2 2
5 1 2018-03-01 3 3
6 2 2018-01-01 4 4
7 2 2018-04-01 5 5

Get the difference of adjacent 'Date's with difftime specifying the unit as "week", check if the difference is greater than 2, and get the cumulative sum
dat %>%
mutate(GroupNew = cumsum(abs(difftime(Date, lag(Date,
default = first(Date)), unit = "week")) > 2) + 1)
# A tibble: 7 x 4
# ID Date Group GroupNew
# <int> <date> <int> <dbl>
#1 1 2018-01-01 1 1
#2 1 2018-01-02 1 1
#3 1 2018-01-02 1 1
#4 1 2018-02-01 2 2
#5 1 2018-03-01 3 3
#6 2 2018-01-01 4 4
#7 2 2018-04-01 5 5

Related

Identify unique values within a multivariable subset

I have data that look like these:
Subject Site Date
1 2 '2020-01-01'
1 2 '2020-01-01'
1 2 '2020-01-02'
2 1 '2020-01-02'
2 1 '2020-01-03'
2 1 '2020-01-03'
And I'd like to create an order variable for unique dates by Subject and Site. i.e.
Want
1
1
2
1
2
2
I define a little wrapper:
rle <- function(x) cumsum(!duplicated(x))
and I notice inconsistent behavior when I supply:
have1 <- unlist(tapply(val$Date, val[, c( 'Site', 'Subject')], rle))
versus
have2 <- unlist(tapply(val$Date, val[, c('Subject', 'Site')], rle))
> have1
[1] 1 1 2 1 2 2
> have2
[1] 1 2 2 1 1 2
Is there any way to ensure that the natural ordering of the dataset is followed regardless of the specific columns supplied to the INDEX argument?
library(dplyr)
val %>%
group_by(Subject, Site) %>%
mutate(Want = match(Date, unique(Date))) %>%
ungroup
-output
# A tibble: 6 × 4
Subject Site Date Want
<int> <int> <chr> <int>
1 1 2 2020-01-01 1
2 1 2 2020-01-01 1
3 1 2 2020-01-02 2
4 2 1 2020-01-02 1
5 2 1 2020-01-03 2
6 2 1 2020-01-03 2
val$Want <- with(val, ave(as.integer(as.Date(Date)), Subject, Site,
FUN = \(x) match(x, unique(x))))
val$Want
[1] 1 1 2 1 2 2
data
val <- structure(list(Subject = c(1L, 1L, 1L, 2L, 2L, 2L), Site = c(2L,
2L, 2L, 1L, 1L, 1L), Date = c("2020-01-01", "2020-01-01", "2020-01-02",
"2020-01-02", "2020-01-03", "2020-01-03")),
class = "data.frame", row.names = c(NA,
-6L))

how to create aggregate data from repeated data based on a date in R

I have longitudinal patient data in R. I would like to create an aggregate table like table 2 below from table 1. so Table 2 would only have one row for each patient and have total counts of consultations before the registration date (column 3 in table 1) and total consultations after the registration date
Table1:
patid
consultation_date
registration_date
consultation_count
1
07/07/2016
07/07/2018
1
1
07/07/2019
07/07/2018
1
1
07/07/2020
07/07/2018
1
2
14/08/2016
07/09/2016
1
2
07/05/2015
07/09/2016
1
2
02/12/2016
07/09/2016
1
Table 2:
patid
consultation_count_pre_registration
consultation_count_post_registration
1
1
2
2
2
1
Similar to akrun in using tidyverse but slightly different approach:
library(dplyr)
library(tidyr)
consultations |>
mutate(period = ifelse(
registration_date <= consultation_date,
"after registration",
"before registration"
)
) |>
group_by(patid, period) |>
summarise(n = n()) |>
pivot_wider(
names_from = period,
values_from = n
)
# A tibble: 2 x 3
# Groups: patid [2]
# patid `after registration` `before registration`
# <int> <int> <int>
# 1 1 2 1
# 2 2 1 2
Data
consultations <- read.table(text = "patid consultation_date registration_date consultation_count
1 07/07/2016 07/07/2018 1
1 07/07/2019 07/07/2018 1
1 07/07/2020 07/07/2018 1
2 14/08/2016 07/09/2016 1
2 07/05/2015 07/09/2016 1
2 02/12/2016 07/09/2016 1", h=T)
We could convert the 'date' to Date class, then group by 'patid', get the sum of logical vector from the 'consultation_date' and 'registration_date'
library(dplyr)
library(lubridate)
df1 %>%
mutate(across(ends_with('date'), dmy)) %>%
group_by(patid) %>%
summarise(
count_pre = sum(consultation_date < registration_date, na.rm = TRUE),
count_post = sum(consultation_date > registration_date, na.rm = TRUE),
.groups = 'drop')
-output
# A tibble: 2 × 3
patid count_pre count_post
<int> <int> <int>
1 1 1 2
2 2 2 1
data
df1 <- structure(list(patid = c(1L, 1L, 1L, 2L, 2L, 2L),
consultation_date = c("07/07/2016",
"07/07/2019", "07/07/2020", "14/08/2016", "07/05/2015", "02/12/2016"
), registration_date = c("07/07/2018", "07/07/2018", "07/07/2018",
"07/09/2016", "07/09/2016", "07/09/2016"), consultation_count = c(1L,
1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-6L))

add column to df according to rule

My dataset is:
unit date total
1 2019-04-02 7
1 2020-01-01 5
2 2019-12-01 10
2 2020-01-03 2
3 2019-09-01 3
3 2020-03-03 3
I would like to add the 'category' column if any value in 'total' is higher or equal to 10 for each 'unit':
unit date total category
1 2019-04-02 7 low
1 2020-01-01 5 low
2 2019-12-01 10 high
2 2020-01-03 2 high
3 2019-09-01 3 low
3 2020-03-03 3 low
I have tried many things such as:
df$category <- "low"
for (i in df$unit){
if (rowSums(df$total >= 10) > 0){
df$category <- "high"
}
}
but none worked. Can you please advise?
Try working around the max values in each group and then assign the category. Here the code:
library(dplyr)
#Code
dfnew <- df %>% group_by(unit) %>% mutate(category=ifelse(max(total,na.rm=T)>=10,'High','Low'))
Output:
# A tibble: 6 x 4
# Groups: unit [3]
unit date total category
<int> <chr> <int> <chr>
1 1 2019-04-02 7 Low
2 1 2020-01-01 5 Low
3 2 2019-12-01 10 High
4 2 2020-01-03 2 High
5 3 2019-09-01 3 Low
6 3 2020-03-03 3 Low
Some data used:
#Data
df <- structure(list(unit = c(1L, 1L, 2L, 2L, 3L, 3L), date = c("2019-04-02",
"2020-01-01", "2019-12-01", "2020-01-03", "2019-09-01", "2020-03-03"
), total = c(7L, 5L, 10L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,
-6L))
Does this work:
> library(dplyr)
> df %>% group_by(unit) %>% mutate(category = case_when(max(total) >= 10 ~ 'high', TRUE ~ 'low'))
# A tibble: 6 x 4
# Groups: unit [3]
unit date total category
<dbl> <dttm> <dbl> <chr>
1 1 2019-04-02 00:00:00.000 7 low
2 1 2020-01-01 00:00:00.000 5 low
3 2 2019-12-01 00:00:00.000 10 high
4 2 2020-01-03 00:00:00.000 2 high
5 3 2019-09-01 00:00:00.000 3 low
6 3 2020-03-03 00:00:00.000 3 low
>
One base R option using ave, e.g.,
transform(
df,
category = c("Low","High")[ave(total>=10,unit,FUN = any)+1]
)
which gives
unit date total category
1 1 2019-04-02 7 Low
2 1 2020-01-01 5 Low
3 2 2019-12-01 10 High
4 2 2020-01-03 2 High
5 3 2019-09-01 3 Low
6 3 2020-03-03 3 Low
Data
> dput(df)
structure(list(unit = c(1L, 1L, 2L, 2L, 3L, 3L), date = c("2019-04-02",
"2020-01-01", "2019-12-01", "2020-01-03", "2019-09-01", "2020-03-03"
), total = c(7L, 5L, 10L, 2L, 3L, 3L)), class = "data.frame", row.names = c(NA,
-6L))
For each unit you can check if any value is greater than equal to 10 and assign category value accordingly.
library(dplyr)
df %>%
group_by(unit) %>%
mutate(category = if(any(total >= 10)) 'high' else 'low')
# unit date total category
# <int> <chr> <int> <chr>
#1 1 2019-04-02 7 low
#2 1 2020-01-01 5 low
#3 2 2019-12-01 10 high
#4 2 2020-01-03 2 high
#5 3 2019-09-01 3 low
#6 3 2020-03-03 3 low
The same logic can be implemented in base R
df$category <- with(df, ave(total, unit, FUN = function(x)
if(any(x >= 10)) 'high' else 'low'))
and data.table :
library(data.table)
setDT(df)[, category := if(any(total >= 10)) 'high' else 'low', unit]

Trying to find occurrences of ID that meets sequential conditions in R

I'm trying to return a logical vector based on whether a person meets one set of conditions and ALSO meets another set of conditions later on. I'm using a data frame that looks like so:
Person.Id Year Term
250 1 3
250 1 1
250 2 3
300 1 3
511 2 1
300 1 5
700 2 3
What I want to return is a logical vector that indicates true/false if person ID 250 has year 1 and term 3, AND later has year 2 term 3. So a person that only has year 1 term 3 or year 1 term 5 will return false. Solutions in dplyr preferred! I feel like this is simple and I'm just missing something. I initially tried this code but all it returned was a blank df:
df2 <- df1 %>%
group_by(Person.Id) %>%
filter((year==1 & term==3) & (year==2 & term==3))
Are you looking for something like this ?
require(dplyr)
df %>%
group_by(Person.Id) %>%
mutate(count=sum((year==1 & term==3) | (year==2 & term==3))) %>%
mutate(count2=if_else(count==2,T,F))
# A tibble: 7 x 5
# Groups: Person.Id [4]
Person.Id year term count count2
<int> <int> <int> <int> <lgl>
1 250 1 3 2 TRUE
2 250 1 1 2 TRUE
3 250 2 3 2 TRUE
4 300 1 3 1 FALSE
5 511 2 1 0 FALSE
6 300 1 5 1 FALSE
7 700 2 3 1 FALSE
Maybe this can help:
#Data
Data <- structure(list(Person.Id = c(250L, 250L, 250L, 300L, 511L, 300L,
700L), Year = c(1L, 1L, 2L, 1L, 2L, 1L, 2L), Term = c(3L, 1L,
3L, 3L, 1L, 5L, 3L)), row.names = c(NA, -7L), class = "data.frame")
#Flags
cond1 <- Data$Year==1 & Data$Term==3
cond2 <- Data$Year==2 & Data$Term==3
#Replace
Data$Flag1 <- 0
Data$Flag1[cond1]<-1
Data$Flag2 <- 0
Data$Flag2[cond2]<-1
#Filter
Data %>% group_by(Person.Id) %>% filter(Flag1==1 | Flag2==1)
# A tibble: 4 x 5
# Groups: Person.Id [3]
Person.Id Year Term Flag1 Flag2
<int> <int> <int> <dbl> <dbl>
1 250 1 3 1 0
2 250 2 3 0 1
3 300 1 3 1 0
4 700 2 3 0 1

subsetting duplicates per individual

dfin <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 1 0 20 20
1 2 1 20 20
Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME.
dfout <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 2 1 20 20
Using RStudio.
An option is to do a group by 'STUDY', 'ID' and filter out the duplicated 0 values in 'CYCLE'
library(dplyr)
dfin %>%
arrange(STUDY, ID, TIME) %>%
group_by(STUDY, ID) %>%
filter(!(duplicated(CYCLE) & CYCLE == 0))
# A tibble: 2 x 5
# Groups: STUDY, ID [2]
# STUDY ID CYCLE TIME VALUE
# <int> <int> <int> <int> <int>
#1 1 1 0 10 50
#2 1 2 1 20 20
Also, if there are many duplicates for 0 and want to remove only the row where 'TIME' is also max
dfin %>%
group_by(STUDY, ID) %>%
filter(!(TIME == max(TIME) & CYCLE == 0))
Or using base R
dfin1 <- do.call(order, dfin[c("STUDY", "ID", "TIME")])
dfin1[!(duplicated(dfin1[1:3]) & duplicated(dfin1$CYCLE)),]
# STUDY ID CYCLE TIME VALUE
#1 1 1 0 10 50
#3 1 2 1 20 20
data
dfin <- structure(list(STUDY = c(1L, 1L, 1L), ID = c(1L, 1L, 2L), CYCLE = c(0L,
0L, 1L), TIME = c(10L, 20L, 20L), VALUE = c(50L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-3L))

Resources