I have a dataset in long format. Every subject in the dataset was observed five times during the week. I have a column with the number of the day of the week in which the observation was supposed to happen/happened and another column with the actual dates of the observations. The latter column has some missing values. I would like to use the information on the first column to fill the missing values in the second column. Here is a toy dataset:
df <- data.frame(case = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
day = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
date = as.Date(c("2023-01-02", "2023-01-03", NA, NA, "2023-01-06",
NA, "2021-05-11", "2021-05-12", "2021-05-13", NA)))
df
# case day date
# 1 1 2023-01-02
# 1 2 2023-01-03
# 1 3 <NA>
# 1 4 <NA>
# 1 5 2023-01-06
# 2 1 <NA>
# 2 2 2021-05-11
# 2 3 2021-05-12
# 2 4 2021-05-13
# 2 5 <NA>
And here is the desired output:
# case day date
#1 1 1 2023-01-02
#2 1 2 2023-01-03
#3 1 3 2023-01-04
#4 1 4 2023-01-05
#5 1 5 2023-01-06
#6 2 1 2021-05-10
#7 2 2 2021-05-11
#8 2 3 2021-05-12
#9 2 4 2021-05-13
#10 2 5 2021-05-14
Does this work for you? No linear models are used.
library(tidyverse)
df2 <-
df %>%
mutate(
ref_date = case_when(
case == 1 ~ as.Date("2023-01-01"),
case == 2 ~ as.Date("2021-05-09")
),
date2 = as.Date(day, origin = ref_date)
)
Output:
> df2
case day date ref_date date2
1 1 1 2023-01-02 2023-01-01 2023-01-02
2 1 2 2023-01-03 2023-01-01 2023-01-03
3 1 3 <NA> 2023-01-01 2023-01-04
4 1 4 <NA> 2023-01-01 2023-01-05
5 1 5 2023-01-06 2023-01-01 2023-01-06
6 2 1 <NA> 2021-05-09 2021-05-10
7 2 2 2021-05-11 2021-05-09 2021-05-11
8 2 3 2021-05-12 2021-05-09 2021-05-12
9 2 4 2021-05-13 2021-05-09 2021-05-13
10 2 5 <NA> 2021-05-09 2021-05-14
I concede that G.G.'s answer has the advantage that you don't need to hardcode the reference date.
P.S. here is a pure tidyverse solution without any hardcoding:
df2 <-
df %>%
mutate(ref_date = date - day) %>%
group_by(case) %>%
fill(ref_date, .direction = "downup") %>%
ungroup() %>%
mutate(date2 = as.Date(day, origin = ref_date))
1) Convert case to factor and then use predict with lm to fill in the NA's. No packages are used.
within(df, {
case <- factor(case)
date <- .Date(predict(lm(date ~ case/day), data.frame(case, date)))
})
giving
case day date
1 1 1 2023-01-02
2 1 2 2023-01-03
3 1 3 2023-01-04
4 1 4 2023-01-05
5 1 5 2023-01-06
6 2 1 2021-05-10
7 2 2 2021-05-11
8 2 3 2021-05-12
9 2 4 2021-05-13
10 2 5 2021-05-14
2) Find the mean day and date and then use day to appropriately offset each row.
library(dplyr) # version 1.1.0 or later
df %>%
mutate(date = {
Mean <- Map(mean, na.omit(pick(date, day)))
Mean$date + day - Mean$day
}, .by = case)
Related
I'm looking to track the order in which an item falls in a sequence. For example, the final product should look something like this.
df <- data.frame(paitent=c('Sally', 'Josh', 'Josh', 'Abram','Sally', 'Josh'),
visit=mdy(c('2/10/2022', '2/11/2022', '2/12/2022', '2/13/2022', '2/14/2022', '2/15/2022')),
visit_count=c(1,1,2,1,2,3))
paitent visit
1 Sally 2022-02-10
2 Josh 2022-02-11
3 Josh 2022-02-12
4 Abram 2022-02-13
5 Sally 2022-02-14
6 Josh 2022-02-15
The 'visit_count' column would be automatically populated based on the Patient name and where it falls in sequence based on the date.
I'm not exactly sure where to go. I've looked into using the mutate and nrow() functions to count the rows but I'm having trouble finding a way to filter for the specific name and then only count the dates that are less than the current records date.
dplyr
library(dplyr)
df %>%
group_by(paitent) %>%
mutate(visit_count2 = rank(visit, ties.method = "first")) %>%
ungroup()
# # A tibble: 6 x 4
# paitent visit visit_count visit_count2
# <chr> <date> <dbl> <int>
# 1 Sally 2022-02-10 1 1
# 2 Josh 2022-02-11 1 1
# 3 Josh 2022-02-12 2 2
# 4 Abram 2022-02-13 1 1
# 5 Sally 2022-02-14 2 2
# 6 Josh 2022-02-15 3 3
base R
df$visit_count2 <- ave(as.numeric(df$visit), df$paitent, FUN = function(z) rank(z, ties.method = "first"))
df
# paitent visit visit_count visit_count2
# 1 Sally 2022-02-10 1 1
# 2 Josh 2022-02-11 1 1
# 3 Josh 2022-02-12 2 2
# 4 Abram 2022-02-13 1 1
# 5 Sally 2022-02-14 2 2
# 6 Josh 2022-02-15 3 3
data.table
library(data.table)
as.data.table(df)[, visit_count2 := rank(visit, ties.method = "first"), by = .(paitent)]
Data
df <- structure(list(paitent = c("Sally", "Josh", "Josh", "Abram", "Sally", "Josh"), visit = structure(c(19033, 19034, 19035, 19036, 19037, 19038), class = "Date"), visit_count = c(1, 1, 2, 1, 2, 3), visit_count2 = c(1, 1, 2, 1, 2, 3)), row.names = c(NA, -6L), class = "data.frame")
We can group by patient, sort visit in ascending order and then create visit_count
df%>%
group_by(paitent)%>%
arrange(visit)%>%
mutate(visit_count=row_number())
# A tibble: 6 x 3
# Groups: paitent [3]
paitent visit visit_count
<fct> <date> <int>
1 Sally 2022-02-10 1
2 Josh 2022-02-11 1
3 Josh 2022-02-12 2
4 Abram 2022-02-13 1
5 Sally 2022-02-14 2
6 Josh 2022-02-15 3
library(dplyr)
df %>%
group_by(patient) %>%
mutate(visit_count =1:n())
patient visit visit_count
<chr> <date> <int>
1 Sally 2022-02-10 1
2 Josh 2022-02-11 1
3 Josh 2022-02-12 2
4 Abram 2022-02-13 1
5 Sally 2022-02-14 2
6 Josh 2022-02-15 3
I'm both new to coding in R (always used SPSS but have to use R for a project) and this website, so bear with me. Hopefully I'm both able to explain the problem and what I've tried.
My data looks somewhat like this:
df <- data.frame (
ID = c(1, 1, 1, 2, 2, 2, 2, 3, 3),
measurement = c (1, 2, 3, 1, 2, 3, 4, 1, 2),
date_event1 = c(NA, NA, "2021-02-15", NA, NA, NA, "2021-03-01", NA, NA),
date_event2 = c(NA, NA, NA, NA, "2021-03-06", NA, NA, "2022-02-02", "2022-02-02")
)
df
ID measurement date_event1 date_event2
1 1 <NA> <NA>
1 2 <NA> <NA>
1 3 2021-02-15 <NA>
2 1 <NA> <NA>
2 2 <NA> 2021-03-06
2 3 <NA> <NA>
2 4 2021-03-01 <NA>
3 1 <NA> 2022-02-02
3 2 <NA> 2022-02-02
I have patients (identified by ID) with a variable number of measurements (identified by measurement number and their date, so long-format data) and events (coded here as 'event1' and 'event2'). Events can be present for a particular patient and measurement (then coded with the date it occurred) or absent (then coded as NA).
Ultimately, my goal is to calculate intervals (in days) between two first events, if two are present. If no or only 1 event took place, the result should be NA. Desired output should look something like this:
ID measurement date_event1 date_event2 interval
1 1 <NA> <NA> NA
1 2 <NA> <NA> NA
1 3 2021-02-15 <NA> NA
2 1 <NA> <NA> **5**
2 2 <NA> 2021-03-06 **5**
2 3 <NA> <NA> **5**
2 4 2021-03-01 <NA> 5
3 1 <NA> 2022-02-02 **NA**
3 2 <NA> 2022-02-02 NA
Main issues here are:
finding the first event using functions as 'min' will error if NA's are present.
using the 'min(x, na.rm=TRUE)' as a solution doesn't work if IDs with only NA's are present
What I've tried:
df <- df %>%
group_by(ID) %>%
arrange(ID, measurement) %>%
# creating 2 identifier variables if all rows for event1/event2 are NA
mutate(allNA1 = ifelse(all(is.na(date_event1)), 1, 0)) %>%
mutate(allNA2 = ifelse(all(is.na(date_event2)), 1, 0)) %>%
ungroup()
# for simplicity, combining these two identifier variables into 1
df$test <- ifelse(df$allNA1 == 0 & df$allNA2 == 0, 1, NA)
# then using this combined identifier variable to only use mindate on IDs that have at least 1 event
df <- df %>%
group_by(ID) %>%
mutate(mindate = if_else(test == 1, min(date_event1, na.rm=TRUE), NA_real_)) %>%
ungroup ()
I haven't gotten to the comparing-dates step as finding the first date still produces the 'no non-missing arguments to min; returning Inf' warnings, even though I'm only mutating if test==1. What am I missing here? Are there easier solutions to my main problem? Thank you in advance!
Edit: forgot to add that the FIRST event should be used. Changes highlighted in bold.
Edit 2: made an error in the example, changed dates and intervals, also removed a column for simplicity.
Edit 3: Harre suggested to suppress warnings. Using the following did not work:
options(warn = -1)
df <- df %>%
group_by(ID) %>%
mutate(interval = abs(min(date_event1, na.rm=TRUE) - min(date_event2, na.rm=TRUE))) %>%
ungroup()
options(warn = 1)
Edit 6/8/22:
Okay, so I managed to circumvent the problem:
df <- df %>%
group_by(ID) %>%
# first recoding all missing values to an extremely late date
mutate(date_event1 = if_else(is.na(date_event1), as.Date("2099-09-09"), date_event1)) %>%
# then finding the earliest date (which are the non-2099-dates, if one was present)
mutate(min_date_event1 = min(date_event1)) %>%
# then recoding the 2099-dates back to NA
mutate(min_date_event1= if_else(mindate == '2099-09-09', as.Date(NA_real_), mindate)) %>%
Which feels really inefficient for something I could do with 1 function in SPSS (AGGREGATE > FIRST). I'll checkmark my question but if anyone has an easier solution, feel free to add suggestions!
Given that your "goal is to calculate intervals (in days) between two events, if two are present. If no or only 1 event took place, the result should be NA", there is no need for anything else than converting to the date-type:
library(dplyr)
df |>
mutate(interval = abs(as.Date(date_event2) - as.Date(date_event1)))
or
library(dplyr)
df |>
mutate(across(starts_with("date"), as.Date),
interval = abs(date_event2 - date_event1))
Output:
ID measurement date_measurement date_event1 date_event2 interval
1 1 1 2020-01-01 <NA> <NA> NA days
2 1 2 2020-01-05 <NA> <NA> NA days
3 1 3 2020-01-10 2021-02-15 <NA> NA days
4 2 1 2021-02-01 <NA> 2021-03-01 NA days
5 2 2 2021-02-15 <NA> 2021-03-05 NA days
6 2 3 2021-03-01 <NA> <NA> NA days
7 2 4 2021-04-01 2021-03-01 2021-03-06 5 days
8 3 1 2022-01-01 <NA> 2022-02-02 NA days
9 3 2 2022-03-01 <NA> <NA> NA days
Update:
I this case you'll just want to suppress the warning as it doesn't influence the result (shown below). Alternatively you could arrange by the date and pick the first using first() no matter if it's NA or not.
df |>
group_by(ID) |>
mutate(across(starts_with("date"), as.Date),
across(starts_with("date_event"), ~ suppressWarnings(min(., na.rm = TRUE)), .names = "{.col}_min"),
interval = na_if(abs(date_event2_min - date_event1_min), Inf)) |>
ungroup()
Output:
# A tibble: 9 × 7
ID measurement date_event1 date_event2 date_event1_min date_event2_min interval
<dbl> <dbl> <date> <date> <date> <date> <drtn>
1 1 1 NA NA 2021-02-15 NA NA days
2 1 2 NA NA 2021-02-15 NA NA days
3 1 3 2021-02-15 NA 2021-02-15 NA NA days
4 2 1 NA NA 2021-03-01 2021-03-06 5 days
5 2 2 NA 2021-03-06 2021-03-01 2021-03-06 5 days
6 2 3 NA NA 2021-03-01 2021-03-06 5 days
7 2 4 2021-03-01 NA 2021-03-01 2021-03-06 5 days
8 3 1 NA 2022-02-02 NA 2022-02-02 NA days
9 3 2 NA 2022-02-02 NA 2022-02-02 NA days
I am trying to find the first date of each category then subtract 5 days AND I want to keep the days inbetween! this is where I am struggling. I tried seq() but it gave me an error, so I'm not sure if this is the right way to do it.
I am able to get 5 days prior to my start date for each category, but I can't figure out how to get 0, 1, 2, 3, 4 AND 5 days prior to my start date!
The error I got is this (for the commented out part of the code):
Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) :
'from' must be of length 1
Any help would be greatly appreciated!
library ("lubridate")
library("dplyr")
library("tidyr")
data <- data.frame(date = c("2020-06-08",
"2020-06-09",
"2020-06-10",
"2020-06-11",
"2020-06-12",
"2021-07-13",
"2021-07-14",
"2021-07-15",
"2021-08-16",
"2021-08-17",
"2021-08-18",
"2021-09-19",
"2021-09-20"),
value = c(2,1,7,1,0,1,2,3,4,7,6,5,10),
category = c(1,1,1,1,1,2,2,2,3,3,3,4,4))
data$date <- as.Date(data$date)
View(data)
test_dates <- data %>%
group_by(category) %>%
arrange(date) %>%
slice(1L) %>% #takes first date
mutate(first_day = as.Date(date) - 5)#%>%
#seq(as.Date(first_day),by="day",length.out=5)
#error for seq(): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
head(test_dates)
The answer I'm looking for should include these dates but in a column format! I'm also trying to input NA in the value category if the value doesnt already exist. I want to keep all possible columns, as the dataframe I'm needing to use this on has about 20 columns
Dates: "2020-06-03 ", "2020-06-04", "2020-06-05", "2020-06-06", "2020-06-07", "2020-06-08", "2020-07-08 ", "2020-07-09", "2020-07-10", "2020-07-11", "2020-07-12", "2021-07-13", "2020-08-11 ", "2020-08-12", "2020-08-13", "2020-08-14", "2020-08-15", "2021-08-16", "2020-09-14 ", "2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2021-09-19",
Related question here: How do I subset my df for the minimum date based on one category and including x days before that?
Here's one approach but kinda clunky:
bind_rows(
data,
data %>%
group_by(category) %>%
slice_min(date) %>%
uncount(6, .id = "id") %>%
mutate(date = date - id + 1) %>%
select(-id)) %>%
arrange(category, date)
Result
# A tibble: 37 × 3
date value category
<date> <dbl> <dbl>
1 2020-06-03 2 1
2 2020-06-04 2 1
3 2020-06-05 2 1
4 2020-06-06 2 1
5 2020-06-07 2 1
6 2020-06-08 2 1
7 2020-06-08 2 1
8 2020-06-09 1 1
9 2020-06-10 7 1
10 2020-06-11 1 1
# … with 27 more rows
This approach provides the row from each category with the minimum date, plus the five dates prior for each category (with value set to NA for these rows)
library(data.table)
setDT(data)[data[, .(date=seq(min(date)-5,by="day", length.out=6)), category], on=.(category,date)]
Output:
date value category
1: 2020-06-03 NA 1
2: 2020-06-04 NA 1
3: 2020-06-05 NA 1
4: 2020-06-06 NA 1
5: 2020-06-07 NA 1
6: 2020-06-08 2 1
7: 2021-07-08 NA 2
8: 2021-07-09 NA 2
9: 2021-07-10 NA 2
10: 2021-07-11 NA 2
11: 2021-07-12 NA 2
12: 2021-07-13 1 2
13: 2021-08-11 NA 3
14: 2021-08-12 NA 3
15: 2021-08-13 NA 3
16: 2021-08-14 NA 3
17: 2021-08-15 NA 3
18: 2021-08-16 4 3
19: 2021-09-14 NA 4
20: 2021-09-15 NA 4
21: 2021-09-16 NA 4
22: 2021-09-17 NA 4
23: 2021-09-18 NA 4
24: 2021-09-19 5 4
date value category
Note: The above uses a join; an identical result can be achieved without a join by row-binding the first row for each category with the data.table generated similarly as above:
rbind(
setDT(data)[order(date), .SD[1],category],
data[,.(date=seq(min(date)-5,by="day",length.out=5),value=NA),category]
)
You indicate you have many columns, so if you are going to take this second approach, rather than explicitly setting value=NA in the second input to rbind, you can also just leave it out, and add fill=TRUE within the rbind()
A dplyr version of the same is:
bind_rows(
data %>%
group_by(category) %>%
slice_min(date) %>%
ungroup() %>%
mutate(date=as.Date(date)),
data %>%
group_by(category) %>%
summarize(date=seq(min(as.Date(date))-5,by="day", length.out=5), .groups="drop")
)
Output:
# A tibble: 24 x 3
date value category
<date> <dbl> <dbl>
1 2020-06-08 2 1
2 2021-07-13 1 2
3 2021-08-16 4 3
4 2021-09-19 5 4
5 2020-06-03 NA 1
6 2020-06-04 NA 1
7 2020-06-05 NA 1
8 2020-06-06 NA 1
9 2020-06-07 NA 1
10 2021-07-08 NA 2
# ... with 14 more rows
Update (9/21/22) -
If you want the NA values to be filled, simply add this to the end of either data.table pipeline:
...[,value:=max(value, na.rm=T), category]
or add this to the dplyr pipeline
... %>%
group_by(category) %>%
mutate(value=max(value, na.rm=T))
#Jon Srpings answer fired this alternative approach:
Here we first get the first days - 5 as already presented in the question. Then we use bind_rows as Jon Srping does in his answer. Next step is to identify the original first dates within the dates column (we use !duplicated within filter). Last main step is to use coalesce:
library(lubridate)
library(dplyr)
data %>%
group_by(category) %>%
mutate(x = min(ymd(date))-5) %>%
slice(1) %>%
bind_rows(data) %>%
mutate(date = ymd(date)) %>%
filter(!duplicated(date)) %>%
mutate(x = coalesce(x, date)) %>%
arrange(category) %>%
select(date = x, value)
category date value
<dbl> <date> <dbl>
1 1 2020-06-03 2
2 1 2020-06-09 1
3 1 2020-06-10 7
4 1 2020-06-11 1
5 1 2020-06-12 0
6 2 2021-07-08 1
7 2 2021-07-14 2
8 2 2021-07-15 3
9 3 2021-08-11 4
10 3 2021-08-17 7
11 3 2021-08-18 6
12 4 2021-09-14 5
13 4 2021-09-20 10
I have the following data:
library(tidyverse)
library(lubridate)
df <- tibble(date = as_date(c("2019-11-20", "2019-11-27", "2020-04-01", "2020-04-15", "2020-09-23", "2020-11-25", "2021-03-03")))
# A tibble: 7 x 1
date
<date>
1 2019-11-20
2 2019-11-27
3 2020-04-01
4 2020-04-15
5 2020-09-23
6 2020-11-25
7 2021-03-03
I also have an ordered comparison vector of dates:
comparison <- seq(as_date("2019-12-01"), today(), by = "months") - 1
I now want to compare my dates in df to those comparison dates and so something like:
if date in df is < comparison[1], then assign a 1
if date in df is < comparison[2], then assign a 2
and so on.
I know I could do it with a case_when, e.g.
df %>%
mutate(new_var = case_when(date < comparison[1] ~ 1,
date < comparison[2] ~ 2))
(of course filling this up with all comparisons).
However, this would require to manually write out all sequential conditions and I'm wondering if I couldn't just automate it. I though about creating a match lookup first (i.e. take the comparison vector, then add the respective new_var number (i.e. 1, 2, and so on)) and then match it against my data, but I only know how to do that for exact matches and don't know how I can add the "smaller than" condition.
Expected result:
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17
You can use findInterval as follows:
df %>% mutate(new_var = df$date %>% findInterval(comparison) + 1)
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17
My code is written in R where I have a table existing of 3 variables: a date, an ID and a path. The table is sorted by ID first, then by date. When path is 0, I need to group all previous path numbers for that ID in one line and register the first date (Data_Start) and data where Path = 0 occurred (Date_End). This needs to be done per ID.
For example the second row in the desired result table: path = 0 occurred on 2018-10-08 for ID 5, meaning that all the paths of the previous dates needs to be grouped together as path = 1,0,3,4, Data_Start = 2018-10-05 and Data_End = 2018-10-08.
Source table
Date ID Path
2018-10-05 5 1
2018-10-06 5 0
2018-10-07 5 3
2018-10-08 5 0
2018-10-06 5 4
2018-10-08 7 5
2018-10-07 8 2
2018-10-08 8 1
2018-10-09 8 0
Desired result:
Date_Start Date_End ID Index Path
2018-10-05 2018-10-06 5 1 1,0
2018-10-05 2018-10-08 5 2 1,0,3,0
2018-10-06 2018-10-06 5 3 4
2018-10-08 2018-10-08 7 4 5
2018-10-07 2018-10-09 8 5 2,1,0
Thank you in advance!
Along with ID, we can create another group where Path becomes 0, get the first and last Date of each group. To get all the previous Path numbers we check if the last value ends with 0 and also replace their Date_Start with the first value.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(group = lag(cumsum(Path == 0), default = 0)) %>%
group_by(ID, group) %>%
summarise(Date_Start = first(Date),
Date_End = last(Date),
Path = toString(Path)) %>%
mutate(Path = paste_values(Path),
Date_Start = replace(Date_Start,endsWith(Path,"0"),first(Date_Start))) %>%
ungroup %>%
dplyr::select(-group) %>%
mutate(Index = row_number())
# A tibble: 5 x 5
# ID Date_Start Date_End Path Index
# <int> <fct> <fct> <chr> <int>
#1 5 2018-10-05 2018-10-06 1, 0 1
#2 5 2018-10-05 2018-10-08 1, 0, 3, 0 2
#3 5 2018-10-06 2018-10-06 4 3
#4 7 2018-10-08 2018-10-08 5 4
#5 8 2018-10-07 2018-10-09 2, 1, 0 5
where I define paste_values function as
paste_values <- function(value) {
sapply(seq_along(value), function(x) {
if (endsWith(value[x], "0")) toString(value[1:x])
else value[x]
})
}