I am trying to find the first date of each category then subtract 5 days AND I want to keep the days inbetween! this is where I am struggling. I tried seq() but it gave me an error, so I'm not sure if this is the right way to do it.
I am able to get 5 days prior to my start date for each category, but I can't figure out how to get 0, 1, 2, 3, 4 AND 5 days prior to my start date!
The error I got is this (for the commented out part of the code):
Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) :
'from' must be of length 1
Any help would be greatly appreciated!
library ("lubridate")
library("dplyr")
library("tidyr")
data <- data.frame(date = c("2020-06-08",
"2020-06-09",
"2020-06-10",
"2020-06-11",
"2020-06-12",
"2021-07-13",
"2021-07-14",
"2021-07-15",
"2021-08-16",
"2021-08-17",
"2021-08-18",
"2021-09-19",
"2021-09-20"),
value = c(2,1,7,1,0,1,2,3,4,7,6,5,10),
category = c(1,1,1,1,1,2,2,2,3,3,3,4,4))
data$date <- as.Date(data$date)
View(data)
test_dates <- data %>%
group_by(category) %>%
arrange(date) %>%
slice(1L) %>% #takes first date
mutate(first_day = as.Date(date) - 5)#%>%
#seq(as.Date(first_day),by="day",length.out=5)
#error for seq(): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
head(test_dates)
The answer I'm looking for should include these dates but in a column format! I'm also trying to input NA in the value category if the value doesnt already exist. I want to keep all possible columns, as the dataframe I'm needing to use this on has about 20 columns
Dates: "2020-06-03 ", "2020-06-04", "2020-06-05", "2020-06-06", "2020-06-07", "2020-06-08", "2020-07-08 ", "2020-07-09", "2020-07-10", "2020-07-11", "2020-07-12", "2021-07-13", "2020-08-11 ", "2020-08-12", "2020-08-13", "2020-08-14", "2020-08-15", "2021-08-16", "2020-09-14 ", "2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2021-09-19",
Related question here: How do I subset my df for the minimum date based on one category and including x days before that?
Here's one approach but kinda clunky:
bind_rows(
data,
data %>%
group_by(category) %>%
slice_min(date) %>%
uncount(6, .id = "id") %>%
mutate(date = date - id + 1) %>%
select(-id)) %>%
arrange(category, date)
Result
# A tibble: 37 × 3
date value category
<date> <dbl> <dbl>
1 2020-06-03 2 1
2 2020-06-04 2 1
3 2020-06-05 2 1
4 2020-06-06 2 1
5 2020-06-07 2 1
6 2020-06-08 2 1
7 2020-06-08 2 1
8 2020-06-09 1 1
9 2020-06-10 7 1
10 2020-06-11 1 1
# … with 27 more rows
This approach provides the row from each category with the minimum date, plus the five dates prior for each category (with value set to NA for these rows)
library(data.table)
setDT(data)[data[, .(date=seq(min(date)-5,by="day", length.out=6)), category], on=.(category,date)]
Output:
date value category
1: 2020-06-03 NA 1
2: 2020-06-04 NA 1
3: 2020-06-05 NA 1
4: 2020-06-06 NA 1
5: 2020-06-07 NA 1
6: 2020-06-08 2 1
7: 2021-07-08 NA 2
8: 2021-07-09 NA 2
9: 2021-07-10 NA 2
10: 2021-07-11 NA 2
11: 2021-07-12 NA 2
12: 2021-07-13 1 2
13: 2021-08-11 NA 3
14: 2021-08-12 NA 3
15: 2021-08-13 NA 3
16: 2021-08-14 NA 3
17: 2021-08-15 NA 3
18: 2021-08-16 4 3
19: 2021-09-14 NA 4
20: 2021-09-15 NA 4
21: 2021-09-16 NA 4
22: 2021-09-17 NA 4
23: 2021-09-18 NA 4
24: 2021-09-19 5 4
date value category
Note: The above uses a join; an identical result can be achieved without a join by row-binding the first row for each category with the data.table generated similarly as above:
rbind(
setDT(data)[order(date), .SD[1],category],
data[,.(date=seq(min(date)-5,by="day",length.out=5),value=NA),category]
)
You indicate you have many columns, so if you are going to take this second approach, rather than explicitly setting value=NA in the second input to rbind, you can also just leave it out, and add fill=TRUE within the rbind()
A dplyr version of the same is:
bind_rows(
data %>%
group_by(category) %>%
slice_min(date) %>%
ungroup() %>%
mutate(date=as.Date(date)),
data %>%
group_by(category) %>%
summarize(date=seq(min(as.Date(date))-5,by="day", length.out=5), .groups="drop")
)
Output:
# A tibble: 24 x 3
date value category
<date> <dbl> <dbl>
1 2020-06-08 2 1
2 2021-07-13 1 2
3 2021-08-16 4 3
4 2021-09-19 5 4
5 2020-06-03 NA 1
6 2020-06-04 NA 1
7 2020-06-05 NA 1
8 2020-06-06 NA 1
9 2020-06-07 NA 1
10 2021-07-08 NA 2
# ... with 14 more rows
Update (9/21/22) -
If you want the NA values to be filled, simply add this to the end of either data.table pipeline:
...[,value:=max(value, na.rm=T), category]
or add this to the dplyr pipeline
... %>%
group_by(category) %>%
mutate(value=max(value, na.rm=T))
#Jon Srpings answer fired this alternative approach:
Here we first get the first days - 5 as already presented in the question. Then we use bind_rows as Jon Srping does in his answer. Next step is to identify the original first dates within the dates column (we use !duplicated within filter). Last main step is to use coalesce:
library(lubridate)
library(dplyr)
data %>%
group_by(category) %>%
mutate(x = min(ymd(date))-5) %>%
slice(1) %>%
bind_rows(data) %>%
mutate(date = ymd(date)) %>%
filter(!duplicated(date)) %>%
mutate(x = coalesce(x, date)) %>%
arrange(category) %>%
select(date = x, value)
category date value
<dbl> <date> <dbl>
1 1 2020-06-03 2
2 1 2020-06-09 1
3 1 2020-06-10 7
4 1 2020-06-11 1
5 1 2020-06-12 0
6 2 2021-07-08 1
7 2 2021-07-14 2
8 2 2021-07-15 3
9 3 2021-08-11 4
10 3 2021-08-17 7
11 3 2021-08-18 6
12 4 2021-09-14 5
13 4 2021-09-20 10
Related
I have a dataset in long format. Every subject in the dataset was observed five times during the week. I have a column with the number of the day of the week in which the observation was supposed to happen/happened and another column with the actual dates of the observations. The latter column has some missing values. I would like to use the information on the first column to fill the missing values in the second column. Here is a toy dataset:
df <- data.frame(case = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
day = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
date = as.Date(c("2023-01-02", "2023-01-03", NA, NA, "2023-01-06",
NA, "2021-05-11", "2021-05-12", "2021-05-13", NA)))
df
# case day date
# 1 1 2023-01-02
# 1 2 2023-01-03
# 1 3 <NA>
# 1 4 <NA>
# 1 5 2023-01-06
# 2 1 <NA>
# 2 2 2021-05-11
# 2 3 2021-05-12
# 2 4 2021-05-13
# 2 5 <NA>
And here is the desired output:
# case day date
#1 1 1 2023-01-02
#2 1 2 2023-01-03
#3 1 3 2023-01-04
#4 1 4 2023-01-05
#5 1 5 2023-01-06
#6 2 1 2021-05-10
#7 2 2 2021-05-11
#8 2 3 2021-05-12
#9 2 4 2021-05-13
#10 2 5 2021-05-14
Does this work for you? No linear models are used.
library(tidyverse)
df2 <-
df %>%
mutate(
ref_date = case_when(
case == 1 ~ as.Date("2023-01-01"),
case == 2 ~ as.Date("2021-05-09")
),
date2 = as.Date(day, origin = ref_date)
)
Output:
> df2
case day date ref_date date2
1 1 1 2023-01-02 2023-01-01 2023-01-02
2 1 2 2023-01-03 2023-01-01 2023-01-03
3 1 3 <NA> 2023-01-01 2023-01-04
4 1 4 <NA> 2023-01-01 2023-01-05
5 1 5 2023-01-06 2023-01-01 2023-01-06
6 2 1 <NA> 2021-05-09 2021-05-10
7 2 2 2021-05-11 2021-05-09 2021-05-11
8 2 3 2021-05-12 2021-05-09 2021-05-12
9 2 4 2021-05-13 2021-05-09 2021-05-13
10 2 5 <NA> 2021-05-09 2021-05-14
I concede that G.G.'s answer has the advantage that you don't need to hardcode the reference date.
P.S. here is a pure tidyverse solution without any hardcoding:
df2 <-
df %>%
mutate(ref_date = date - day) %>%
group_by(case) %>%
fill(ref_date, .direction = "downup") %>%
ungroup() %>%
mutate(date2 = as.Date(day, origin = ref_date))
1) Convert case to factor and then use predict with lm to fill in the NA's. No packages are used.
within(df, {
case <- factor(case)
date <- .Date(predict(lm(date ~ case/day), data.frame(case, date)))
})
giving
case day date
1 1 1 2023-01-02
2 1 2 2023-01-03
3 1 3 2023-01-04
4 1 4 2023-01-05
5 1 5 2023-01-06
6 2 1 2021-05-10
7 2 2 2021-05-11
8 2 3 2021-05-12
9 2 4 2021-05-13
10 2 5 2021-05-14
2) Find the mean day and date and then use day to appropriately offset each row.
library(dplyr) # version 1.1.0 or later
df %>%
mutate(date = {
Mean <- Map(mean, na.omit(pick(date, day)))
Mean$date + day - Mean$day
}, .by = case)
Let's say I have this example dataframe (but a lot bigger)
df = data.frame(ID_number = c(111,111,111,22,22,33,33),
date = c('2021-06-14','2021-06-12','2021-03-11',
'2021-05-20','2021-05-14',
'2018-04-20','2017-03-14'),
answers = 1:7,
sex = c('F','M','F','M','M','M','F') )
The output
ID_number date answers sex
1 111 2021-06-14 1 F
2 111 2021-06-12 2 M
3 111 2021-03-11 3 F
4 22 2021-05-20 4 M
5 22 2021-05-14 5 M
6 33 2018-04-20 6 M
7 33 2017-03-14 7 F
we can see that there are 7 different members, but the one who created the dataframe has made a mistake and assigned the same ID_number to members 1,2 and 3. The same ID_number to members 4 and 5 and so on ...
In the dataframe there is the data of the collection of the data of each member and I wish to only keep the member that has the earliest date. The resulted dataframe would look like this
ID_number date answers sex
1 111 2021-03-11 3 F
2 22 2021-05-14 5 M
3 33 2017-03-14 7 F
Appreciate the help.
You could filter on the min date per group_by like this:
library(dplyr)
df %>%
group_by(ID_number) %>%
filter(date == min(date))
#> # A tibble: 3 × 4
#> # Groups: ID_number [3]
#> ID_number date answers sex
#> <dbl> <chr> <int> <chr>
#> 1 111 2021-03-11 3 F
#> 2 22 2021-05-14 5 M
#> 3 33 2017-03-14 7 F
Created on 2023-01-04 with reprex v2.0.2
With slice_min:
library(dplyr)
df %>%
group_by(ID_number) %>%
slice_min(date)
In the dev. version, you can use inline grouping with .by:
devtools::install_github("tidyverse/dplyr")
df %>%
slice_min(date, .by = ID_number)
Using base R
subset(df, as.numeric(date) == ave(as.numeric(date), ID_number, FUN = min))
ID_number date answers sex
3 111 2021-03-11 3 F
5 22 2021-05-14 5 M
7 33 2017-03-14 7 F
I have two df.
't_dates' has sequence of dates.
'client_for_gg' has client_id and a start and end date for each client.
The output would tell me for each day in my t_dates$date how many clients had that date fall within their start and end date.
library(tidyverse)
library(lubridate)
t_dates <- seq.Date(from = as.Date('2022-11-01'),
to = as.Date('2022-11-15'),
by = "day") %>%
data.frame(date = .)
client_for_gg <- data.frame(client_id = c("x_555", "x_666", "x_777", "x_888", "x_999")
, start = c("2022-01-01", "2022-01-01", "2022-11-05", "2022-11-07", "2022-11-10")
, end = c("2022-11-03", "2022-11-12", "2022-12-01", "2022-12-01", "2022-12-01")) %>%
mutate(start = as.Date(start)
, end = as.Date(end))
df <- t_dates %>%
mutate(count = sum(as.Date(t_dates$date) %within%
lubridate::interval(client_for_gg$start, client_for_gg$end)))
However, in my output my counts all come to 10 on each day. Please help - thank you.
library(dplyr)
client_for_gg %>%
rowwise() %>%
mutate(count = sum(t_dates$date >= start & t_dates$date <= end ))
# A tibble: 5 x 4
# Rowwise:
client_id start end count
<chr> <date> <date> <int>
1 x_555 2022-01-01 2022-11-03 3
2 x_666 2022-01-01 2022-11-12 12
3 x_777 2022-11-05 2022-12-01 11
4 x_888 2022-11-07 2022-12-01 9
5 x_999 2022-11-10 2022-12-01 6
This worked.
> t_dates %>%
+ rowwise() %>%
+ mutate(count = sum(client_for_gg$start <= date & client_for_gg$end >= date))
# A tibble: 15 x 2
# Rowwise:
date count
<date> <int>
1 2022-11-01 2
2 2022-11-02 2
3 2022-11-03 2
4 2022-11-04 1
5 2022-11-05 2
6 2022-11-06 2
7 2022-11-07 3
8 2022-11-08 3
9 2022-11-09 3
10 2022-11-10 4
11 2022-11-11 4
12 2022-11-12 4
13 2022-11-13 3
14 2022-11-14 3
15 2022-11-15 3
library(data.table)
setDT(client_for_gg)[setDT(t_dates)[,d:=date], on =.(start<=d, end>=d)][, .N, date]
Output:
date N
1: 2022-11-01 2
2: 2022-11-02 2
3: 2022-11-03 2
4: 2022-11-04 1
5: 2022-11-05 2
6: 2022-11-06 2
7: 2022-11-07 3
8: 2022-11-08 3
9: 2022-11-09 3
10: 2022-11-10 4
11: 2022-11-11 4
12: 2022-11-12 4
13: 2022-11-13 3
14: 2022-11-14 3
15: 2022-11-15 3
In the devel version of dplyr, we can use join_by which can also do non-equi joins
library(dplyr)
t_dates %>%
full_join(client_for_gg, by =join_by(date >= start, date <= end)) %>%
count(client_id, start, end, name = 'count')
-output
client_id start end count
1 x_555 2022-01-01 2022-11-03 3
2 x_666 2022-01-01 2022-11-12 12
3 x_777 2022-11-05 2022-12-01 11
4 x_888 2022-11-07 2022-12-01 9
5 x_999 2022-11-10 2022-12-01 6
I have the following data:
library(tidyverse)
library(lubridate)
df <- tibble(date = as_date(c("2019-11-20", "2019-11-27", "2020-04-01", "2020-04-15", "2020-09-23", "2020-11-25", "2021-03-03")))
# A tibble: 7 x 1
date
<date>
1 2019-11-20
2 2019-11-27
3 2020-04-01
4 2020-04-15
5 2020-09-23
6 2020-11-25
7 2021-03-03
I also have an ordered comparison vector of dates:
comparison <- seq(as_date("2019-12-01"), today(), by = "months") - 1
I now want to compare my dates in df to those comparison dates and so something like:
if date in df is < comparison[1], then assign a 1
if date in df is < comparison[2], then assign a 2
and so on.
I know I could do it with a case_when, e.g.
df %>%
mutate(new_var = case_when(date < comparison[1] ~ 1,
date < comparison[2] ~ 2))
(of course filling this up with all comparisons).
However, this would require to manually write out all sequential conditions and I'm wondering if I couldn't just automate it. I though about creating a match lookup first (i.e. take the comparison vector, then add the respective new_var number (i.e. 1, 2, and so on)) and then match it against my data, but I only know how to do that for exact matches and don't know how I can add the "smaller than" condition.
Expected result:
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17
You can use findInterval as follows:
df %>% mutate(new_var = df$date %>% findInterval(comparison) + 1)
# A tibble: 7 x 2
date new_var
<date> <dbl>
1 2019-11-20 1
2 2019-11-27 1
3 2020-04-01 6
4 2020-04-15 6
5 2020-09-23 11
6 2020-11-25 13
7 2021-03-03 17
Let's suppose I have this dataframe:
Date A B
2010-01-01 NA 1
2010-01-02 2 NA
2010-01-05 3 NA
2010-01-07 NA 4
2010-01-20 5 NA
2010-01-25 6 7
I want to collapse rows, removing the NA values to the closest Date. So the result would be:
Date A B
2010-01-02 2 1
2010-01-07 3 4
2010-01-20 5 NA
2010-01-25 6 7
I saw this stack overflow that solves collapsing using a key value, but I could not find a similar case using close date values to collapse.
Obs1: It would be good if there was a way to not collapse the rows if the dates are too far apart (example: more than 15 days apart).
Obs2: It would be good if the collapsing lines kept the latter date rather than the earlier, as shown in the example above.
Using dplyr package an option could be to group_by on combination of A and B in such a way that they form complete values.
Considering Obs#2 the max of Date should be taken for combined row.
library(dplyr)
library(lubridate)
df %>% mutate(Date = ymd(Date)) %>%
mutate(GrpA = cumsum(!is.na(A)), GrpB = cumsum(!is.na(B))) %>%
rowwise() %>%
mutate(Grp = max(GrpA, GrpB)) %>%
ungroup() %>%
select(-GrpA, -GrpB) %>%
group_by(Grp) %>%
summarise(Date = max(Date), A = A[!is.na(A)][1], B = B[!is.na(B)][1])
# # A tibble: 4 x 4
# Grp Date A B
# <int> <date> <int> <int>
# 1 1 2010-01-02 2 1
# 2 2 2010-01-07 3 4
# 3 3 2010-01-20 5 NA
# 4 4 2010-01-25 6 7
Data:
df <- read.table(text =
"Date A B
2010-01-01 NA 1
2010-01-02 2 NA
2010-01-05 3 NA
2010-01-07 NA 4
2010-01-20 5 NA
2010-01-25 6 7",
stringsAsFactors = FALSE, header = TRUE)