My code is written in R where I have a table existing of 3 variables: a date, an ID and a path. The table is sorted by ID first, then by date. When path is 0, I need to group all previous path numbers for that ID in one line and register the first date (Data_Start) and data where Path = 0 occurred (Date_End). This needs to be done per ID.
For example the second row in the desired result table: path = 0 occurred on 2018-10-08 for ID 5, meaning that all the paths of the previous dates needs to be grouped together as path = 1,0,3,4, Data_Start = 2018-10-05 and Data_End = 2018-10-08.
Source table
Date ID Path
2018-10-05 5 1
2018-10-06 5 0
2018-10-07 5 3
2018-10-08 5 0
2018-10-06 5 4
2018-10-08 7 5
2018-10-07 8 2
2018-10-08 8 1
2018-10-09 8 0
Desired result:
Date_Start Date_End ID Index Path
2018-10-05 2018-10-06 5 1 1,0
2018-10-05 2018-10-08 5 2 1,0,3,0
2018-10-06 2018-10-06 5 3 4
2018-10-08 2018-10-08 7 4 5
2018-10-07 2018-10-09 8 5 2,1,0
Thank you in advance!
Along with ID, we can create another group where Path becomes 0, get the first and last Date of each group. To get all the previous Path numbers we check if the last value ends with 0 and also replace their Date_Start with the first value.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(group = lag(cumsum(Path == 0), default = 0)) %>%
group_by(ID, group) %>%
summarise(Date_Start = first(Date),
Date_End = last(Date),
Path = toString(Path)) %>%
mutate(Path = paste_values(Path),
Date_Start = replace(Date_Start,endsWith(Path,"0"),first(Date_Start))) %>%
ungroup %>%
dplyr::select(-group) %>%
mutate(Index = row_number())
# A tibble: 5 x 5
# ID Date_Start Date_End Path Index
# <int> <fct> <fct> <chr> <int>
#1 5 2018-10-05 2018-10-06 1, 0 1
#2 5 2018-10-05 2018-10-08 1, 0, 3, 0 2
#3 5 2018-10-06 2018-10-06 4 3
#4 7 2018-10-08 2018-10-08 5 4
#5 8 2018-10-07 2018-10-09 2, 1, 0 5
where I define paste_values function as
paste_values <- function(value) {
sapply(seq_along(value), function(x) {
if (endsWith(value[x], "0")) toString(value[1:x])
else value[x]
})
}
Related
I have a dataset in long format. Every subject in the dataset was observed five times during the week. I have a column with the number of the day of the week in which the observation was supposed to happen/happened and another column with the actual dates of the observations. The latter column has some missing values. I would like to use the information on the first column to fill the missing values in the second column. Here is a toy dataset:
df <- data.frame(case = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
day = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
date = as.Date(c("2023-01-02", "2023-01-03", NA, NA, "2023-01-06",
NA, "2021-05-11", "2021-05-12", "2021-05-13", NA)))
df
# case day date
# 1 1 2023-01-02
# 1 2 2023-01-03
# 1 3 <NA>
# 1 4 <NA>
# 1 5 2023-01-06
# 2 1 <NA>
# 2 2 2021-05-11
# 2 3 2021-05-12
# 2 4 2021-05-13
# 2 5 <NA>
And here is the desired output:
# case day date
#1 1 1 2023-01-02
#2 1 2 2023-01-03
#3 1 3 2023-01-04
#4 1 4 2023-01-05
#5 1 5 2023-01-06
#6 2 1 2021-05-10
#7 2 2 2021-05-11
#8 2 3 2021-05-12
#9 2 4 2021-05-13
#10 2 5 2021-05-14
Does this work for you? No linear models are used.
library(tidyverse)
df2 <-
df %>%
mutate(
ref_date = case_when(
case == 1 ~ as.Date("2023-01-01"),
case == 2 ~ as.Date("2021-05-09")
),
date2 = as.Date(day, origin = ref_date)
)
Output:
> df2
case day date ref_date date2
1 1 1 2023-01-02 2023-01-01 2023-01-02
2 1 2 2023-01-03 2023-01-01 2023-01-03
3 1 3 <NA> 2023-01-01 2023-01-04
4 1 4 <NA> 2023-01-01 2023-01-05
5 1 5 2023-01-06 2023-01-01 2023-01-06
6 2 1 <NA> 2021-05-09 2021-05-10
7 2 2 2021-05-11 2021-05-09 2021-05-11
8 2 3 2021-05-12 2021-05-09 2021-05-12
9 2 4 2021-05-13 2021-05-09 2021-05-13
10 2 5 <NA> 2021-05-09 2021-05-14
I concede that G.G.'s answer has the advantage that you don't need to hardcode the reference date.
P.S. here is a pure tidyverse solution without any hardcoding:
df2 <-
df %>%
mutate(ref_date = date - day) %>%
group_by(case) %>%
fill(ref_date, .direction = "downup") %>%
ungroup() %>%
mutate(date2 = as.Date(day, origin = ref_date))
1) Convert case to factor and then use predict with lm to fill in the NA's. No packages are used.
within(df, {
case <- factor(case)
date <- .Date(predict(lm(date ~ case/day), data.frame(case, date)))
})
giving
case day date
1 1 1 2023-01-02
2 1 2 2023-01-03
3 1 3 2023-01-04
4 1 4 2023-01-05
5 1 5 2023-01-06
6 2 1 2021-05-10
7 2 2 2021-05-11
8 2 3 2021-05-12
9 2 4 2021-05-13
10 2 5 2021-05-14
2) Find the mean day and date and then use day to appropriately offset each row.
library(dplyr) # version 1.1.0 or later
df %>%
mutate(date = {
Mean <- Map(mean, na.omit(pick(date, day)))
Mean$date + day - Mean$day
}, .by = case)
I am trying to find the first date of each category then subtract 5 days AND I want to keep the days inbetween! this is where I am struggling. I tried seq() but it gave me an error, so I'm not sure if this is the right way to do it.
I am able to get 5 days prior to my start date for each category, but I can't figure out how to get 0, 1, 2, 3, 4 AND 5 days prior to my start date!
The error I got is this (for the commented out part of the code):
Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) :
'from' must be of length 1
Any help would be greatly appreciated!
library ("lubridate")
library("dplyr")
library("tidyr")
data <- data.frame(date = c("2020-06-08",
"2020-06-09",
"2020-06-10",
"2020-06-11",
"2020-06-12",
"2021-07-13",
"2021-07-14",
"2021-07-15",
"2021-08-16",
"2021-08-17",
"2021-08-18",
"2021-09-19",
"2021-09-20"),
value = c(2,1,7,1,0,1,2,3,4,7,6,5,10),
category = c(1,1,1,1,1,2,2,2,3,3,3,4,4))
data$date <- as.Date(data$date)
View(data)
test_dates <- data %>%
group_by(category) %>%
arrange(date) %>%
slice(1L) %>% #takes first date
mutate(first_day = as.Date(date) - 5)#%>%
#seq(as.Date(first_day),by="day",length.out=5)
#error for seq(): Error in seq.default(., as.Date(first_day), by = "day", length.out = 5) : 'from' must be of length 1
head(test_dates)
The answer I'm looking for should include these dates but in a column format! I'm also trying to input NA in the value category if the value doesnt already exist. I want to keep all possible columns, as the dataframe I'm needing to use this on has about 20 columns
Dates: "2020-06-03 ", "2020-06-04", "2020-06-05", "2020-06-06", "2020-06-07", "2020-06-08", "2020-07-08 ", "2020-07-09", "2020-07-10", "2020-07-11", "2020-07-12", "2021-07-13", "2020-08-11 ", "2020-08-12", "2020-08-13", "2020-08-14", "2020-08-15", "2021-08-16", "2020-09-14 ", "2020-09-15", "2020-09-16", "2020-09-17", "2020-09-18", "2021-09-19",
Related question here: How do I subset my df for the minimum date based on one category and including x days before that?
Here's one approach but kinda clunky:
bind_rows(
data,
data %>%
group_by(category) %>%
slice_min(date) %>%
uncount(6, .id = "id") %>%
mutate(date = date - id + 1) %>%
select(-id)) %>%
arrange(category, date)
Result
# A tibble: 37 × 3
date value category
<date> <dbl> <dbl>
1 2020-06-03 2 1
2 2020-06-04 2 1
3 2020-06-05 2 1
4 2020-06-06 2 1
5 2020-06-07 2 1
6 2020-06-08 2 1
7 2020-06-08 2 1
8 2020-06-09 1 1
9 2020-06-10 7 1
10 2020-06-11 1 1
# … with 27 more rows
This approach provides the row from each category with the minimum date, plus the five dates prior for each category (with value set to NA for these rows)
library(data.table)
setDT(data)[data[, .(date=seq(min(date)-5,by="day", length.out=6)), category], on=.(category,date)]
Output:
date value category
1: 2020-06-03 NA 1
2: 2020-06-04 NA 1
3: 2020-06-05 NA 1
4: 2020-06-06 NA 1
5: 2020-06-07 NA 1
6: 2020-06-08 2 1
7: 2021-07-08 NA 2
8: 2021-07-09 NA 2
9: 2021-07-10 NA 2
10: 2021-07-11 NA 2
11: 2021-07-12 NA 2
12: 2021-07-13 1 2
13: 2021-08-11 NA 3
14: 2021-08-12 NA 3
15: 2021-08-13 NA 3
16: 2021-08-14 NA 3
17: 2021-08-15 NA 3
18: 2021-08-16 4 3
19: 2021-09-14 NA 4
20: 2021-09-15 NA 4
21: 2021-09-16 NA 4
22: 2021-09-17 NA 4
23: 2021-09-18 NA 4
24: 2021-09-19 5 4
date value category
Note: The above uses a join; an identical result can be achieved without a join by row-binding the first row for each category with the data.table generated similarly as above:
rbind(
setDT(data)[order(date), .SD[1],category],
data[,.(date=seq(min(date)-5,by="day",length.out=5),value=NA),category]
)
You indicate you have many columns, so if you are going to take this second approach, rather than explicitly setting value=NA in the second input to rbind, you can also just leave it out, and add fill=TRUE within the rbind()
A dplyr version of the same is:
bind_rows(
data %>%
group_by(category) %>%
slice_min(date) %>%
ungroup() %>%
mutate(date=as.Date(date)),
data %>%
group_by(category) %>%
summarize(date=seq(min(as.Date(date))-5,by="day", length.out=5), .groups="drop")
)
Output:
# A tibble: 24 x 3
date value category
<date> <dbl> <dbl>
1 2020-06-08 2 1
2 2021-07-13 1 2
3 2021-08-16 4 3
4 2021-09-19 5 4
5 2020-06-03 NA 1
6 2020-06-04 NA 1
7 2020-06-05 NA 1
8 2020-06-06 NA 1
9 2020-06-07 NA 1
10 2021-07-08 NA 2
# ... with 14 more rows
Update (9/21/22) -
If you want the NA values to be filled, simply add this to the end of either data.table pipeline:
...[,value:=max(value, na.rm=T), category]
or add this to the dplyr pipeline
... %>%
group_by(category) %>%
mutate(value=max(value, na.rm=T))
#Jon Srpings answer fired this alternative approach:
Here we first get the first days - 5 as already presented in the question. Then we use bind_rows as Jon Srping does in his answer. Next step is to identify the original first dates within the dates column (we use !duplicated within filter). Last main step is to use coalesce:
library(lubridate)
library(dplyr)
data %>%
group_by(category) %>%
mutate(x = min(ymd(date))-5) %>%
slice(1) %>%
bind_rows(data) %>%
mutate(date = ymd(date)) %>%
filter(!duplicated(date)) %>%
mutate(x = coalesce(x, date)) %>%
arrange(category) %>%
select(date = x, value)
category date value
<dbl> <date> <dbl>
1 1 2020-06-03 2
2 1 2020-06-09 1
3 1 2020-06-10 7
4 1 2020-06-11 1
5 1 2020-06-12 0
6 2 2021-07-08 1
7 2 2021-07-14 2
8 2 2021-07-15 3
9 3 2021-08-11 4
10 3 2021-08-17 7
11 3 2021-08-18 6
12 4 2021-09-14 5
13 4 2021-09-20 10
I have a dataset with ID numbers, dates, and test results, and need to create a final dataset where each row consists of a unique ID, date, and test result value. How can I find duplicates based on ID and date, and then keep rows based on a specific test result value?
df <- data.frame(id_number = c(1, 1, 2, 2, 3, 3, 3, 4),
date = c('2021-11-03', '2021-11-19', '2021-11-11', '2021-11-11', '2021-11-05', '2021-11-05', '2021-11-16', '2021-11-29'),
result = c(0,1,0,0,0,9,0,9) )
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 2 2021-11-11 0
5 3 2021-11-05 0
6 3 2021-11-05 9
7 3 2021-11-16 0
8 4 2021-11-29 9
df <- unique(df)
After using the unique function, I am still left with rows that have duplicate id_number and date, and different test results. Of these, I need to keep only the row that equals 0 or 1, and exclude any 9s.
In the example below, I'd want to keep row 4 and exclude row 5. I can't simply exclude rows where result = 9 because I want to keep those for any non-duplicate observations.
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 3 2021-11-05 0
5 3 2021-11-05 9
6 3 2021-11-16 0
7 4 2021-11-29 9
You can do:
library(tidyverse)
df %>%
group_by(id_number, date) %>%
filter(!(result == 9 & row_number() > 1)) %>%
ungroup()
# A tibble: 6 x 3
id_number date result
<dbl> <chr> <dbl>
1 1 2021-11-03 0
2 1 2021-11-19 1
3 2 2021-11-11 0
4 3 2021-11-05 0
5 3 2021-11-16 0
6 4 2021-11-29 9
For simplicity of understanding, use:
a) rows different than 9:
df <- subset(df,df$result != 9)
And then
b) Remove duplicated:
df <- subset(df, duplicated(df)==FALSE)
So if you want specific columns:
df <- subset(df,duplicated(df$result)==FALSE)
Or:
df <- subset(df,duplicated(df[ ,2:3])==FALSE)
This maybe complicated and hard to explain. Let's say I have a dataframe that have 4 columns date, id, response_1, and response_2: id column has unique values, response_1 variable contains values of 1 and 0, and response_2 variable uses response_1 values to determine whether to assign 1 and 0 to the unique id. If the id has a value of 0 in the response_1 variable it assign the value of 0 to response_2 variable, but once the id has a value of 1 in the response_1 variable it stays 1 in the response_2 variable regardless of value in response_1 (please see id 1 and 3).
sample <- data.frame(date = c("2020-04-17", "2020-04-17", "2020-04-17",
"2020-05-13", "2020-05-13", "2020-05-13",
"2020-06-12", "2020-06-12", "2020-06-12",
"2020-06-19", "2020-06-19"),
id = c(1,2,3,1,2,3,1,3,4,5,1),
response_1=c(0,1,0,1,0,1,0,0,0,1,1),
response_2=c(0,1,0,1,1,1,1,1,0,1,1))
date id response_1 response_2
1 2020-04-17 1 0 0
2 2020-04-17 2 1 1
3 2020-04-17 3 0 0
4 2020-05-13 1 1 1
5 2020-05-13 2 0 1
6 2020-05-13 3 1 1
7 2020-06-12 1 0 1
8 2020-06-12 3 0 1
9 2020-06-12 4 0 0
10 2020-06-19 5 1 1
11 2020-06-19 1 1 1
What I want to calculate using this dataset is seeing in each day how many unique id we had and how many turned into 1 since the beginning of dataset. For instance on June 12, we had total of 4 unique id (1,2,3, and 4) in the whole dataset and 3 of them turned into 1 (id 1,2,and 3) 4 was still 0.
Like this:
result <- data.frame(date=c("04-17-2020", "05-13-2020","06-12-2020", "06-19-2020"),
count_id = c(3,3,4,5), total=c(1,3,3,4))
date count_id total
1 04-17-2020 3 1
2 05-13-2020 3 3
3 06-12-2020 4 3
4 06-19-2020 5 4
What will be the best way to accomplish this in R?
You can use duplicated with cumsum to get count of cumulative unique id's and take cumsum of response_1 variable. For each date we then select the last row to get final count.
library(dplyr)
sample %>%
group_by(id) %>%
mutate(response_11 = response_1 * as.integer(!duplicated(response_1))) %>%
ungroup %>%
mutate(count_id = cumsum(!duplicated(id)),
total = cumsum(response_11)) %>%
group_by(date) %>%
slice(n()) %>%
select(date, count_id, total)
# date count_id total
# <chr> <int> <dbl>
#1 2020-04-17 3 1
#2 2020-05-13 3 3
#3 2020-06-12 4 3
#4 2020-06-19 5 4
I have a data frame in the below format and I'm trying to find the difference in time between the Event 'ASSIGNED' and the last time the Event is 'CREATED' that comes before it.
**AccountID** **TIME** **EVENT**
1 2016-11-08T01:54:15.000Z CREATED
1 2016-11-09T01:54:15.000Z ASSIGNED
1 2016-11-10T01:54:15.000Z CREATED
1 2016-11-11T01:54:15.000Z CALLED
1 2016-11-12T01:54:15.000Z ASSIGNED
1 2016-11-12T01:54:15.000Z SLEEP
Currently my code is as follows, my difficulty is selecting the CREATED that just comes before the ASSIGNED Event
test <- timetable.filter %>%
group_by(AccountID) %>%
mutate(timeToAssign = ifelse(EVENT == 'ASSIGNED',
interval(ymd_hms(TIME), max(ymd_hms(TIME[EVENT == 'CREATED']))) %/% hours(1), NA))
I'm looking for the output to be
**AccountID** **TIME** **EVENT** **timeToAssign**
1 2016-11-08T01:54:15.000Z CREATED NA
1 2016-11-09T01:54:15.000Z ASSIGNED 12
1 2016-11-10T01:54:15.000Z CREATED NA
1 2016-11-11T01:54:15.000Z CALLED NA
1 2016-11-12T01:54:15.000Z ASSIGNED 24
1 2016-11-12T01:54:15.000Z SLEEP NA
With dplyr and tidyr:
library(dplyr); library(tidyr); library(anytime)
df %>%
group_by(AccountID) %>%
mutate(CREATED_INDEX = if_else(EVENT == 'CREATED', row_number(), NA_integer_),
TIME = anytime(TIME)) %>%
fill(CREATED_INDEX) %>%
mutate(TimeToAssign = if_else(EVENT == 'ASSIGNED',
as.numeric(TIME - TIME[CREATED_INDEX], units = 'hours'),
NA_real_)) %>%
select(-CREATED_INDEX)
# A tibble: 6 x 4
# Groups: AccountID [1]
# AccountID TIME EVENT TimeToAssign
# <int> <dttm> <fctr> <dbl>
#1 1 2016-11-08 01:54:15 CREATED NA
#2 1 2016-11-09 01:54:15 ASSIGNED 24
#3 1 2016-11-10 01:54:15 CREATED NA
#4 1 2016-11-11 01:54:15 CALLED NA
#5 1 2016-11-12 01:54:15 ASSIGNED 48
#6 1 2016-11-12 01:54:15 SLEEP NA