Shifting a date-based analysis by a year: process questions - r

I am doing a migration study indexed to a specific event. To create the dataset, I basically subset a larger dataset to a specific date, and then made flags based on additional dates, then added in information. In total, this takes 7 scripts. Now, I want to create a comparison dataset, with the same information but indexed to two years earlier.
My question is, is there an easy way where I can use the same script and just tell R to somehow treat all the code as two years before, or do I have to create a duplicate of the code and then edit it in line to be two years before. Here's a very basic example of some of the code I'm using to generate the dataset from a larger framework:
#example of things I'd want shifted 2 years
df <- subset(df, DATE_AFTER > as.Date("2016-09-27"))
df$flag <- with(df,
as.numeric(DATE_BEFORE < as.Date("2016-09-28") &
DATE_AFTER > as.Date("2016-09-27")))
df
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
Dummy data
df <- data.frame(ID=c("A", "B", "C", "D", "E", "F", "G"),
DATE_BEFORE=as.Date(c("2013-01-23", "2010-01-23", "2018-01-23",
"2014-01-23", "2011-01-23", "2010-01-23",
"2017-01-23")),
DATE_AFTER=as.Date(c("2018-01-23", "2016-01-23", "2020-01-23",
"2016-01-23", "2019-01-23", "2019-01-23",
"2018-01-23")))

Just wrap it in a function. To subtract on year we may use as.POSIXlt as shown in this answer.
my_df_subset <- \(date, subtract_yr=0L) {
dt <- as.POSIXlt(paste0(date, '-01'))
dt$year <- dt$year - subtract_yr
dt <- as.Date(dt)
transform(subset(df, DATE_AFTER > dt),
flag=as.numeric(DATE_BEFORE < dt + 1L &
DATE_AFTER > dt))
}
my_df_subset("2016-09-27")
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
my_df_subset("2016-09-27", 2L) ## two years earlier
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 2 B 2010-01-23 2016-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 4 D 2014-01-23 2016-01-23 1
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
Note: R >= 4.1 used.
Data:
df <- structure(list(ID = c("A", "B", "C", "D", "E", "F", "G"), DATE_BEFORE = structure(c(15728,
14632, 17554, 16093, 14997, 14632, 17189), class = "Date"), DATE_AFTER = structure(c(17554,
16823, 18284, 16823, 17919, 17919, 17554), class = "Date")), class = "data.frame", row.names = c(NA,
-7L))

Related

How to calculate days between occurrences by groups

I'm struggling on how can I calculate the quantity of the days between occurrences, since I need to calculate how many days does it take between maintenances on an equipment.
I have a dataframe with a lot of equipments and dates indicating the maintenance, then I need to calculate the days between the maintenances for each equipment. I will show a toy example:
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"),
maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
#test
# car maintenance_date
#1 A 20-09-2020
#2 A 25-09-2020
#3 B 14-05-2020
#4 B 20-05-2020
#5 B 20-05-2021
#6 C 11-01-2021
#7 C 13-01-2021
#8 D 13-01-2021
#9 D 15-01-2021
#10 D 15-01-2021
#11 E 13-01-2021
#for result, I'd like something like:
result
# car maintenance_date
#1 A 5
#2 B 6
#3 B 365
#4 C 2
#5 D 2
#6 D 0
I thought of using something like test %>% arrange(maintenance_date) %>% group_by(car) %>% ....
Any hint on how can I do that?
We need to convert to Date class before doing the arrange and then do the group_by 'car' and get the difference
library(dplyr)
library(lubridate)
test %>%
mutate(maintenance_date = dmy(maintenance_date)) %>%
arrange(maintenance_date) %>%
group_by(car) %>%
summarise(maintenance_date = diff(maintenance_date), .groups = 'drop')
-output
# A tibble: 6 × 2
car maintenance_date
<chr> <drtn>
1 A 5 days
2 B 6 days
3 B 365 days
4 C 2 days
5 D 2 days
6 D 0 days
data.table
library(data.table)
setDT(test)
test[, maintenance_date := as.Date(maintenance_date, format="%d-%m-%Y")
][, .(ndays = diff(maintenance_date)), by = car]
# car ndays
# <char> <difftime>
# 1: A 5 days
# 2: B 6 days
# 3: B 365 days
# 4: C 2 days
# 5: D 2 days
# 6: D 0 days
Another solution, tidyverse-based, can be:
library(tidyverse)
library(lubridate)
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"), maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
test %>%
group_by(car) %>%
mutate(maintenance_date = c(-1,diff(dmy(maintenance_date)))) %>%
filter(maintenance_date >= 0) %>% ungroup
#> # A tibble: 6 × 2
#> # Groups: car [4]
#> car maintenance_date
#> <chr> <dbl>
#> 1 A 5
#> 2 B 6
#> 3 B 365
#> 4 C 2
#> 5 D 2
#> 6 D 0

How to add a boolean value to a column when 2 different dataframes match on 2 columns in R?

I have 2 different dataframes. I want add a column to my second dataframe and have it assigned a value 0 or 1. In the case where df1$code == df2$code & df1$date == df2$date I want a 0 for these rows. A visual and reproducible example maybe makes it more easy to understand.
df1 <- data.frame(code = c("A", "B", "C", "D"), date = c(1,2,3,4))
df2 <- data.frame(code = c("A", "B", "E", "R", "V", "F"), date = c(1,2,3,4,5,6))
df3 <- data.frame(code = c("A", "B", "E", "R", "V", "F"), date = c(1,2,3,4,5,6), value =c(1,1,0,0,0,0))
DF1
code date
1 A 1
2 B 2
3 C 3
4 D 4
DF2
code date
1 A 1
2 B 2
3 E 3
4 R 4
5 V 5
6 F 6
The resulting DF I want
code date value
1 A 1 1
2 B 2 1
3 E 3 0
4 R 4 0
5 V 5 0
6 F 6 0
We can use %in% to create a logical vector and then coerce it to binary with as.integer or +
df2$value <- +(df2$code %in% df1$code)
df2
# code date value
#1 A 1 1
#2 B 2 1
#3 E 3 0
#4 R 4 0
#5 V 5 0
#6 F 6 0
I would do it like this:
df2 %>% left_join(mutate(df1, value = 1)) %>%
mutate(value = coalesce(value, 0))
# Joining, by = c("code", "date")
# code date value
# 1 A 1 1
# 2 B 2 1
# 3 E 3 0
# 4 R 4 0
# 5 V 5 0
# 6 F 6 0

Compare date by group in two data frames in R

I have one data frame containing event date by id:
data.frame(id = c("a", "a", "a", "d", "d"),
date = as.Date(c("2018-01-03", "2018-02-02", "2018-02-22", "2018-02-13", "2018-05-01")))
id date
1 a 2018-01-03
2 a 2018-02-02
3 a 2018-02-22
4 d 2018-02-13
5 d 2018-05-01
And another one containing start and end of periods by id:
data.frame(id = c("a", "a", "d", "d", "d", "d"),
start = as.Date(c("2018-01-15", "2018-01-30", "2018-03-01", "2018-02-01", "2018-04-02", "2018-03-19")),
end = as.Date(c("2018-01-18", "2018-02-10", "2018-03-03", "2018-04-22", "2018-05-23", "2018-08-29")))
id start end
1 a 2018-01-15 2018-01-18
2 a 2018-01-30 2018-02-10
3 d 2018-03-01 2018-03-03
4 d 2018-02-01 2018-04-22
5 d 2018-04-02 2018-05-23
6 d 2018-03-19 2018-08-29
For each id, I need to count the number of periods from the second data frame to which each date in the first data frame belongs.
My desired dataframe would be:
id date n
1 a 2018-01-03 0 # does not belong to any period
2 a 2018-02-02 1 # belongs to [2018-01-30,2018-02-10]
3 a 2018-02-22 0 # does not belong to any period
4 d 2018-02-13 1 # belongs to [2018-02-01,2018-04-22]
5 d 2018-05-01 2 # belongs to [2018-04-02,2018-05-23] and [2018-03-19,2018-08-29]
My problem is not about date comparison and summing the results. My problem is about performing those analysis inside each id group. I guess there is a way using split and/or the apply family, but I did not find how.
How can I do it in base R? I work in a restrictive environment where I only have access to base R.
base r approach
temp <- subset( merge(df1, df2), date >= start & date <= end, select = "date" )
df1$n <- sapply( df1$date, function(x) length( temp$date[ temp$date == x ] ))
# id date n
# 1 a 2018-01-03 0
# 2 a 2018-02-02 1
# 3 a 2018-02-22 0
# 4 d 2018-02-13 1
# 5 d 2018-05-01 2
Another base R approach:
dates <- data.frame(id = c("a", "a", "a", "d", "d"),
date = as.Date(c("2018-01-03", "2018-02-02", "2018-02-22", "2018-02-13", "2018-05-01")))
periods <- data.frame(id = c("a", "a", "d", "d", "d", "d"),
start = as.Date(c("2018-01-15", "2018-01-30", "2018-03-01", "2018-02-01", "2018-04-02", "2018-03-19")),
end = as.Date(c("2018-01-18", "2018-02-10", "2018-03-03", "2018-04-22", "2018-05-23", "2018-08-29")))
df <- transform(merge(dates, periods), belongs = date >= start & date <= end)
aggregate(belongs ~ date + id, data = df, sum)
# date id belongs
# 1 2018-01-03 a 0
# 2 2018-02-02 a 1
# 3 2018-02-22 a 0
# 4 2018-02-13 d 1
# 5 2018-05-01 d 2
Or using data.table:
library(data.table)
dt <- as.data.table(merge(dates, periods))
dt[, .(n = sum(date >= start & date <= end)), by=c("id","date")]
# id date n
# 1: a 2018-01-03 0
# 2: a 2018-02-02 1
# 3: a 2018-02-22 0
# 4: d 2018-02-13 1
# 5: d 2018-05-01 2

How to bind additional rows to dataframe for column totals? [duplicate]

This question already has answers here:
Add row to a data frame with total sum for each column
(12 answers)
Closed 4 years ago.
I'm trying to add additional rows to my data table with the column totals so that when I display on ggplot, I am able to filter by "Total" for my selectInput in my Shiny app. However, because I have various data types (i.e. date, string and numeric), it makes it more complicated.
Here's a sample df:
data.frame(
Date = rep(seq(as.Date("2018-01-01"), by= "1 day", length.out= 3), 3),
Company = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
Attr_1 = c("AB", "AC", "AD", "AB", "AC", "AD", "AB", "AC", "AD"),
Attr_2 = c(1,2,3,4,5,6,7,8,9)
)
Here's what I'm hoping to achieve:
Date Company Attr_1 Attr_2
2018-01-01 A AB 1
2018-01-02 A AC 2
2018-01-03 A AD 3
2018-01-01 B AB 4
2018-01-02 B AC 5
2018-01-03 B AD 6
2018-01-01 C AB 7
2018-01-02 C AC 8
2018-01-03 C AD 9
2018-01-01 Total AB 12
2018-01-02 Total AC 15
2018-01-03 Total AD 18
Does anyone have an easy solution for this? What I can think of is to calculate the colSums manually and then rbind back into this dataframe. But is there a simpler solution?
df = data.frame(
Company = c("A", "B", "C", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4)
)
library(dplyr)
bind_rows(df, df %>%
summarise_at(vars(matches("Attr")), funs(sum)) %>%
mutate(Company = "Total"))
# Company Attr_1 Attr_2
# 1 A 12 1
# 2 B 13 2
# 3 C 14 3
# 4 D 14 4
# 5 A 3 5
# 6 B 5 4
# 7 Total 61 19
Solution to your edit:
df %>%
group_by(Date, Attr_1) %>%
summarise(Attr_2 = sum(Attr_2),
Company = "Total") %>%
ungroup() %>%
bind_rows(df, .)
A solution that works even if there is a 'W' company.
data.frame(
Company = c("A", "B", "W", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4),
stringsAsFactors=FALSE
) -> df
df %>% summarise_if(is.numeric,sum) %>%
mutate(Company='Total') %>%
bind_rows(df,.)
# Company Attr_1 Attr_2
#1 A 12 1
#2 B 13 2
#3 W 14 3
#4 D 14 4
#5 A 3 5
#6 B 5 4
#7 Total 61 19
Here's a base R solution:
df <- data.frame(
Company = c("A", "B", "C", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4)
)
rbind(df, data.frame(Company = "Total", Attr_1 = sum(df$Attr_1), Attr_2 = sum(df$Attr_2)))
Output:
Company Attr_1 Attr_2
1 A 12 1
2 B 13 2
3 C 14 3
4 D 14 4
5 A 3 5
6 B 5 4
7 Total 61 19
I find adorn_totals from the janitorpackage very useful for this (and other) tasks
library( janitor )
df %>% adorn_totals()
# Company Attr_1 Attr_2
# A 12 1
# B 13 2
# C 14 3
# D 14 4
# A 3 5
# B 5 4
# Total 61 19

Sliding window in R for dates

I am working on a project that requires me to create a flag field if an item has n number of occurrences (on different dates) within a 30-day window (for each TYPE). If it did occur n times within the window then it would flag all of the relevant dates with 1 and if it did not occur then the dates would be 0.
My data looks something like this:
a <- data.frame("TYPE" = c("A", "A", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
"DATE" = c("4/20/2018 11:47",
"4/25/2018 7:21",
"4/15/2018 6:11",
"4/19/2018 4:22",
"4/15/2018 17:46",
"4/16/2018 11:59",
"4/20/2018 7:50",
"4/26/2018 2:55",
"4/27/2018 11:46",
"4/27/2018 13:03",
"4/20/2018 7:31",
"4/22/2018 9:45"))
After sorting first by TYPE and then by DATE and then saying n <- 4 this would be my expected output:
It is important to note that there will be multiple dates that are the same date for a type, and this will need a rolling/moving window for each TYPE.
I am looking for assistance with how to approach this problem.
Would I be able to use the SlidingWindow function in R and then define my own function?
A dplyr::inner_join based approach could be as:
library(dplyr)
a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
inner_join(.,., by="TYPE") %>%
group_by(TYPE, DATE.x) %>%
summarise(FLAG = as.integer(sum(abs((DATE.x-DATE.y)/(24*60*60))<=30)>=4))
# # A tibble: 12 x 3
# # Groups: TYPE [?]
# TYPE DATE.x FLAG
# <fctr> <dttm> <int>
# 1 A 2018-04-20 11:47:00 0
# 2 A 2018-04-25 07:21:00 0
# 3 B 2018-04-15 06:11:00 0
# 4 B 2018-04-19 04:22:00 0
# 5 C 2018-04-15 17:46:00 1
# 6 C 2018-04-16 11:59:00 1
# 7 C 2018-04-20 07:50:00 1
# 8 C 2018-04-26 02:55:00 1
# 9 D 2018-04-20 07:31:00 1
# 10 D 2018-04-22 09:45:00 1
# 11 D 2018-04-27 11:46:00 1
# 12 D 2018-04-27 13:03:00 1
Note: The 4 records for TYPE = D are also within 30 days range. Hence, FLAG should be set as 1 for TYPE D as well.

Resources