I am working on a project that requires me to create a flag field if an item has n number of occurrences (on different dates) within a 30-day window (for each TYPE). If it did occur n times within the window then it would flag all of the relevant dates with 1 and if it did not occur then the dates would be 0.
My data looks something like this:
a <- data.frame("TYPE" = c("A", "A", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
"DATE" = c("4/20/2018 11:47",
"4/25/2018 7:21",
"4/15/2018 6:11",
"4/19/2018 4:22",
"4/15/2018 17:46",
"4/16/2018 11:59",
"4/20/2018 7:50",
"4/26/2018 2:55",
"4/27/2018 11:46",
"4/27/2018 13:03",
"4/20/2018 7:31",
"4/22/2018 9:45"))
After sorting first by TYPE and then by DATE and then saying n <- 4 this would be my expected output:
It is important to note that there will be multiple dates that are the same date for a type, and this will need a rolling/moving window for each TYPE.
I am looking for assistance with how to approach this problem.
Would I be able to use the SlidingWindow function in R and then define my own function?
A dplyr::inner_join based approach could be as:
library(dplyr)
a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
inner_join(.,., by="TYPE") %>%
group_by(TYPE, DATE.x) %>%
summarise(FLAG = as.integer(sum(abs((DATE.x-DATE.y)/(24*60*60))<=30)>=4))
# # A tibble: 12 x 3
# # Groups: TYPE [?]
# TYPE DATE.x FLAG
# <fctr> <dttm> <int>
# 1 A 2018-04-20 11:47:00 0
# 2 A 2018-04-25 07:21:00 0
# 3 B 2018-04-15 06:11:00 0
# 4 B 2018-04-19 04:22:00 0
# 5 C 2018-04-15 17:46:00 1
# 6 C 2018-04-16 11:59:00 1
# 7 C 2018-04-20 07:50:00 1
# 8 C 2018-04-26 02:55:00 1
# 9 D 2018-04-20 07:31:00 1
# 10 D 2018-04-22 09:45:00 1
# 11 D 2018-04-27 11:46:00 1
# 12 D 2018-04-27 13:03:00 1
Note: The 4 records for TYPE = D are also within 30 days range. Hence, FLAG should be set as 1 for TYPE D as well.
Related
I am doing a migration study indexed to a specific event. To create the dataset, I basically subset a larger dataset to a specific date, and then made flags based on additional dates, then added in information. In total, this takes 7 scripts. Now, I want to create a comparison dataset, with the same information but indexed to two years earlier.
My question is, is there an easy way where I can use the same script and just tell R to somehow treat all the code as two years before, or do I have to create a duplicate of the code and then edit it in line to be two years before. Here's a very basic example of some of the code I'm using to generate the dataset from a larger framework:
#example of things I'd want shifted 2 years
df <- subset(df, DATE_AFTER > as.Date("2016-09-27"))
df$flag <- with(df,
as.numeric(DATE_BEFORE < as.Date("2016-09-28") &
DATE_AFTER > as.Date("2016-09-27")))
df
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
Dummy data
df <- data.frame(ID=c("A", "B", "C", "D", "E", "F", "G"),
DATE_BEFORE=as.Date(c("2013-01-23", "2010-01-23", "2018-01-23",
"2014-01-23", "2011-01-23", "2010-01-23",
"2017-01-23")),
DATE_AFTER=as.Date(c("2018-01-23", "2016-01-23", "2020-01-23",
"2016-01-23", "2019-01-23", "2019-01-23",
"2018-01-23")))
Just wrap it in a function. To subtract on year we may use as.POSIXlt as shown in this answer.
my_df_subset <- \(date, subtract_yr=0L) {
dt <- as.POSIXlt(paste0(date, '-01'))
dt$year <- dt$year - subtract_yr
dt <- as.Date(dt)
transform(subset(df, DATE_AFTER > dt),
flag=as.numeric(DATE_BEFORE < dt + 1L &
DATE_AFTER > dt))
}
my_df_subset("2016-09-27")
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
my_df_subset("2016-09-27", 2L) ## two years earlier
# ID DATE_BEFORE DATE_AFTER flag
# 1 A 2013-01-23 2018-01-23 1
# 2 B 2010-01-23 2016-01-23 1
# 3 C 2018-01-23 2020-01-23 0
# 4 D 2014-01-23 2016-01-23 1
# 5 E 2011-01-23 2019-01-23 1
# 6 F 2010-01-23 2019-01-23 1
# 7 G 2017-01-23 2018-01-23 0
Note: R >= 4.1 used.
Data:
df <- structure(list(ID = c("A", "B", "C", "D", "E", "F", "G"), DATE_BEFORE = structure(c(15728,
14632, 17554, 16093, 14997, 14632, 17189), class = "Date"), DATE_AFTER = structure(c(17554,
16823, 18284, 16823, 17919, 17919, 17554), class = "Date")), class = "data.frame", row.names = c(NA,
-7L))
I'm struggling on how can I calculate the quantity of the days between occurrences, since I need to calculate how many days does it take between maintenances on an equipment.
I have a dataframe with a lot of equipments and dates indicating the maintenance, then I need to calculate the days between the maintenances for each equipment. I will show a toy example:
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"),
maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
#test
# car maintenance_date
#1 A 20-09-2020
#2 A 25-09-2020
#3 B 14-05-2020
#4 B 20-05-2020
#5 B 20-05-2021
#6 C 11-01-2021
#7 C 13-01-2021
#8 D 13-01-2021
#9 D 15-01-2021
#10 D 15-01-2021
#11 E 13-01-2021
#for result, I'd like something like:
result
# car maintenance_date
#1 A 5
#2 B 6
#3 B 365
#4 C 2
#5 D 2
#6 D 0
I thought of using something like test %>% arrange(maintenance_date) %>% group_by(car) %>% ....
Any hint on how can I do that?
We need to convert to Date class before doing the arrange and then do the group_by 'car' and get the difference
library(dplyr)
library(lubridate)
test %>%
mutate(maintenance_date = dmy(maintenance_date)) %>%
arrange(maintenance_date) %>%
group_by(car) %>%
summarise(maintenance_date = diff(maintenance_date), .groups = 'drop')
-output
# A tibble: 6 × 2
car maintenance_date
<chr> <drtn>
1 A 5 days
2 B 6 days
3 B 365 days
4 C 2 days
5 D 2 days
6 D 0 days
data.table
library(data.table)
setDT(test)
test[, maintenance_date := as.Date(maintenance_date, format="%d-%m-%Y")
][, .(ndays = diff(maintenance_date)), by = car]
# car ndays
# <char> <difftime>
# 1: A 5 days
# 2: B 6 days
# 3: B 365 days
# 4: C 2 days
# 5: D 2 days
# 6: D 0 days
Another solution, tidyverse-based, can be:
library(tidyverse)
library(lubridate)
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"), maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
test %>%
group_by(car) %>%
mutate(maintenance_date = c(-1,diff(dmy(maintenance_date)))) %>%
filter(maintenance_date >= 0) %>% ungroup
#> # A tibble: 6 × 2
#> # Groups: car [4]
#> car maintenance_date
#> <chr> <dbl>
#> 1 A 5
#> 2 B 6
#> 3 B 365
#> 4 C 2
#> 5 D 2
#> 6 D 0
How to use R to create a rank column? Below is an example
This is what I have:
Date group
12/5/2020 A
12/5/2020 A
11/7/2020 A
11/7/2020 A
11/9/2020 B
11/9/2020 B
10/8/2020 B
This is what I want:
Date group rank
12/5/2020 A 2
12/5/2020 A 2
11/7/2020 A 1
11/7/2020 A 1
11/9/2020 B 2
11/9/2020 B 2
10/8/2020 B 1
tidyverse
(I'm using dplyr here since I think it is easy to see the steps being done.)
A first approach might be to capitalize on R's factor function, which assigns an integer to each distinct value, so that operations on this factor is faster (when compared with strings). That is, it takes a (possibly looooong) vector of strings and converts it into a just-as-long vector of integers (much smaller and faster) and a very short vector of strings, where the integers are indices into the small vector of strings. This small vector is called the factor's "levels".
library(dplyr)
group_by(dat, group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <chr> <chr> <int>
# 1 12/5/2020 A 2
# 2 12/5/2020 A 2
# 3 11/7/2020 A 1
# 4 11/7/2020 A 1
# 5 11/9/2020 B 2
# 6 11/9/2020 B 2
# 7 10/8/2020 B 1
This "sorta" works, but there are two problems:
This is reliant on the lexicographic sorting of the Date column, for which this data sample is acceptable, but this will fail. A better way is to convert to something more appropriately sortable, such as a Date object.
Failing sorts:
sort(c("12/9/2020", "11/9/2020", "2/9/2020"))
# [1] "11/9/2020" "12/9/2020" "2/9/2020"
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = as.integer(factor(Date))) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
and
There really are better functions for ranking, such as dplyr::dense_rank (which #akrun put in an answer first ... I was building to it, honestly):
dat %>%
mutate(Date = as.Date(Date, format = "%m/%d/%Y")) %>%
group_by(group) %>%
mutate(rank = dense_rank(Date)) %>%
ungroup()
# # A tibble: 7 x 3
# Date group rank
# <date> <chr> <int>
# 1 2020-12-05 A 2
# 2 2020-12-05 A 2
# 3 2020-11-07 A 1
# 4 2020-11-07 A 1
# 5 2020-11-09 B 2
# 6 2020-11-09 B 2
# 7 2020-10-08 B 1
We can use dense_rank after converting the 'Date' to Date class
library(dplyr)
library(lubridate)
df1 %>%
group_by(group) %>%
mutate(rank = dense_rank(mdy(Date)))
# A tibble: 7 x 3
# Groups: group [2]
# Date group rank
# <chr> <chr> <int>
#1 12/5/2020 A 2
#2 12/5/2020 A 2
#3 11/7/2020 A 1
#4 11/7/2020 A 1
#5 11/9/2020 B 2
#6 11/9/2020 B 2
#7 10/8/2020 B 1
data
df1 <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA,
-7L))
Convert the Date column to the actual date object, arrange the data by Date and use match with unique to get rank column.
library(dplyr)
df %>%
mutate(Date = lubridate::mdy(Date)) %>%
arrange(group, Date) %>%
group_by(group) %>%
mutate(rank = match(Date, unique(Date)))
# Date group rank
# <date> <chr> <int>
#1 2020-11-07 A 1
#2 2020-11-07 A 1
#3 2020-12-05 A 2
#4 2020-12-05 A 2
#5 2020-10-08 B 1
#6 2020-11-09 B 2
#7 2020-11-09 B 2
data
df <- structure(list(Date = c("12/5/2020", "12/5/2020", "11/7/2020",
"11/7/2020", "11/9/2020", "11/9/2020", "10/8/2020"), group = c("A",
"A", "A", "A", "B", "B", "B")), class = "data.frame", row.names = c(NA, -7L))
I have one data frame containing event date by id:
data.frame(id = c("a", "a", "a", "d", "d"),
date = as.Date(c("2018-01-03", "2018-02-02", "2018-02-22", "2018-02-13", "2018-05-01")))
id date
1 a 2018-01-03
2 a 2018-02-02
3 a 2018-02-22
4 d 2018-02-13
5 d 2018-05-01
And another one containing start and end of periods by id:
data.frame(id = c("a", "a", "d", "d", "d", "d"),
start = as.Date(c("2018-01-15", "2018-01-30", "2018-03-01", "2018-02-01", "2018-04-02", "2018-03-19")),
end = as.Date(c("2018-01-18", "2018-02-10", "2018-03-03", "2018-04-22", "2018-05-23", "2018-08-29")))
id start end
1 a 2018-01-15 2018-01-18
2 a 2018-01-30 2018-02-10
3 d 2018-03-01 2018-03-03
4 d 2018-02-01 2018-04-22
5 d 2018-04-02 2018-05-23
6 d 2018-03-19 2018-08-29
For each id, I need to count the number of periods from the second data frame to which each date in the first data frame belongs.
My desired dataframe would be:
id date n
1 a 2018-01-03 0 # does not belong to any period
2 a 2018-02-02 1 # belongs to [2018-01-30,2018-02-10]
3 a 2018-02-22 0 # does not belong to any period
4 d 2018-02-13 1 # belongs to [2018-02-01,2018-04-22]
5 d 2018-05-01 2 # belongs to [2018-04-02,2018-05-23] and [2018-03-19,2018-08-29]
My problem is not about date comparison and summing the results. My problem is about performing those analysis inside each id group. I guess there is a way using split and/or the apply family, but I did not find how.
How can I do it in base R? I work in a restrictive environment where I only have access to base R.
base r approach
temp <- subset( merge(df1, df2), date >= start & date <= end, select = "date" )
df1$n <- sapply( df1$date, function(x) length( temp$date[ temp$date == x ] ))
# id date n
# 1 a 2018-01-03 0
# 2 a 2018-02-02 1
# 3 a 2018-02-22 0
# 4 d 2018-02-13 1
# 5 d 2018-05-01 2
Another base R approach:
dates <- data.frame(id = c("a", "a", "a", "d", "d"),
date = as.Date(c("2018-01-03", "2018-02-02", "2018-02-22", "2018-02-13", "2018-05-01")))
periods <- data.frame(id = c("a", "a", "d", "d", "d", "d"),
start = as.Date(c("2018-01-15", "2018-01-30", "2018-03-01", "2018-02-01", "2018-04-02", "2018-03-19")),
end = as.Date(c("2018-01-18", "2018-02-10", "2018-03-03", "2018-04-22", "2018-05-23", "2018-08-29")))
df <- transform(merge(dates, periods), belongs = date >= start & date <= end)
aggregate(belongs ~ date + id, data = df, sum)
# date id belongs
# 1 2018-01-03 a 0
# 2 2018-02-02 a 1
# 3 2018-02-22 a 0
# 4 2018-02-13 d 1
# 5 2018-05-01 d 2
Or using data.table:
library(data.table)
dt <- as.data.table(merge(dates, periods))
dt[, .(n = sum(date >= start & date <= end)), by=c("id","date")]
# id date n
# 1: a 2018-01-03 0
# 2: a 2018-02-02 1
# 3: a 2018-02-22 0
# 4: d 2018-02-13 1
# 5: d 2018-05-01 2
This question already has answers here:
Add row to a data frame with total sum for each column
(12 answers)
Closed 4 years ago.
I'm trying to add additional rows to my data table with the column totals so that when I display on ggplot, I am able to filter by "Total" for my selectInput in my Shiny app. However, because I have various data types (i.e. date, string and numeric), it makes it more complicated.
Here's a sample df:
data.frame(
Date = rep(seq(as.Date("2018-01-01"), by= "1 day", length.out= 3), 3),
Company = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
Attr_1 = c("AB", "AC", "AD", "AB", "AC", "AD", "AB", "AC", "AD"),
Attr_2 = c(1,2,3,4,5,6,7,8,9)
)
Here's what I'm hoping to achieve:
Date Company Attr_1 Attr_2
2018-01-01 A AB 1
2018-01-02 A AC 2
2018-01-03 A AD 3
2018-01-01 B AB 4
2018-01-02 B AC 5
2018-01-03 B AD 6
2018-01-01 C AB 7
2018-01-02 C AC 8
2018-01-03 C AD 9
2018-01-01 Total AB 12
2018-01-02 Total AC 15
2018-01-03 Total AD 18
Does anyone have an easy solution for this? What I can think of is to calculate the colSums manually and then rbind back into this dataframe. But is there a simpler solution?
df = data.frame(
Company = c("A", "B", "C", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4)
)
library(dplyr)
bind_rows(df, df %>%
summarise_at(vars(matches("Attr")), funs(sum)) %>%
mutate(Company = "Total"))
# Company Attr_1 Attr_2
# 1 A 12 1
# 2 B 13 2
# 3 C 14 3
# 4 D 14 4
# 5 A 3 5
# 6 B 5 4
# 7 Total 61 19
Solution to your edit:
df %>%
group_by(Date, Attr_1) %>%
summarise(Attr_2 = sum(Attr_2),
Company = "Total") %>%
ungroup() %>%
bind_rows(df, .)
A solution that works even if there is a 'W' company.
data.frame(
Company = c("A", "B", "W", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4),
stringsAsFactors=FALSE
) -> df
df %>% summarise_if(is.numeric,sum) %>%
mutate(Company='Total') %>%
bind_rows(df,.)
# Company Attr_1 Attr_2
#1 A 12 1
#2 B 13 2
#3 W 14 3
#4 D 14 4
#5 A 3 5
#6 B 5 4
#7 Total 61 19
Here's a base R solution:
df <- data.frame(
Company = c("A", "B", "C", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4)
)
rbind(df, data.frame(Company = "Total", Attr_1 = sum(df$Attr_1), Attr_2 = sum(df$Attr_2)))
Output:
Company Attr_1 Attr_2
1 A 12 1
2 B 13 2
3 C 14 3
4 D 14 4
5 A 3 5
6 B 5 4
7 Total 61 19
I find adorn_totals from the janitorpackage very useful for this (and other) tasks
library( janitor )
df %>% adorn_totals()
# Company Attr_1 Attr_2
# A 12 1
# B 13 2
# C 14 3
# D 14 4
# A 3 5
# B 5 4
# Total 61 19