How to calculate days between occurrences by groups - r

I'm struggling on how can I calculate the quantity of the days between occurrences, since I need to calculate how many days does it take between maintenances on an equipment.
I have a dataframe with a lot of equipments and dates indicating the maintenance, then I need to calculate the days between the maintenances for each equipment. I will show a toy example:
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"),
maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
#test
# car maintenance_date
#1 A 20-09-2020
#2 A 25-09-2020
#3 B 14-05-2020
#4 B 20-05-2020
#5 B 20-05-2021
#6 C 11-01-2021
#7 C 13-01-2021
#8 D 13-01-2021
#9 D 15-01-2021
#10 D 15-01-2021
#11 E 13-01-2021
#for result, I'd like something like:
result
# car maintenance_date
#1 A 5
#2 B 6
#3 B 365
#4 C 2
#5 D 2
#6 D 0
I thought of using something like test %>% arrange(maintenance_date) %>% group_by(car) %>% ....
Any hint on how can I do that?

We need to convert to Date class before doing the arrange and then do the group_by 'car' and get the difference
library(dplyr)
library(lubridate)
test %>%
mutate(maintenance_date = dmy(maintenance_date)) %>%
arrange(maintenance_date) %>%
group_by(car) %>%
summarise(maintenance_date = diff(maintenance_date), .groups = 'drop')
-output
# A tibble: 6 × 2
car maintenance_date
<chr> <drtn>
1 A 5 days
2 B 6 days
3 B 365 days
4 C 2 days
5 D 2 days
6 D 0 days

data.table
library(data.table)
setDT(test)
test[, maintenance_date := as.Date(maintenance_date, format="%d-%m-%Y")
][, .(ndays = diff(maintenance_date)), by = car]
# car ndays
# <char> <difftime>
# 1: A 5 days
# 2: B 6 days
# 3: B 365 days
# 4: C 2 days
# 5: D 2 days
# 6: D 0 days

Another solution, tidyverse-based, can be:
library(tidyverse)
library(lubridate)
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"), maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
test %>%
group_by(car) %>%
mutate(maintenance_date = c(-1,diff(dmy(maintenance_date)))) %>%
filter(maintenance_date >= 0) %>% ungroup
#> # A tibble: 6 × 2
#> # Groups: car [4]
#> car maintenance_date
#> <chr> <dbl>
#> 1 A 5
#> 2 B 6
#> 3 B 365
#> 4 C 2
#> 5 D 2
#> 6 D 0

Related

Reorder one row in tibble - move it to the last row

How do I rearrange the rows in tibble?
I wish to reorder rows such that: row with x = "c" goes to the bottom of the tibble, everything else remains same.
library(dplyr)
tbl <- tibble(x = c("a", "b", "c", "d", "e", "f", "g", "h"),
y = 1:8)
An alternative to dplyr::arrange(), using base R:
tbl[order(tbl$x == "c"), ] # Thanks to Merijn van Tilborg
Output:
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
tbl |> dplyr::arrange(x == "c")
Using forcats, convert to factor having c the last, then arrange. This doesn't change the class of the column x.
library(forcats)
tbl %>%
arrange(fct_relevel(x, "c", after = Inf))
# # A tibble: 8 x 2
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
If the order of x is important, it is better to keep it as factor class, below will change the class from character to factor with c being last:
tbl %>%
mutate(x = fct_relevel(x, "c", after = Inf)) %>%
arrange(x)

Enumerate a grouping variable in a tibble

I would like to know how to use row_number or anything else to transform a variable group into a integer
tibble_test <- tibble(A = letters[1:10], group = c("A", "A", "A", "B", "B", "C", "C", "C", "C", "D"))
# to get the enumeration inside each group of 'group'
tibble_test %>%
group_by(group) %>%
mutate(G1 = row_number())
But I would like to have this output:
# A tibble: 10 x 4
A group G1 G2
<chr> <chr> <dbl> <dbl>
1 a A 1 1
2 b A 2 1
3 c A 3 1
4 d B 1 2
5 e B 2 2
6 f C 1 3
7 g C 2 3
8 h C 3 3
9 i C 4 3
10 j D 1 4
My question is: how to get this column G2, I know i could transform the 'group' var into a factor then integer (after the tibble is arranged) but I would like to know if it can be done using a counting.
You just need one more step and include the group indices with group_indices(). Be aware that how your data is arranged/sorted will affect the index.
library(dplyr)
tibble_test <- tibble(A = letters[1:10], group = c("A", "A", "A", "B", "B", "C", "C", "C", "C", "D"))
# to get the enumeration inside each group of 'group'
tibble_test %>%
group_by(group) %>%
mutate(G1 = row_number(),
G2 = group_indices())
# A tibble: 10 x 4
# Groups: group [4]
A group G1 G2
<chr> <chr> <int> <int>
1 a A 1 1
2 b A 2 1
3 c A 3 1
4 d B 1 2
5 e B 2 2
6 f C 1 3
7 g C 2 3
8 h C 3 3
9 i C 4 3
10 j D 1 4

Keep consecutive duplicates

I have a data frame where one column contains some consecutive duplicates. I want to keep the rows with consecutive duplicates (any length >1). I would prefer a solution in dplyr or data.table.
Example data :
a <- seq(10,150,10)
b <- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E")
df <- tibble(a, b)
Data:
# A tibble: 15 x 2
a b
<dbl> <chr>
1 10 A
2 20 A
3 30 B
4 40 C
5 50 C
6 60 A
7 70 B
8 80 B
9 90 B
10 100 C
11 110 A
12 120 C
13 130 D
14 140 E
15 150 E
So I would like to keep the rows with consecutive duplicates in column b.
Expected outcome:
# A tibble: 9 x 2
a b
<dbl> <chr>
1 10 A
2 20 A
4 40 C
5 50 C
7 70 B
8 80 B
9 90 B
14 140 E
15 150 E
Thanks!
Using the data.table input shown in the Note at the end, set N to be the number of elements in each group of consecutive elements and then keep groups for which it is greater than 1.
DT[, N :=.N, by = rleid(b)][N > 1, .(a, b)]
giving:
a b
1: 10 A
2: 20 A
3: 40 C
4: 50 C
5: 70 B
6: 80 B
7: 90 B
8: 140 E
9: 150 E
Note
We assume the input in reproducible form is:
library(data.table)
a <- seq(10,150,10)
b <- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E")
DT <- data.table(a, b)
In dplyr we can use lag to create groups and select groups with more than 1 row.
library(dplyr)
df %>%
group_by(group = cumsum(b != lag(b, default = first(b)))) %>%
filter(n() > 1) %>%
ungroup() %>%
select(-group)
# a b
# <dbl> <chr>
#1 10 A
#2 20 A
#3 40 C
#4 50 C
#5 70 B
#6 80 B
#7 90 B
#8 140 E
#9 150 E
In base R, we can use rle and ave to subset rows from df
subset(df, ave(b, with(rle(b), rep(seq_along(values), lengths)), FUN = length) > 1)
Since you also have the data.table tag, i like using the data.table::rleid function for such tasks, i.e.
library(dplyr)
df %>%
group_by(grp = data.table::rleid(b), b) %>%
filter(n() > 1)
which gives,
# A tibble: 9 x 3
# Groups: grp, b [4]
a b grp
<dbl> <chr> <int>
1 10 A 1
2 20 A 1
3 40 C 3
4 50 C 3
5 70 B 5
6 80 B 5
7 90 B 5
8 140 E 10
9 150 E 10
You want to remove duplicate except when consecutive: the following code flags duplicate values and consecutive values, then keeps only rows that are not duplicate or that are part of a consecutive set of duplicates.
df %>%
mutate(duplicate = duplicated(b),
consecutive = c(NA, diff(as.integer(factor(b)))) == 0) %>%
filter(!duplicate | consecutive) %>%
select(-duplicate, -consecutive)
Use rle to get the run length.
Assuming df <- data.frame(a=a,b=b), then the following can make it
df[-cumsum(rle(b)$lengths)[rle(b)$lengths==1],]
Another solution utilizes both lead() and lag():
library(tidyverse)
a <- seq(10,150,10)
b <- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E")
df <- tibble(a, b)
df %>% filter(b == lead(b) | b == lag(b))
#> # A tibble: 9 x 2
#> a b
#> <dbl> <chr>
#> 1 10 A
#> 2 20 A
#> 3 40 C
#> 4 50 C
#> 5 70 B
#> 6 80 B
#> 7 90 B
#> 8 140 E
#> 9 150 E
Created on 2019-10-21 by the reprex package (v0.3.0)
Here is another option (which should be faster):
D[-D[, {
x <- rowid(rleid(b)) < 2
.I[x & shift(x, -1L, fill=TRUE)]
}]]
timing code:
library(data.table)
set.seed(0L)
nr <- 1e7
nb <- 1e4
DT <- data.table(b=sample(nb, nr, TRUE))
#DT <- data.table(b=c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E"))
DT2 <- copy(DT)
mtd1 <- function(df) {
df[-cumsum(rle(b)$lengths)[rle(b)$lengths==1],]
}
mtd2 <- function(D) {
D[, N :=.N, by = rleid(b)][N > 1, .(b)]
}
mtd3 <- function(D) {
D[-D[, {
x <- rowid(rleid(b)) < 2
.I[x & shift(x, -1L, fill=TRUE)]
}]]
}
bench::mark(mtd1(DT), mtd2(DT2), mtd3(DT), check=FALSE)
timings:
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 mtd1(DT) 1.1s 1.1s 0.908 1.98GB 10.9 1 12 1.1s <df[,1] [2,014 x ~ <df[,3] [59 x ~ <bch:t~ <tibble [1 x ~
2 mtd2(DT2) 2.88s 2.88s 0.348 267.12MB 0 1 0 2.88s <df[,1] [2,014 x ~ <df[,3] [23 x ~ <bch:t~ <tibble [1 x ~
3 mtd3(DT) 639.91ms 639.91ms 1.56 505.48MB 4.69 1 3 639.91ms <df[,1] [2,014 x ~ <df[,3] [24 x ~ <bch:t~ <tibble [1 x ~

How to add a row to data frame based on a condition

I've a dataframe which I want to add a row on the basis of the following conditions. The conditions are when column a is equal to C and column b is equal to 3 or 5.
Here is my dataframe
df <- data.frame(a = c("A", "B", "C", "D", "C", "A", "C", "E"),
b = c(seq(8)), stringsAsFactors = TRUE)
Whenever the condition is TRUE I want to add a row below where the condition is met add 3. I have tried the following
rbind(df, data.frame(a="add", b = "3"))
# a b
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 C 5
# 6 A 6
# 7 C 7
# 8 E 8
# 9 add 3
This is not the output I want. The output I want is
# a b
# 1 A 1
# 2 B 2
# 3 C 3
# 4 add 3
# 5 D 4
# 6 C 5
# 7 add 3
# 8 A 6
# 9 C 7
# 10 E 8
How can I do that? I am new to R and thank you for your help.
lens = ifelse(df$b %in% c(3, 5) & df$a == "C", 2, 1)
ind = rep(1:NROW(df), lens)
df2 = df[ind,]
df2$a = as.character(df2$a)
df2$a[cumsum(lens)[which(lens == 2)]] = "add"
df2$b[cumsum(lens)[which(lens == 2)]] = 3
df2
# a b
#1 A 1
#2 B 2
#3 C 3
#3.1 add 3
#4 D 4
#5 C 5
#5.1 add 3
#6 A 6
#7 C 7
#8 E 8
A solution using the tidyverse package.
library(tidyverse)
df2 <- df %>%
mutate(Group = lag(cumsum(a == "C" & b %in% c(3, 5)), default = FALSE)) %>%
group_split(Group) %>%
map_dfr(~ .x %>% bind_rows(tibble(a = "add", b = 3))) %>%
slice(-n()) %>%
select(-Group)
df2
# # A tibble: 10 x 2
# a b
# <chr> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 add 3
# 5 D 4
# 6 C 5
# 7 add 3
# 8 A 6
# 9 C 7
# 10 E 8
In base R, we can find out position where a = "c" and b is 3 or 5. Repeat those rows in the dataframe and replace them with required values.
pos <- which(df$a == "C" & df$b %in% c(3, 5))
df <- df[sort(c(seq(nrow(df)), pos)), ]
df[seq_along(pos) + pos, ] <- list("add", 3)
row.names(df) <- NULL
df
# a b
#1 A 1
#2 B 2
#3 C 3
#4 add 3
#5 D 4
#6 C 5
#7 add 3
#8 A 6
#9 C 7
#10 E 8
data
df <- data.frame(a = c("A", "B", "C", "D", "C", "A", "C", "E"),
b = c(seq(8)), stringsAsFactors = FALSE)

How to bind additional rows to dataframe for column totals? [duplicate]

This question already has answers here:
Add row to a data frame with total sum for each column
(12 answers)
Closed 4 years ago.
I'm trying to add additional rows to my data table with the column totals so that when I display on ggplot, I am able to filter by "Total" for my selectInput in my Shiny app. However, because I have various data types (i.e. date, string and numeric), it makes it more complicated.
Here's a sample df:
data.frame(
Date = rep(seq(as.Date("2018-01-01"), by= "1 day", length.out= 3), 3),
Company = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
Attr_1 = c("AB", "AC", "AD", "AB", "AC", "AD", "AB", "AC", "AD"),
Attr_2 = c(1,2,3,4,5,6,7,8,9)
)
Here's what I'm hoping to achieve:
Date Company Attr_1 Attr_2
2018-01-01 A AB 1
2018-01-02 A AC 2
2018-01-03 A AD 3
2018-01-01 B AB 4
2018-01-02 B AC 5
2018-01-03 B AD 6
2018-01-01 C AB 7
2018-01-02 C AC 8
2018-01-03 C AD 9
2018-01-01 Total AB 12
2018-01-02 Total AC 15
2018-01-03 Total AD 18
Does anyone have an easy solution for this? What I can think of is to calculate the colSums manually and then rbind back into this dataframe. But is there a simpler solution?
df = data.frame(
Company = c("A", "B", "C", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4)
)
library(dplyr)
bind_rows(df, df %>%
summarise_at(vars(matches("Attr")), funs(sum)) %>%
mutate(Company = "Total"))
# Company Attr_1 Attr_2
# 1 A 12 1
# 2 B 13 2
# 3 C 14 3
# 4 D 14 4
# 5 A 3 5
# 6 B 5 4
# 7 Total 61 19
Solution to your edit:
df %>%
group_by(Date, Attr_1) %>%
summarise(Attr_2 = sum(Attr_2),
Company = "Total") %>%
ungroup() %>%
bind_rows(df, .)
A solution that works even if there is a 'W' company.
data.frame(
Company = c("A", "B", "W", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4),
stringsAsFactors=FALSE
) -> df
df %>% summarise_if(is.numeric,sum) %>%
mutate(Company='Total') %>%
bind_rows(df,.)
# Company Attr_1 Attr_2
#1 A 12 1
#2 B 13 2
#3 W 14 3
#4 D 14 4
#5 A 3 5
#6 B 5 4
#7 Total 61 19
Here's a base R solution:
df <- data.frame(
Company = c("A", "B", "C", "D", "A", "B"),
Attr_1 = c(12,13,14,14,3,5),
Attr_2 = c(1,2,3,4,5,4)
)
rbind(df, data.frame(Company = "Total", Attr_1 = sum(df$Attr_1), Attr_2 = sum(df$Attr_2)))
Output:
Company Attr_1 Attr_2
1 A 12 1
2 B 13 2
3 C 14 3
4 D 14 4
5 A 3 5
6 B 5 4
7 Total 61 19
I find adorn_totals from the janitorpackage very useful for this (and other) tasks
library( janitor )
df %>% adorn_totals()
# Company Attr_1 Attr_2
# A 12 1
# B 13 2
# C 14 3
# D 14 4
# A 3 5
# B 5 4
# Total 61 19

Resources