I have a data frame where one column contains some consecutive duplicates. I want to keep the rows with consecutive duplicates (any length >1). I would prefer a solution in dplyr or data.table.
Example data :
a <- seq(10,150,10)
b <- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E")
df <- tibble(a, b)
Data:
# A tibble: 15 x 2
a b
<dbl> <chr>
1 10 A
2 20 A
3 30 B
4 40 C
5 50 C
6 60 A
7 70 B
8 80 B
9 90 B
10 100 C
11 110 A
12 120 C
13 130 D
14 140 E
15 150 E
So I would like to keep the rows with consecutive duplicates in column b.
Expected outcome:
# A tibble: 9 x 2
a b
<dbl> <chr>
1 10 A
2 20 A
4 40 C
5 50 C
7 70 B
8 80 B
9 90 B
14 140 E
15 150 E
Thanks!
Using the data.table input shown in the Note at the end, set N to be the number of elements in each group of consecutive elements and then keep groups for which it is greater than 1.
DT[, N :=.N, by = rleid(b)][N > 1, .(a, b)]
giving:
a b
1: 10 A
2: 20 A
3: 40 C
4: 50 C
5: 70 B
6: 80 B
7: 90 B
8: 140 E
9: 150 E
Note
We assume the input in reproducible form is:
library(data.table)
a <- seq(10,150,10)
b <- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E")
DT <- data.table(a, b)
In dplyr we can use lag to create groups and select groups with more than 1 row.
library(dplyr)
df %>%
group_by(group = cumsum(b != lag(b, default = first(b)))) %>%
filter(n() > 1) %>%
ungroup() %>%
select(-group)
# a b
# <dbl> <chr>
#1 10 A
#2 20 A
#3 40 C
#4 50 C
#5 70 B
#6 80 B
#7 90 B
#8 140 E
#9 150 E
In base R, we can use rle and ave to subset rows from df
subset(df, ave(b, with(rle(b), rep(seq_along(values), lengths)), FUN = length) > 1)
Since you also have the data.table tag, i like using the data.table::rleid function for such tasks, i.e.
library(dplyr)
df %>%
group_by(grp = data.table::rleid(b), b) %>%
filter(n() > 1)
which gives,
# A tibble: 9 x 3
# Groups: grp, b [4]
a b grp
<dbl> <chr> <int>
1 10 A 1
2 20 A 1
3 40 C 3
4 50 C 3
5 70 B 5
6 80 B 5
7 90 B 5
8 140 E 10
9 150 E 10
You want to remove duplicate except when consecutive: the following code flags duplicate values and consecutive values, then keeps only rows that are not duplicate or that are part of a consecutive set of duplicates.
df %>%
mutate(duplicate = duplicated(b),
consecutive = c(NA, diff(as.integer(factor(b)))) == 0) %>%
filter(!duplicate | consecutive) %>%
select(-duplicate, -consecutive)
Use rle to get the run length.
Assuming df <- data.frame(a=a,b=b), then the following can make it
df[-cumsum(rle(b)$lengths)[rle(b)$lengths==1],]
Another solution utilizes both lead() and lag():
library(tidyverse)
a <- seq(10,150,10)
b <- c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E")
df <- tibble(a, b)
df %>% filter(b == lead(b) | b == lag(b))
#> # A tibble: 9 x 2
#> a b
#> <dbl> <chr>
#> 1 10 A
#> 2 20 A
#> 3 40 C
#> 4 50 C
#> 5 70 B
#> 6 80 B
#> 7 90 B
#> 8 140 E
#> 9 150 E
Created on 2019-10-21 by the reprex package (v0.3.0)
Here is another option (which should be faster):
D[-D[, {
x <- rowid(rleid(b)) < 2
.I[x & shift(x, -1L, fill=TRUE)]
}]]
timing code:
library(data.table)
set.seed(0L)
nr <- 1e7
nb <- 1e4
DT <- data.table(b=sample(nb, nr, TRUE))
#DT <- data.table(b=c("A", "A", "B", "C", "C", "A", "B", "B", "B", "C", "A", "C", "D", "E", "E"))
DT2 <- copy(DT)
mtd1 <- function(df) {
df[-cumsum(rle(b)$lengths)[rle(b)$lengths==1],]
}
mtd2 <- function(D) {
D[, N :=.N, by = rleid(b)][N > 1, .(b)]
}
mtd3 <- function(D) {
D[-D[, {
x <- rowid(rleid(b)) < 2
.I[x & shift(x, -1L, fill=TRUE)]
}]]
}
bench::mark(mtd1(DT), mtd2(DT2), mtd3(DT), check=FALSE)
timings:
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 mtd1(DT) 1.1s 1.1s 0.908 1.98GB 10.9 1 12 1.1s <df[,1] [2,014 x ~ <df[,3] [59 x ~ <bch:t~ <tibble [1 x ~
2 mtd2(DT2) 2.88s 2.88s 0.348 267.12MB 0 1 0 2.88s <df[,1] [2,014 x ~ <df[,3] [23 x ~ <bch:t~ <tibble [1 x ~
3 mtd3(DT) 639.91ms 639.91ms 1.56 505.48MB 4.69 1 3 639.91ms <df[,1] [2,014 x ~ <df[,3] [24 x ~ <bch:t~ <tibble [1 x ~
Related
How do I rearrange the rows in tibble?
I wish to reorder rows such that: row with x = "c" goes to the bottom of the tibble, everything else remains same.
library(dplyr)
tbl <- tibble(x = c("a", "b", "c", "d", "e", "f", "g", "h"),
y = 1:8)
An alternative to dplyr::arrange(), using base R:
tbl[order(tbl$x == "c"), ] # Thanks to Merijn van Tilborg
Output:
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
tbl |> dplyr::arrange(x == "c")
Using forcats, convert to factor having c the last, then arrange. This doesn't change the class of the column x.
library(forcats)
tbl %>%
arrange(fct_relevel(x, "c", after = Inf))
# # A tibble: 8 x 2
# x y
# <chr> <int>
# 1 a 1
# 2 b 2
# 3 d 4
# 4 e 5
# 5 f 6
# 6 g 7
# 7 h 8
# 8 c 3
If the order of x is important, it is better to keep it as factor class, below will change the class from character to factor with c being last:
tbl %>%
mutate(x = fct_relevel(x, "c", after = Inf)) %>%
arrange(x)
I want to calculate the percentage difference based on groups in b (A,A;B,B;C,C) of a data.frame. The new variable rate should contain the change in c based on the time stamps a (i.e. 3 and 7, while the smaller number is always the previous). It should be NA, when there is just one time stamp for a given group.
I tried to use the common methods, but i always get a vector of zeros. Example data:
a <- c(3,7,3,7,3,7,3,3,7,3,7,3,7,7)
b <- c("a", "a", "b", "b", "c", "c", "d", "e", "e", "f","f", "g", "g", "h")
c <- runif(14, min=80, max=100)
df <- data.frame(a,b,c)
I tried the common approach with using group_by, mutate and lag i.e.
df %>% group_by(a,b) %>%
mutate(rate = 100 * (c - lag(c))/lag(c))
I also tried arrange before using lag, but the result i get is always 0. I think its because there is not a time series which is why we cant use lag. There must be an easy way to achieve this with using regular functions.
If you use lag() to define your rate, you will end up with "NA" for every b == 3, and the correct value for b == 7 when you group_by "b". Also, c() is a Primitive function in R and it's best not to use "c" as the name of a variable.
Is this the outcome you're after?
library(tidyverse)
a <- c(3,7,3,7,3,7,3,3,7,3,7,3,7,7)
b <- c("a", "a", "b", "b", "c", "c", "d", "e", "e", "f","f", "g", "g", "h")
d <- c(80, 100, runif(12, min=80, max=100))
df <- data.frame(a,b,d)
df %>% group_by(b) %>%
mutate(rate = 100 * (d - lag(d, default = NA))/lag(d, default = NA))
#> # A tibble: 14 × 4
#> # Groups: b [8]
#> a b d rate
#> <dbl> <chr> <dbl> <dbl>
#> 1 3 a 80 NA
#> 2 7 a 100 25
#> 3 3 b 88.0 NA
#> 4 7 b 91.1 3.54
#> 5 3 c 95.1 NA
#> 6 7 c 82.7 -13.1
#> 7 3 d 92.6 NA
#> 8 3 e 84.1 NA
#> 9 7 e 91.8 9.20
#> 10 3 f 81.9 NA
#> 11 7 f 93.6 14.4
#> 12 3 g 88.7 NA
#> 13 7 g 80.6 -9.11
#> 14 7 h 99.2 NA
Created on 2021-12-20 by the reprex package (v2.0.1)
You can be more flexible with ifelse()'s too, e.g. if you want NA's for cases where you have a single group but zeros for cases where a == 3:
library(tidyverse)
a <- c(3,7,3,7,3,7,3,3,7,3,7,3,7,7)
b <- c("a", "a", "b", "b", "c", "c", "d", "e", "e", "f","f", "g", "g", "h")
d <- c(80, 100, runif(12, min=80, max=100))
df <- data.frame(a,b,d)
df %>% group_by(b) %>%
mutate(group_number = n()) %>%
mutate(rate = ifelse(group_number == 1, NA, ifelse(a == 7, 100 * (d - lag(d, default = NA))/lag(d, default = NA), 0))) %>%
select(-group_number) %>%
ungroup()
#> # A tibble: 14 × 4
#> a b d rate
#> <dbl> <chr> <dbl> <dbl>
#> 1 3 a 80 0
#> 2 7 a 100 25
#> 3 3 b 95.8 0
#> 4 7 b 83.9 -12.5
#> 5 3 c 87.0 0
#> 6 7 c 81.5 -6.26
#> 7 3 d 97.0 NA
#> 8 3 e 99.1 0
#> 9 7 e 82.6 -16.6
#> 10 3 f 82.3 0
#> 11 7 f 96.0 16.7
#> 12 3 g 99.5 0
#> 13 7 g 93.4 -6.09
#> 14 7 h 86.8 NA
Created on 2021-12-20 by the reprex package (v2.0.1)
I'm struggling on how can I calculate the quantity of the days between occurrences, since I need to calculate how many days does it take between maintenances on an equipment.
I have a dataframe with a lot of equipments and dates indicating the maintenance, then I need to calculate the days between the maintenances for each equipment. I will show a toy example:
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"),
maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
#test
# car maintenance_date
#1 A 20-09-2020
#2 A 25-09-2020
#3 B 14-05-2020
#4 B 20-05-2020
#5 B 20-05-2021
#6 C 11-01-2021
#7 C 13-01-2021
#8 D 13-01-2021
#9 D 15-01-2021
#10 D 15-01-2021
#11 E 13-01-2021
#for result, I'd like something like:
result
# car maintenance_date
#1 A 5
#2 B 6
#3 B 365
#4 C 2
#5 D 2
#6 D 0
I thought of using something like test %>% arrange(maintenance_date) %>% group_by(car) %>% ....
Any hint on how can I do that?
We need to convert to Date class before doing the arrange and then do the group_by 'car' and get the difference
library(dplyr)
library(lubridate)
test %>%
mutate(maintenance_date = dmy(maintenance_date)) %>%
arrange(maintenance_date) %>%
group_by(car) %>%
summarise(maintenance_date = diff(maintenance_date), .groups = 'drop')
-output
# A tibble: 6 × 2
car maintenance_date
<chr> <drtn>
1 A 5 days
2 B 6 days
3 B 365 days
4 C 2 days
5 D 2 days
6 D 0 days
data.table
library(data.table)
setDT(test)
test[, maintenance_date := as.Date(maintenance_date, format="%d-%m-%Y")
][, .(ndays = diff(maintenance_date)), by = car]
# car ndays
# <char> <difftime>
# 1: A 5 days
# 2: B 6 days
# 3: B 365 days
# 4: C 2 days
# 5: D 2 days
# 6: D 0 days
Another solution, tidyverse-based, can be:
library(tidyverse)
library(lubridate)
test = data.frame(car = c("A", "A", "B", "B", "B", "C", "C", "D", "D", "D", "E"), maintenance_date= c("20-09-2020", "25-09-2020", "14-05-2020", "20-05-2020", "20-05-2021", "11-01-2021", "13-01-2021", "13-01-2021", "15-01-2021", "15-01-2021", "13-01-2021"))
test %>%
group_by(car) %>%
mutate(maintenance_date = c(-1,diff(dmy(maintenance_date)))) %>%
filter(maintenance_date >= 0) %>% ungroup
#> # A tibble: 6 × 2
#> # Groups: car [4]
#> car maintenance_date
#> <chr> <dbl>
#> 1 A 5
#> 2 B 6
#> 3 B 365
#> 4 C 2
#> 5 D 2
#> 6 D 0
I would like to know how to use row_number or anything else to transform a variable group into a integer
tibble_test <- tibble(A = letters[1:10], group = c("A", "A", "A", "B", "B", "C", "C", "C", "C", "D"))
# to get the enumeration inside each group of 'group'
tibble_test %>%
group_by(group) %>%
mutate(G1 = row_number())
But I would like to have this output:
# A tibble: 10 x 4
A group G1 G2
<chr> <chr> <dbl> <dbl>
1 a A 1 1
2 b A 2 1
3 c A 3 1
4 d B 1 2
5 e B 2 2
6 f C 1 3
7 g C 2 3
8 h C 3 3
9 i C 4 3
10 j D 1 4
My question is: how to get this column G2, I know i could transform the 'group' var into a factor then integer (after the tibble is arranged) but I would like to know if it can be done using a counting.
You just need one more step and include the group indices with group_indices(). Be aware that how your data is arranged/sorted will affect the index.
library(dplyr)
tibble_test <- tibble(A = letters[1:10], group = c("A", "A", "A", "B", "B", "C", "C", "C", "C", "D"))
# to get the enumeration inside each group of 'group'
tibble_test %>%
group_by(group) %>%
mutate(G1 = row_number(),
G2 = group_indices())
# A tibble: 10 x 4
# Groups: group [4]
A group G1 G2
<chr> <chr> <int> <int>
1 a A 1 1
2 b A 2 1
3 c A 3 1
4 d B 1 2
5 e B 2 2
6 f C 1 3
7 g C 2 3
8 h C 3 3
9 i C 4 3
10 j D 1 4
I am trying to remove rows that have offsetting values.
library(dplyr)
a <- c(1, 1, 1, 1, 2, 2, 2, 2,2,2)
b <- c("a", "b", "b", "b", "c", "c","c", "d", "d", "d")
d <- c(10, 10, -10, 10, 20, -20, 20, 30, -30, 30)
o <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
df <- tibble(ID = a, SEQ = b, VALUE = d, OTHER = o)
Generates this ordered table that is grouped by ID and SEQ.
> df
# A tibble: 10 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 B
3 1 b -10 C
4 1 b 10 D
5 2 c 20 E
6 2 c -20 F
7 2 c 20 G
8 2 d 30 H
9 2 d -30 I
10 2 d 30 J
I want to drop the row pairs (2,3), (5,6), (8,9) because VALUE negates the VALUE in the matching previous row.
I want the resulting table to be
> df2
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 D
3 2 c 20 G
4 2 d 30 J
I know that I can't use group_by %>% summarize, because I need to keep the value that is in OTHER. I've looked at the dplyr::lag() function but I don't see how that can help. I believe that I could loop through the table with some type of for each loop and generate a logical vector that can be used to drop the rows, but I was hoping for a more elegant solution.
What about:
vec <- cbind(
c(head(df$VALUE,-1) + df$VALUE[-1], 9999) ,
df$VALUE + c(9999, head(df$VALUE,-1))
)
vec <- apply(vec,1,prod)
vec <- vec!=0
df[vec,]
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
The idea is to take your VALUE field and subtract it with a slightly subset version of it. When the result is 0, than you remove the line.
Here's another solution with dplyr. Not sure about the edge case you mentioned in the comments, but feel free to test it with my solution:
library(dplyr)
df %>%
group_by(ID, SEQ) %>%
mutate(diff = VALUE + lag(VALUE),
diff2 = VALUE + lead(VALUE)) %>%
mutate_at(vars(diff:diff2), funs(coalesce(., 1))) %>%
filter((diff != 0 & diff2 != 0)) %>%
select(-diff, -diff2)
Result:
# A tibble: 4 x 4
# Groups: ID, SEQ [4]
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
Note:
This solution first creates two diff columns, one adding the lag, another adding the lead of VALUE to each VALUE. Only the offset columns will either have a zero in diff or in diff2, so I filtered out those rows, resulting in the desired output.