Is there a way in R to group by 'runs'? - r

Say I have:
df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2, 2),
date=c("2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05", "2000-01-09", "2000-01-10", "2000-01-11", "2000-01-13"),
want_group=c(1, 1, 2, 2, 3,3,3,4))
I want to create a want_group variable that groups by date, group, and whether they were "daily". So for example I want to create unique id's for within group 1 for the 1st and 2nd, and then a new unique id for the 4th and 5th, and then similarly for group 2 for the 9th, 10th, and 11th.
group date want_group
1 1 2000-01-01 1
2 1 2000-01-02 1
3 1 2000-01-04 2
4 1 2000-01-05 2
5 2 2000-01-09 3
6 2 2000-01-10 3
7 2 2000-01-11 3
8 2 2000-01-13 4
Thanks,

We can use diff and cumsum to calculate the runs. This increments everytime the difference in date is more than 1.
df$new <- cumsum(c(TRUE, diff(as.Date(df$date)) > 1))
df
# group date want_group new
#1 1 2000-01-01 1 1
#2 1 2000-01-02 1 1
#3 1 2000-01-04 2 2
#4 1 2000-01-05 2 2
#5 2 2000-01-09 3 3
#6 2 2000-01-10 3 3
#7 2 2000-01-11 3 3
#8 2 2000-01-13 4 4
We add TRUE in the beginning since diff returns output of length 1 less than the original vector.
To handle this by group we can do
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
group_by(group) %>%
mutate(new = c(TRUE, diff(date) > 1)) %>%
ungroup() %>%
mutate(new = cumsum(new))

With base R, we can also do
df$date <- as.Date(df$date)
df$new <- with(df, cumsum(c(TRUE, date[-1]- date[-length(date)] > 1)))
df$new
#[1] 1 1 2 2 3 3 3 4
Or use difference with lag in dplyr
library(dplyr)
df %>%
mutate(date = as.Date(date),
want_group = cumsum(date - lag(date, default = first(date)) > 1))

Related

A computation efficient way to find the IDs of the Type 1 rows just above and below each Type 2 rows?

I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA

Create a count-consecutive variable which resets to 1

I have a dataset like the following, where "group" is a group variable. I want to count the number of 'next' days by group, but if it is not the next day I want the count to reset to one (as shown in the "want" column). Then, I want to return the max number of the "want" column (as in want2). Suggestions would be appreciated!
df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2),
date=c("2000-01-01", "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-09", "2000-01-10", "2000-01-12"),
want=c(1,1,2,3,1,2,1),
want2=c(3,3,3,3,2,2,2))
bonus part 2: Thank you for all the feedback, it was extremely helpful. Is there a way to do the same with an added condition? I have a binary variable and I also want my count to reset when that variable==0. Like so:
# group date binary want
#1 1 2000-01-01 1 1
#2 1 2000-01-03 1 1
#3 1 2000-01-04 1 2
#4 1 2000-01-05 0 1
#5 2 2000-01-09 1 1
#6 2 2000-01-10 0 1
#7 2 2000-01-12 1 1
#8 3 2000-01-05 1 1
#9 3 2000-01-06 1 2
#10 3 2000-01-07 1 3
#11 3 2000-01-08 1 4
I have tried akrun's suggestion which worked very well without the binary var, I tried to modify it adding the binary var as part of cumsum but I get errors:
df %>% group_by(group)
%>% mutate(wantn = rowid(cumsum(c(TRUE, diff(as.Date(date)) !=1 & binary==1)))
Thanks!
An option is to group by 'group', then use diff on the Date class convered 'date', create a logical vector and use cumsum to replicate the results in 'want' ('wantn') and then with the 'wantn', apply max on it
library(dplyr)
library(data.table)
df %>%
group_by(group) %>%
mutate(wantn = rowid(cumsum(c(TRUE, diff(as.Date(date)) !=1))),
want2n = max(wantn))
# A tibble: 7 x 6
# Groups: group [2]
# group date want want2 wantn want2n
# <dbl> <fct> <dbl> <dbl> <int> <int>
#1 1 2000-01-01 1 3 1 3
#2 1 2000-01-03 1 3 1 3
#3 1 2000-01-04 2 3 2 3
#4 1 2000-01-05 3 3 3 3
#5 2 2000-01-09 1 2 1 2
#6 2 2000-01-10 2 2 2 2
#7 2 2000-01-12 1 2 1 2
or if we want to not use rowid, then create the grouping variable with cumsum and get the sequence
df %>%
group_by(group) %>%
group_by(group2 = cumsum(c(TRUE, diff(as.Date(date)) !=1)), add = TRUE) %>%
mutate(wantn = row_number()) %>%
group_by(group) %>%
mutate(want2n = max(wantn)) %>%
select(-group2)

Add fixed number of rows for each group with values based on another column

I have a large dataframe containing IDs and a start date of intervention for each ID:
ID Date
1 1 17228
2 2 17226
3 3 17230
And I would like to add 2 rows to each ID with subsequent dates as the values in those rows:
ID Date
1 1 17228
2 1 17229
3 1 17230
4 2 17226
5 2 17227
6 2 17228
7 3 17230
8 3 17231
9 3 17232
Is there any way using dplyr if possible? Other ways are also fine!
We expand the data by uncounting, then grouped by 'ID', get the sequence from the first 'Date' to the number of rows (n()) while incrementing by 1
library(tidyverse)
df1 %>%
uncount(3) %>%
group_by(ID) %>%
mutate(Date = seq(Date[1], length.out = n(), by = 1))
# A tibble: 9 x 2
# Groups: ID [3]
# ID Date
# <int> <dbl>
#1 1 17228
#2 1 17229
#3 1 17230
#4 2 17226
#5 2 17227
#6 2 17228
#7 3 17230
#8 3 17231
#9 3 17232
Or another option is unnest a list column
df1 %>%
group_by(ID) %>%
mutate(Date = list(Date[1] + 0:2)) %>%
unnest
Or with complete
df1 %>%
group_by(ID) %>%
complete(Date = first(Date) + 0:2)
Or using base R (pasteing from the comments)
within(df1[rep(seq_len(nrow(df1)), each = 3),], Date <- Date + 0:2)
Or more compactly in data.table
library(data.table)
setDT(df1)[, .(Date = Date + 0:2), ID]
do.call(rbind, lapply(split(d, d$ID), function(x){
rbind(x, data.frame(ID = rep(tail(x$ID, 1), 2),
Date = tail(x$Date, 1) + 1:2))
}))
# ID Date
#1.1 1 17228
#1.11 1 17229
#1.2 1 17230
#2.2 2 17226
#2.1 2 17227
#2.21 2 17228
#3.3 3 17230
#3.1 3 17231
#3.2 3 17232
Data
d = structure(list(ID = 1:3, Date = c(17228L, 17226L, 17230L)),
class = "data.frame",
row.names = c("1", "2", "3"))
Using dplyr, we can repeat every row 3 times, group_by ID and increment every date from 0 to n() - 1 for each ID.
library(dplyr)
df %>%
slice(rep(seq_len(n()), each = 3)) %>%
group_by(ID) %>%
mutate(Date = Date + 0: (n() - 1))
# ID Date
# <int> <int>
#1 1 17228
#2 1 17229
#3 1 17230
#4 2 17226
#5 2 17227
#6 2 17228
#7 3 17230
#8 3 17231
#9 3 17232
A base R one-liner using the same logic above would be
transform(df[rep(seq_len(nrow(df)), each = 3),], Date = Date + 0:2)

delete a row where the order is wrong within a group

I have a data with about 1000 groups each group is ordered from 1-100(can be any number within 100).
As I was looking through the data. I found that some groups had bad orders, i.e., it would order to 100 then suddenly a 24 would show up.
How can I delete all of these error data
As you can see from the picture above(before -> after), I would like to find all rows that don't follow the order within the group and just delete it.
Any help would be great!
lag will compute the difference between the current value and the previous value, diff will be used to select only positive difference i.e. the current value is greater than the previous value. min is used as lag give the first value NA. I keep the helper column diff to check, but you can deselect using %>% select(-diff)
library(dplyr)
df1 %>% group_by(gruop) %>% mutate(diff = order-lag(order)) %>%
filter(diff >= 0 | order==min(order))
# A tibble: 8 x 3
# Groups: gruop [2]
gruop order diff
<int> <int> <int>
1 1 1 NA
2 1 3 2
3 1 5 2
4 1 10 5
5 2 1 NA
6 2 4 3
7 2 4 0
8 2 8 4
Data
df1 <- read.table(text="
gruop order
1 1
1 3
1 5
1 10
1 2
2 1
2 4
2 4
2 8
2 3
",header=T, stringsAsFactors = F)
Assuming the order column increments by 1 every time we can use ave where we remove those rows which do not have difference of 1 with the previous row by group.
df[!ave(df$order, df$group, FUN = function(x) c(1, diff(x))) != 1, ]
# group order
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#6 2 1
#7 2 2
#8 2 3
#9 2 4
EDIT
For the updated example, we can just change the comparison
df[ave(df$order, df$group, FUN = function(x) c(1, diff(x))) >= 0, ]
Playing with data.table:
library(data.table)
setDT(df1)[, diffo := c(1, diff(order)), group][diffo == 1, .(group, order)]
group order
1: 1 1
2: 1 2
3: 1 3
4: 1 4
5: 2 1
6: 2 2
7: 2 3
8: 2 4
Where df1 is:
df1 <- data.frame(
group = rep(1:2, each = 5),
order = c(1:4, 2, 1:4, 3)
)
EDIT
If you only need increasing order, and not steps of one then you can do:
df3 <- transform(df1, order = c(1,3,5,10,2,1,4,7,9,3))
setDT(df3)[, diffo := c(1, diff(order)), group][diffo >= 1, .(group, order)]
group order
1: 1 1
2: 1 3
3: 1 5
4: 1 10
5: 2 1
6: 2 4
7: 2 7
8: 2 9

Update or add value to aggregate in data.frame

Let's say I have the following simple data.frame:
ID value
1 1 3
2 2 4
3 1 5
4 3 3
My desired output is below, where we add a value to cumsum or we update it according to the latest value of an already used ID.
ID value cumsum
1 1 3 3
2 2 4 7
3 1 5 9
4 3 3 12
In row 3, the new value forms an updated cumsum (7-3+5=9). Row 4 adds a new value to cumsum because the ID was not used before (4+5+3=12).
This produces the desired outcome for your example:
df<-read.table(header=T, text="ID value
1 1 3
2 2 4
3 1 5
4 3 3")
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value = value-lag(value, def = 0L)) %>%
ungroup %>% mutate(cumsum = cumsum(value))
# # A tibble: 4 x 3
# ID value cumsum
# <int> <int> <int>
# 1 1 3 3
# 2 2 4 7
# 3 1 2 9
# 4 3 3 12
I used data.table for cumsum. Calculating the cumulative mean is a bit more tricky because the number of oberservations is not adjusted by just using cummean.
library(data.table)
dt = data.table(id = c(1, 2, 1, 3), value = c(3, 4, 5, 3))
dt[, tmp := value-shift(value, n = 1L, type = "lag", fill = 0), by = c("id")]
#CUMSUM
dt[, cumsum := cumsum(tmp)]
#CUMMEAN WITH UPDATED N
dt[value != tmp, skip := 1:.N]
dt[, skip := na.locf(skip, na.rm = FALSE)]
dt[is.na(skip), skip := 0]
dt[, cummean := cumsum(tmp)/(seq_along(tmp)-skip)]
Output is:
id value tmp cumsum skip cummean
1: 1 3 3 3 0 3.0
2: 2 4 4 7 0 3.5
3: 1 5 2 9 1 4.5
4: 3 3 3 12 1 4.0
Edit: Changed lag function to data.table's shift function.

Resources