Efficient way to find repeated runs of rows, remove, & count - r

I have a data set with repeating rows. I want to remove consecutive repeated and count them but only if they're consecutive. I'm looking for an efficient way to do this. Can't think of how in dplyr or data.table.
MWE
dat <- data.frame(
x = c(6, 2, 3, 3, 3, 1, 1, 6, 5, 5, 6, 6, 5, 4),
y = c(7, 5, 7, 7, 7, 5, 5, 7, 1, 2, 7, 7, 1, 7),
z = c(rep(LETTERS[1:2], each=7))
)
## x y z
## 1 6 7 A
## 2 2 5 A
## 3 3 7 A
## 4 3 7 A
## 5 3 7 A
## 6 1 5 A
## 7 1 5 A
## 8 6 7 B
## 9 5 1 B
## 10 5 2 B
## 11 6 7 B
## 12 6 7 B
## 13 5 1 B
## 14 4 7 B
Desired output
x y z n
1 6 7 A 1
2 2 5 A 1
3 3 7 A 3
4 1 5 A 2
5 6 7 B 1
6 5 1 B 1
7 5 2 B 1
8 6 7 B 2
9 5 1 B 1
10 4 7 B 1

With data.table:
library(data.table)
setDT(dat)
dat[, c(.SD[1L], .N), by=.(g = rleidv(dat))][, g := NULL]
x y z N
1: 6 7 A 1
2: 2 5 A 1
3: 3 7 A 3
4: 1 5 A 2
5: 6 7 B 1
6: 5 1 B 1
7: 5 2 B 1
8: 6 7 B 2
9: 5 1 B 1
10: 4 7 B 1

Similar to Ricky's answer, here's another base solution:
with(rle(do.call(paste, dat)), cbind(dat[ cumsum(lengths), ], lengths))
In case paste doesn't cut it for the column classes you have, you can do
ud = unique(dat)
ud$r = seq_len(nrow(ud))
dat$r0 = seq_len(nrow(dat))
newdat = merge(dat, ud)
with(rle(newdat[order(newdat$r0), ]$r), cbind(dat[cumsum(lengths), ], lengths))
... though I'm guessing there's some better way.

With dplyr, you can borrow data.table::rleid to make a run ID column, then use n to count rows and unique to chop out repeats:
dat %>% group_by(run = data.table::rleid(x, y, z)) %>% mutate(n = n()) %>%
distinct() %>% ungroup() %>% select(-run)
You can replace rleid with just base R, if you like, but it's not as pretty:
dat %>% group_by(run = rep(seq_along(rle(paste(x, y, z))$len),
times = rle(paste(x, y, z))$len)) %>%
mutate(n = n()) %>% distinct() %>% ungroup() %>% select(-run)
Either way, you get:
Source: local data frame [10 x 4]
x y z n
(dbl) (dbl) (fctr) (int)
1 6 7 A 1
2 2 5 A 1
3 3 7 A 3
4 1 5 A 2
5 6 7 B 1
6 5 1 B 1
7 5 2 B 1
8 6 7 B 2
9 5 1 B 1
10 4 7 B 1
Edit
Per #Frank's comment, you can also use summarise to insert n and collapse instead of mutate and unique if you group_by all the variables you want to keep before run, as summarise collapses the last group. One advantage to this approach is that you don't have to ungroup to get rid of run, as summarise does for you:
dat %>% group_by(x, y, z, run = data.table::rleid(x, y, z)) %>%
summarise(n = n()) %>% select(-run)

A base solution below
idx <- rle(with(dat, paste(x, y, z)))
d <- cbind(do.call(rbind, strsplit(idx$values, " ")), idx$lengths)
as.data.frame(d)
V1 V2 V3 V4
1 6 7 A 1
2 2 5 A 1
3 3 7 A 3
4 1 5 A 2
5 6 7 B 1
6 5 1 B 1
7 5 2 B 1
8 6 7 B 2
9 5 1 B 1
10 4 7 B 1

If you have a large dataset, you could use a similar idea to Frank's data.table solution, but avoid using .SD like this:
dat[, g := rleidv(dat)][, N := .N, keyby = g
][J(unique(g)), mult = "first"
][, g := NULL
][]
It's less readable, and it turns out it's slower, too. Frank's solution is faster and more readable.
# benchmark on 14 million rows
dat <- data.frame(
x = rep(c(6, 2, 3, 3, 3, 1, 1, 6, 5, 5, 6, 6, 5, 4), 1e6),
y = rep(c(7, 5, 7, 7, 7, 5, 5, 7, 1, 2, 7, 7, 1, 7), 1e6),
z = rep(c(rep(LETTERS[1:2], each=7)), 1e6)
)
setDT(dat)
d1 <- copy(dat)
d2 <- copy(dat)
With R 3.2.4 and data.table 1.9.7 (on Frank's computer):
system.time(d1[, c(.SD[1L], .N), by=.(g = rleidv(d1))][, g := NULL])
# user system elapsed
# 0.42 0.10 0.52
system.time(d2[, g := rleidv(d2)][, N := .N, keyby = g][J(unique(g)), mult = "first"][, g := NULL][])
# user system elapsed
# 2.48 0.25 2.74

Not much different than the other answers, but (1) having ordered data and (2) looking for consecutive runs seems a good candidate for, just, ORing x[-1L] != x[-length(x)] accross columns instead of pasteing or other complex operations. I guess this is, somehow, equivalent to data.table::rleid.
ans = logical(nrow(dat) - 1L)
for(j in seq_along(dat)) ans[dat[[j]][-1L] != dat[[j]][-nrow(dat)]] = TRUE
ans = c(TRUE, ans)
#or, the two-pass, `c(TRUE, Reduce("|", lapply(dat, function(x) x[-1L] != x[-length(x)])))`
cbind(dat[ans, ], n = tabulate(cumsum(ans)))
# x y z n
#1 6 7 A 1
#2 2 5 A 1
#3 3 7 A 3
#6 1 5 A 2
#8 6 7 B 1
#9 5 1 B 1
#10 5 2 B 1
#11 6 7 B 2
#13 5 1 B 1
#14 4 7 B 1

Another base attempt using ave, just because:
dat$grp <- ave(
seq_len(nrow(dat)),
dat[c("x","y","z")],
FUN=function(x) cumsum(c(1,diff(x))!=1)
)
dat$count <- ave(dat$grp, dat, FUN=length)
dat[!duplicated(dat[1:4]),]
# x y z grp count
#1 6 7 A 0 1
#2 2 5 A 0 1
#3 3 7 A 0 3
#6 1 5 A 0 2
#8 6 7 B 0 1
#9 5 1 B 0 1
#10 5 2 B 0 1
#11 6 7 B 1 2
#13 5 1 B 1 1
#14 4 7 B 0 1
And a data.table conversion attempt:
d1[, .(sq=.I, grp=cumsum(c(1, diff(.I)) != 1)), by=list(x,y,z)][(sq), .N, by=list(x,y,z,grp)]

Related

Left join by group and condition (`tidyverse` or `data.table`)

I have a very large data frame that includes integer columns state and state_cyclen. Every row is a gameframe, while state describes the state a game is in at that frame and state_cyclen is coded to indicate n occurrence of that state (it is basically data.table::rleid(state)). Conditioning on state and cycling by state_cyclen I need to import several columns from other definitions data frames. Definition data frames store properties about state and their row ordering informs on the way these properties are cycled throughout the game (players encounter each game state many times).
A minimal example of the long data that should be left joined:
data <- data.frame(
state = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 3, 4, 4, 3, 3),
state_cyclen = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 4, 4)
)
data
#> state state_cyclen
#> 1 1 1
#> 2 1 1
#> 3 2 1
#> 4 2 1
#> 5 3 1
#> 6 3 1
#> 7 1 2
#> 8 1 2
#> 9 2 2
#> 10 2 2
#> 11 3 2
#> 12 3 2
#> 13 2 3
#> 14 2 3
#> 15 3 3
#> 16 3 3
#> 17 3 3
#> 18 4 1
#> 19 4 1
#> 20 3 4
#> 21 3 4
Minimal example for definition data frames storing the ordering:
def_one <- data.frame(
prop = letters[1:3],
others = LETTERS[1:3]
)
def_two <- data.frame(
prop = letters[4:10],
others = LETTERS[4:10]
)
def_three <- data.frame(
prop = letters[11:12],
others = LETTERS[11:12]
)
I have a solution written in base R that gives the desired output, but it's neither very readable, nor probably very efficient.
# Add empty columns
data$prop <- NA
data$others <- NA
# Function that recycles numeric vector bounded by a upper limit
bounded_vec_recyc <- function(vec, n) if(n == 1) vec else (vec - 1) %% n + 1
# My solution
vec_pos_one <- data[data[, "state"] == 1, ]$state_cyclen
vec_pos_one <- bounded_vec_recyc(vec_pos_one, n = nrow(def_one))
data[data[, "state"] == 1, ][, c("prop", "others")] <- def_one[vec_pos_one,]
vec_pos_two <- data[data[, "state"] == 2, ]$state_cyclen
vec_pos_two <- bounded_vec_recyc(vec_pos_two, n = nrow(def_two))
data[data[, "state"] == 2, ][, c("prop", "others")] <- def_two[vec_pos_two,]
vec_pos_three <- data[data[, "state"] == 3, ]$state_cyclen
vec_pos_three <- bounded_vec_recyc(vec_pos_three, n = nrow(def_three))
data[data[, "state"] == 3, ][, c("prop", "others")] <- def_three[vec_pos_three,]
data
#> state state_cyclen prop others
#> 1 1 1 a A
#> 2 1 1 a A
#> 3 2 1 d D
#> 4 2 1 d D
#> 5 3 1 k K
#> 6 3 1 k K
#> 7 1 2 b B
#> 8 1 2 b B
#> 9 2 2 e E
#> 10 2 2 e E
#> 11 3 2 l L
#> 12 3 2 l L
#> 13 2 3 f F
#> 14 2 3 f F
#> 15 3 3 k K
#> 16 3 3 k K
#> 17 3 3 k K
#> 18 4 1 <NA> <NA>
#> 19 4 1 <NA> <NA>
#> 20 3 4 l L
#> 21 3 4 l L
Created on 2022-08-30 with reprex v2.0.2
TLDR: As you can see, I am basically trying to merge one by one these definition data frames to the main data frame on corresponding state by recycling the rows of the definition data frame while retaining their order, using the state_cyclen column to keep track of occurrences of each state throughout the game.
Is there a way to do this within the tidyverse or data.table that is faster or at least easier to read? I need this to be quite fast as I have many such gameframe files (in the hundreds) and they are lengthy (hundreds of thousands of rows).
P.S. Not sure if title is adequate for the operations I am doing, as I can imagine multiple ways of implementation. Edits on it are welcome.
Here, I make a lookup table combining the three sources. Then I join the data with the number of rows for each state, modify the state_cyclen in data using modulo with that number to be within the lookup range, then join.
library(tidyverse)
def <- bind_rows(def_one, def_two, def_three, .id = "state") %>%
mutate(state = as.numeric(state)) %>%
group_by(state) %>%
mutate(state_cyclen_adj = row_number()) %>%
ungroup()
data %>%
left_join(def %>% count(state)) %>%
# eg for row 15 we change 3 to 1 since the lookup table only has 2 rows
mutate(state_cyclen_adj = (state_cyclen - 1) %% n + 1) %>%
left_join(def)
Joining, by = "state"
Joining, by = c("state", "state_cyclen_adj")
state state_cyclen n state_cyclen_adj prop others
1 1 1 3 1 a A
2 1 1 3 1 a A
3 2 1 7 1 d D
4 2 1 7 1 d D
5 3 1 2 1 k K
6 3 1 2 1 k K
7 1 2 3 2 b B
8 1 2 3 2 b B
9 2 2 7 2 e E
10 2 2 7 2 e E
11 3 2 2 2 l L
12 3 2 2 2 l L
13 2 3 7 3 f F
14 2 3 7 3 f F
15 3 3 2 1 k K
16 3 3 2 1 k K
17 3 3 2 1 k K
18 4 1 NA NA <NA> <NA>
19 4 1 NA NA <NA> <NA>
20 3 4 2 2 l L
21 3 4 2 2 l L
Here is a data.table solution. Not sure it is easier to read, but pretty sure it is more efficient:
library(data.table)
dt <- rbind(setDT(def_one)[,state := 1],
setDT(def_two)[,state := 2],
setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]
data <- setDT(data)
data[dt[,.N,by = state],
state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
on = "state",
by = .EACHI]
dt[data,on = c("state","state_cyclen")]
prop others state state_cyclen
1: a A 1 1
2: a A 1 1
3: d D 2 1
4: d D 2 1
5: k K 3 1
6: k K 3 1
7: b B 1 2
8: b B 1 2
9: e E 2 2
10: e E 2 2
11: l L 3 2
12: l L 3 2
13: f F 2 3
14: f F 2 3
15: k K 3 1
16: k K 3 1
17: k K 3 1
18: <NA> <NA> 4 1
19: <NA> <NA> 4 1
20: l L 3 2
21: l L 3 2
prop others state state_cyclen
By step:
I bind the def_one, def_two and def_three dataframes to create a data.table with the variable you need to merge
dt <- rbind(setDT(def_one)[,state := 1],
setDT(def_two)[,state := 2],
setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]
In case you want to merge a lot of dataframes, you can use rbindlist and a list of data.tables.
I then modify your state_cyclen in data to do the same recycling than you:
dt[,.N,by = state]
state N
1: 1 3
2: 2 7
3: 3 2
gives the lengths you use to define your recycling.
data[dt[,.N,by = state],
state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
on = "state",
by = .EACHI]
I use the by = .EACHI to modify the variable for each group during the merge, using the N variable from dt[,.N,by = state]
Then I just have to do the left join:
dt[data,on = c("state","state_cyclen")]
An option with nest/unnest
library(dplyr)
library(tidyr)
data %>%
nest_by(state) %>%
left_join(tibble(state = 1:3, dat = list(def_one, def_two, def_three))) %>%
mutate(data = list(bind_cols(data, if(!is.null(dat))
dat[data %>%
pull(state_cyclen) %>%
bounded_vec_recyc(., nrow(dat)),] else NULL)), dat = NULL) %>%
ungroup %>%
unnest(data)
-output
# A tibble: 21 × 4
state state_cyclen prop others
<dbl> <dbl> <chr> <chr>
1 1 1 a A
2 1 1 a A
3 1 2 b B
4 1 2 b B
5 2 1 d D
6 2 1 d D
7 2 2 e E
8 2 2 e E
9 2 3 f F
10 2 3 f F
# … with 11 more rows

How to replace repeating entries in a data frame with n-(number of times it's repeated) in R?

In my data I have repeating entries in a column. What I'm trying to do is if an entry n is repeated more than 2 times within a column, then I want to replace that entry with n-(number_of_times_it_has_repeated - 2). For example, if my data looks like this:
df <- data.frame(
A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> df
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
7 9
7 10
2 11
8 12
8 13
we can see that in df$A 7 is repeated 4 times. If the entry is repeated more than 2 times, then I want to replace that entry. So in my example,the 1st and 2nd entry of the number 7 would remain unchanged. The 3rd instance of the number 7 would be replaced by : 7 - (3-2). The 4th instance of number 7 would be replaced by 7 - (4-2).
We can also see that in df$A, the number 2 is repeated 3 times. using the same method, the 3rd instance of number 2 would be replaced with 2 - (3-2).
As there are no repeating values in df$B, that column would remain unchanged.
For clarity, my expected result would be:
dfNew <- data.frame(
A = c(1,2,2,4,5,7,7,6,5,1,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13)
)
> dfNew
A B
1 2
2 3
2 4
4 5
5 6
7 7
7 8
6 9
5 10
1 11
8 12
8 13
Here's how you can do it for one column -
library(dplyr)
df %>%
group_by(A) %>%
transmute(A = A - c(rep(0, 2), row_number())[row_number()]) %>%
ungroup
# A
# <dbl>
# 1 1
# 2 2
# 3 2
# 4 4
# 5 5
# 6 7
# 7 7
# 8 6
# 9 5
#10 1
#11 8
#12 8
To do it for all the columns you can use map_dfc -
purrr::map_dfc(names(df), ~{
df %>%
group_by(.data[[.x]]) %>%
transmute(!!.x := .data[[.x]] - c(rep(0, 2), row_number())[row_number()])%>%
ungroup
})
# A B
# <dbl> <dbl>
# 1 1 2
# 2 2 3
# 3 2 4
# 4 4 5
# 5 5 6
# 6 7 7
# 7 7 8
# 8 6 9
# 9 5 10
#10 1 11
#11 8 12
#12 8 13
The logic here is that for each number we subtract 0 from first 2 values and later we subtract -1, -2 and so on.
You can skip the order if you don't want it here is my approach, if you have some data where after the changes there are still some duplicates then i can work on the answer to put it in a function or something.
my_df <- data.frame(A = c(1,2,2,4,5,7,7,7,7,2,8,8),
B = c(2,3,4,5,6,7,8,9,10,11,12,13),
stringsAsFactors = FALSE)
my_df <- my_df[order(my_df$A, my_df$B),]
my_df$Id <- seq.int(from = 1, to = nrow(my_df), by = 1)
my_temp <- my_df %>% group_by(A) %>% filter(n() > 2) %>% mutate(Count = seq.int(from = 1, to = n(), by = 1)) %>% filter(Count > 2) %>% mutate(A = A - (Count - 2))
my_var <- which(my_df$Id %in% my_temp$Id)
if (length(my_var)) {
my_df <- my_df[-my_var,]
my_df <- rbind(my_df, my_temp[, c("A", "B", "Id")])
}
my_df <- my_df[order(my_df$A, my_df$B),]
A base R option using ave + pmax + seq_along
list2DF(
lapply(
df,
function(x) {
x - ave(x, x, FUN = function(v) pmax(seq_along(v) - 2, 0))
}
)
)
gives
A B
1 1 2
2 2 3
3 2 4
4 4 5
5 5 6
6 7 7
7 7 8
8 6 9
9 5 10
10 1 11
11 8 12
12 8 13

Adding column conditioning on the other columns

I want to create a new column z based on the values of x and y. If x>y, z=y otherwise z=x.
x y
3 4
5 2
6 6
1 7
9 4
Output required:
x y z
3 4 3
5 2 2
6 6 6
1 7 1
9 4 4
You can use ifelse :
df$z <- with(df, ifelse(x > y, y, x))
#Or without with
#df$z <- ifelse(df$x > df$y, df$y, df$x)
df
# x y z
#1 3 4 3
#2 5 2 2
#3 6 6 6
#4 1 7 1
#5 9 4 4
In dplyr, you can use if_else which is same as above or case_when which is helpful when you have to list down multiple conditions.
library(dplyr)
df %>%
mutate(z = case_when(x > y ~ y,
TRUE ~x))
If I get it correctly, you are looking for minimum value out of several columns. You can use pmin function:
library(dplyr)
df <- data.frame(x = c(3,5,6,1,9),
y = c(4,2,6,7,4))
df <- df %>% mutate(z = pmin(x, y))
result:
> df
x y z
1 3 4 3
2 5 2 2
3 6 6 6
4 1 7 1
5 9 4 4
It will count minimum value in a data frame row wise and will simplify syntax if you would like to include more than 2 columns:
df <- data.frame(x = c(3, 5, 6, 1, 9),
y = c(4, 2, 6, 7, 4),
a = c(2, 5, 7, 3, 3))
df <- df %>% mutate(z = pmin(x, y, a))
result:
> df
x y a z
1 3 4 2 2
2 5 2 5 2
3 6 6 7 6
4 1 7 3 1
5 9 4 3 3
Similar to another answer but using data.table and pmin:
library(data.table)
dt <- data.table(x = c(3,5,6,1,9),
y = c(4,2,6,7,4))
dt[, z:= pmin(x,y)]
dt
# x y z
# 1: 3 4 3
# 2: 5 2 2
# 3: 6 6 6
# 4: 1 7 1
# 5: 9 4 4
Function pmin returns the parallel minima (https://www.rdocumentation.org/packages/mc2d/versions/0.1-17/topics/pmin)
Another option with fifelse in data.table
library(data.table)
setDT(dt)[, z := fifelse(x > y, y, x)]

Conditional update similar to SQL

I have the following dataframe
library(tidyverse)
x <- c(1,2,3,NA,NA,4,5)
y <- c(1,2,3,5,5,4,5)
z <- c(1,1,1,6,7,7,8)
df <- data.frame(x,y,z)
df
x y z
1 1 1 1
2 2 2 1
3 3 3 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
I would like to update the dataframe according to the following conditions
If z==1, update to x=1, else leave the current value for x
If z==1, update to y=2, else leave the current value for y
The following code does the job fine
df %>% mutate(x=if_else(z==1,1,x),y=if_else(z==1,2,y))
x y z
1 1 2 1
2 1 2 1
3 1 2 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
However, I have to add if_else statement for x and y mutate functions. This has the potential to make my code complicated and hard to read. To give you a SQL analogy, consider the following code
UPDATE df
SET x= 1, y= 2
WHERE z = 1;
I would like to achieve the following:
Specify the update condition ahead of time, so I don't have to repeat it for every mutate function
I would like to avoid using data.table or base R. I am using dplyr so I would like to stick to it for consistency
Using mutate_cond posted at dplyr mutate/replace several columns on a subset of rows we can do this:
df %>% mutate_cond(z == 1, x = 1, y = 2)
giving:
x y z
1 1 2 1
2 1 2 1
3 1 2 1
4 NA 5 6
5 NA 5 7
6 4 4 7
7 5 5 8
sqldf
Of course you can directly implement it in SQL with sqldf -- ignore the warning message that the backend RSQLite issues.
library(sqldf)
sqldf(c("update df set x = 1, y = 2 where z = 1", "select * from df"))
base R
It straight-forward in base R:
df[df$z == 1, c("x", "y")] <- list(1, 2)
library(dplyr)
df %>%
mutate(x = replace(x, z == 1, 1),
y = replace(y, z == 1, 2))
# x y z
#1 1 2 1
#2 1 2 1
#3 1 2 1
#4 NA 5 6
#5 NA 5 7
#6 4 4 7
#7 5 5 8
In base R
transform(df,
x = replace(x, z == 1, 1),
y = replace(y, z == 1, 2))
If you store the condition in a variable, you don't have to type it multiple times
condn = (df$z == 1)
transform(df,
x = replace(x, condn, 1),
y = replace(y, condn, 2))
Here is one option with map2. Loop through the 'x', 'y' columns of the dataset, along with the values to change, apply case_when based on the values of 'z' if it is TRUE, then return the new value, or else return the same column and bind the columns with the original dataset
library(dplyr)
library(purrr)
map2_df(df %>%
select(x, y), c(1, 2), ~ case_when(df$z == 1 ~ .y, TRUE ~ .x)) %>%
bind_cols(df %>%
select(z), .) %>%
select(names(df))
Or using base R, create a logical vector, use that to subset the rows of columns 'x', 'y' and update by assigning to a list of values
i1 <- df$z == 1
df[i1, c('x', 'y')] <- list(1, 2)
df
# x y z
#1 1 2 1
#2 1 2 1
#3 1 2 1
#4 NA 5 6
#5 NA 5 7
#6 4 4 7
#7 5 5 8
The advantage of both the solutions are that we can pass n number of columns with corresponding values to pass and not repeating the code
If you have an SQL background, you should really check out data.table:
library(data.table)
dt <- as.data.table(df)
set(dt, which(z == 1), c('x', 'y'), list(1, 2))
dt
# or perhaps more classic syntax
dt <- as.data.table(df)
dt
# x y z
#1: 1 1 1
#2: 2 2 1
#3: 3 3 1
#4: NA 5 6
#5: NA 5 7
#6: 4 4 7
#7: 5 5 8
dt[z == 1, `:=`(x = 1, y = 2)]
dt
# x y z
#1: 1 2 1
#2: 1 2 1
#3: 1 2 1
#4: NA 5 6
#5: NA 5 7
#6: 4 4 7
#7: 5 5 8
The last option is an update join. This is great if you have the lookup data already done upfront:
# update join:
dt <- as.data.table(df)
dt_lookup <- data.table(x = 1, y = 2, z = 1)
dt[dt_lookup, on = .(z), `:=`(x = i.x, y = i.y)]
dt

Carry / use value from previous group

data:
structure(list(id = c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5),
ax = c("a", "a", "b", "b", "b", "b", "b", "b", "c", "c",
"d", "d", "e"), time = c(1, 3, 0, 2, 4, 5, 6, 8, 7, 9, 10,
11, 12)), .Names = c("id", "ax", "time"), class = c("data.table",
"data.frame"), row.names = c(NA, -13L))
looks like:
id ax time
1: 1 a 1
2: 1 a 3
3: 2 b 0
4: 2 b 2
5: 2 b 4
6: 2 b 5
7: 2 b 6
8: 2 b 8
9: 3 c 7
10: 3 c 9
11: 4 d 10
12: 4 d 11
13: 5 e 12
I want to have the max of the previous group next to the actual group:
desired output:
id ax time newCol
1: 1 a 1 NA
2: 1 a 3 NA
3: 2 b 0 3
4: 2 b 2 3
5: 2 b 4 3
6: 2 b 5 3
7: 2 b 6 3
8: 2 b 8 3
9: 3 c 7 8
10: 3 c 9 8
11: 4 d 10 9
12: 4 d 11 9
13: 5 e 12 11
Is it also possible to have the value of the "previous-previous" grp?
Interessted in baseR, data.table and tidyverse solutions
note:
Can be grouped by EITHER id or ax. The example is a little redundant here.
A data.table solution:
dtt.max <- dtt[, .(max = max(time)), by = ax]
dtt.max[, max.prev := shift(max)]
dtt[dtt.max, newCol := i.max.prev, on = 'ax']
# > dtt
# id ax time newCol
# 1: 1 a 1 NA
# 2: 1 a 3 NA
# 3: 2 b 0 3
# 4: 2 b 2 3
# 5: 2 b 4 3
# 6: 2 b 5 3
# 7: 2 b 6 3
# 8: 2 b 8 3
# 9: 3 c 7 8
# 10: 3 c 9 8
# 11: 4 d 10 9
# 12: 4 d 11 9
# 13: 5 e 12 11
data.table solution using id + 1
library(data.table)
merge(d, setDT(d)[, max(time), id + 1], all.x = TRUE)
Here is a dplyr approach. The key here is to group and ungroup when necessary:
df %>%
group_by(ax) %>%
mutate(new = time[n()]) %>%
ungroup() %>%
mutate(new = lag(new)) %>%
group_by(ax) %>%
mutate(new = new[1])
# A tibble: 13 x 4
# Groups: ax [5]
id ax time new
<dbl> <chr> <dbl> <dbl>
1 1. a 1. NA
2 1. a 3. NA
3 2. b 0. 3.
4 2. b 2. 3.
5 2. b 4. 3.
6 2. b 5. 3.
7 2. b 6. 3.
8 2. b 8. 3.
9 3. c 7. 8.
10 3. c 9. 8.
11 4. d 10. 9.
12 4. d 11. 9.
13 5. e 12. 11.
Assuming id is the same as group:
dfr <- dfr %>% group_by(id) %>% mutate(groupmax = max(time))
dfr$old_group_max <- dfr$groupmax[match(dfr$id - 1, dfr$id)]
The antepenultimate group is left as an exercise :-)
1) This uses no packages. It computes the maximum for each group giving Ag and and then lags it giving LagMax. Finally it left joins using merge that back into the original data frame DF:
Ag <- aggregate(time ~ id, DF, max)
LagMax <- transform(Ag, lagmax = c(NA, head(time, -1)), time = NULL)
merge(DF, LagMax, by = "id", all.x = TRUE)
giving:
id ax time lagmax
1 1 a 1 NA
2 1 a 3 NA
3 2 b 0 3
4 2 b 2 3
5 2 b 4 3
6 2 b 5 3
7 2 b 6 3
8 2 b 8 3
9 3 c 7 8
10 3 c 9 8
11 4 d 10 9
12 4 d 11 9
13 5 e 12 11
2) This sorts time within id so that we know that the maximum is the last value in each id group.
o <- order(factor(DF$id, levels = unique(DF$id)), DF$time)
Time <- DF$time[o]
lagmax <- function(r) if (r[1] == 1) NA else Time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
In the question the time values are already sorted within id and if that is known to be the case the above could be shortened to:
lagmax <- function(r) if (r[1] == 1) NA else DF$time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
3) This one-liner is a data.table translation of (2):
library(data.table)
DT <- copy(DF) # don't overwrite DF
setDT(DT)[, g:=rleid(id)][, lagmax := DT$time[.I[1]-1], keyby = c("g", "id")]
In the sample data in the question time is sorted within id and if that were known to be the case we could use the following shorter code in place of the last line above
setDT(DT)[, lagmax := DT$time[.I[1]-1], by = id]

Resources