This is a follow-up question to one I already posted here: Count occurrence of IDs within the last x days in R
I'm trying to do another rolling count. I have the following data:
date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01")
group = c("G","G","F","G","E","E","H","H","H","A")
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1)
group date ID
1: G 2014-04-01 2
2: G 2014-04-12 3
3: F 2014-04-07 4
4: G 2014-05-03 2
5: E 2014-04-14 3
6: E 2014-05-04 1
7: H 2014-03-31 2
8: H 2014-04-18 4
9: H 2014-04-23 2
10: A 2014-04-01 1
For each group, I would like to count the number of unique ID within the past 30 days of the current date. The desired count column would look like this:
group date ID count
1: G 2014-04-01 2 1
2: G 2014-04-12 3 2
3: F 2014-04-07 4 1
4: G 2014-05-03 1 2
5: E 2014-04-14 3 1
6: E 2014-05-04 1 2
7: H 2014-03-31 2 1
8: H 2014-04-18 4 2
9: H 2014-04-23 2 3
10: A 2014-04-01 1 1
In my previous thread, a solution by #ThomasIsCoding was provided. I tried to modify his code to perform the task I am now trying to do by doing the following:
dt[date <= first(date) + 30, date := as.Date(date)][, count := uniqueN(ID), group]
group date ID count
1: G 2014-04-01 2 2
2: G 2014-04-12 3 2
3: F 2014-04-07 4 1
4: G 2014-05-03 2 2
5: E 2014-04-14 3 2
6: E 2014-05-04 1 2
7: H 2014-03-31 2 2
8: H 2014-04-18 4 2
9: H 2014-04-23 2 2
10: A 2014-04-01 1 1
But it doesn't seem to take into account the time range condition. Any suggestions is greatly appreciated!
An option using non-equi join:
DT[, onemthago := date - 30L]
DT[, count :=
DT[.SD, on=.(group, date>=onemthago, date<=date),
by=.EACHI, length(unique(ID))]$V1
]
output:
group date ID onemthago count
1: G 2014-04-01 2 2014-03-02 1
2: G 2014-04-12 3 2014-03-13 2
3: F 2014-04-07 4 2014-03-08 1
4: G 2014-05-03 2 2014-04-03 2
5: E 2014-04-14 3 2014-03-15 1
6: E 2014-05-04 1 2014-04-04 2
7: H 2014-03-31 2 2014-03-01 1
8: H 2014-04-18 4 2014-03-19 2
9: H 2014-04-23 2 2014-03-24 2
10: A 2014-04-01 1 2014-03-02 1
data:
date = as.Date(c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01"))
group = c("G","G","F","G","E","E","H","H","H","A")
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1)
library(data.table)
DT <- data.table(group, date, ID)
edit to address comment on multiple lookback periods. You can try something like:
for (x in c(30L, 90L)) {
DT[, daysago := date - x]
DT[, paste0("count", x) :=
.SD[.SD, on=.(group, date>=daysago, date<=date),
by=.EACHI, length(unique(ID))]$V1
][]
}
DT
If I understood your problem correctly one alternative way within the tidyverse would be this:
library(tidyverse)
tb <- dplyr::tibble(date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01"),
group = c("G","G","F","G","E","E","H","H","H","A"),
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1))
tb %>%
dplyr::group_by(group) %>%
dplyr::mutate(as.numeric(difftime(Sys.Date(), date)) < 31) %>%
dplyr::distinct(ID) %>%
dplyr::count(group) %>%
dplyr::right_join(tb) %>%
dplyr::select(group, date, ID, Count = n)
group date ID Count
<chr> <chr> <dbl> <int>
1 A 2014-04-01 1 1
2 E 2014-04-14 3 2
3 E 2014-05-04 1 2
4 F 2014-04-07 4 1
5 G 2014-04-01 2 2
6 G 2014-04-12 3 2
7 G 2014-05-03 2 2
8 H 2014-03-31 2 2
9 H 2014-04-18 4 2
10 H 2014-04-23 2 2
for the rolling window type of function this should be a solution:
tb %>%
dplyr::full_join(tb, by = "group") %>%
dplyr::filter(as.numeric(difftime(as.Date(date.x), as.Date(date.y), units = "days")) >= 0 & as.numeric(difftime(date.x, date.y, units = "days")) < 31) %>%
dplyr::distinct(group, date.x, ID.y) %>%
dplyr::count(group, date.x) %>%
# you might want to cut the pipe here and look at the result (do not forget to delete the %>% in the line above when removing the part below
dplyr::right_join(tb, by = c("group", "date.x" = "date")) %>%
dplyr::select(group, date = date.x, ID, count = n)
group date ID count
<chr> <chr> <dbl> <int>
1 A 2014-04-01 1 1
2 E 2014-04-14 3 1
3 E 2014-05-04 1 2
4 F 2014-04-07 4 1
5 G 2014-04-01 2 1
6 G 2014-04-12 3 2
7 G 2014-05-03 2 2
8 H 2014-03-31 2 1
9 H 2014-04-18 4 2
10 H 2014-04-23 2 2
See how well this proposed solution works for your case.
date = c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01")
group = c("G","G","F","G","E","E","H","H","H","A")
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1)
dt <- data.table( date=as.Date(date), group, ID )
setkey( dt, group, date )
calc.id <- function(ID,date) {
## there is always going to be at least 1
id.count <- 1
if( length(ID) > 1 ) {
v <- sapply( 2:length(ID), function(i) {
j <- date[ 1:i ] >= date[i] - 30
uniqueN( ID[j] )
})
id.count <- c( id.count, v )
}
return( id.count )
}
dt[ , count := calc.id(ID,date), by=group ]
dt
I didn't find another way than to use sapply for the rolling calculation. I doubt it will outperform the solution above.
It produces this:
date group ID count
1: 2014-04-01 A 1 1
2: 2014-04-14 E 3 1
3: 2014-05-04 E 1 2
4: 2014-04-07 F 4 1
5: 2014-04-01 G 2 1
6: 2014-04-12 G 3 2
7: 2014-05-03 G 2 2
8: 2014-03-31 H 2 1
9: 2014-04-18 H 4 2
10: 2014-04-23 H 2 2
Related
I have a very large data frame that includes integer columns state and state_cyclen. Every row is a gameframe, while state describes the state a game is in at that frame and state_cyclen is coded to indicate n occurrence of that state (it is basically data.table::rleid(state)). Conditioning on state and cycling by state_cyclen I need to import several columns from other definitions data frames. Definition data frames store properties about state and their row ordering informs on the way these properties are cycled throughout the game (players encounter each game state many times).
A minimal example of the long data that should be left joined:
data <- data.frame(
state = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 3, 4, 4, 3, 3),
state_cyclen = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 4, 4)
)
data
#> state state_cyclen
#> 1 1 1
#> 2 1 1
#> 3 2 1
#> 4 2 1
#> 5 3 1
#> 6 3 1
#> 7 1 2
#> 8 1 2
#> 9 2 2
#> 10 2 2
#> 11 3 2
#> 12 3 2
#> 13 2 3
#> 14 2 3
#> 15 3 3
#> 16 3 3
#> 17 3 3
#> 18 4 1
#> 19 4 1
#> 20 3 4
#> 21 3 4
Minimal example for definition data frames storing the ordering:
def_one <- data.frame(
prop = letters[1:3],
others = LETTERS[1:3]
)
def_two <- data.frame(
prop = letters[4:10],
others = LETTERS[4:10]
)
def_three <- data.frame(
prop = letters[11:12],
others = LETTERS[11:12]
)
I have a solution written in base R that gives the desired output, but it's neither very readable, nor probably very efficient.
# Add empty columns
data$prop <- NA
data$others <- NA
# Function that recycles numeric vector bounded by a upper limit
bounded_vec_recyc <- function(vec, n) if(n == 1) vec else (vec - 1) %% n + 1
# My solution
vec_pos_one <- data[data[, "state"] == 1, ]$state_cyclen
vec_pos_one <- bounded_vec_recyc(vec_pos_one, n = nrow(def_one))
data[data[, "state"] == 1, ][, c("prop", "others")] <- def_one[vec_pos_one,]
vec_pos_two <- data[data[, "state"] == 2, ]$state_cyclen
vec_pos_two <- bounded_vec_recyc(vec_pos_two, n = nrow(def_two))
data[data[, "state"] == 2, ][, c("prop", "others")] <- def_two[vec_pos_two,]
vec_pos_three <- data[data[, "state"] == 3, ]$state_cyclen
vec_pos_three <- bounded_vec_recyc(vec_pos_three, n = nrow(def_three))
data[data[, "state"] == 3, ][, c("prop", "others")] <- def_three[vec_pos_three,]
data
#> state state_cyclen prop others
#> 1 1 1 a A
#> 2 1 1 a A
#> 3 2 1 d D
#> 4 2 1 d D
#> 5 3 1 k K
#> 6 3 1 k K
#> 7 1 2 b B
#> 8 1 2 b B
#> 9 2 2 e E
#> 10 2 2 e E
#> 11 3 2 l L
#> 12 3 2 l L
#> 13 2 3 f F
#> 14 2 3 f F
#> 15 3 3 k K
#> 16 3 3 k K
#> 17 3 3 k K
#> 18 4 1 <NA> <NA>
#> 19 4 1 <NA> <NA>
#> 20 3 4 l L
#> 21 3 4 l L
Created on 2022-08-30 with reprex v2.0.2
TLDR: As you can see, I am basically trying to merge one by one these definition data frames to the main data frame on corresponding state by recycling the rows of the definition data frame while retaining their order, using the state_cyclen column to keep track of occurrences of each state throughout the game.
Is there a way to do this within the tidyverse or data.table that is faster or at least easier to read? I need this to be quite fast as I have many such gameframe files (in the hundreds) and they are lengthy (hundreds of thousands of rows).
P.S. Not sure if title is adequate for the operations I am doing, as I can imagine multiple ways of implementation. Edits on it are welcome.
Here, I make a lookup table combining the three sources. Then I join the data with the number of rows for each state, modify the state_cyclen in data using modulo with that number to be within the lookup range, then join.
library(tidyverse)
def <- bind_rows(def_one, def_two, def_three, .id = "state") %>%
mutate(state = as.numeric(state)) %>%
group_by(state) %>%
mutate(state_cyclen_adj = row_number()) %>%
ungroup()
data %>%
left_join(def %>% count(state)) %>%
# eg for row 15 we change 3 to 1 since the lookup table only has 2 rows
mutate(state_cyclen_adj = (state_cyclen - 1) %% n + 1) %>%
left_join(def)
Joining, by = "state"
Joining, by = c("state", "state_cyclen_adj")
state state_cyclen n state_cyclen_adj prop others
1 1 1 3 1 a A
2 1 1 3 1 a A
3 2 1 7 1 d D
4 2 1 7 1 d D
5 3 1 2 1 k K
6 3 1 2 1 k K
7 1 2 3 2 b B
8 1 2 3 2 b B
9 2 2 7 2 e E
10 2 2 7 2 e E
11 3 2 2 2 l L
12 3 2 2 2 l L
13 2 3 7 3 f F
14 2 3 7 3 f F
15 3 3 2 1 k K
16 3 3 2 1 k K
17 3 3 2 1 k K
18 4 1 NA NA <NA> <NA>
19 4 1 NA NA <NA> <NA>
20 3 4 2 2 l L
21 3 4 2 2 l L
Here is a data.table solution. Not sure it is easier to read, but pretty sure it is more efficient:
library(data.table)
dt <- rbind(setDT(def_one)[,state := 1],
setDT(def_two)[,state := 2],
setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]
data <- setDT(data)
data[dt[,.N,by = state],
state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
on = "state",
by = .EACHI]
dt[data,on = c("state","state_cyclen")]
prop others state state_cyclen
1: a A 1 1
2: a A 1 1
3: d D 2 1
4: d D 2 1
5: k K 3 1
6: k K 3 1
7: b B 1 2
8: b B 1 2
9: e E 2 2
10: e E 2 2
11: l L 3 2
12: l L 3 2
13: f F 2 3
14: f F 2 3
15: k K 3 1
16: k K 3 1
17: k K 3 1
18: <NA> <NA> 4 1
19: <NA> <NA> 4 1
20: l L 3 2
21: l L 3 2
prop others state state_cyclen
By step:
I bind the def_one, def_two and def_three dataframes to create a data.table with the variable you need to merge
dt <- rbind(setDT(def_one)[,state := 1],
setDT(def_two)[,state := 2],
setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]
In case you want to merge a lot of dataframes, you can use rbindlist and a list of data.tables.
I then modify your state_cyclen in data to do the same recycling than you:
dt[,.N,by = state]
state N
1: 1 3
2: 2 7
3: 3 2
gives the lengths you use to define your recycling.
data[dt[,.N,by = state],
state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
on = "state",
by = .EACHI]
I use the by = .EACHI to modify the variable for each group during the merge, using the N variable from dt[,.N,by = state]
Then I just have to do the left join:
dt[data,on = c("state","state_cyclen")]
An option with nest/unnest
library(dplyr)
library(tidyr)
data %>%
nest_by(state) %>%
left_join(tibble(state = 1:3, dat = list(def_one, def_two, def_three))) %>%
mutate(data = list(bind_cols(data, if(!is.null(dat))
dat[data %>%
pull(state_cyclen) %>%
bounded_vec_recyc(., nrow(dat)),] else NULL)), dat = NULL) %>%
ungroup %>%
unnest(data)
-output
# A tibble: 21 × 4
state state_cyclen prop others
<dbl> <dbl> <chr> <chr>
1 1 1 a A
2 1 1 a A
3 1 2 b B
4 1 2 b B
5 2 1 d D
6 2 1 d D
7 2 2 e E
8 2 2 e E
9 2 3 f F
10 2 3 f F
# … with 11 more rows
I have this dataset:
library(data.table)
dt <- data.table(
record=c(1:20),
area=rep(LETTERS[1:4], c(4, 6, 3, 7)),
score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)
and I have used the solution from this post to create this summary:
dt_summary =
dt[ , .N, keyby = .(area, score, cluster)
][ , {
idx = frank(-N, ties.method = 'min') == 1
NN = sum(N)
.(
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
dt_score_1 <- dt_summary[score == 1]
setnames(dt_score_1, "area", "zone")
I would like to use the results from dt_score_1 to filter dt based on the area/zone and cluster/cluster_mode. So in a new data.table, the only rows taken from dt for area A should belong to cluster X, for area D they should be cluster Z etc.
If I'm understanding the question correctly, this is a merge of dt with dt_score_1 with the conditions area = zone, cluster = cluster_mode.
dt[dt_score_1, on = .(area = zone, cluster = cluster_mode)]
# record area score cluster i.score cluster_pct cluster_freq record_freq
# 1: 1 A 1 X 1 100.00000 2 2
# 2: 2 A 1 X 1 100.00000 2 2
# 3: 7 B 1 X 1 66.66667 2 3
# 4: 8 B 1 X 1 66.66667 2 3
# 5: 11 C 2 X 1 100.00000 1 1
# 6: 12 C 1 X 1 100.00000 1 1
# 7: 14 D 1 Z 1 80.00000 4 5
# 8: 15 D 1 Z 1 80.00000 4 5
# 9: 16 D 1 Z 1 80.00000 4 5
# 10: 17 D 1 Z 1 80.00000 4 5
# 11: 20 D 3 Z 1 80.00000 4 5
For a more detailed explanation of join-as-filter, see the link below posted by #Frank
Perform a semi-join with data.table
I have factor variable that occurs in two columns and and now I want first lag, no matter what column factor last appeared in.
Consider following data.table.
require(data.table)
set.seed(21)
dt <- data.table(item1 = c(rep(sample(letters[1:5]), 2), sample(letters[6:10])),
item2 = c(rep(sample(letters[6:10]), 2), sample(letters[1:5])),
value1 = rnorm(15, 5, 2),
value2 = rnorm(15, 5, 2),
iteration = rep(1:3, each = 5))
> dt
item1 item2 value1 value2 iteration
1: d i 0.4464375 6.491179 1
2: b j 6.5148245 5.665638 1
3: c f 3.9031889 2.751919 1
4: a g 5.3450990 3.587738 1
5: e h 6.1257061 3.544912 1
6: d i 8.0236359 1.331371 2
7: b j 6.3180503 4.184624 2
8: c f 7.2440561 5.053722 2
9: a g 3.4307173 6.823257 2
10: e h 4.1486154 8.268693 2
11: j a 5.7859952 5.121371 3
12: f c 5.0735143 8.695145 3
13: i e 2.9358327 5.160250 3
14: g d 2.4702771 7.837112 3
15: h b 4.5460694 7.917232 3
I have tried to solve this with data.table package.
dt[, lag1 := c(NA, value1), by = item1]
dt[, lag2 := c(NA, value2), by = item2]
dt
item1 item2 value1 value2 iteration lag1 lag2
1: d i 0.4464375 6.491179 1 NA NA
2: b j 6.5148245 5.665638 1 NA NA
3: c f 3.9031889 2.751919 1 NA NA
4: a g 5.3450990 3.587738 1 NA NA
5: e h 6.1257061 3.544912 1 NA NA
6: d i 8.0236359 1.331371 2 0.4464375 6.491179
7: b j 6.3180503 4.184624 2 6.5148245 5.665638
8: c f 7.2440561 5.053722 2 3.9031889 2.751919
9: a g 3.4307173 6.823257 2 5.3450990 3.587738
10: e h 4.1486154 8.268693 2 6.1257061 3.544912
11: j a 5.7859952 5.121371 3 NA NA
12: f c 5.0735143 8.695145 3 NA NA
13: i e 2.9358327 5.160250 3 NA NA
14: g d 2.4702771 7.837112 3 NA NA
15: h b 4.5460694 7.917232 3 NA NA
I could probably solve this by creating one column for item and one for value, but is there a better solution?
And just to be clear, my expected value on lag1 on row 11 is 4.184624.
I will also need lag for item2 and did this dplyr.
dt %>%
mutate(nr = 1:nrow(dt)) %>%
gather(key, value, -nr, -iteration) %>%
mutate(key = ifelse(key == "item1" | key == "item2", "item", "value"),
variabel = rep(c(1, 2), 2, each = nrow(dt))) %>%
spread(key, value) %>%
group_by(item) %>%
arrange(nr) %>%
mutate(lag = lag(value)) %>%
gather(key, value, -iteration, -nr, -variabel) %>%
unite(key, c("key", "variabel"), sep = "") %>%
spread(key, value)
iteration nr item1 item2 lag1 lag2 value1 value2
* <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 e f <NA> <NA> 4.48327811883486 5.98823833422944
2 1 2 b i <NA> <NA> 6.21252978898878 3.6803830789734
3 1 3 d g <NA> <NA> 5.62689643314086 7.00228385274896
4 1 4 c h <NA> <NA> 5.10720616395708 7.14416894881173
5 1 5 a j <NA> <NA> 7.25650757535391 6.51153141154262
6 2 6 e f 4.48327811883486 5.98823833422944 3.88373308164829 2.08907058913021
7 2 7 b i 6.21252978898878 3.6803830789734 8.07191789162847 6.88574195362948
8 2 8 d g 5.62689643314086 7.00228385274896 4.87510729533042 1.25944984673148
9 2 9 c h 5.10720616395708 7.14416894881173 5.0431504307243 4.4934555124612
10 2 10 a j 7.25650757535391 6.51153141154262 0.820345123625779 4.41487625686153
11 3 11 g d 1.25944984673148 4.87510729533042 3.37822264689098 5.43753611910662
12 3 12 j a 4.41487625686153 0.820345123625779 -0.88757977661203 2.28986114731552
13 3 13 i e 6.88574195362948 3.88373308164829 4.96240860503556 4.75454561215201
14 3 14 h b 4.4934555124612 8.07191789162847 4.29063975464589 4.09626986248512
15 3 15 f c 2.08907058913021 5.0431504307243 5.07114037497055 5.19449624162733
A solution can be found by using melt after adding a column for row number.
library(data.table)
#Add a column to represent row number in current table. This will be used
#later to sort data.table to find correct `lag`
dt[,rn:=.I]
#Use melt to transfer values in long format. value1 contains "items" and
#value2 contains "values"
dt<-melt(dt, id=c("iteration","rn"), measure=list(1:2,c("value1","value2")),
value.name = c("item","value"))
#The order in original table is row number, item1 and item2. The same can be
#achieved by sorting on "rn" and "variable"
dt[order(rn,variable), lag := shift(value), by = item]
dt[variable==1,][dt[variable==2,],.(item1 = item, item2 = i.item, value1 = value,
value2=i.value, iteration, lag1 = lag, lag2 = i.lag), on=("rn")]
# item1 item2 value1 value2 iteration lag1 lag2
# 1: d i 0.4464375 6.491179 1 NA NA
# 2: b j 6.5148245 5.665638 1 NA NA
# 3: c f 3.9031889 2.751919 1 NA NA
# 4: a g 5.3450990 3.587738 1 NA NA
# 5: e h 6.1257061 3.544912 1 NA NA
# 6: d i 8.0236359 1.331371 2 0.4464375 6.491179
# 7: b j 6.3180503 4.184624 2 6.5148245 5.665638
# 8: c f 7.2440561 5.053722 2 3.9031889 2.751919
# 9: a g 3.4307173 6.823257 2 5.3450990 3.587738
# 10: e h 4.1486154 8.268693 2 6.1257061 3.544912
# 11: j a 5.7859952 5.121371 3 4.1846241 3.430717
# 12: f c 5.0735143 8.695145 3 5.0537224 7.244056
# 13: i e 2.9358327 5.160250 3 1.3313712 4.148615
# 14: g d 2.4702771 7.837112 3 6.8232573 8.023636
# 15: h b 4.5460694 7.917232 3 8.2686930 6.318050
Posting another similar approach. Similar in using an elongated version of item1 + item2 into a long data.table. Difference is in using joins.
There are 2 possible situations:
1) the lag is always in the immediate previous iteration, then the following code using normal join should work:
library(data.table)
set.seed(21)
dt <- data.table(item1 = c(rep(sample(letters[1:5]), 2), sample(letters[6:10])),
item2 = c(rep(sample(letters[6:10]), 2), sample(letters[1:5])),
value1 = rnorm(15, 5, 2),
value2 = rnorm(15, 5, 2),
iteration = rep(1:3, each = 5))
#if that first lag can always be found in previous iteration
dt[.(iitem=c(item1, item2), ivalue=c(value1, value2), iiteration=c(iteration + 1, iteration + 1)),
lag1 := ivalue,
on=c(item1="iitem", iteration="iiteration")]
dt[.(iitem=c(item1, item2), ivalue=c(value1, value2), iiteration=c(iteration + 1, iteration + 1)),
lag2 := ivalue,
on=c(item2="iitem", iteration="iiteration")]
dt
# item1 item2 value1 value2 iteration lag1 lag2
# 1: d i 0.4464375195067456 6.491178609416053 1 NA NA
# 2: b j 6.5148244502509627 5.665638360665036 1 NA NA
# 3: c f 3.9031888919439428 2.751919085284464 1 NA NA
# 4: a g 5.3450989557007524 3.587738435542055 1 NA NA
# 5: e h 6.1257061355108435 3.544912270783058 1 NA NA
# 6: d i 8.0236359188753603 1.331371229451156 2 0.4464375195067456 6.491178609416053
# 7: b j 6.3180503383288116 4.184624119479032 2 6.5148244502509627 5.665638360665036
# 8: c f 7.2440561493491140 5.053722389597528 2 3.9031888919439428 2.751919085284464
# 9: a g 3.4307172617070858 6.823257275121762 2 5.3450989557007524 3.587738435542055
# 10: e h 4.1486154223793710 8.268692951017332 2 6.1257061355108435 3.544912270783058
# 11: j a 5.7859951827443368 5.121371228719468 3 4.1846241194790323 3.430717261707086
# 12: f c 5.0735142596491132 8.695145055731583 3 5.0537223895975281 7.244056149349114
# 13: i e 2.9358326775434151 5.160249909302514 3 1.3313712294511557 4.148615422379371
# 14: g d 2.4702770572371642 7.837111765957783 3 6.8232572751217617 8.023635918875360
# 15: h b 4.5460694295527579 7.917231870893728 3 8.2686929510173321 6.318050338328812
2) if the lag might be in earlier iterations, then the following code using non-equi joins should help
library(data.table)
set.seed(21)
dt <- data.table(item1 = c(rep(sample(letters[1:5]), 2), sample(letters[6:10])),
item2 = c(rep(sample(letters[6:10]), 2), sample(letters[1:5])),
value1 = rnorm(15, 5, 2),
value2 = rnorm(15, 5, 2),
iteration = rep(1:3, each = 5))
#remove iteration=2, item1=c, item2=f to show finding lag from earlier iterations
dt <- dt[-8,]
#if that first lag can be found in even earlier iteration, using non-equi joins as follows:
elongated <- dt[,.(item=c(item1, item2), ivalue=c(value1, value2), iteration=c(iteration, iteration), cpyalliter=c(iteration, iteration))]
dt[, lag1 := elongated[.SD, on=.(item=item1, iteration < iteration)][,
last(ivalue), by=.(item1=item, item2, value1, value2, iteration)]$V1 ]
dt[, lag2 := elongated[.SD, on=.(item=item2, iteration < iteration)][,
last(ivalue), by=.(item1, item2=item, value1, value2, iteration)]$V1 ]
dt
# item1 item2 value1 value2 iteration lag1 lag2
# 1: d i 0.4464375195067456 6.491178609416053 1 NA NA
# 2: b j 6.5148244502509627 5.665638360665036 1 NA NA
# 3: c f 3.9031888919439428 2.751919085284464 1 NA NA
# 4: a g 5.3450989557007524 3.587738435542055 1 NA NA
# 5: e h 6.1257061355108435 3.544912270783058 1 NA NA
# 6: d i 8.0236359188753603 1.331371229451156 2 0.4464375195067456 6.491178609416053
# 7: b j 6.3180503383288116 4.184624119479032 2 6.5148244502509627 5.665638360665036
# 8: a g 3.4307172617070858 6.823257275121762 2 5.3450989557007524 3.587738435542055
# 9: e h 4.1486154223793710 8.268692951017332 2 6.1257061355108435 3.544912270783058
# 10: j a 5.7859951827443368 5.121371228719468 3 4.1846241194790323 3.430717261707086
# 11: f c 5.0735142596491132 8.695145055731583 3 2.7519190852844644 3.903188891943943
# 12: i e 2.9358326775434151 5.160249909302514 3 1.3313712294511557 4.148615422379371
# 13: g d 2.4702770572371642 7.837111765957783 3 6.8232572751217617 8.023635918875360
# 14: h b 4.5460694295527579 7.917231870893728 3 8.2686929510173321 6.318050338328812
I wonder if there is a way to write the 2nd case more succinctly (i.e. with a little less chaining)
data:
structure(list(id = c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5),
ax = c("a", "a", "b", "b", "b", "b", "b", "b", "c", "c",
"d", "d", "e"), time = c(1, 3, 0, 2, 4, 5, 6, 8, 7, 9, 10,
11, 12)), .Names = c("id", "ax", "time"), class = c("data.table",
"data.frame"), row.names = c(NA, -13L))
looks like:
id ax time
1: 1 a 1
2: 1 a 3
3: 2 b 0
4: 2 b 2
5: 2 b 4
6: 2 b 5
7: 2 b 6
8: 2 b 8
9: 3 c 7
10: 3 c 9
11: 4 d 10
12: 4 d 11
13: 5 e 12
I want to have the max of the previous group next to the actual group:
desired output:
id ax time newCol
1: 1 a 1 NA
2: 1 a 3 NA
3: 2 b 0 3
4: 2 b 2 3
5: 2 b 4 3
6: 2 b 5 3
7: 2 b 6 3
8: 2 b 8 3
9: 3 c 7 8
10: 3 c 9 8
11: 4 d 10 9
12: 4 d 11 9
13: 5 e 12 11
Is it also possible to have the value of the "previous-previous" grp?
Interessted in baseR, data.table and tidyverse solutions
note:
Can be grouped by EITHER id or ax. The example is a little redundant here.
A data.table solution:
dtt.max <- dtt[, .(max = max(time)), by = ax]
dtt.max[, max.prev := shift(max)]
dtt[dtt.max, newCol := i.max.prev, on = 'ax']
# > dtt
# id ax time newCol
# 1: 1 a 1 NA
# 2: 1 a 3 NA
# 3: 2 b 0 3
# 4: 2 b 2 3
# 5: 2 b 4 3
# 6: 2 b 5 3
# 7: 2 b 6 3
# 8: 2 b 8 3
# 9: 3 c 7 8
# 10: 3 c 9 8
# 11: 4 d 10 9
# 12: 4 d 11 9
# 13: 5 e 12 11
data.table solution using id + 1
library(data.table)
merge(d, setDT(d)[, max(time), id + 1], all.x = TRUE)
Here is a dplyr approach. The key here is to group and ungroup when necessary:
df %>%
group_by(ax) %>%
mutate(new = time[n()]) %>%
ungroup() %>%
mutate(new = lag(new)) %>%
group_by(ax) %>%
mutate(new = new[1])
# A tibble: 13 x 4
# Groups: ax [5]
id ax time new
<dbl> <chr> <dbl> <dbl>
1 1. a 1. NA
2 1. a 3. NA
3 2. b 0. 3.
4 2. b 2. 3.
5 2. b 4. 3.
6 2. b 5. 3.
7 2. b 6. 3.
8 2. b 8. 3.
9 3. c 7. 8.
10 3. c 9. 8.
11 4. d 10. 9.
12 4. d 11. 9.
13 5. e 12. 11.
Assuming id is the same as group:
dfr <- dfr %>% group_by(id) %>% mutate(groupmax = max(time))
dfr$old_group_max <- dfr$groupmax[match(dfr$id - 1, dfr$id)]
The antepenultimate group is left as an exercise :-)
1) This uses no packages. It computes the maximum for each group giving Ag and and then lags it giving LagMax. Finally it left joins using merge that back into the original data frame DF:
Ag <- aggregate(time ~ id, DF, max)
LagMax <- transform(Ag, lagmax = c(NA, head(time, -1)), time = NULL)
merge(DF, LagMax, by = "id", all.x = TRUE)
giving:
id ax time lagmax
1 1 a 1 NA
2 1 a 3 NA
3 2 b 0 3
4 2 b 2 3
5 2 b 4 3
6 2 b 5 3
7 2 b 6 3
8 2 b 8 3
9 3 c 7 8
10 3 c 9 8
11 4 d 10 9
12 4 d 11 9
13 5 e 12 11
2) This sorts time within id so that we know that the maximum is the last value in each id group.
o <- order(factor(DF$id, levels = unique(DF$id)), DF$time)
Time <- DF$time[o]
lagmax <- function(r) if (r[1] == 1) NA else Time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
In the question the time values are already sorted within id and if that is known to be the case the above could be shortened to:
lagmax <- function(r) if (r[1] == 1) NA else DF$time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
3) This one-liner is a data.table translation of (2):
library(data.table)
DT <- copy(DF) # don't overwrite DF
setDT(DT)[, g:=rleid(id)][, lagmax := DT$time[.I[1]-1], keyby = c("g", "id")]
In the sample data in the question time is sorted within id and if that were known to be the case we could use the following shorter code in place of the last line above
setDT(DT)[, lagmax := DT$time[.I[1]-1], by = id]
I have a problem that I don't manage to solve properly in data.table. I have the following data:
plouf <- data.table( ID = rep(LETTERS[1:10],each = 10) )
plouf[,c(paste0("X",1:10)) := lapply(1:10,function(x){sample(10,100,replace = T)})]
There are two things that block me time to time:
col <- "X1"
plouf[get(col) > 5, .(col = get(col)[1]) ,by = ID]
ID col
1: A 7
2: B 7
3: C 9
4: D 6
5: E 8
6: F 7
7: G 6
8: H 7
9: I 6
10: J 7
The column is named "col" instead of "X1". I tried with eval, get, didn't get it.
And same kind :
col <- 1
plouf[get(paste0("X",col)) > 5, .(paste0("X",col) = get(paste0("X",col))[1]) ,by = ID]
Error: unexpected '=' in "plouf[get(paste0("X",col)) > 5, .(paste0("X",col) ="
I tried this from Using paste when naming a list :
plouf[get(paste0("X",col)) > 5,setNames( get(paste0("X",col))[1],paste0("X",col)) ,by = ID]
ID V1
1: A 7
2: B 7
3: C 9
4: D 6
5: E 8
6: F 7
7: G 6
8: H 7
9: I 6
10: J 7
but it is not the desired result. Could someone explain me how it works ?
We can use setNames
plouf[get(col) > 5, setNames(list(get(col)[1]), col) ,by = ID]
or another option is setnames after getting the result
setnames(plouf[get(col) > 5, .(get(col)[1]) ,by = ID], 'V1', col)[]
# ID X1
#1: A 8
#2: B 7
#3: C 6
#4: D 10
#5: F 9
#6: G 8
#7: H 10
#8: I 6
#9: J 8
If we are using dplyr, then the option would be
library(dplyr)
plouf %>%
filter_at(col, any_vars(.>5)) %>%
group_by(ID) %>%
summarise_at(col, first)
# A tibble: 9 x 2
# ID X1
# <chr> <int>
#1 A 8
#2 B 7
#3 C 6
#4 D 10
#5 F 9
#6 G 8
#7 H 10
#8 I 6
#9 J 8
Or with := and sym from rlang
plouf %>%
filter(!! rlang::sym(col) > 5) %>%
group_by(ID) %>%
summarise(!! col := first(!!rlang::sym(col)))