I'm trying to do a rolling count for variable one or two if they have group x.
Essentially I want something get back new_var1 and new_var2 in this example, where every time Var1 or Var2 has the combination a and group f it counts, or b and group f and so on. So, the overall appearances of a in each group are counted, regardless if a appears in column Var1 or Var2. However, the counts must be assigned to the proper colum. So, if a appears in column Var1 then the actual count must be assigned to column new_var1. Accordingly, for a in Var2 the actual count is to be in new_var2.
x <- expand.grid(letters[1:5],letters[1:5],KEEP.OUT.ATTRS = FALSE)
x <- x[x[,1]!=x[,2],c(2,1)]
x <- data.frame(x,group=as.character(rep(letters[c(1,2,1,4,1)+5],each=4)))
x<- data.frame(x,new_var1 = c(1,2,3,4,1,2,3,4,2,3,4,5,1,2,3,4,3,4,5,6))
x<- data.frame(x,new_var2 = c(1,1,1,1,1,1,1,1,5,2,2,2,1,1,1,1,6,3,6,3))`
Var2 Var1 group new_var2 new_var1
a b f 1 1
a c f 2 1
a d f 3 1
a e f 4 1
b a g 1 1
b c g 2 1
b d g 3 1
b e g 4 1
c a f 2 5
c b f 3 2
c d f 4 2
c e f 5 2
d a i 1 1
d b i 2 1
d c i 3 1
d e i 4 1
e a f 3 6
e b f 4 3
e c f 5 6
e d f 6 3
Any help would greatly be appreciated.
I managed to get this working:
x <- data.table(x)
x[, new_var1a := seq(.N) , by = c('Var1','group')]
x[, new_var2a := seq(.N) , by = c('Var2','group')]
Var2 Var1 group new_var2 new_var1 new_var1a new_var2a
1: a b f 1 1 1 1
2: a c f 2 1 1 2
3: a d f 3 1 1 3
4: a e f 4 1 1 4
5: b a g 1 1 1 1
6: b c g 2 1 1 2
7: b d g 3 1 1 3
8: b e g 4 1 1 4
9: c a f 2 5 1 1
10: c b f 3 2 2 2
11: c d f 4 2 2 3
12: c e f 5 2 2 4
13: d a i 1 1 1 1
14: d b i 2 1 1 2
15: d c i 3 1 1 3
16: d e i 4 1 1 4
17: e a f 3 6 2 1
18: e b f 4 3 3 2
19: e c f 5 6 2 3
20: e d f 6 3 3 4
But it treats var1 and var2 independently. Which I do not want.
So, your problem is more an algorithm problem so we'll use a loop instead of dplyr or data.table. FOR ME, using loops in R opten means using Rcpp. So this is my answer:
// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
#include <boost/foreach.hpp>
using namespace Rcpp;
// the C-style upper-case macro name is a bit ugly
#define foreach BOOST_FOREACH
// [[Rcpp::export]]
ListOf<IntegerVector> new_vars(const IntegerVector& Var1,
const IntegerVector& Var2,
int n_Var,
ListOf<IntegerVector> ind_groups) {
int nrow = Var1.size();
IntegerVector new_var1a(nrow, NA_INTEGER);
IntegerVector new_var2a(nrow, NA_INTEGER);
for (int i = 0; i < ind_groups.size(); i++) {
IntegerVector counts(n_Var);
foreach(const int& j, ind_groups[i]) {
new_var1a[j] = ++counts[Var1[j]];
new_var2a[j] = ++counts[Var2[j]];
}
}
return List::create(Named("new_var1a") = new_var1a,
Named("new_var2a") = new_var2a);
}
/*** R
x <- expand.grid(letters[1:5],letters[1:5],
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE)
x <- x[x[,1]!=x[,2],c(2,1)]
x <- data.frame(x,group=as.character(rep(letters[c(1,2,1,4,1)+5],each=4)))
x <- data.frame(x,new_var1 = c(1,2,3,4,1,2,3,4,2,3,4,5,1,2,3,4,3,4,5,6))
x <- data.frame(x,new_var2 = c(1,1,1,1,1,1,1,1,5,2,2,2,1,1,1,1,6,3,6,3))
getNewVars <- function(x) {
Vars.levels <- unique(c(x$Var2, x$Var1))
new_vars <- new_vars(
Var1 = match(x$Var1, Vars.levels) - 1,
Var2 = match(x$Var2, Vars.levels) - 1,
n_Var = length(Vars.levels),
ind_groups = split(seq_along(x$group) - 1, x$group)
)
cbind(x, new_vars)
}
getNewVars(x)
*/
Put this in a ".cpp" file and source it.
PS: Make sure to use stringsAsFactors = FALSE.
Solution with dplyr, by first casting the data from wide to long format, while keeping the row id to later merge again.
Sample data
df = read.table(text=" Var2 Var1 group new_var2 new_var1
a b f 1 1
a c f 2 1
a d f 3 1
a e f 4 1
b a g 1 1
b c g 2 1
b d g 3 1
b e g 4 1
c a f 2 5
c b f 3 2
c d f 4 2
c e f 5 2
d a i 1 1
d b i 2 1
d c i 3 1
d e i 4 1
e a f 3 6
e b f 4 3
e c f 5 6
e d f 6 3",header=T)
df = df[,c("Var2","Var1","group")]
Code
library(reshape2)
library(dplyr)
df$id = seq(1,nrow(df))
df2 = melt(df, id.vars=c("id", "group")) %>% arrange(id)
df2 = df2 %>% group_by(group,value) %>% mutate(n= row_number())
df = df %>% left_join(df2[df2$variable=="Var1",c("id","n")], by="id")
df = df %>% left_join(df2[df2$variable=="Var2",c("id","n")], by="id")
colnames(df)[colnames(df)=="n.x"]="new_var1"
colnames(df)[colnames(df)=="n.y"]="new_var2"
Optionally add df2 = df2 %>% group_by(group,value,id) %>% mutate(n=max(n)) if a line can contain the same variables (which is not the case in your example).
Output
Var2 Var1 group id new_var1 new_var2
1 a b f 1 1 1
2 a c f 2 1 2
3 a d f 3 1 3
4 a e f 4 1 4
5 b a g 5 1 1
6 b c g 6 1 2
7 b d g 7 1 3
8 b e g 8 1 4
9 c a f 9 5 2
10 c b f 10 2 3
11 c d f 11 2 4
12 c e f 12 2 5
13 d a i 13 1 1
14 d b i 14 1 2
15 d c i 15 1 3
16 d e i 16 1 4
17 e a f 17 6 3
18 e b f 18 3 4
19 e c f 19 6 5
20 e d f 20 3 6
Hope this helps!
The dcast() function from the data.table package allows us to reshape multiple value variables simultaneously. This can be used to avoid the double left join in Florian's answer:
library(data.table)
long <- melt(setDT(x)[, rn := .I], id.vars = c("rn", "group"),
measure.vars = c("Var1", "Var2"), value.name = "Var")[
, variable := rleid(variable)][
order(rn), new_var := rowid(group, Var)][]
dcast(long, rn + group ~ ..., value.var = c("Var", "new_var"))[, rn := NULL][]
group Var_1 Var_2 new_var_1 new_var_2
1: f b a 1 1
2: f c a 1 2
3: f d a 1 3
4: f e a 1 4
5: g a b 1 1
6: g c b 1 2
7: g d b 1 3
8: g e b 1 4
9: f a c 5 2
10: f b c 2 3
11: f d c 2 4
12: f e c 2 5
13: i a d 1 1
14: i b d 1 2
15: i c d 1 3
16: i e d 1 4
17: f a e 6 3
18: f b e 3 4
19: f c e 6 5
20: f d e 3 6
Explanation
setDT(x) coerces x to data.table, then a column with row numbers is added before reshaping from wide to long format. Just to get nicer looking column names from the subsequent dcast(), the variables are renamed (for this [, variable := sub("Var", "", variable)] can be used as alternative to [, variable := rleid(variable)]).
The important step is the numbering of appearances of each Var within each group using rowid() grouped by group and Var.
Now, the result has two value columns. Finally, it is reshaped back from long to wide format again, and the rn column is removed as no longer needed.
Data
x <- expand.grid(letters[1:5], letters[1:5], KEEP.OUT.ATTRS = FALSE)
x <- x[x[, 1] != x[, 2], c(2, 1)]
x <- data.frame(
x,
group = as.character(rep(letters[c(1, 2, 1, 4, 1) + 5], each = 4)),
new_var1 = c(1, 2, 3, 4, 1, 2, 3, 4, 2, 3, 4, 5, 1, 2, 3, 4, 3, 4, 5, 6),
new_var2 = c(1, 1, 1, 1, 1, 1, 1, 1, 5, 2, 2, 2, 1, 1, 1, 1, 6, 3, 6, 3))
Related
I am trying to concatenate certain row values (Strings) given varying conditions in R. I have flagged the row values in Flag (the flagging criteria are irrelevant in this example).
Notations: B is the beginning of a run and E the end. 0 is outside the run. 1 denotes any strings excluding B and E in the run. Your solution does not need to follow my convention.
Rules: Every run must begin with B and ends with E. There can be any number of 1 in the run. Any Strings positioned between B and E (both inclusive) are to be concatenated in the order as they are positioned in the run, and replace the B-string. . 0-string will remain in the dataframe. 1- and E-strings will be removed after concatenation.
I haven't come up with anything close to the desired output.
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
Strings Flag
1 d 0
2 r B
3 q 1
4 r 1
5 v E
6 f B
7 y E
8 u B
9 c E
10 x 0
11 h B
12 w 1
13 x 1
14 t 1
15 j E
16 d 0
17 j 0
Intermediate output.
Strings Flag Result
1 d 0 d
2 r B r q r v
3 q 1 q
4 r 1 r
5 v E v
6 f B f y
7 y E y
8 u B u c
9 c E c
10 x 0 x
11 h B h w x t j
12 w 1 w
13 x 1 x
14 t 1 t
15 j E j
16 d 0 d
17 j 0 j
Desired output.
Result
1 d
2 r q r v
3 f y
4 u c
5 x
6 h w x t j
7 d
8 j
Here is a solution that might help you. However, I am still not sure if I got your point correctly:
library(dplyr)
df2 %>%
mutate(Flag2 = cumsum(Flag == 'B' | Flag == '0')) %>%
group_by(Flag2) %>%
summarise(Result = paste0(Strings, collapse = ' '))
# A tibble: 8 × 2
Flag2 Result
<int> <chr>
1 1 d
2 2 r q r v
3 3 f y
4 4 u c
5 5 x
6 6 h w x t j
7 7 d
8 8 j
Using dplyr:
library(dplyr)
set.seed(128)
df2 <- data.frame(Strings = sample(letters, 17, replace = T),
Flag = c(0,"B",1,1,"E","B","E","B","E",0,"B",1,1,1,"E",0,0))
df2 %>%
group_by(group = cumsum( (Flag=="B") + (lag(Flag,1,"0")=="E"))) %>%
mutate(Result=if_else(Flag=="B", paste0(Strings,collapse = " "),Strings)) %>%
filter(!(Flag %in% c("1", "E"))) %>% ungroup() %>%
select(-group, -Strings, -Flag)
#> # A tibble: 8 × 1
#> Result
#> <chr>
#> 1 d
#> 2 r q r v
#> 3 f y
#> 4 u c
#> 5 x
#> 6 h w x t j
#> 7 d
#> 8 j
If I take the following two-column dataframe, with one column being a factor and the other being a numeric vector:
data <- data.frame(x=c("a","b","c","d","e","f","g","h"), y = c(1,2,3,3,2,1,5,6))
data$x <- as.factor(data$x)
How can I turn it into a new dataframe data2 where the factor levels of data$x are columns and the rows contain the corresponding numeric values from data$y, like so?
structure(list(a = 1, b = 2, c = 3, d = 3, e = 2, f = 1, g = 5,
h = 6), class = "data.frame", row.names = c(NA, -1L))
With base R, use rbind.data.frame:
d <- rbind.data.frame(data$y)
colnames(d) <- data$x
a b c d e f g h
1 1 2 3 3 2 1 5 6
With pivot_wider:
tidyr::pivot_wider(data, names_from = x, values_from = y)
a b c d e f g h
1 1 2 3 3 2 1 5 6
or with xtabs:
xtabs(y ~ ., data = data) |>
as.data.frame.list()
a b c d e f g h
1 1 2 3 3 2 1 5 6
Another possible solution, using data.table::transpose:
data.table::transpose(data, make.names = 1)
#> a b c d e f g h
#> 1 1 2 3 3 2 1 5 6
Another option using the sjmisc package:
library(sjmisc)
data %>%
rotate_df(cn = T)
# a b c d e f g h
#y 1 2 3 3 2 1 5 6
I am new to R. I would like to calculate the mean for each row of a dataframe, but using different subset of columns for each row. I have two extra-columns providing me the names of the column that represent the "start" and the "end" that I should use to calculate each mean, respectively.
Let's take this example
dframe <- data.frame(a=c("2","3","4", "2"), b=c("1","3","6", "2"), c=c("4","5","6", "3"), d=c("4","2","8", "5"), e=c("a", "c", "a", "b"), f=c("c", "d", "d", "c"))
dframe
Which provides the following dataframe:
a b c d e f
1 2 1 4 4 a c
2 3 3 5 2 c d
3 4 6 6 8 a d
4 2 2 3 5 b c
The columns e and f represent the first and last column I use to calculate the mean for each row.
For example, on line 1, the mean would be calculated including column a, b, c ((2+1+4)/3 -> 2.3)
So I would like to obtain the following output:
a b c d e f mean
1 2 1 4 4 a c 2.3
2 3 3 5 2 c d 3.5
3 4 6 6 8 a d 6
4 2 2 3 5 b c 2.5
I learnt how to create the indices, and I want then to use RowMeans, but I cannot find the correct arguments.
dframe %>%
mutate(e_indice = match(e, colnames(dframe)))%>%
mutate(f_indice = match(f, colnames(dframe)))%>%
mutate(mean = RowMeans(????, na.rm = TRUE))
Thanks a lot for your help
One dplyr option could be:
dframe %>%
rowwise() %>%
mutate(mean = rowMeans(cur_data()[match(e, names(.)):match(f, names(.))]))
a b c d e f mean
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 2 1 4 4 a c 2.33
2 3 3 5 2 c d 3.5
3 4 6 6 8 a d 6
4 2 2 3 5 b c 2.5
I would define a helper function that lets you slice the indices you want
from a matrix.
rowSlice <- function(x, start, stop) {
replace(x, col(x) < start | col(x) > stop, NA)
}
rowSlice(matrix(1, 4, 4), c(1, 3, 1, 2), c(3, 4, 4, 3))
#> [,1] [,2] [,3] [,4]
#> [1,] 1 1 1 NA
#> [2,] NA NA 1 1
#> [3,] 1 1 1 1
#> [4,] NA 1 1 NA
Then use across() to select the relvant columns, slice them,
and take the rowMeans().
library(dplyr)
dframe <- data.frame(
a = c(2, 3, 4, 2),
b = c(1, 3, 6, 2),
c = c(4, 5, 6, 3),
d = c(4, 2, 8, 5),
e = c("a", "c", "a", "b"),
f = c("c", "d", "d", "c")
)
dframe %>%
mutate(ei = match(e, colnames(dframe))) %>%
mutate(fi = match(f, colnames(dframe))) %>%
mutate(
mean = across(a:d) %>%
rowSlice(ei, fi) %>%
rowMeans(na.rm = TRUE)
)
#> a b c d e f ei fi mean
#> 1 2 1 4 4 a c 1 3 2.333333
#> 2 3 3 5 2 c d 3 4 3.500000
#> 3 4 6 6 8 a d 1 4 6.000000
#> 4 2 2 3 5 b c 2 3 2.500000
A base R solution. First, set columns to numeric. Then create a list of the columns on which to apply the mean. Then apply mean on selected columns.
s <- mapply(seq, match(dframe$e, colnames(dframe)), match(dframe$f, colnames(dframe)))
dframe$mean <- lapply(seq(nrow(dframe)), function(x) rowMeans(dframe[x, s[[x]]]))
a b c d e f mean
1 2 1 4 4 a c 2.333333
2 3 3 5 2 c d 3.5
3 4 6 6 8 a d 6
4 2 2 3 5 b c 2.5
A base R approach using apply
dframe$mean <- apply(dframe, 1, function(x)
mean(as.numeric(x[which(names(x) == x["e"]) : which(names(x) == x["f"])])))
dframe
a b c d e f mean
1 2 1 4 4 a c 2.333333
2 3 3 5 2 c d 3.500000
3 4 6 6 8 a d 6.000000
4 2 2 3 5 b c 2.500000
I have factor variable that occurs in two columns and and now I want first lag, no matter what column factor last appeared in.
Consider following data.table.
require(data.table)
set.seed(21)
dt <- data.table(item1 = c(rep(sample(letters[1:5]), 2), sample(letters[6:10])),
item2 = c(rep(sample(letters[6:10]), 2), sample(letters[1:5])),
value1 = rnorm(15, 5, 2),
value2 = rnorm(15, 5, 2),
iteration = rep(1:3, each = 5))
> dt
item1 item2 value1 value2 iteration
1: d i 0.4464375 6.491179 1
2: b j 6.5148245 5.665638 1
3: c f 3.9031889 2.751919 1
4: a g 5.3450990 3.587738 1
5: e h 6.1257061 3.544912 1
6: d i 8.0236359 1.331371 2
7: b j 6.3180503 4.184624 2
8: c f 7.2440561 5.053722 2
9: a g 3.4307173 6.823257 2
10: e h 4.1486154 8.268693 2
11: j a 5.7859952 5.121371 3
12: f c 5.0735143 8.695145 3
13: i e 2.9358327 5.160250 3
14: g d 2.4702771 7.837112 3
15: h b 4.5460694 7.917232 3
I have tried to solve this with data.table package.
dt[, lag1 := c(NA, value1), by = item1]
dt[, lag2 := c(NA, value2), by = item2]
dt
item1 item2 value1 value2 iteration lag1 lag2
1: d i 0.4464375 6.491179 1 NA NA
2: b j 6.5148245 5.665638 1 NA NA
3: c f 3.9031889 2.751919 1 NA NA
4: a g 5.3450990 3.587738 1 NA NA
5: e h 6.1257061 3.544912 1 NA NA
6: d i 8.0236359 1.331371 2 0.4464375 6.491179
7: b j 6.3180503 4.184624 2 6.5148245 5.665638
8: c f 7.2440561 5.053722 2 3.9031889 2.751919
9: a g 3.4307173 6.823257 2 5.3450990 3.587738
10: e h 4.1486154 8.268693 2 6.1257061 3.544912
11: j a 5.7859952 5.121371 3 NA NA
12: f c 5.0735143 8.695145 3 NA NA
13: i e 2.9358327 5.160250 3 NA NA
14: g d 2.4702771 7.837112 3 NA NA
15: h b 4.5460694 7.917232 3 NA NA
I could probably solve this by creating one column for item and one for value, but is there a better solution?
And just to be clear, my expected value on lag1 on row 11 is 4.184624.
I will also need lag for item2 and did this dplyr.
dt %>%
mutate(nr = 1:nrow(dt)) %>%
gather(key, value, -nr, -iteration) %>%
mutate(key = ifelse(key == "item1" | key == "item2", "item", "value"),
variabel = rep(c(1, 2), 2, each = nrow(dt))) %>%
spread(key, value) %>%
group_by(item) %>%
arrange(nr) %>%
mutate(lag = lag(value)) %>%
gather(key, value, -iteration, -nr, -variabel) %>%
unite(key, c("key", "variabel"), sep = "") %>%
spread(key, value)
iteration nr item1 item2 lag1 lag2 value1 value2
* <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 e f <NA> <NA> 4.48327811883486 5.98823833422944
2 1 2 b i <NA> <NA> 6.21252978898878 3.6803830789734
3 1 3 d g <NA> <NA> 5.62689643314086 7.00228385274896
4 1 4 c h <NA> <NA> 5.10720616395708 7.14416894881173
5 1 5 a j <NA> <NA> 7.25650757535391 6.51153141154262
6 2 6 e f 4.48327811883486 5.98823833422944 3.88373308164829 2.08907058913021
7 2 7 b i 6.21252978898878 3.6803830789734 8.07191789162847 6.88574195362948
8 2 8 d g 5.62689643314086 7.00228385274896 4.87510729533042 1.25944984673148
9 2 9 c h 5.10720616395708 7.14416894881173 5.0431504307243 4.4934555124612
10 2 10 a j 7.25650757535391 6.51153141154262 0.820345123625779 4.41487625686153
11 3 11 g d 1.25944984673148 4.87510729533042 3.37822264689098 5.43753611910662
12 3 12 j a 4.41487625686153 0.820345123625779 -0.88757977661203 2.28986114731552
13 3 13 i e 6.88574195362948 3.88373308164829 4.96240860503556 4.75454561215201
14 3 14 h b 4.4934555124612 8.07191789162847 4.29063975464589 4.09626986248512
15 3 15 f c 2.08907058913021 5.0431504307243 5.07114037497055 5.19449624162733
A solution can be found by using melt after adding a column for row number.
library(data.table)
#Add a column to represent row number in current table. This will be used
#later to sort data.table to find correct `lag`
dt[,rn:=.I]
#Use melt to transfer values in long format. value1 contains "items" and
#value2 contains "values"
dt<-melt(dt, id=c("iteration","rn"), measure=list(1:2,c("value1","value2")),
value.name = c("item","value"))
#The order in original table is row number, item1 and item2. The same can be
#achieved by sorting on "rn" and "variable"
dt[order(rn,variable), lag := shift(value), by = item]
dt[variable==1,][dt[variable==2,],.(item1 = item, item2 = i.item, value1 = value,
value2=i.value, iteration, lag1 = lag, lag2 = i.lag), on=("rn")]
# item1 item2 value1 value2 iteration lag1 lag2
# 1: d i 0.4464375 6.491179 1 NA NA
# 2: b j 6.5148245 5.665638 1 NA NA
# 3: c f 3.9031889 2.751919 1 NA NA
# 4: a g 5.3450990 3.587738 1 NA NA
# 5: e h 6.1257061 3.544912 1 NA NA
# 6: d i 8.0236359 1.331371 2 0.4464375 6.491179
# 7: b j 6.3180503 4.184624 2 6.5148245 5.665638
# 8: c f 7.2440561 5.053722 2 3.9031889 2.751919
# 9: a g 3.4307173 6.823257 2 5.3450990 3.587738
# 10: e h 4.1486154 8.268693 2 6.1257061 3.544912
# 11: j a 5.7859952 5.121371 3 4.1846241 3.430717
# 12: f c 5.0735143 8.695145 3 5.0537224 7.244056
# 13: i e 2.9358327 5.160250 3 1.3313712 4.148615
# 14: g d 2.4702771 7.837112 3 6.8232573 8.023636
# 15: h b 4.5460694 7.917232 3 8.2686930 6.318050
Posting another similar approach. Similar in using an elongated version of item1 + item2 into a long data.table. Difference is in using joins.
There are 2 possible situations:
1) the lag is always in the immediate previous iteration, then the following code using normal join should work:
library(data.table)
set.seed(21)
dt <- data.table(item1 = c(rep(sample(letters[1:5]), 2), sample(letters[6:10])),
item2 = c(rep(sample(letters[6:10]), 2), sample(letters[1:5])),
value1 = rnorm(15, 5, 2),
value2 = rnorm(15, 5, 2),
iteration = rep(1:3, each = 5))
#if that first lag can always be found in previous iteration
dt[.(iitem=c(item1, item2), ivalue=c(value1, value2), iiteration=c(iteration + 1, iteration + 1)),
lag1 := ivalue,
on=c(item1="iitem", iteration="iiteration")]
dt[.(iitem=c(item1, item2), ivalue=c(value1, value2), iiteration=c(iteration + 1, iteration + 1)),
lag2 := ivalue,
on=c(item2="iitem", iteration="iiteration")]
dt
# item1 item2 value1 value2 iteration lag1 lag2
# 1: d i 0.4464375195067456 6.491178609416053 1 NA NA
# 2: b j 6.5148244502509627 5.665638360665036 1 NA NA
# 3: c f 3.9031888919439428 2.751919085284464 1 NA NA
# 4: a g 5.3450989557007524 3.587738435542055 1 NA NA
# 5: e h 6.1257061355108435 3.544912270783058 1 NA NA
# 6: d i 8.0236359188753603 1.331371229451156 2 0.4464375195067456 6.491178609416053
# 7: b j 6.3180503383288116 4.184624119479032 2 6.5148244502509627 5.665638360665036
# 8: c f 7.2440561493491140 5.053722389597528 2 3.9031888919439428 2.751919085284464
# 9: a g 3.4307172617070858 6.823257275121762 2 5.3450989557007524 3.587738435542055
# 10: e h 4.1486154223793710 8.268692951017332 2 6.1257061355108435 3.544912270783058
# 11: j a 5.7859951827443368 5.121371228719468 3 4.1846241194790323 3.430717261707086
# 12: f c 5.0735142596491132 8.695145055731583 3 5.0537223895975281 7.244056149349114
# 13: i e 2.9358326775434151 5.160249909302514 3 1.3313712294511557 4.148615422379371
# 14: g d 2.4702770572371642 7.837111765957783 3 6.8232572751217617 8.023635918875360
# 15: h b 4.5460694295527579 7.917231870893728 3 8.2686929510173321 6.318050338328812
2) if the lag might be in earlier iterations, then the following code using non-equi joins should help
library(data.table)
set.seed(21)
dt <- data.table(item1 = c(rep(sample(letters[1:5]), 2), sample(letters[6:10])),
item2 = c(rep(sample(letters[6:10]), 2), sample(letters[1:5])),
value1 = rnorm(15, 5, 2),
value2 = rnorm(15, 5, 2),
iteration = rep(1:3, each = 5))
#remove iteration=2, item1=c, item2=f to show finding lag from earlier iterations
dt <- dt[-8,]
#if that first lag can be found in even earlier iteration, using non-equi joins as follows:
elongated <- dt[,.(item=c(item1, item2), ivalue=c(value1, value2), iteration=c(iteration, iteration), cpyalliter=c(iteration, iteration))]
dt[, lag1 := elongated[.SD, on=.(item=item1, iteration < iteration)][,
last(ivalue), by=.(item1=item, item2, value1, value2, iteration)]$V1 ]
dt[, lag2 := elongated[.SD, on=.(item=item2, iteration < iteration)][,
last(ivalue), by=.(item1, item2=item, value1, value2, iteration)]$V1 ]
dt
# item1 item2 value1 value2 iteration lag1 lag2
# 1: d i 0.4464375195067456 6.491178609416053 1 NA NA
# 2: b j 6.5148244502509627 5.665638360665036 1 NA NA
# 3: c f 3.9031888919439428 2.751919085284464 1 NA NA
# 4: a g 5.3450989557007524 3.587738435542055 1 NA NA
# 5: e h 6.1257061355108435 3.544912270783058 1 NA NA
# 6: d i 8.0236359188753603 1.331371229451156 2 0.4464375195067456 6.491178609416053
# 7: b j 6.3180503383288116 4.184624119479032 2 6.5148244502509627 5.665638360665036
# 8: a g 3.4307172617070858 6.823257275121762 2 5.3450989557007524 3.587738435542055
# 9: e h 4.1486154223793710 8.268692951017332 2 6.1257061355108435 3.544912270783058
# 10: j a 5.7859951827443368 5.121371228719468 3 4.1846241194790323 3.430717261707086
# 11: f c 5.0735142596491132 8.695145055731583 3 2.7519190852844644 3.903188891943943
# 12: i e 2.9358326775434151 5.160249909302514 3 1.3313712294511557 4.148615422379371
# 13: g d 2.4702770572371642 7.837111765957783 3 6.8232572751217617 8.023635918875360
# 14: h b 4.5460694295527579 7.917231870893728 3 8.2686929510173321 6.318050338328812
I wonder if there is a way to write the 2nd case more succinctly (i.e. with a little less chaining)
data:
structure(list(id = c(1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5),
ax = c("a", "a", "b", "b", "b", "b", "b", "b", "c", "c",
"d", "d", "e"), time = c(1, 3, 0, 2, 4, 5, 6, 8, 7, 9, 10,
11, 12)), .Names = c("id", "ax", "time"), class = c("data.table",
"data.frame"), row.names = c(NA, -13L))
looks like:
id ax time
1: 1 a 1
2: 1 a 3
3: 2 b 0
4: 2 b 2
5: 2 b 4
6: 2 b 5
7: 2 b 6
8: 2 b 8
9: 3 c 7
10: 3 c 9
11: 4 d 10
12: 4 d 11
13: 5 e 12
I want to have the max of the previous group next to the actual group:
desired output:
id ax time newCol
1: 1 a 1 NA
2: 1 a 3 NA
3: 2 b 0 3
4: 2 b 2 3
5: 2 b 4 3
6: 2 b 5 3
7: 2 b 6 3
8: 2 b 8 3
9: 3 c 7 8
10: 3 c 9 8
11: 4 d 10 9
12: 4 d 11 9
13: 5 e 12 11
Is it also possible to have the value of the "previous-previous" grp?
Interessted in baseR, data.table and tidyverse solutions
note:
Can be grouped by EITHER id or ax. The example is a little redundant here.
A data.table solution:
dtt.max <- dtt[, .(max = max(time)), by = ax]
dtt.max[, max.prev := shift(max)]
dtt[dtt.max, newCol := i.max.prev, on = 'ax']
# > dtt
# id ax time newCol
# 1: 1 a 1 NA
# 2: 1 a 3 NA
# 3: 2 b 0 3
# 4: 2 b 2 3
# 5: 2 b 4 3
# 6: 2 b 5 3
# 7: 2 b 6 3
# 8: 2 b 8 3
# 9: 3 c 7 8
# 10: 3 c 9 8
# 11: 4 d 10 9
# 12: 4 d 11 9
# 13: 5 e 12 11
data.table solution using id + 1
library(data.table)
merge(d, setDT(d)[, max(time), id + 1], all.x = TRUE)
Here is a dplyr approach. The key here is to group and ungroup when necessary:
df %>%
group_by(ax) %>%
mutate(new = time[n()]) %>%
ungroup() %>%
mutate(new = lag(new)) %>%
group_by(ax) %>%
mutate(new = new[1])
# A tibble: 13 x 4
# Groups: ax [5]
id ax time new
<dbl> <chr> <dbl> <dbl>
1 1. a 1. NA
2 1. a 3. NA
3 2. b 0. 3.
4 2. b 2. 3.
5 2. b 4. 3.
6 2. b 5. 3.
7 2. b 6. 3.
8 2. b 8. 3.
9 3. c 7. 8.
10 3. c 9. 8.
11 4. d 10. 9.
12 4. d 11. 9.
13 5. e 12. 11.
Assuming id is the same as group:
dfr <- dfr %>% group_by(id) %>% mutate(groupmax = max(time))
dfr$old_group_max <- dfr$groupmax[match(dfr$id - 1, dfr$id)]
The antepenultimate group is left as an exercise :-)
1) This uses no packages. It computes the maximum for each group giving Ag and and then lags it giving LagMax. Finally it left joins using merge that back into the original data frame DF:
Ag <- aggregate(time ~ id, DF, max)
LagMax <- transform(Ag, lagmax = c(NA, head(time, -1)), time = NULL)
merge(DF, LagMax, by = "id", all.x = TRUE)
giving:
id ax time lagmax
1 1 a 1 NA
2 1 a 3 NA
3 2 b 0 3
4 2 b 2 3
5 2 b 4 3
6 2 b 5 3
7 2 b 6 3
8 2 b 8 3
9 3 c 7 8
10 3 c 9 8
11 4 d 10 9
12 4 d 11 9
13 5 e 12 11
2) This sorts time within id so that we know that the maximum is the last value in each id group.
o <- order(factor(DF$id, levels = unique(DF$id)), DF$time)
Time <- DF$time[o]
lagmax <- function(r) if (r[1] == 1) NA else Time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
In the question the time values are already sorted within id and if that is known to be the case the above could be shortened to:
lagmax <- function(r) if (r[1] == 1) NA else DF$time[r[1] - 1]
transform(DF, lagmax = ave(seq_along(id), id, FUN = lagmax))
3) This one-liner is a data.table translation of (2):
library(data.table)
DT <- copy(DF) # don't overwrite DF
setDT(DT)[, g:=rleid(id)][, lagmax := DT$time[.I[1]-1], keyby = c("g", "id")]
In the sample data in the question time is sorted within id and if that were known to be the case we could use the following shorter code in place of the last line above
setDT(DT)[, lagmax := DT$time[.I[1]-1], by = id]