average by group, removing current row - r

I want to compute group means of a variable but excluding the focal respondent:
set.seed(1)
dat <- data.table(id = 1:30, y = runif(30), grp = rep(1:3, each=10))
The first record (respondent) should have an average of... the second... and so on:
mean(dat[c==1, y][-1])
mean(dat[c==1, y][-2])
mean(dat[c==1, y][-3])
For the second group the same:
mean(dat[c==2, y][-1])
mean(dat[c==2, y][-2])
mean(dat[c==2, y][-3])
I tried this, but it didn't work:
ex[, avg := mean(ex[, y][-.I]), by=grp]
Any ideas?

You can try this solution:
set.seed(1)
dat <- data.table(id = 1:9, y = c(NA,runif(8)), grp = rep(1:3, each=3))
dat[, avg2 := sapply(seq_along(y),function(i) mean(y[-i],na.rm=T)), by=grp]
dat
# id y grp avg2
# 1: 1 NA 1 0.3188163
# 2: 2 0.2655087 1 0.3721239
# 3: 3 0.3721239 1 0.2655087
# 4: 4 0.5728534 2 0.5549449
# 5: 5 0.9082078 2 0.3872676
# 6: 6 0.2016819 2 0.7405306
# 7: 7 0.8983897 3 0.8027365
# 8: 8 0.9446753 3 0.7795937
# 9: 9 0.6607978 3 0.9215325

Seems like you're most of the way there and just need to account for NA's:
dat[, avg := (sum(y, na.rm=T) - ifelse(is.na(y), 0, y)) / (sum(!is.na(y)) + is.na(y) - 1)
, by = grp]
No double loops or extra memory required.

If I'm understanding correctly, I think this does the job:
dat[,
.(id, y2=rep(y, .N), id2=rep(id, .N), id3=rep(id, each=.N)), by=grp
][
!(id2 == id3),
mean(y2),
by=.(id3, grp)
]
First step is to duplicate the whole group data for each id, and to mark which row we want to exclude from the mean. Second step is to exclude the rows, and then group back by group/id. Obviously this isn't super memory efficient, but should work so long as you're not memory constrained.

Related

R data.table: keep column when grouping by expression

When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.
For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD
f <- function(x) x %% 3
Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]
f x y
1: 1 0.2597929 1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503 1.3122561
6: 0 2.6501140 1.9394756
The desired output is as if I had done the following
Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp
a x y
1: 1 0.2597929 1.0256259
2: 19 2.1106619 -1.4375193
3: 2 1.2862501 0.7918292
4: 20 0.6600591 -0.5827745
5: 3 1.3758503 1.3122561
6: 18 2.6501140 1.9394756
Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?
Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset
library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
# a x y
#1: 1 0.7265238 0.5631753
#2: 19 1.7110611 -0.3141118
#3: 2 0.1643566 -0.4704501
#4: 20 0.5182394 -0.1309016
#5: 3 -0.6039137 0.1349981
#6: 18 0.3094155 -1.1892190
NOTE: The values in columns 'x', 'y' would be different as there was no set.seed

Keeping only the x largest groups with data.table

I have recently started using the data.table package in R, but I recently stumbled into an issue that I do not know how to tackle with data.table.
Sample data:
set.seed(1)
library(data.table)
dt = data.table(group=c("A","A","A","B","B","B","C","C"),value = runif(8))
I can add a group count with the statement
dt[,groupcount := .N ,group]
but now I only want to keep the x groups with the largest value for groupcount. Let's assume x=1 for the example.
I tried chaining as follows:
dt[,groupcount := .N ,group][groupcount %in% head(sort(unique(groupcount),decreasing=TRUE),1)]
But since group A and B both have three elements, they both remain in the data.table. I only want the x largest groups where x=1, so I only want one of the groups (A or B) to remain. I assume this can be done in a single line with data.table. Is this true, and if yes, how?
To clarify: x is an arbitrarily chosen number here. The function should also work with x=3, where it would return the 3 largest groups.
Here is a method that uses a join.
x <- 1
dt[dt[, .N, by=group][order(-N)[1:x]], on="group"]
group value N
1: A 0.2655087 3
2: A 0.3721239 3
3: A 0.5728534 3
The inner data.frame is aggregated to count the observations and the position of the x largest groups is retrieved using order subset using the value of x. The resulting data frame is then joined onto the original by group.
How about making use of the order of the groupcount
setorder(dt, -groupcount)
x <- 1
dt[group %in% dt[ , unique(group)][1:x] ]
# group value groupcount
# 1: A 0.2655087 3
# 2: A 0.3721239 3
# 3: A 0.5728534 3
x <- 3
dt[group %in% dt[ , unique(group)][1:x] ]
# group value groupcount
# 1: A 0.2655087 3
# 2: A 0.3721239 3
# 3: A 0.5728534 3
# 4: B 0.9082078 3
# 5: B 0.2016819 3
# 6: B 0.8983897 3
# 7: C 0.9446753 2
# 8: C 0.6607978 2
## alternative syntax
# dt[group %in% unique(dt$group)[1:x] ]
We can do
x <- 1
dt[dt[, {tbl <- table(group)
nm <- names(tbl)[tbl==max(tbl)]
if(length(nm) < x) rep(TRUE, .N)
else group %in% sample(names(tbl)[tbl==max(tbl)], x)}]]

How to avoid recycling when using roll_mean in R?

I have been trying to apply a rolling mean to several columns in a dataframe, where each column contains data from multiple individuals. I have succeeded using roll_mean from the RcppRoll package and lapply. I've included below an example using a dummy dataframe and the output.
x <- rnorm(20,1);
y <- rnorm(20,2);
z <- rnorm(20,3);
ID <- rep(1:2, each=10);
mydf <- data.frame(ID, x, y, z);
vars <- c("x", "y", "z");
setDT(mydf)[, paste0(vars, "_", "mean") := lapply(.SD, function(x) roll_mean(x, n=3, na.rm = TRUE)), .SDcols = vars, by = ID]
mydf
ID x y z x_mean y_mean z_mean
1: 1 0.34457704 1.9580361 2.6458335 1.2515642 1.8307447 2.569645
2: 1 1.41839352 2.0697324 1.8495358 1.7012511 1.7248261 2.988908
3: 1 1.99172192 1.4644657 3.2135652 1.8455087 1.7165419 3.184736
4: 1 1.69363783 1.6402801 3.9036227 1.5002658 2.1512764 3.289555
5: 1 1.85116646 2.0448798 2.4370206 0.9775842 3.1215589 2.563110
6: 1 0.95599300 2.7686692 3.5280206 0.8477701 3.4576141 3.106095
7: 1 0.12559300 4.5511275 1.7242892 0.9450234 3.5134499 3.020176
8: 1 1.46172438 3.0530454 4.0659766 0.9080677 3.0100022 3.371839
9: 1 1.24775283 2.9361768 3.2702614 1.2515642 1.8307447 2.569645
10: 1 0.01472603 3.0407845 2.7792776 1.7012511 1.7248261 2.988908
11: 2 -0.91146047 2.5898074 2.0328348 0.4314443 1.2688530 2.477879
12: 2 0.48183559 1.8230335 2.6910075 1.2689767 0.9650435 2.544006
13: 2 1.72395769 -0.6062819 2.7097949 0.8747931 1.2273766 1.974265
14: 2 1.60113680 1.6783790 2.2312143 0.2579207 1.6945497 2.233321
15: 2 -0.70071522 2.6100328 0.9817857 0.1162224 2.0928536 2.606608
16: 2 -0.12665946 0.7952374 3.4869635 1.3884888 2.1063817 2.986786
17: 2 1.17604187 2.8732906 3.3510742 2.0557599 2.2701173 3.178248
18: 2 3.11608400 2.6506171 2.1223190 1.5553274 2.3987061 3.015501
19: 2 1.87515393 1.2864441 4.0613513 0.4314443 1.2688530 2.477879
20: 2 -0.32525560 3.2590570 2.8628313 1.2689767 0.9650435 2.544006
As you can see from the output table (mydf) the mean parameters have been created as part of the lapply statement, and the rolling means have been calculated for each individual ID. However, the rolling mean function has recycled the results in order to fill the data frame, as the roll_mean function generates 8 values from the 10 raw values for each individual ID. It has used recycling to fill the last 2 rows for each ID.
My actual data is time series data and I don't want the results recycled. I want to avoid recycling by adding the raw x values to the start of the x_mean column up until the point where there are sufficient raw data to produce the 3 point rolling mean.
I've tried searching (on SO and google) for posts about avoiding recycling in roll_mean or similar functions with no success.
Does anyone know how to pad the first 2 rows in my example to avoid recycling in the roll_mean function?
Thanks.
The whole solution:
x <- rnorm(20,1);
y <- rnorm(20,2);
z <- rnorm(20,3);
ID <- rep(1:2, each=10);
mydf <- data.table(ID, x, y, z); # Changed to dt here
vars <- c("x", "y", "z");
# fill = NA and align = 'right'
mydf[, paste0(vars, "_", "mean") := lapply(.SD, function(x) RcppRoll::roll_mean(x, n = 3, na.rm = TRUE, fill = NA, align = 'right')), .SDcols = vars, by = ID]
mydf
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 NA NA NA
# 2: 1 1.1836433 2.7821363 2.746638 NA NA NA
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...
mydf[is.na(x_mean), c(paste0(vars, "_", "mean")) := mget(paste0(vars))]
mydf
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 0.3735462 2.918977 2.835476
# 2: 1 1.1836433 2.7821363 2.746638 1.1836433 2.782136 2.746638
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...
Edit:
Missing parts of mydf can be also filled in a bit "smarter" way, i.e. by using roll means with window smaller by 1 in every iteration:
for (n_inner in n_roll:1) {
mydf[!complete.cases(mydf),
paste0(vars, "_", "mean") := lapply(
.SD, function(x) RcppRoll::roll_mean(x, n = n_inner, na.rm = TRUE, fill = NA, align = 'right')), .SDcols = vars, by = ID]
}
# ID x y z x_mean y_mean z_mean
# 1: 1 0.3735462 2.9189774 2.835476 0.3735462 2.918977 2.835476 <- Values from x, y and z
# 2: 1 1.1836433 2.7821363 2.746638 0.7785948 2.850557 2.791057 <- roll_mean with window 2
# 3: 1 0.1643714 2.0745650 3.696963 0.5738536 2.591893 3.093026 <- roll_mean with window 3
# 4: 1 2.5952808 0.0106483 3.556663 1.3144318 1.622450 3.333422 <- as above
# 5: 1 1.3295078 2.6198257 2.311244 1.3630533 1.568346 3.188290
# ...

R Given a list of same dimension data tables, produce a summary of the means of each cell

I'm finding it hard to put what I want into words so I will try to run through an example to explain it. Let's say I've repeated an experiment twice and have two tables:
[df1] [df2]
X Y X Y
2 3 4 1
5 2 2 4
These tables are stored in a list (where the list can contain more than two elements if necessary), and what I want to do is create an average of each cell in the tables across the list (or for a generalised version, apply any function I choose to the cells i.e. mad, sd, etc)
[df1] [df2] [dfMeans]
X Y X Y X Y
2 3 4 1 mean(2,4) mean(3,1)
5 2 2 4 mean(5,2) mean(2,4)
I have a code solution to my problem, but since this is in R there is most likely a cleaner way to do things:
df1 <- data.frame(X=c(2,3,4),Y=c(3,2,1))
df2 <- data.frame(X=c(5,1,3),Y=c(4,1,4))
df3 <- data.frame(X=c(2,7,4),Y=c(1,7,6))
dfList <- list(df1,df2,df3)
dfMeans <- data.frame(MeanX=c(NA,NA,NA),MeanY=c(NA,NA,NA))
for (rowIndex in 1:nrow(df1)) {
for (colIndex in 1:ncol(df1)) {
valuesAtCell <- c()
for (tableIndex in 1:length(dfList)) {
valuesAtCell <- c(valuesAtCell, dfList[[tableIndex]][rowIndex,colIndex])
}
dfMeans[rowIndex, colIndex] <- mean(valuesAtCell)
}
}
print(dfMeans)
Here is a data.table solution where the mean is applied row-wise across the data frames:
library(data.table)
dtList <- rbindlist(dfList, use.names = TRUE, idcol = TRUE)
dtList
.id X Y
1: 1 2 3
2: 1 3 2
3: 1 4 1
4: 2 5 4
5: 2 1 1
6: 2 3 4
7: 3 2 1
8: 3 7 7
9: 3 4 6
dtList[, rn := 1:.N, by = .id][][, .(X = mean(X), Y = mean(Y)), by = rn]
rn X Y
1: 1 3.000000 2.666667
2: 2 3.666667 3.333333
3: 3 3.666667 3.666667
You can replace the mean by another aggregation function, eg, median. The .id column numbers the original data frames each row was sourced from.
Edit
The solution can be extended to an arbitrary number of columns (provided column names and column order are identical in all data frames):
cn <- colnames(df1)
cn
[1] "X" "Y"
dtList[, rn := 1:.N, by = .id][, lapply(.SD, mean), by = rn, .SDcols = cn][, rn := NULL][]
X Y
1: 3.000000 2.666667
2: 3.666667 3.333333
3: 3.666667 3.666667
The column names are taken from one of the original data frames which adds to the flexibility of the solution. [, rn := NULL] removes the row numbers from the result, [] ensures the result ist printed.
You could simply sum all data.frame's in your list using Reduce(), and divide by the length of dfList, which is equal to the number of df's it contains.
Reduce(`+`, dfList) / length(dfList)
# X Y
#1 3.000000 2.666667
#2 3.666667 3.333333
#3 3.666667 3.666667

data.table aggregating over a subset of keys and joining with the subset

I have a table, Y, which contains a subset of unique keys from a much larger table, X, which has many duplicate keys. For each key in Y, I want to aggregate the same keys in X and add the aggregated variables to Y. I've been playing around with data.table and I've come up with a way that works without having to make a copy, but I'm hoping there is a faster and less syntactically dizzying solution. As more variables are added the syntax gets harder and harder to read and more helper references are made to table X when I really only care about them in table Y.
My question, just to clarify, is whether there is a more efficient and/or syntactically simpler way to do this operation.
My solution:
Y[X[Y, b:= sum(a)], b := i.b, nomatch=0]
For example:
set.seed(1)
X = data.table(id = sample.int(10,30, replace=TRUE), a = runif(30))
Y = data.table(id = seq(1,5))
setkey(X,id)
setkey(Y,id)
#head(X)
#id a
#1: 1 0.4112744
#2: 1 0.3162717
#3: 2 0.6470602
#4: 2 0.2447973
#5: 3 0.4820801
#6: 3 0.8273733
Y[X[Y, b := sum(a)], b := i.b, nomatch=0]
#head(Y)
# id b
#1: 1 0.7275461
#2: 2 0.8918575
#3: 3 3.0622883
#4: 4 2.9098465
#5: 5 0.7893562
IIUC, we could use data.table's by-without-by feature here...
## <= 1.9.2
X[Y, list(b=sum(a))] ## implicit by-without-by
## 1.9.3
X[Y, list(b=sum(a)), by=.EACHI] ## explicit by
# id b
# 1: 1 0.7275461
# 2: 2 0.8918575
# 3: 3 3.0622883
# 4: 4 2.9098465
# 5: 5 0.7893562
In 1.9.3, by-without-by has now been changed to require explicit by`. You can read more about it here under 1.9.3 new features points (1) and (2), and the links from there.
Is this what you had in mind?
# set up a reproducible example
library(data.table)
set.seed(1) # for reproducible example
X = data.table(id = sample.int(10,30, replace=TRUE), a = runif(30))
Y = data.table(id = seq(1,5))
setkey(X,id)
setkey(Y,id)
# this statement does the work
result <- X[,list(b=sum(a)),keyby=id][Y]
result
# id b
# 1: 1 0.7275461
# 2: 2 0.8918575
# 3: 3 3.0622883
# 4: 4 2.9098465
# 5: 5 0.7893562
This might be faster, as it subsets X first.
result.2 <- X[Y][,list(b=sum(a)),by=id]
identical(result, result.2)
# [1] TRUE

Resources