Keeping only the x largest groups with data.table - r

I have recently started using the data.table package in R, but I recently stumbled into an issue that I do not know how to tackle with data.table.
Sample data:
set.seed(1)
library(data.table)
dt = data.table(group=c("A","A","A","B","B","B","C","C"),value = runif(8))
I can add a group count with the statement
dt[,groupcount := .N ,group]
but now I only want to keep the x groups with the largest value for groupcount. Let's assume x=1 for the example.
I tried chaining as follows:
dt[,groupcount := .N ,group][groupcount %in% head(sort(unique(groupcount),decreasing=TRUE),1)]
But since group A and B both have three elements, they both remain in the data.table. I only want the x largest groups where x=1, so I only want one of the groups (A or B) to remain. I assume this can be done in a single line with data.table. Is this true, and if yes, how?
To clarify: x is an arbitrarily chosen number here. The function should also work with x=3, where it would return the 3 largest groups.

Here is a method that uses a join.
x <- 1
dt[dt[, .N, by=group][order(-N)[1:x]], on="group"]
group value N
1: A 0.2655087 3
2: A 0.3721239 3
3: A 0.5728534 3
The inner data.frame is aggregated to count the observations and the position of the x largest groups is retrieved using order subset using the value of x. The resulting data frame is then joined onto the original by group.

How about making use of the order of the groupcount
setorder(dt, -groupcount)
x <- 1
dt[group %in% dt[ , unique(group)][1:x] ]
# group value groupcount
# 1: A 0.2655087 3
# 2: A 0.3721239 3
# 3: A 0.5728534 3
x <- 3
dt[group %in% dt[ , unique(group)][1:x] ]
# group value groupcount
# 1: A 0.2655087 3
# 2: A 0.3721239 3
# 3: A 0.5728534 3
# 4: B 0.9082078 3
# 5: B 0.2016819 3
# 6: B 0.8983897 3
# 7: C 0.9446753 2
# 8: C 0.6607978 2
## alternative syntax
# dt[group %in% unique(dt$group)[1:x] ]

We can do
x <- 1
dt[dt[, {tbl <- table(group)
nm <- names(tbl)[tbl==max(tbl)]
if(length(nm) < x) rep(TRUE, .N)
else group %in% sample(names(tbl)[tbl==max(tbl)], x)}]]

Related

R Given a list of same dimension data tables, produce a summary of the means of each cell

I'm finding it hard to put what I want into words so I will try to run through an example to explain it. Let's say I've repeated an experiment twice and have two tables:
[df1] [df2]
X Y X Y
2 3 4 1
5 2 2 4
These tables are stored in a list (where the list can contain more than two elements if necessary), and what I want to do is create an average of each cell in the tables across the list (or for a generalised version, apply any function I choose to the cells i.e. mad, sd, etc)
[df1] [df2] [dfMeans]
X Y X Y X Y
2 3 4 1 mean(2,4) mean(3,1)
5 2 2 4 mean(5,2) mean(2,4)
I have a code solution to my problem, but since this is in R there is most likely a cleaner way to do things:
df1 <- data.frame(X=c(2,3,4),Y=c(3,2,1))
df2 <- data.frame(X=c(5,1,3),Y=c(4,1,4))
df3 <- data.frame(X=c(2,7,4),Y=c(1,7,6))
dfList <- list(df1,df2,df3)
dfMeans <- data.frame(MeanX=c(NA,NA,NA),MeanY=c(NA,NA,NA))
for (rowIndex in 1:nrow(df1)) {
for (colIndex in 1:ncol(df1)) {
valuesAtCell <- c()
for (tableIndex in 1:length(dfList)) {
valuesAtCell <- c(valuesAtCell, dfList[[tableIndex]][rowIndex,colIndex])
}
dfMeans[rowIndex, colIndex] <- mean(valuesAtCell)
}
}
print(dfMeans)
Here is a data.table solution where the mean is applied row-wise across the data frames:
library(data.table)
dtList <- rbindlist(dfList, use.names = TRUE, idcol = TRUE)
dtList
.id X Y
1: 1 2 3
2: 1 3 2
3: 1 4 1
4: 2 5 4
5: 2 1 1
6: 2 3 4
7: 3 2 1
8: 3 7 7
9: 3 4 6
dtList[, rn := 1:.N, by = .id][][, .(X = mean(X), Y = mean(Y)), by = rn]
rn X Y
1: 1 3.000000 2.666667
2: 2 3.666667 3.333333
3: 3 3.666667 3.666667
You can replace the mean by another aggregation function, eg, median. The .id column numbers the original data frames each row was sourced from.
Edit
The solution can be extended to an arbitrary number of columns (provided column names and column order are identical in all data frames):
cn <- colnames(df1)
cn
[1] "X" "Y"
dtList[, rn := 1:.N, by = .id][, lapply(.SD, mean), by = rn, .SDcols = cn][, rn := NULL][]
X Y
1: 3.000000 2.666667
2: 3.666667 3.333333
3: 3.666667 3.666667
The column names are taken from one of the original data frames which adds to the flexibility of the solution. [, rn := NULL] removes the row numbers from the result, [] ensures the result ist printed.
You could simply sum all data.frame's in your list using Reduce(), and divide by the length of dfList, which is equal to the number of df's it contains.
Reduce(`+`, dfList) / length(dfList)
# X Y
#1 3.000000 2.666667
#2 3.666667 3.333333
#3 3.666667 3.666667

What does ".N" mean in data.table?

I have a data.table dt:
library(data.table)
dt = data.table(a=LETTERS[c(1,1:3)],b=4:7)
a b
1: A 4
2: A 5
3: B 6
4: C 7
The result of dt[, .N, by=a] is
a N
1: A 2
2: B 1
3: C 1
I know the by=a or by="a" means grouped by a column and the N column is the sum of duplicated times of a. However, I don't use nrow() but I get the result. The .N is not just the column name? I can't find the document by ??".N" in R. I tried to use .K, but it doesn't work. What does .N means?
Think of .N as a variable for the number of instances. For example:
dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7)
dt[.N] # returns the last row
# a b
# 1: C 7
Your example returns a new variable with the number of rows per case:
dt[, new_var := .N, by = a]
dt
# a b new_var
# 1: A 4 2 # 2 'A's
# 2: A 5 2
# 3: B 6 1 # 1 'B'
# 4: C 7 1 # 1 'C'
For a list of all special symbols of data.table, see also https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/special-symbols

average by group, removing current row

I want to compute group means of a variable but excluding the focal respondent:
set.seed(1)
dat <- data.table(id = 1:30, y = runif(30), grp = rep(1:3, each=10))
The first record (respondent) should have an average of... the second... and so on:
mean(dat[c==1, y][-1])
mean(dat[c==1, y][-2])
mean(dat[c==1, y][-3])
For the second group the same:
mean(dat[c==2, y][-1])
mean(dat[c==2, y][-2])
mean(dat[c==2, y][-3])
I tried this, but it didn't work:
ex[, avg := mean(ex[, y][-.I]), by=grp]
Any ideas?
You can try this solution:
set.seed(1)
dat <- data.table(id = 1:9, y = c(NA,runif(8)), grp = rep(1:3, each=3))
dat[, avg2 := sapply(seq_along(y),function(i) mean(y[-i],na.rm=T)), by=grp]
dat
# id y grp avg2
# 1: 1 NA 1 0.3188163
# 2: 2 0.2655087 1 0.3721239
# 3: 3 0.3721239 1 0.2655087
# 4: 4 0.5728534 2 0.5549449
# 5: 5 0.9082078 2 0.3872676
# 6: 6 0.2016819 2 0.7405306
# 7: 7 0.8983897 3 0.8027365
# 8: 8 0.9446753 3 0.7795937
# 9: 9 0.6607978 3 0.9215325
Seems like you're most of the way there and just need to account for NA's:
dat[, avg := (sum(y, na.rm=T) - ifelse(is.na(y), 0, y)) / (sum(!is.na(y)) + is.na(y) - 1)
, by = grp]
No double loops or extra memory required.
If I'm understanding correctly, I think this does the job:
dat[,
.(id, y2=rep(y, .N), id2=rep(id, .N), id3=rep(id, each=.N)), by=grp
][
!(id2 == id3),
mean(y2),
by=.(id3, grp)
]
First step is to duplicate the whole group data for each id, and to mark which row we want to exclude from the mean. Second step is to exclude the rows, and then group back by group/id. Obviously this isn't super memory efficient, but should work so long as you're not memory constrained.

data.table aggregating over a subset of keys and joining with the subset

I have a table, Y, which contains a subset of unique keys from a much larger table, X, which has many duplicate keys. For each key in Y, I want to aggregate the same keys in X and add the aggregated variables to Y. I've been playing around with data.table and I've come up with a way that works without having to make a copy, but I'm hoping there is a faster and less syntactically dizzying solution. As more variables are added the syntax gets harder and harder to read and more helper references are made to table X when I really only care about them in table Y.
My question, just to clarify, is whether there is a more efficient and/or syntactically simpler way to do this operation.
My solution:
Y[X[Y, b:= sum(a)], b := i.b, nomatch=0]
For example:
set.seed(1)
X = data.table(id = sample.int(10,30, replace=TRUE), a = runif(30))
Y = data.table(id = seq(1,5))
setkey(X,id)
setkey(Y,id)
#head(X)
#id a
#1: 1 0.4112744
#2: 1 0.3162717
#3: 2 0.6470602
#4: 2 0.2447973
#5: 3 0.4820801
#6: 3 0.8273733
Y[X[Y, b := sum(a)], b := i.b, nomatch=0]
#head(Y)
# id b
#1: 1 0.7275461
#2: 2 0.8918575
#3: 3 3.0622883
#4: 4 2.9098465
#5: 5 0.7893562
IIUC, we could use data.table's by-without-by feature here...
## <= 1.9.2
X[Y, list(b=sum(a))] ## implicit by-without-by
## 1.9.3
X[Y, list(b=sum(a)), by=.EACHI] ## explicit by
# id b
# 1: 1 0.7275461
# 2: 2 0.8918575
# 3: 3 3.0622883
# 4: 4 2.9098465
# 5: 5 0.7893562
In 1.9.3, by-without-by has now been changed to require explicit by`. You can read more about it here under 1.9.3 new features points (1) and (2), and the links from there.
Is this what you had in mind?
# set up a reproducible example
library(data.table)
set.seed(1) # for reproducible example
X = data.table(id = sample.int(10,30, replace=TRUE), a = runif(30))
Y = data.table(id = seq(1,5))
setkey(X,id)
setkey(Y,id)
# this statement does the work
result <- X[,list(b=sum(a)),keyby=id][Y]
result
# id b
# 1: 1 0.7275461
# 2: 2 0.8918575
# 3: 3 3.0622883
# 4: 4 2.9098465
# 5: 5 0.7893562
This might be faster, as it subsets X first.
result.2 <- X[Y][,list(b=sum(a)),by=id]
identical(result, result.2)
# [1] TRUE

R data.table subsetting within a group and splitting a data table into two

I have the following data.table.
ts,id
1,a
2,a
3,a
4,a
5,a
6,a
7,a
1,b
2,b
3,b
4,b
I want to subset this data.table into two. The criteria is to have approximately the first half for each group (in this case column "id") in one data table and the remaining in another data.table. So the expected result are two data.tables as follows
ts,id
1,a
2,a
3,a
4,a
1,b
2,b
and
ts,id
5,a
6,a
7,a
3,b
4,b
I tried the following,
z1 = x[,.SD[.I < .N/2,],by=dev]
z1
and got just the following
id ts
a 1
a 2
a 3
Somehow, .I within the .SD isn't working the way I think it should. Any help appreciated.
Thanks in advance.
.I gives the row locations with respect to the whole data.table. Thus it can't be used like that within .SD.
Something like
DT[, subset := seq_len(.N) > .N/2,by='id']
subset1 <- DT[(subset)][,subset:=NULL]
subset2 <- DT[!(subset)][,subset:=NULL]
subset1
# ts id
# 1: 4 a
# 2: 5 a
# 3: 6 a
# 4: 7 a
# 5: 3 b
# 6: 4 b
subset2
# ts id
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 1 b
# 5: 2 b
Should work
For more than 2 groups, you could use cut to create a factor with the appropriate number of levels
Something like
DT[, subset := cut(seq_len(.N), 3, labels= FALSE),by='id']
# you could copy to the global environment a subset for each, but this
# will not be memory efficient!
list2env(setattr(split(DT, DT[['subset']]),'names', paste0('s',1:3)), .GlobalEnv)
Here's the corrected version of your expression:
dt[, .SD[, .SD[.I <= .N/2]], by = id]
# id ts
#1: a 1
#2: a 2
#3: a 3
#4: b 1
#5: b 2
The reason yours is not working is because .I and .N are not available in the i-expression (i.e. first argument of [) and so the parent data.table's .I and .N are used (i.e. dt's).

Resources