R data.table .SD unexpected behaviour - r

I was trying to calculate some basic statistics of a data table and I've come to this (for me) unexpected behaviour.
If I calculate using everything using "explicit" indexes everything works as expected as in the following example:
library(data.table)
n <- 100; reps <- 6; n1 <- 2
df <- as.data.frame(cbind(matrix(seq_len(n*n1), ncol=n1),
matrix(sample(0:1000, n*reps, replace=TRUE), ncol=reps)))
dt <- data.table(df)
dtmean <- dt[, lapply(.SD[,c(seq(2,5))], mean, na.rm=TRUE), by=c("V1")]
but if I use
a=2
b=5
dtmean <- dt[, lapply(.SD[,c(seq(a,b))], mean, na.rm=TRUE), by=c("V1")]
the results is not what I expect (previous lines)
Is this intentionally how data.table should work?
so the first part of code for n=10 gives
V1 V3 V4 V5 V6
1: 1 504 399 430 564
2: 2 547 294 274 700
3: 3 555 305 781 326
4: 4 144 840 983 221
5: 5 894 659 169 38
6: 6 788 289 598 433
7: 7 810 378 86 22
8: 8 848 212 701 565
9: 9 412 707 890 160
10: 10 82 580 927 607
while the second
V1 V1 V2 V3 V4
1: 1 2 3 4 5
2: 2 2 3 4 5
3: 3 2 3 4 5
4: 4 2 3 4 5
5: 5 2 3 4 5
6: 6 2 3 4 5
7: 7 2 3 4 5
8: 8 2 3 4 5
9: 9 2 3 4 5
10: 10 2 3 4 5
shouldn't they give me same results?
the function mean here does not calculate anything since V1 has all different values, the question is about selecting of the indexes, I don't understand why they work in different ways.

You should use .SDcols to control what's included in .SD in this case:
dtmean <- dt[, lapply(.SD, mean, na.rm=TRUE), by="V1", .SDcols=seq(a,b)]
To do it your style, you should use with=FALSE on the inner .SD call:
dtmean <- dt[, lapply(.SD[, seq(a,b), with=FALSE], mean, na.rm=TRUE), by=c("V1")]
.SD is itself a data.table, so [ has the same semantics, i.e., the issue is the same as the difference between
dt[ , seq(a,b)]
and
dt[ , seq(a,b), with=FALSE]
Addendum to note that .SDcols can also be used to determine a,b inline in some cases, e.g. if a:b is just the numeric columns of the table, we can use:
dt[ , lapply(.SD, mean, na.rm=TRUE), by=V1, .SDcols=is.numeric]
Or if a:b have a pattern in their name, e.g.:
dt[ , lapply(.SD, mean, na.rm=TRUE), by=V1, .SDcols=patterns("ends_with_x$")]

Related

Calculate mean of all groups except the current group

I have a data frame with two grouping variables, 'mkt' and 'mdl', and some values 'pr':
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2),
mdl = c('a','a','b','b','b','a','b','a','b'),
pr = c(120,120,110,110,145,130,145,130, 145))
df
mkt mdl pr
1 1 a 120
2 1 a 120
3 1 b 110
4 1 b 110
5 2 b 145
6 2 a 130
7 2 b 145
8 2 a 130
9 2 b 145
Within each 'mkt', the mean 'pr' for each 'mdl' should be calculated as the mean of 'pr' of all other 'mdl' in the same 'mkt', except the current 'mdl'.
For example, for the group defined by mkt == 1 and mdl == a, the 'avgother' is calculated as the average of 'pt' for mkt == 1 (same 'mkt') and mdl == b (all other 'mdl' than the current group a).
Desired result:
# mkt mdl pr avgother
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130
First get the average of each mkt and mdl values and for each mkt exclude the current value and get the average of remaining values.
library(dplyr)
library(purrr)
df %>%
group_by(mkt, mdl) %>%
summarise(avgother = mean(pr)) %>%
mutate(avgother = map_dbl(row_number(), ~mean(avgother[-.x]))) %>%
ungroup %>%
inner_join(df, by = c('mkt', 'mdl'))
# mkt mdl avgother pr
# <dbl> <chr> <dbl> <dbl>
#1 1 a 110 120
#2 1 a 110 120
#3 1 b 120 110
#4 1 b 120 110
#5 2 a 145 130
#6 2 a 145 130
#7 2 b 130 145
#8 2 b 130 145
#9 2 b 130 145
Using data.table, calculate sum and length by 'mkt'. Then, within each mkt-mdl group, calculate mean as (mkt sum - group sum) / (mkt length - group length)
library(data.table)
setDT(df)[ , `:=`(s = sum(pr), n = .N), by = mkt]
df[ , avgother := (s - sum(pr)) / (n - .N), by = .(mkt, mdl)]
df[ , `:=`(s = NULL, n = NULL)]
# mkt mdl pr avgother
# 1: 1 a 120 110
# 2: 1 a 120 110
# 3: 1 b 110 120
# 4: 1 b 110 120
# 5: 2 b 145 130
# 6: 2 a 130 145
# 7: 2 b 145 130
# 8: 2 a 130 145
# 9: 2 b 145 130
Consider base R with multiple ave calls for different level grouping calculation using the decomposed version of mean with sum / count:
df <- within(df, {
avgoth <- (ave(pr, mkt, FUN=sum) - ave(pr, mkt, mdl, FUN=sum)) /
(ave(pr, mkt, FUN=length) - ave(pr, mkt, mdl, FUN=length))
})
df
# mkt mdl pr avgoth
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130
For the sake of completeness, here is another data.table approach which uses grouping by each i, i.e., join and aggregate simultaneously.
For demonstration, an enhanced sample dataset is used which has a third market with 3 products:
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3]),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3))
library(data.table)
mdt <- setDT(df)[, .(mdl, s = sum(pr), .N), by = .(mkt)]
df[mdt, on = .(mkt, mdl), avgother := (sum(pr) - s) / (.N - N), by = .EACHI][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
The temporay table mdt contains the sum and count of prices within each mkt but replicated for each product mdl within the market:
mdt
mkt mdl s N
1: 1 a 460 4
2: 1 a 460 4
3: 1 b 460 4
4: 1 b 460 4
5: 2 b 695 5
6: 2 a 695 5
7: 2 b 695 5
8: 2 a 695 5
9: 2 b 695 5
10: 3 a 6 3
11: 3 b 6 3
12: 3 c 6 3
Having mkt and mdl in mdt allows for grouping by each i (by = .EACHI)
Here is an approach which computes avgother directly by subsetting pr values which do not belong to the actual value of mdl before computing the averages.
This is quite different to the other answers posted so far which justifies to post this as a separate answer, IMHO.
# enhanced sample dataset covering more corner cases
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3,4),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3],'d'),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3, 9))
library(data.table)
setDT(df)[, avgother := sapply(mdl, function(m) mean(pr[m != mdl])), by = mkt][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
13: 4 d 9 NaN
Difference between approaches
The other answers share more or less the same approach (although implemented in different manners)
compute sums and counts of pr for each mkt
compute sums and counts of prfor each mkt and mdl
subtract mkt/mdl sums and counts from mkt sums and counts
compute avgother
This approach
groups by mkt
loops through mdl within each mkt,
subsets pr to drop values which do not belong to the actual value of mdl
before computing mean() directly.
Caveat concerning performance: Although the code essentially is a one-liner it does not imply it is the fastest.

Applying function with more than 1 arguments [duplicate]

This question already has answers here:
Row sums over columns with a certain pattern in their name
(3 answers)
Closed 6 years ago.
I would like to apply a function to a data table which has more than 1 arguments.
Assume:
dt<-as.data.table(matrix(c(201,202,201,201,202,202,4,6,9,2,4,5,6,9,7,3,2,1), nrow = 6, ncol = 3, byrow = FALSE))
V1 V2 V3
1: 201 4 6
2: 202 6 9
3: 201 9 7
4: 201 2 3
5: 202 4 2
6: 202 5 1
I would like to apply a function with 3 arguments. For the sake of simplicity let's take a sum of them.
Obviously solution is not dt[,sum:=V1+V2+V3]
If I would pass 2nd and 3rd arguments in following way, it does not work.
dt[,sum:=lapply(V1,function(x,y,z) x+y+z,y=V2,z=V3)]
What is the proper way of applying a function with more than 1 arguments?
mapply() lets you loop over multiple vectors as arguments, using the corresponding position of the arguments with each other.
dt[,sum:=mapply(function(x,y,z) x+y+z, V1, V2, V3)]
V1 V2 V3 sum
1: 201 4 6 211
2: 202 6 9 217
3: 201 9 7 217
4: 201 2 3 206
5: 202 4 2 208
6: 202 5 1 208
We can use Reduce with +
dt[, Sum := Reduce(`+`, .SD)]
dt
# V1 V2 V3 Sum
#1: 201 4 6 211
#2: 202 6 9 217
#3: 201 9 7 217
#4: 201 2 3 206
#5: 202 4 2 208
#6: 202 5 1 208
If there are multiple arguments, one option is Map with do.call. Create the function of interest ('f1'), then specify the columns that will go as argument in .SDcols, use do.call with Map as argument, specify the function 'f1', unlist the output and assign (:=) it to 'Sum'
f1 <- function(x, y, z) x + y + z
dt[, Sum := unlist(do.call(Map, c(f=f1, unname(.SD)))), .SDcols = V1:V3]

R ave or apply? adding columns by a function with strata variable, but fast and multiple columns

In R:
I am not sure what the proper title for this question is, so maybe someone can help me out. It would be greatly appreciated. I'm sorry if this is called something easily searchable.
So I have a ragged array matrix (multiple UPCS)
[upc] [quantity1] [quantity2] [sum1] [sum2]
[1] 123 11 3 NA NA
[2] 123 2 1 NA ...
[3] 789 5 3 NA
[4] 456 10 6 NA
[5] 789 6 2 NA NA
I want the matrix to be summed by UPC, for example:
[upc] [quantity1] [quantity2] [sum1] [sum2]
[1] 123 11 3 13 4
[2] 123 2 1 13 4
[3] 789 5 3 11 5
[4] 456 10 6 10 6
[5] 789 6 2 11 5
Thank you for your time and help.
The trick is it need to be done in the most efficient way possible, since it will be done many times.
If these are in a data.frame the usual way to get the same function applied to the same grouping variables is with the aggregate function, but it doesn't behave like ave and returns a shorter result. I suspect it's not as fast as using data.table, dplyr functions, or Rcpp approaches, either.
aggregate( df[ , 2:3], df[1], sum)
This would be the way with a data.table:
library(data.table)
dt <- data.table(df)
setkey(dt, "upc")
# To show you what the inner expression would return
dt[, lapply(.SD, sum), by="upc"]
#-----------
upc quantity1 quantity2
1: 123 13 4
2: 456 10 6
3: 789 11 5
#-----------
dt[ dt[, lapply(.SD, sum), by="upc"] ] # It is a self join operation
upc quantity1 quantity2 quantity1.1 quantity2.1
1: 123 11 3 13 4
2: 123 2 1 13 4
3: 456 10 6 10 6
4: 789 5 3 11 5
5: 789 6 2 11 5
But then to return to the less efficient data.frame methods ....to get that first argument to sit alongside the source columns you would merge by upc:
> merge(df, aggregate( df[ , 2:3], df[1], sum), by="upc")
upc quantity1.x quantity2.x quantity1.y quantity2.y
1 123 11 3 13 4
2 123 2 1 13 4
3 456 10 6 10 6
4 789 5 3 11 5
5 789 6 2 11 5

Sample random rows within each group in a data.table

How would you use data.table to efficiently take a sample of rows within each group in a data frame?
DT = data.table(a = sample(1:2), b = sample(1:1000,20))
DT
a b
1: 2 562
2: 1 183
3: 2 180
4: 1 874
5: 2 533
6: 1 21
7: 2 57
8: 1 20
9: 2 39
10: 1 948
11: 2 799
12: 1 893
13: 2 993
14: 1 69
15: 2 906
16: 1 347
17: 2 969
18: 1 130
19: 2 118
20: 1 732
I was thinking of something like: DT[ , sample(??, 3), by = a] that would return a sample of three rows for each "a" (the order of the returned rows isn't significant):
a b
1: 2 180
2: 2 57
3: 2 799
4: 1 69
5: 1 347
6: 1 732
Maybe something like this?
> DT[,.SD[sample(.N, min(3,.N))],by = a]
a b
1: 1 744
2: 1 497
3: 1 167
4: 2 888
5: 2 950
6: 2 343
(Thanks to Josh for the correction, below.)
I believe joran's answer can be further generalized. The details are here (How do you sample groups in a data.table with a caveat) but I believe this solution accounts for cases where there aren't "3" rows to sample from.
The current solution will error out when it tries to sample "x" times from rows that have less than "x" common values. In the below case, x=3. And it takes into consideration this caveat. (Solution done by nrussell)
set.seed(123)
##
DT <- data.table(
a=c(1,1,1,1:15,1,1),
b=sample(1:1000,20))
##
R> DT[,.SD[sample(.N,min(.N,3))],by = a]
a b
1: 1 288
2: 1 881
3: 1 409
4: 2 937
5: 3 46
6: 4 525
7: 5 887
8: 6 548
9: 7 453
10: 8 948
11: 9 449
12: 10 670
13: 11 566
14: 12 102
15: 13 993
16: 14 243
17: 15 42
There are two subtle considerations that impact the answer to this question, and these are mentioned by Josh O'Brien and Valentin in comments. The first is that subsetting via .SD is very inefficient, and it is better to sample .I directly (see the benchmark below).
The second consideration, if we do sample from .I, is that calling sample(.I, size = 1) leads to unexpected behavior when .I > 1 and length(.I) = 1. In this case, sample() behaves as if we called sample(1:.I, size = 1), which is surely not what we want. As Valentin notes, it's better to use the construct .I[sample(.N, size = 1)] in this case.
As a benchmark, we build a simple 1,000 x 1 data.table and sample randomly per group. Even with such a small data.table the .I method is roughly 20x faster.
library(microbenchmark)
library(data.table)
set.seed(1L)
DT <- data.table(id = sample(1e3, 1e3, replace = TRUE))
microbenchmark(
`.I` = DT[DT[, .I[sample(.N, 1)], by = id][[2]]],
`.SD` = DT[, .SD[sample(.N, 1)], by = id]
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> .I 2.396166 2.588275 3.22504 2.794152 3.118135 19.73236 100
#> .SD 55.798177 59.152000 63.72131 61.213650 64.205399 102.26781 100
Created on 2020-12-02 by the reprex package (v0.3.0)
Inspired by this answer by David Arenburg, another method to avoid the .SD allocation would be to sample the groups, then join back onto the original data using .EACHI
DT[ DT[, sample(.N, 3), by=a], b[i.V1], on="a", by=.EACHI]
# a V1
# 1: 2 42
# 2: 2 498
# 3: 2 179
# 4: 1 469
# 5: 1 93
# 6: 1 898
where the DT[, sample(.N, 3), by=a] line gives us a sample for each group
# a V1
# 1: 1 9
# 2: 1 3
# 3: 1 2
# 4: 2 4
# 5: 2 9
# ---
so we can then use V1 to give us the b it corresponds to.
Stratified sampling > oversampling
size=don[y==1,.(strata=length(iden)),by=.(y,x)] # count of iden by strata
table(don$x,don$y)
don<-merge(don,size[,.(y,strata)],by="x") #merge strata values
don_strata=don[,.SD[sample(.N,strata)],by=.(y,x)]

in r, how can one trim or winsorize data by a factor

I'm trying to apply the winsor function at each level of a factor (subjects) in order to remove extreme cases. I can apply the winsor function to the entire column, but would like to do it within subject.
Subject RT
1 402
1 422
1 155
1 460
2 283
2 224
2 346
2 447
3 415
3 161
3 1
3 343
Ideally, I'd like the output to be a vector containing the same number of rows as the input but with outliers (e.g. the second last value of Subject 3) to be removed and replaced as per the winsor function.
you are looking for the ?by function
# for example:
by(myDF, myDF$Subject, winsor(myDF$RT))
However, using data.table (instead of data.frame) might be better suited for you
### broken down step by step:
library(data.table)
myDT <- data.table(myDF)
myDT[, winsorResult := winsor(RT), by=Subject]
library(psych)
transform(dat,win = ave(RT,Subject,FUN=winsor))
Subject RT win
1 1 402 402.0
2 1 422 422.0
3 1 155 303.2
4 1 460 437.2
5 2 283 283.0
6 2 224 259.4
7 2 346 346.0
8 2 447 386.4
9 3 415 371.8
10 3 161 161.0
11 3 1 97.0
12 3 343 343.0

Resources