I cannot seem to find any documentation on what exactly .EACHI does in data.table. I see a brief mention of it in the documentation:
Aggregation for a subset of known groups is particularly efficient
when passing those groups in i and setting by=.EACHI. When i is a
data.table, DT[i,j,by=.EACHI] evaluates j for the groups of DT that
each row in i joins to. We call this grouping by each i.
But what does "groups" in the context of DT mean? Is a group determined by the key that is set on DT? Is the group every distinct row that uses all the columns as the key? I fully understand how to run something like DT[i,j,by=my_grouping_variable] but am confused as to how .EACHI would work. Could someone explain please?
I've added this to the list here. And hopefully we'll be able to deliver as planned.
The reason is most likely that by=.EACHI is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X and Y:
X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")
We know that we can join by doing X[Y]. this is similar to a subset operation, but using data.tables (instead of integers / row names or logical values). For each row in Y, taking Y's key columns, it finds and returns corresponding matching rows in X's key columns (+ columns in Y) .
X[Y]
# x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a
Now let's say we'd like to, for each row from Y's key columns (here only one key column), we'd like to get the count of matches in X. In versions of data.table < 1.9.4, we can do this by simply specifying .N in j as follows:
# < 1.9.4
X[Y, .N]
# x N
# 1: 2 2
# 2: 6 1
What this implicitly does is, in the presence of j, evaluate the j-expression on each matched result of X (corresponding to the row in Y). This was called by-without-by or implicit-by, because it's as if there's a hidden by.
The issue was that this'll always perform a by operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N] (or simply nrow(X[Y]) in this case). That is, we can't have the j expression in the same call if we don't want a by-without-by. As a result, when we did for example X[Y, list(z)], it evaluated list(z) using by-without-by and was therefore slightly slower.
Additionally data.table users requested this to be explicit - see this and this for more context.
Hence by=.EACHI was added. Now, when we do:
X[Y, .N]
# [1] 3
it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.
And,
X[Y, .N, by=.EACHI]
evaluates j-expression on the matching rows for each row in Y (corresponding to value from Y's key columns here). It'd be easier to see this by using which=TRUE.
X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7
If we run .N for each, then we should get 2,1.
X[Y, .N, by=.EACHI]
# x N
# 1: 2 2
# 2: 6 1
So we now have both functionalities.
Related
I want to use data.table to create a function that only keeps rows where the ID column(s) - stored as a vector of strings - are duplicated. Note that where there are multiple ID columns, I only want to keep rows where the combination of ID columns is duplicated.
library(data.table)
dt <- data.table(x = c(1:5,5), y = rep(c(1,3,5), each = 2), z = rep(1:3, 2))
get_duplicate_id_rows1 <- function(dt_in, id_str) {
dt_in[, if(.N > 1) .SD, by = id_str]
}
get_duplicate_id_rows1(dt, c("x", "y"))
#> x y z
#> 1: 5 5 2
#> 2: 5 5 3
get_duplicate_id_rows1(dt[, .(x,y)], c("x", "y"))
#> Empty data.table (0 rows and 2 cols): x,y
As above, my first attempt works when the data table has one non-ID column. However, when all of the columns are ID columns, then the data table has no rows. I think this is because, as per ?data.table, .SD includes all variables of the original data table, except the grouping rows. Consequently, .SD has zero columns, which seems to be causing my issue.
get_duplicate_id_rows2 <- function(dt_in, id_str) {
dt_in[, if(.N > 1) .SD, by = id_str, .SDcols = names(dt_in)]
}
get_duplicate_id_rows2(dt, c("x", "y"))
#> x y x y z
#> 1: 5 5 5 5 2
#> 2: 5 5 5 5 3
get_duplicate_id_rows2(dt[, .(x,y)], c("x", "y"))
#> x y x y
#> 1: 5 5 5 5
#> 2: 5 5 5 5
My second attempt tries to circumvent my issues with my first attempt by using .SDcols. This does resolve the issue where all the columns in my data table are ID columns. However, here the column names in id_str are duplicated.
I think this is because one set of column names comes from the by argument and the other set of column names comes from .SDcols, although I'm not certain about this, because in my first attempt, the resultant data table had zero rows, not zero columns.
Consequently, I'd love to understand what's going on here, and what the most efficient solution to my problem is - particularly for large data sets, which is why I'm moving from tidyverse to data.table.
Created on 2020-04-09 by the reprex package (v0.3.0)
We can use .I to get the index of groups with frequency count greater than 1, extract the column and subset the data.table
dt[dt[, .I[.N >1], .(x, y)]$V1]
NOTE: It should be faster than .SD
Here is another option:
dt[dt[rowid(x, y) > 1], on=.(x, y), .SD]
In the example, your explanation for returning 0 row is correct. As grouping columns are used for grouping, it will be identical for each group and can be accessed via .BY and hence .SD need not have these columns to prevent duplication.
By default, when by is used, these are also returned as the leftmost columns in the output, hence in get_duplicate_id_rows2, you see x, y and then columns from .SD as specified in .SDcols.
Lastly, regarding efficiency, you can time the various options posted here using microbenchmark with your actual dataset and share your results.
I have a data.table and need to know the index of the row containing a minimal value under a given condition. Simple example:
dt <- data.table(i=11:13, val=21:23)
# i val
# 1: 11 21
# 2: 12 22
# 3: 13 23
Now, suppose I'd like to know in which row val is minimal under the condition i>=12, which is 2 in this case.
What didn't work:
dt[i>=12, which.min(val)]
# [1] 1
returns 1, because within dt[i>=12] it is the first row.
Also
dt[i>=12, .I[which.min(val)]]
# [1] 1
returned 1, because .I is only supposed to be used with grouping.
What did work:
To apply .I correctly, I added a grouping column:
dt[i>=12, g:=TRUE]
dt[i>=12, .I[which.min(val)], by=g][, V1]
# [1] 2
Note, that g is NA for i<12, thus which.min excludes that group from the result.
But, this requires extra computational power to add the column and perform the grouping. My productive data.table has several millions of rows and I have to find the minimum very often, so I'd like to avoid any extra computations.
Do you have any idea, how to efficiently solve this?
But, this requires extra computational power to add the column and perform the grouping.
So, keep the data sorted by it if it's so important:
setorder(dt, val)
dt[.(i_min = 12), on=.(i >= i_min), mult="first", which = TRUE]
# 2
This can also be extended to check more threshold i values. Just give a vector in i_min =:
dt[.(i_min = 9:14), on=.(i >= i_min), mult="first", which = TRUE]
# [1] 1 1 1 2 3 NA
How it works
x[i, on=, ...] is the syntax for a join.
i can be another table or equivalently a list of equal-length vectors.
.() is a shorthand for list().
on= can have inequalities for a "non-equi join".
mult= can determine what happens when a row of i has more than one match in x.
which=TRUE will return row numbers of x instead of the full joined table.
You can use the fact that which.min will ignore NA values to "mask" the values you don't want to consider:
dt[,which.min(ifelse(i>=12, val, NA))]
As a simple example of this behavior, which.min(c(NA, 2, 1)) returns 3, because the 3rd element is the min among all the non-NA values.
This seems like such an obvious question but I feel like I am doing it wrong. I have a vector of strings, and I just want to find the matching row indices in a data.table. The data.table is keyed by the column I want to match against, so, I think I should be able to use a binary search to find the matching indices.
Example to follow:
Here, I have a data.table, keyed by column c2 and a vector of strings, new_dat, for which I would like to find the row indices.
library(data.table)
## Example data, a keyed data.table
dat <- data.table(c1=1:10, c2=letters[1:10], key='c2')
## Match at some indices (keyed column so should be binary search?)
new_dat <- c('d', 'j')
## This doesn't feel right -- I don't think this is taking advantage of the
## data.table ordering at all
## Tried some dumb stuff like dat[match(new_dat, c2, 0L), .I]
dat[match(new_dat, c2, 0L), ] # only want the index of the matches
# c1 c2
# 1: 4 d
# 2: 10 j
## So, this is the desired result,
## but this is just doing ordinary linear search (I say w/o actually looking at the code)
match(new_dat, dat[['c2']], 0L)
# [1] 4 10
Edid
I just realized I could do,
dat[, ind := 1:.N][match(new_dat, c2, 0L), ind]
To get the indices, but still doesn't solve the problem I was trying to portray.
In order to find row indices (while not grouping using the by argument) you could specify which = TRUE while performing the binary join
options(datatable.verbose = TRUE) # Setting to TRUE so we can see the binary join being triggered
dat[new_dat, which = TRUE]
# Starting bmerge ...done in 0 secs <~~ binary join triggered
# [1] 4 10
(This will also work without creating c1 because it doesn't use that column at all)
And, if you just want to perform a normal binary join and see all the values in all the columns, you don't need to use match or create an index, just do
dat[new_dat]
# Starting bmerge ...done in 0 secs <~~ binary join triggered
# c1 c2
# 1: 4 d
# 2: 10 j
In general, as long as your data is keyed (or you using the on argument in order to join) and new_dat is not of class integer or numeric, data.table will automatically trigger the binary join even if new_dat is passed to the ith argument without being wrapped into . or J. Though, if new_dat is one of the above mentioned classes, data.table will try to perform a row indexing instead. Thus, you will need to use dat[.(new_dat), which = TRUE] or dat[J(new_dat), which = TRUE] in order to force bmerge instead row indexing.
The data.table package in R provides the option:
which: ‘TRUE’ returns the integer row numbers of ‘x’ that ‘i’
matches to.
However, I see no way of obtaining, within j, the integer row numbers of 'x' within the groups established using by.
For example, given...
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
...I would like to know the indices into DT for each value of y.
The value to me is that I am using a data.table in parallel with Another Data Structure (ADS) to which I intend to perform groupwise computations based on the efficiently computed groupings of the data.table.
For example, assuming ADS is a vector with a value for each row in DT:
ADS<-sample(100,nrow(DT))
I can, as a workaround, compute the groupwise mean of ADS determined by DT$y the group if I first add a new sequence column to the data.table.
DT[,seqNum:=seq_len(nrow(DT))]
DT[,mean(ADS[seqNum]),by=y]
Which gives the result I want at the cost of adding a new column.
I realize that in this example I can get the same answer using tapply:
tapply(ADS,DT$y,mean)
However, I will not then get the performance benefit of data.tables efficient grouping (especially when the 'by' columns are indexed).
Perhaps there is some syntax I am overlooking???
Perhaps this is an easy feature to add to data.table and I should request it (wink, wink)???
Proposed syntax: optionally set '.which' to the group indices, allowing to write:
DT[,mean(ADS[.which]),by=y,which=TRUE]
Available since data.table 1.8.3 you can use .I in the j of a data.table to get the row indices by groups...
DT[ , list( yidx = list(.I) ) , by = y ]
# y yidx
#1: 1 1,4,7
#2: 3 2,5,8
#3: 6 3,6,9
A keyed data.table will be sorted so that groups are stored in contiguous blocks. In that case, you could use .N to extract the group-wise indexing information:
DT <- data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6))
setkey(DT, y)
ii <- DT[,.N, by=y]
ii[, start := cumsum(N) - N[1] + 1][,end := cumsum(N)][, N := NULL]
# y start end
# 1: 1 1 3
# 2: 3 4 6
# 3: 6 7 9
(Personally, I'd probably just add an indexing column like your suggested seqNum. Seems simpler, I don't think it will affect performance too much unless you are really pushing the limits.)
.SD looks useful but I do not really know what I am doing with it. What does it stand for? Why is there a preceding period (full stop). What is happening when I use it?
I read:
.SD is a data.table containing the subset of x's data for each group, excluding the group column(s). It can be used when grouping by i, when grouping by by, keyed by, and _ad hoc_ by
Does that mean that the daughter data.tables is held in memory for the next operation?
.SD stands for something like "Subset of Data.table". There's no significance to the initial ".", except that it makes it even more unlikely that there will be a clash with a user-defined column name.
If this is your data.table:
DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
DT
# x y v
# 1: a 1 1
# 2: b 1 3
# 3: c 1 5
# 4: a 3 2
# 5: b 3 4
# 6: c 3 6
Doing this may help you see what .SD is:
DT[ , .SD[ , paste(x, v, sep="", collapse="_")], by=y]
# y V1
# 1: 1 a1_b3_c5
# 2: 3 a2_b4_c6
Basically, the by=y statement breaks the original data.table into these two sub-data.tables
DT[ , print(.SD), by=y]
# <1st sub-data.table, called '.SD' while it's being operated on>
# x v
# 1: a 1
# 2: b 3
# 3: c 5
# <2nd sub-data.table, ALSO called '.SD' while it's being operated on>
# x v
# 1: a 2
# 2: b 4
# 3: c 6
# <final output, since print() doesn't return anything>
# Empty data.table (0 rows) of 1 col: y
and operates on them in turn.
While it is operating on either one, it lets you refer to the current sub-data.table by using the nick-name/handle/symbol .SD. That's very handy, as you can access and operate on the columns just as if you were sitting at the command line working with a single data.table called .SD ... except that here, data.table will carry out those operations on every single sub-data.table defined by combinations of the key, "pasting" them back together and returning the results in a single data.table!
Edit:
Given how well-received this answer was, I've converted it into a package vignette now available here
Given how often this comes up, I think this warrants a bit more exposition, beyond the helpful answer given by Josh O'Brien above.
In addition to the Subset of the Data acronym usually cited/created by Josh, I think it's also helpful to consider the "S" to stand for "Selfsame" or "Self-reference" -- .SD is in its most basic guise a reflexive reference to the data.table itself -- as we'll see in examples below, this is particularly helpful for chaining together "queries" (extractions/subsets/etc using [). In particular, this also means that .SD is itself a data.table (with the caveat that it does not allow assignment with :=).
The simpler usage of .SD is for column subsetting (i.e., when .SDcols is specified); I think this version is much more straightforward to understand, so we'll cover that first below. The interpretation of .SD in its second usage, grouping scenarios (i.e., when by = or keyby = is specified), is slightly different, conceptually (though at core it's the same, since, after all, a non-grouped operation is an edge case of grouping with just one group).
Here are some illustrative examples and some other examples of usages that I myself implement often:
Loading Lahman Data
To give this a more real-world feel, rather than making up data, let's load some data sets about baseball from Lahman:
library(data.table)
library(magrittr) # some piping can be beautiful
library(Lahman)
Teams = as.data.table(Teams)
# *I'm selectively suppressing the printed output of tables here*
Teams
Pitching = as.data.table(Pitching)
# subset for conciseness
Pitching = Pitching[ , .(playerID, yearID, teamID, W, L, G, ERA)]
Pitching
Naked .SD
To illustrate what I mean about the reflexive nature of .SD, consider its most banal usage:
Pitching[ , .SD]
# playerID yearID teamID W L G ERA
# 1: bechtge01 1871 PH1 1 2 3 7.96
# 2: brainas01 1871 WS3 12 15 30 4.50
# 3: fergubo01 1871 NY2 0 0 1 27.00
# 4: fishech01 1871 RC1 4 16 24 4.35
# 5: fleetfr01 1871 NY2 0 1 1 10.00
# ---
# 44959: zastrro01 2016 CHN 1 0 8 1.13
# 44960: zieglbr01 2016 ARI 2 3 36 2.82
# 44961: zieglbr01 2016 BOS 2 4 33 1.52
# 44962: zimmejo02 2016 DET 9 7 19 4.87
# 44963: zychto01 2016 SEA 1 0 12 3.29
That is, we've just returned Pitching, i.e., this was an overly verbose way of writing Pitching or Pitching[]:
identical(Pitching, Pitching[ , .SD])
# [1] TRUE
In terms of subsetting, .SD is still a subset of the data, it's just a trivial one (the set itself).
Column Subsetting: .SDcols
The first way to impact what .SD is is to limit the columns contained in .SD using the .SDcols argument to [:
Pitching[ , .SD, .SDcols = c('W', 'L', 'G')]
# W L G
# 1: 1 2 3
# 2: 12 15 30
# 3: 0 0 1
# 4: 4 16 24
# 5: 0 1 1
# ---
# 44959: 1 0 8
# 44960: 2 3 36
# 44961: 2 4 33
# 44962: 9 7 19
# 44963: 1 0 12
This is just for illustration and was pretty boring. But even this simply usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:
Column Type Conversion
Column type conversion is a fact of life for data munging -- as of this writing, fwrite cannot automatically read Date or POSIXct columns, and conversions back and forth among character/factor/numeric are common. We can use .SD and .SDcols to batch-convert groups of such columns.
We notice that the following columns are stored as character in the Teams data set:
# see ?Teams for explanation; these are various IDs
# used to identify the multitude of teams from
# across the long history of baseball
fkt = c('teamIDBR', 'teamIDlahman45', 'teamIDretro')
# confirm that they're stored as `character`
Teams[ , sapply(.SD, is.character), .SDcols = fkt]
# teamIDBR teamIDlahman45 teamIDretro
# TRUE TRUE TRUE
If you're confused by the use of sapply here, note that it's the same as for base R data.frames:
setDF(Teams) # convert to data.frame for illustration
sapply(Teams[ , fkt], is.character)
# teamIDBR teamIDlahman45 teamIDretro
# TRUE TRUE TRUE
setDT(Teams) # convert back to data.table
The key to understanding this syntax is to recall that a data.table (as well as a data.frame) can be considered as a list where each element is a column -- thus, sapply/lapply applies FUN to each column and returns the result as sapply/lapply usually would (here, FUN == is.character returns a logical of length 1, so sapply returns a vector).
The syntax to convert these columns to factor is very similar -- simply add the := assignment operator
Teams[ , (fkt) := lapply(.SD, factor), .SDcols = fkt]
Note that we must wrap fkt in parentheses () to force R to interpret this as column names, instead of trying to assign the name fkt to the RHS.
The flexibility of .SDcols (and :=) to accept a character vector or an integer vector of column positions can also come in handy for pattern-based conversion of column names*. We could convert all factor columns to character:
fkt_idx = which(sapply(Teams, is.factor))
Teams[ , (fkt_idx) := lapply(.SD, as.character), .SDcols = fkt_idx]
And then convert all columns which contain team back to factor:
team_idx = grep('team', names(Teams), value = TRUE)
Teams[ , (team_idx) := lapply(.SD, factor), .SDcols = team_idx]
** Explicitly using column numbers (like DT[ , (1) := rnorm(.N)]) is bad practice and can lead to silently corrupted code over time if column positions change. Even implicitly using numbers can be dangerous if we don't keep smart/strict control over the ordering of when we create the numbered index and when we use it.
Controlling a Model's RHS
Varying model specification is a core feature of robust statistical analysis. Let's try and predict a pitcher's ERA (Earned Runs Average, a measure of performance) using the small set of covariates available in the Pitching table. How does the (linear) relationship between W (wins) and ERA vary depending on which other covariates are included in the specification?
Here's a short script leveraging the power of .SD which explores this question:
# this generates a list of the 2^k possible extra variables
# for models of the form ERA ~ G + (...)
extra_var = c('yearID', 'teamID', 'G', 'L')
models =
lapply(0L:length(extra_var), combn, x = extra_var, simplify = FALSE) %>%
unlist(recursive = FALSE)
# here are 16 visually distinct colors, taken from the list of 20 here:
# https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
col16 = c('#e6194b', '#3cb44b', '#ffe119', '#0082c8', '#f58231', '#911eb4',
'#46f0f0', '#f032e6', '#d2f53c', '#fabebe', '#008080', '#e6beff',
'#aa6e28', '#fffac8', '#800000', '#aaffc3')
par(oma = c(2, 0, 0, 0))
sapply(models, function(rhs) {
# using ERA ~ . and data = .SD, then varying which
# columns are included in .SD allows us to perform this
# iteration over 16 models succinctly.
# coef(.)['W'] extracts the W coefficient from each model fit
Pitching[ , coef(lm(ERA ~ ., data = .SD))['W'], .SDcols = c('W', rhs)]
}) %>% barplot(names.arg = sapply(models, paste, collapse = '/'),
main = 'Wins Coefficient with Various Covariates',
col = col16, las = 2L, cex.names = .8)
The coefficient always has the expected sign (better pitchers tend to have more wins and fewer runs allowed), but the magnitude can vary substantially depending on what else we control for.
Conditional Joins
data.table syntax is beautiful for its simplicity and robustness. The syntax x[i] flexibly handles two common approaches to subsetting -- when i is a logical vector, x[i] will return those rows of x corresponding to where i is TRUE; when i is another data.table, a join is performed (in the plain form, using the keys of x and i, otherwise, when on = is specified, using matches of those columns).
This is great in general, but falls short when we wish to perform a conditional join, wherein the exact nature of the relationship among tables depends on some characteristics of the rows in one or more columns.
This example is a tad contrived, but illustrates the idea; see here (1, 2) for more.
The goal is to add a column team_performance to the Pitching table that records the team's performance (rank) of the best pitcher on each team (as measured by the lowest ERA, among pitchers with at least 6 recorded games).
# to exclude pitchers with exceptional performance in a few games,
# subset first; then define rank of pitchers within their team each year
# (in general, we should put more care into the 'ties.method'
Pitching[G > 5, rank_in_team := frank(ERA), by = .(teamID, yearID)]
Pitching[rank_in_team == 1, team_performance :=
# this should work without needing copy();
# that it doesn't appears to be a bug:
# https://github.com/Rdatatable/data.table/issues/1926
Teams[copy(.SD), Rank, .(teamID, yearID)]]
Note that the x[y] syntax returns nrow(y) values, which is why .SD is on the right in Teams[.SD] (since the RHS of := in this case requires nrow(Pitching[rank_in_team == 1]) values.
Grouped .SD operations
Often, we'd like to perform some operation on our data at the group level. When we specify by = (or keyby = ), the mental model for what happens when data.table processes j is to think of your data.table as being split into many component sub-data.tables, each of which corresponds to a single value of your by variable(s):
In this case, .SD is multiple in nature -- it refers to each of these sub-data.tables, one-at-a-time (slightly more accurately, the scope of .SD is a single sub-data.table). This allows us to concisely express an operation that we'd like to perform on each sub-data.table before the re-assembled result is returned to us.
This is useful in a variety of settings, the most common of which are presented here:
Group Subsetting
Let's get the most recent season of data for each team in the Lahman data. This can be done quite simply with:
# the data is already sorted by year; if it weren't
# we could do Teams[order(yearID), .SD[.N], by = teamID]
Teams[ , .SD[.N], by = teamID]
Recall that .SD is itself a data.table, and that .N refers to the total number of rows in a group (it's equal to nrow(.SD) within each group), so .SD[.N] returns the entirety of .SD for the final row associated with each teamID.
Another common version of this is to use .SD[1L] instead to get the first observation for each group.
Group Optima
Suppose we wanted to return the best year for each team, as measured by their total number of runs scored (R; we could easily adjust this to refer to other metrics, of course). Instead of taking a fixed element from each sub-data.table, we now define the desired index dynamically as follows:
Teams[ , .SD[which.max(R)], by = teamID]
Note that this approach can of course be combined with .SDcols to return only portions of the data.table for each .SD (with the caveat that .SDcols should be fixed across the various subsets)
NB: .SD[1L] is currently optimized by GForce (see also), data.table internals which massively speed up the most common grouped operations like sum or mean -- see ?GForce for more details and keep an eye on/voice support for feature improvement requests for updates on this front: 1, 2, 3, 4, 5, 6
Grouped Regression
Returning to the inquiry above regarding the relationship between ERA and W, suppose we expect this relationship to differ by team (i.e., there's a different slope for each team). We can easily re-run this regression to explore the heterogeneity in this relationship as follows (noting that the standard errors from this approach are generally incorrect -- the specification ERA ~ W*teamID will be better -- this approach is easier to read and the coefficients are OK):
# use the .N > 20 filter to exclude teams with few observations
Pitching[ , if (.N > 20) .(w_coef = coef(lm(ERA ~ W))['W']), by = teamID
][ , hist(w_coef, 20, xlab = 'Fitted Coefficient on W',
ylab = 'Number of Teams', col = 'darkgreen',
main = 'Distribution of Team-Level Win Coefficients on ERA')]
While there is a fair amount of heterogeneity, there's a distinct concentration around the observed overall value
Hopefully this has elucidated the power of .SD in facilitating beautiful, efficient code in data.table!
I did a video about this after talking with Matt Dowle about .SD, you can see it on YouTube: https://www.youtube.com/watch?v=DwEzQuYfMsI