Aggregate calculations with and without grouping variable in data.table - r

I'm producing some summary statistics at the by-group and overall levels.
(Note: the overall statistic cannot necessarily be derived from the group-level stats. A weighted average could work, but not a median.)
Thus far my workarounds use rbindlist on either summary stats or copies of the original data, as in:
library(data.table)
data(iris)
d <- data.table(iris)
# Approach 1)
rbindlist(list(d[, lapply(.SD, median), by=Species, .SDcols=c('Sepal.Length','Petal.Length')],
d[, lapply(.SD, median), .SDcols=c('Sepal.Length', 'Petal.Length')]),
fill=TRUE)
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: NA 5.8 4.35
# Approach 2)
d2 <- rbindlist(list(copy(d), copy(d[,Species:="Overall"]) ) )
d2[, lapply(.SD, median), by=Species, .SDcols=c('Sepal.Length', 'Petal.Length')]
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: Overall 5.8 4.35
The first approach seems to be faster (avoids copies).
The second approach allows me to use a label "Overall" instead of the NA fill, which is more intelligible if some records were missing the "Species" value (which in the first approach would result in two rows of NA Species.)
Are there any other solutions I should consider?

I think I normally do it like this:
cols = c('Sepal.Length','Petal.Length')
rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
d[, lapply(.SD, median), .SDcols=cols][, Species := 'Overall'])
# Species Sepal.Length Petal.Length
#1: setosa 5.0 1.50
#2: versicolor 5.9 4.35
#3: virginica 6.5 5.55
#4: Overall 5.8 4.35

I accepted #Eddi's answer but wanted to incorporate the good comment from #Frank. This approach IMO makes the most sense.
library(data.table)
d <- data.table(iris)
cols = c('Sepal.Length','Petal.Length')
rbind(d[, lapply(.SD, median), by=Species, .SDcols=cols],
d[, c(Species = 'Overall', lapply(.SD, median) ), .SDcols=cols])
# Species Sepal.Length Petal.Length
# 1: setosa 5.0 1.50
# 2: versicolor 5.9 4.35
# 3: virginica 6.5 5.55
# 4: Overall 5.8 4.35
It may also be slightly faster (1.54 vs. 1.73 millis on microbenchmark) than applying the secondary calculation.

Related

Is there a way to specify a column name as an argument?

Let's say I want to create a function that replicates a column of choice, for the sake of an example.
testdata <- data.frame(
"diff1" = c(seq(1:10)),
"diff2" = c(seq(21:30))
)
goal <- testdata %>%
mutate(newdiff1 = diff1)
So I create a function
funtest <- function(dat,var,newvar){
dat %>%
mutate(newvar = var)
}
however,
test2 <- funtest(testdata,diff1,newdiff1)
would return an error:
Error: object 'diff1' not found
This format works
nondesiredformat <- funtest(testdata,testdata$diff1,newdiff1)
but this will cause the new variable to be always called "newvar", instead of our third argument.
is there a way to change the function so the arguments in test2 may work?
Thank you
In the function, we can use {{}} for doing the evaluation i.e. !! + enquo for unquoted variable names passed into function and for assignment, use the := instead of =
funtest <- function(dat,var,newvar){
dat %>%
mutate({{newvar}} := {{var}})
}
funtest(testdata, diff1, newdiff1)
# diff1 diff2 newdiff1
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 10 10
you can use bquote for this:
eval(bquote(
dat %>%
mutate(.(newvar) := .(var))
))
you could also update old school in your particular case
dat[[newvar]] = dat[[var]]
If you start to write functions with variable names with arguments, you might find data.table more convenient than dplyr. I recently wrote a post on the subject. Standard evaluation is easier to handle with data.table than dplyr, in my opinion.
With data.table, you have several ways to use column names as argument
Using get
You can use get that maps a name with a value in a certain scope. Here the scope is your data.table:
library(data.table)
funtest <- function(dat,var,newvar){
dat[, (newvar) := get(var)]
}
:= is an update-by-reference operator. If you want to know more about it, data.table vignettes are a good place to start. Calling the function:
dt = data.table(iris)
funtest(dt, "Species","x")[]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species x
1: 5.1 3.5 1.4 0.2 setosa setosa
2: 4.9 3.0 1.4 0.2 setosa setosa
3: 4.7 3.2 1.3 0.2 setosa setosa
4: 4.6 3.1 1.5 0.2 setosa setosa
5: 5.0 3.6 1.4 0.2 setosa setosa
---
146: 6.7 3.0 5.2 2.3 virginica virginica
147: 6.3 2.5 5.0 1.9 virginica virginica
148: 6.5 3.0 5.2 2.0 virginica virginica
149: 6.2 3.4 5.4 2.3 virginica virginica
150: 5.9 3.0 5.1 1.8 virginica virginica
Using .SD
You can also use .SD that means Subset of Data. This is more convenient when you have several variables quoted. It avoids the !!!rlang::sym necessary for dplyr.
You can find yourself making complicated computations with a very concise syntax:
df[, newcolnames := lapply(.SD, mean), by = grouping_var, .SDcols = xvars]

filter by using %like% between two columns of the data table

Hello stackoverflowers,
I wonder if I could use the %like% operator row-wise in the datatable between two columns of the same datatable.
The following reproducible example will make it more clear.
First prepare the data
library(data.table)
iris <- as.data.table(iris)
iris <- iris[seq.int(from = 1, to = 150,length.out = 5)]
iris[, Species2 := c('set', "set|vers", "setosa", "nothing" , "virginica")]
Hence the dataset looks as follows.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
1: 5.1 3.5 1.4 0.2 setosa set
2: 4.9 3.6 1.4 0.1 setosa set|vers
3: 6.4 2.9 4.3 1.3 versicolor setosa
4: 6.4 2.7 5.3 1.9 virginica nothing
5: 5.9 3.0 5.1 1.8 virginica virginica
I would like to use something like the following command row-wise.
iris[Species%like%Species2]
but it does not understand that I want it row-wise. Is that possible?
The result should be the 1,2,5 rows.
One way would be to group by row:
iris[, .SD[Species %like% Species2], by = 1:5]
# : Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
#1: 1 5.1 3.5 1.4 0.2 setosa set
#2: 2 4.9 3.6 1.4 0.1 setosa set|vers
#3: 5 5.9 3.0 5.1 1.8 virginica virginica
Or as per #docendodiscimus 's comment, in case there are duplicate entries, you can do:
iris[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)]
%like% is just a wrapper around grepl, so the pattern (right-hand side) can only be length 1. You should be seeing a warning about this.
The stringi package lets you vectorize the pattern argument.
library(stringi)
iris[stri_detect_regex(Species, Species2)]
If you like the operator style instead of the function, you can make your own:
`%vlike%` <- function(x, y) {
stri_detect_regex(x, y)
}
iris[Species %vlike% Species2]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
# 1: 5.1 3.5 1.4 0.2 setosa set
# 2: 4.9 3.6 1.4 0.1 setosa set|vers
# 3: 5.9 3.0 5.1 1.8 virginica virginica
The various answers have the right bits and pieces - do it by Species2:
iris[, .SD[Species %like% Species2], by = Species2]
And if that's slow - use .I instead.
You can't pass a vector to the pattern argument of %like% since it calls upon grepl/grep and these aren't vectorized. You could use mapply to call %like% for each row to get what you want:
iris[mapply(function(x,y) x %like% y, Species, Species2) ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species2
#1: 5.1 3.5 1.4 0.2 setosa set
#2: 4.9 3.6 1.4 0.1 setosa set|vers
#3: 5.9 3.0 5.1 1.8 virginica virginica
Microbenchmark mainly for my own curiosity, but for anyone else interested:
set.seed(1)
dt <- data.table(Species = replicate(100000, paste0(sample(LETTERS, 6), collapse = "")),
Species2 = replicate(100000, paste0(sample(LETTERS, 3), collapse = "")))
microbenchmark::microbenchmark( mapply = dt[mapply(function(x,y) x %like% y, Species, Species2) ],
by_group1 = dt[, .SD[Species[1L] %like% Species2[1L]], by = .(Species, Species2)],
by_group2 = dt[, .SD[Species %like% Species2], by = 1:nrow(dt)],
str_detect = dt[stri_detect_regex(Species, Species2)],
by_species2 = dt[,.SD[Species %like% Species2], by = Species2],
by_species2I = dt[dt[, .I[Species %like% Species2], by = Species2]$V1],
times = 5)
Unit: milliseconds
expr min lq mean median uq max neval
mapply 669.9691 680.2241 700.3758 685.8262 715.8373 750.0224 5
by_group1 10906.2179 10908.0985 10951.5651 10914.7002 11009.0683 11019.7408 5
by_group2 16738.4390 16826.4793 16907.8428 16902.9490 16970.6143 17100.7324 5
str_detect 430.7768 431.1002 432.2279 431.9284 433.3488 433.9855 5
by_species2 2482.7583 2518.6858 2547.5882 2531.4913 2599.0159 2605.9899 5
by_species2I 110.1486 114.6775 115.9223 117.5270 118.5033 118.7553 5
Only ran it 5 times since the by_group* operations were so slow. Looks like #eddi's method using .I is that fastest (assuming I have his intended method correct).
Also, re-ran the benchmark using fewer groups, it seems in this case the by_species2I is still the fastest, and the other by_group* are still slowest by a lot (makes sense since the # of groups for by_group2 is always the data size and for by_group1 it's going to be close to the data size).
set.seed(1)
dt <- data.table(Species = replicate(100000, paste0(sample(LETTERS, 3), collapse = "")),
Species2 = replicate(100000, paste0(sample(LETTERS, 2), collapse = "")))
Unit: milliseconds
expr min lq mean median uq max neval
mapply 611.83085 617.60180 639.7778 638.49061 652.80619 678.15932 5
by_group1 10021.48177 10121.00419 10145.6305 10123.01354 10213.37976 10249.27339 5
by_group2 15828.21224 15997.56034 16018.9583 16066.07284 16101.40961 16101.53651 5
str_detect 416.44549 419.83585 420.6042 421.69423 421.85359 423.19194 5
by_species2 106.06793 114.02764 115.5364 117.62331 118.04524 121.91770 5
by_species2I 14.22369 14.72001 15.2137 15.24514 15.38371 16.49597 5

Nesting a SUM and a MEAN in an aggregate to get means of scores per group

I can't find a dataset similar to my problem so I changed the dataset Iris (dataset in R) to look similar - it's close enough!
data = iris
data$type = gl(5,30,150,labels=c("group1","group2","group3","group4","group5"))
data$ID = gl(30,5,150)
Then I used the following code
xtabs(Sepal.Length ~ Species + type, aggregate(Sepal.Length ~ Species + type + ID, data, mean))
which results in
type
Species group1 group2 group3 group4 group5
setosa 30.16 19.90 0.00 0.00 0.00
versicolor 0.00 12.20 35.88 11.28 0.00
virginica 0.00 0.00 0.00 26.24 39.64
My understanding is that what my code is doing is adding together Sepal.Length for each ID then taking the mean of those values by each of Species and type.
Is this correct?
If not, how would I get this?
Additionally, how would I get this if my data is such that each ID has multiple types? (can't figure out how to construct this in R)
Actually, just to be perfectly clear
What I want is a code that sums together the Sepal.Length for each ID AND type then it will take the average of those sums over all IDs and post an average Sepal.Length by type and species/
With data.table:
library(data.table)
setDT(data)
#sum of Sepal.Length for each ID AND type
data[, id_type_sum := sum(Sepal.Length), by = .(ID, type)]
# mean of this variable by type and species
data[, mean(id_type_sum), by = .(type, Species)]
# type Species V1
# 1: group1 setosa 25.13333
# 2: group2 setosa 24.87500
# 3: group2 versicolor 30.50000
# 4: group3 versicolor 29.90000
# 5: group4 versicolor 28.20000
# 6: group4 virginica 32.80000
# 7: group5 virginica 33.03333
And if you want this in table format, you can use data.table's dcast method:
library(magrittr) # for the %>% operator
data[, mean(id_type_sum), by = .(type, Species)] %>%
dcast(Species ~ type)
Result:
Species group1 group2 group3 group4 group5
1: setosa 25.13333 24.875 NA NA NA
2: versicolor NA 30.500 29.9 28.2 NA
3: virginica NA NA NA 32.8 33.03333

Why is using update on a lm inside a grouped data.table losing its model data?

Ok, this is a weird one. I suspect this is a bug inside data.table, but it would be useful if anyone can explain why this is happening - what is update doing exactly?
I'm using the list(list()) trick inside data.table to store fitted models. When you create a sequence of lm objects each for different groupings, and then update those models, the model data for all models becomes that of the last grouping. This seems like a reference is hanging around somewhere where a copy should have been made, but I can't find where and I can't reproduce this outside of lm and update.
Concrete example:
Starting with the iris data, first make the three species different sample sizes, then fit an lm model to each species, the update those models:
set.seed(3)
DT = data.table(iris)
DT = DT[rnorm(150) < 0.9]
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))),
by = Species]
fit2 = fit[, list(list(update(V1[[1]], ~.-Sepal.Length))), by = Species]
The original data table has different numbers of each species
DT[,.N, by = Species]
# Species N
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
And the first fit confirms thsi:
fit[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 41
# 2: versicolor 39
# 3: virginica 42
But the updated second fit is showing 42 for all models
fit2[, nobs(V1[[1]]), by = Species]
# Species V1
# 1: setosa 42
# 2: versicolor 42
# 3: virginica 42
We can also look at the model attribute which contains the data used for fitting, and see that all the model are indeed using the final groups data. The question is how has this happened?
head(fit$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 5.1 3.5 1.4
# 2 4.9 3.0 1.4
# 3 4.7 3.2 1.3
# 4 4.6 3.1 1.5
# 5 5.0 3.6 1.4
# 6 5.4 3.9 1.7
head(fit$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[1]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
head(fit2$V1[[3]]$model)
# Sepal.Length Sepal.Width Petal.Length
# 1 6.3 3.3 6.0
# 2 5.8 2.7 5.1
# 3 6.3 2.9 5.6
# 4 7.6 3.0 6.6
# 5 4.9 2.5 4.5
# 6 7.3 2.9 6.3
This is not an answer, but is too long for a comment
The .Environment for the terms component is identical for each resulting model
e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE
It appears that data.table is using the same bit of memory (my non-technical term) for
each evaluation of j by group (which is efficient). However when update is called, it is using this to refit the model. This will contain the values from the last group.
So, if you fudge this, it will work
fit = DT[, { xx <-list2env(copy(.SD))
mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
attr(mymodel$terms, '.Environment') <- xx
list(list(mymodel))}, by= 'Species']
lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
Species V1
1: setosa 41
2: versicolor 39
3: virginica 42
not a long term solution, but at least a workaround.

How can I use functions returning vectors (like fivenum) with ddply or aggregate?

I would like to split my data frame using a couple of columns and call let's say fivenum on each group.
aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x)))
The returned value is a data.frame with only 2 columns and the second being a matrix. How can I turn it into normal columns of a data.frame?
Update
I want something like the following with less code using fivenum
ddply(iris, .(Species), summarise,
Min = min(Petal.Width),
Q1 = quantile(Petal.Width, .25),
Med = median(Petal.Width),
Q3 = quantile(Petal.Width, .75),
Max = max(Petal.Width)
)
Here is a solution using data.table (while not specifically requested, it is an obvious compliment or replacement for aggregate or ddply. As well as being slightly long to code, repeatedly calling quantile will be inefficient, as for each call you will be sorting the data
library(data.table)
Tukeys_five <- c("Min","Q1","Med","Q3","Max")
IRIS <- data.table(iris)
# this will create the wide data.table
lengthBySpecies <- IRIS[,as.list(fivenum(Sepal.Length)), by = Species]
# and you can rename the columns from V1, ..., V5 to something nicer
setnames(lengthBySpecies, paste0('V',1:5), Tukeys_five)
lengthBySpecies
Species Min Q1 Med Q3 Max
1: setosa 4.3 4.8 5.0 5.2 5.8
2: versicolor 4.9 5.6 5.9 6.3 7.0
3: virginica 4.9 6.2 6.5 6.9 7.9
Or, using a single call to quantile using the appropriate prob argument.
IRIS[,as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25))), by = Species]
Species 0% 25% 50% 75% 100%
1: setosa 4.3 4.800 5.0 5.2 5.8
2: versicolor 4.9 5.600 5.9 6.3 7.0
3: virginica 4.9 6.225 6.5 6.9 7.9
Note that the names of the created columns are not syntactically valid, although you could go through a similar renaming using setnames
EDIT
Interestingly, quantile will set the names of the resulting vector if you set names = TRUE, and this will copy (slow down the number crunching and consume memory - it even warns you in the help, fancy that!)
Thus, you should probably use
IRIS[,as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25), names = FALSE)), by = Species]
Or, if you wanted to return the named list, without R copying internally
IRIS[,{quant <- as.list(quantile(Sepal.Length, prob = seq(0,1, by = 0.25), names = FALSE))
setattr(quant, 'names', Tukeys_five)
quant}, by = Species]
You can use do.call to call data.frame on each of the matrix elements recursively to get a data.frame with vector elements:
dim(do.call("data.frame",dfr))
[1] 3 7
str(do.call("data.frame",dfr))
'data.frame': 3 obs. of 7 variables:
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
$ Petal.Width.Min. : num 0.1 1 1.4
$ Petal.Width.1st.Qu.: num 0.2 1.2 1.8
$ Petal.Width.Median : num 0.2 1.3 2
$ Petal.Width.Mean : num 0.28 1.36 2
$ Petal.Width.3rd.Qu.: num 0.3 1.5 2.3
$ Petal.Width.Max. : num 0.6 1.8 2.5
As far as I know, there isn't an exact way to do what you're asking, because the function you're using (fivenum) doesn't return data in a way that can be easily bound to columns from within the 'ddply' function. This is easy to clean up, though, in a programmatic way.
Step 1: Perform the fivenum function on each 'Species' value using the 'ddply' function.
data <- ddply(iris, .(Species), summarize, value=fivenum(Petal.Width))
# Species value
# 1 setosa 0.1
# 2 setosa 0.2
# 3 setosa 0.2
# 4 setosa 0.3
# 5 setosa 0.6
# 6 versicolor 1.0
# 7 versicolor 1.2
# 8 versicolor 1.3
# 9 versicolor 1.5
# 10 versicolor 1.8
# 11 virginica 1.4
# 12 virginica 1.8
# 13 virginica 2.0
# 14 virginica 2.3
# 15 virginica 2.5
Now, the 'fivenum' function returns a list, so we end up with 5 line entries for each species. That's the part where the 'fivenum' function is fighting us.
Step 2: Add a label column. We know what Tukey's five numbers are, so we just call them out in the order that the 'fivenum' function returns them. The list will repeat until it hits the end of the data.
Tukeys_five <- c("Min","Q1","Med","Q3","Max")
data$label <- Tukeys_five
# Species value label
# 1 setosa 0.1 Min
# 2 setosa 0.2 Q1
# 3 setosa 0.2 Med
# 4 setosa 0.3 Q3
# 5 setosa 0.6 Max
# 6 versicolor 1.0 Min
# 7 versicolor 1.2 Q1
# 8 versicolor 1.3 Med
# 9 versicolor 1.5 Q3
# 10 versicolor 1.8 Max
# 11 virginica 1.4 Min
# 12 virginica 1.8 Q1
# 13 virginica 2.0 Med
# 14 virginica 2.3 Q3
# 15 virginica 2.5 Max
Step 3: With the labels in place, we can quickly cast this data into a new shape using the 'dcast' function from the 'reshape2' package.
library(reshape2)
dcast(data, Species ~ label)[,c("Species",Tukeys_five)]
# Species Min Q1 Med Q3 Max
# 1 setosa 0.1 0.2 0.2 0.3 0.6
# 2 versicolor 1.0 1.2 1.3 1.5 1.8
# 3 virginica 1.4 1.8 2.0 2.3 2.5
All that junk at the end are just specifying the column order, since the 'dcast' function automatically puts things in alphabetical order.
Hope this helps.
Update: I decided to return, because I realized there is one other option available to you. You can always bind a matrix as part of a data frame definition, so you could resolve your 'aggregate' function like so:
data <- aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x)))
result <- data.frame(Species=data[,1],data[,2])
# Species Min. X1st.Qu. Median Mean X3rd.Qu. Max.
# 1 setosa 0.1 0.2 0.2 0.28 0.3 0.6
# 2 versicolor 1.0 1.2 1.3 1.36 1.5 1.8
# 3 virginica 1.4 1.8 2.0 2.00 2.3 2.5
This is my solution:
ddply(iris, .(Species), summarize, value=t(fivenum(Petal.Width)))

Resources