r - computing statistics for variables passed on to setDT - r

Is there a way to pass on a variables, for which a statistic needs to be computed, to setDT?
The example below illustrates my issue. Only A yields the desired result. As I would like to change var into a vector and pass its elements to setDT via a loop, A is not an option.
I also prefer not using sqldf.
col1 <- c('Group 1','Group 1','Group 2','Group 2')
col2 <- c(0.2,0.3,0.5,0.6)
col3 <- c(0.1,0.2,0.3,0.4)
X <- data.frame(col1,col2,col3)
var <- "col2"
A <- setDT(X)[, list(nbrObs = .N, average = mean(col2)), by = .(col1)]
B <- setDT(X)[, list(nbrObs = .N, average = mean(X[[var]])), by = .(col1)]
C <- setDT(X)[, list(nbrObs = .N, average = mean(var)), by = .(col1)]

We can either pass on the variables by specifying it in .SDcols and then apply the function on the Subset of Data.table (.SD). If there are multiple variables, make sure to loop through the .SD i.e. lapply(.SD, mean).
setDT(X)[, list(nbrObs = .N, average = mean(.SD[[1L]])), by = .(col1), .SDcols= var]
Or another option would be convert to symbol with as.name or as.symbol and evaluate it (eval).
setDT(X)[, list(nbrObs = .N, average = mean(eval(as.name(var)))), by = .(col1)]
Or yet another option is using get to return the value.
setDT(X)[, list(nbrObs = .N, average = mean(get(var))), by = .(col1)]

Related

R data.table - How to modify by reference when using .SD?

So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]
In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]

shift() and list() inside DT don't work with rowMeans

Trying to calculate mean over 4 last values in DT. Thought the following would work:
library(data.table)
dt <- data.table(a = 1:10)
dt[, means := rowMeans(shift(a, 0:3), na.rm = TRUE)]
but it returns 'x' must be an array of at least two dimensions
so I tested with
lags <- paste0("a.lag", c(1,2,3))
dt[, (lags) := shift(a, 1:3)]
dt[, means := rowMeans(c("a", lags), na.rm = TRUE)]
same error. Surprisingly, the following works:
dt[, means := rowMeans(.SD, na.rm = TRUE), .SDcols = c("a", lags)]
Why is .SD returning a 2-dimensional array here but not the other? Is it a bug or am I missing something? Using DT 1.11.9.

Assign by reference a list of results to a number of columns of a data.table

Imagine you have 2 distributions resulting from two simulations stored in a data.frame:
sim1 = 1:10
sim2 = 91:100
sim = data.frame(sim1, sim2)
Now, we want to find the 10% and 90% percentiles of each distribution. This can be done by:
diffSim = ncol(sim)
confidenceInterval = c(0.1, 0.9)
results = lapply(1:diffSim, function(j) {quantile(sim[, j], confidenceInterval,
names = FALSE, type = 3)})
I would like to store these results in a data.table by assigning by reference (:=). However, I first need to getresults in the appropriate shape (i.e. a data.table of 1 row and 4 columns). To do so, I subsequently apply unlist, matrix and as.data.table to results:
DT = data.table(Col1 = "Result")
DT[, c("col2", "col3", "col4", "col5") := as.data.table(matrix(unlist(results), nrow = 1))]
I don't like this at all. Is there a shorter way of doing this?
Not necessarily shorter, but everything in data.table:
library(data.table)
setDT(sim)[, .(col1 = 'Result',
cols = paste0('col',2:5),
vals = unlist(lapply(.SD, quantile, probs = confidenceInterval, type = 3)))
][, dcast(.SD, col1 ~ cols, value.var = 'vals')]
which gives:
col1 col2 col3 col4 col5
1: Result 1 9 91 99

Subset by group with data.table compared to aggregate a data.table

This is a follow up question to Subset by group with data.table using the same data.table:
library(data.table)
bdt <- as.data.table(baseball)
# Aggregating and loosing information on other columns
dt1 <- bdt[ , .(max_g = max(g)), by = id]
# Aggregating and keeping information on other columns
dt2 <- bdt[bdt[, .I[g == max(g)], by = id]$V1]
Why do dt1 and dt2 differ in number of rows?
Isn't dt2 supposed to have the same result just without loosing the respective information in the other columns?
As #Frank pointed out:
bdt[ , .(max_g = max(g)), by = id] provides you with the maximum value, while
bdt[bdt[ , .I[g == max(g)], by = id]$V1] identifies all rows that have this maximum.
See What is the difference between arg max and max? for a mathematical explanation and try this slim version in R:
library(data.table)
bdt <- as.data.table(baseball)
dt <- bdt[id == "woodge01"][order(-g)]
dt[ , .(max = max(g)), by = id]
dt[ dt[ , .I[g == max(g)], by = id]$V1 ]

Loop through data.table and create new columns basis some condition

I have a data.table with quite a few columns. I need to loop through them and create new columns using some condition. Currently I am writing separate line of condition for each column. Let me explain with an example. Let us consider a sample data as -
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
For each of the columns from one to total, I need to create 4 new columns, i.e. mean, sd, uplimit, lowlimit for 2 sigma outlier calculation. I am doing this by -
DTnew <- DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x), uplimit = mean(x)+1.96*sd(x), lowlimit = mean(x)-1.96*sd(x))))), by = .(town,tc)]
This DTnew data.table I am then merging with my DT
DTmerge <- merge(DT, DTnew, by= c('town','tc'))
Now to come up with the outliers, I am writing separate set of codes for each variable -
DTAoutlier <- DTmerge[ ,one.Aoutlier := ifelse (one >= one.lowlimit & one <= one.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,two.Aoutlier := ifelse (two >= two.lowlimit & two <= two.uplimit,0,1)]
DTAoutlier <- DTmerge[ ,three.Aoutlier := ifelse (three >= three.lowlimit & three <= three.uplimit,0,1)]
can some one help to simplify this code so that
I don't have to write separate lines of code for outlier. In this example we have only 8 variables but what if we had 100 variables, would we end up writing 100 lines of code? Can this be done using a for loop? How?
In general for data.table how can we add new columns retaining the original columns. So for example below I am taking log of columns 3 to 10. If I don't create a new DTlog it overwrites the original columns in DT. How can I retain the original columns in DT and have the new columns as well in DT.
DTlog <- DT[,(lapply(.SD,log)),by = .(town,tc),.SDcols=3:10]
Look forward to some expert suggestions.
We can do this using :=. We subset the column names that are not the grouping variables ('nm'). Create a vector of names to assign for the new columns using outer ('nm1'). Then, we use the OP's code, unlist the output and assign (:=) it to 'nm1' to create the new columns.
nm <- names(DT)[-(1:2)]
nm1 <- c(t(outer(c("Mean", "SD", "uplimit", "lowlimit"), nm, paste, sep="_")))
DT[, (nm1):= unlist(lapply(.SD, function(x) { Mean = mean(x)
SD = sd(x)
uplimit = Mean + 1.96*SD
lowlimit = Mean - 1.96*SD
list(Mean, SD, uplimit, lowlimit) }), recursive=FALSE) ,
.(town, tc)]
The second part of the question involves doing a logical comparison between columns. One option would be to subset the initial columns, the 'lowlimit' and 'uplimit' columns separately and do the comparison (as these have the same dimensions) to get a logical output which can be coerced to binary with +. Then assign it to the original dataset to create the outlier columns.
m1 <- +(DT[, nm, with = FALSE] >= DT[, paste("lowlimit", nm, sep="_"),
with = FALSE] & DT[, nm, with = FALSE] <= DT[,
paste("uplimit", nm, sep="_"), with = FALSE])
DT[,paste(nm, "Aoutlier", sep=".") := as.data.frame(m1)]
Or instead of comparing data.tables, we can also use a for loop with set (which would be more efficient)
nm2 <- paste(nm, "Aoutlier", sep=".")
DT[, (nm2) := NA_integer_]
for(j in nm){
set(DT, i = NULL, j = paste(j, "Aoutlier", sep="."),
value = as.integer(DT[[j]] >= DT[[paste("lowlimit", j, sep="_")]] &
DT[[j]] <= DT[[paste("uplimit", j, sep="_")]]))
}
The 'log' columns can also be created with :=
DT[,paste(nm, "log", sep=".") := lapply(.SD,log),by = .(town,tc),.SDcols=nm]
Your data should probably be in long format:
m = melt(DT, id=c("town","tc"))
Then just write your test once
m[,
is_outlier := +(abs(value-mean(value)) > 1.96*sd(value))
, by=.(town, tc, variable)]
I see no outliers in this data (according to the given definition of outlier):
m[, .N, by=is_outlier] # this is a handy alternative to table()
# is_outlier N
# 1: 0 160
How it works
melt keeps the id columns and stacks all the rest into
variable (column names)
value (column contents)
+x does the same thing as as.integer(x), coercing TRUE/FALSE to 1/0
If you really like your data in wide format, though:
vjs = setdiff(names(DT), c("town","tc"))
DT[,
paste0(vjs,".out") := lapply(.SD, function(x) +(abs(x-mean(x)) > 1.96*sd(x)))
, by=.(town, tc), .SDcols=vjs]
For completeness, it should be noted that dplyr's mutate_each provides a handy way of tackling such problems:
library(dplyr)
result <- DT %>%
group_by(town,tc) %>%
mutate_each(funs(mean,sd,
uplimit = (mean(.) + 1.96*sd(.)),
lowlimit = (mean(.) - 1.96*sd(.)),
Aoutlier = as.integer(. >= mean(.) - 1.96*sd(.) &
. <= mean(.) - 1.96*sd(.))),
-town,-tc)

Resources