Is there a way to name columns created from the output of a function in data.table?
I want to get regression estimates from some data, so my code looks like this:
library(data.table)
DT <- as.data.table(expand.grid(x = 1:10, t = letters[1:4]))
DT[, y := rnorm(10), by = t]
# Function to get estimates and standard error for
# each regressor.
ExtractLm <- function(model) {
a <- summary(model)
return(list(rownames(a$coefficients),
a$coefficients[, 1],
a$coefficients[, 2]))
}
DT.regression <- DT[, ExtractLm(lm(x ~ y)), by = t]
The problem here is that DT.regression has ugly names (V1, V2, V3). One solution would be to modify the function ExtractLm() so that it returns a named list, but I think this is not very flexible since it wouldn't allow me to use different names in another situation. Or course, I can always rename columns after the fact, but that's not very elegant.
I tried to use DT[, c("regressor", "estimate", "se") = ExtractLm(lm(x ~ y)), by = t] and many variants and it doesn't work. I've been re-reading the docs but I've only found that kind of syntax when creating new columns with :=, which is not aplicable to this case. Is there a way to use a similar syntax to do this?
Related
I have this command in R which will scale a group of variables at once:
preds <- colnames(d[, 2:34])
d <- d[, (preds) := lapply(.SD, scale), .SDcols=preds]
I would like to modify it in two ways:
I would like it to scale each participant's scores separately. I know that if I were not doing to this all columns at once I could add by = Subject, but I'm not sure how to do this given the current command.
I would like to only mean-center, not standardize the variables.
The subsetting of column names should be more direct instead of subsetting the data
preds <- colnames(d)[2:34]
Then, we just do the standardization by subtracting from the mean of the column specified in the .SDcols and if need a grouping, specify the by=
d[, (preds) := lapply(.SD, function(x) x- mean(x)),
.SDcols = preds, by = participant]
Trying to impute missing values in all numeric rows using this loop:
for(i in 1:ncol(df)){
if (is.numeric(df[,i])){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
}
When data.table package is not attached then code above is working as it should. Once I attach data.table package, then the behaviour changes and it shows me the error:
Error in `[.data.table`(df, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i'
is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This
difference to data.frame is deliberate and explained in FAQ 1.1.
I tried '..i' and 'with=FALSE' everywhere but with no success. Actually it has not passed even first is.numeric condition.
The data.table syntax is a little different in such a case. You can do it as follows:
num_cols <- names(df)[sapply(df, is.numeric)]
for(col in num_cols) {
set(df, i = which(is.na(df[[col]])), j = col, value = mean(df[[col]], na.rm=TRUE))
}
Or, if you want to keep using your existing loop, you can just turn the data back to data.frame using
setDF(df)
An alternative answer to this question, i came up with while sitting with a similar problem on a large scale. One might be interested in avoiding for loops by using the [.data.table method.
DF[i, j, by, on, ...]
First we'll create a function that can perform the imputation
impute_na <- function(x, val = mean, ...){
if(!is.numeric(x))return(x)
na <- is.na(x)
if(is.function(val))
val <- val(x[!na])
if(!is.numeric(val)||length(val)>1)
stop("'val' needs to be either a function or a single numeric value!")
x[na] <- val
x
}
To perform the imputation on the data frame, one could create and evaluate an expression in the data.table environment, but for simplicity of example here we'll overwrite using <-
DF <- DF[, lapply(.SD, impute_na)]
This will impute the mean across all numeric columns, and keep any non-numeric columns as is. If we wished to impute another value (like... 42 or whatever), and maybe we have some grouping variable, for which we only want the mean to computed over this can be included as well by
DF <- DF[, lapply(.SD, impute_na, val = 42)]
DF <- DF[, lapply(.SD, impute_NA), by = group]
Which would impute 42, and the within-group mean respectively.
When writing functions, the following function would work if it is given a data.table by name:
myDelta <- function(DT, col.a = "Sepal.Length", baseline = 5){
DT[, delta := get(col.a) - baseline]
return(DT[])
}
It could be called like this:
library(data.table)
irisDT <- data.table(iris)
myDelta(irisDT)
However this has a few problems:
Assigning the output to a new object will work, but the original is modified in place, so this can be an awkward side effect
I don't assume (though I haven't tested) that this is using the best of all of data.tables fancy fastness
This is not using the 'data.table way', which would be more irisDT[, myDelta()]but because it expects a DT argument which is a data.table, I am repeating myself by writing irisDT[, myDelta(irisDT)].
Explicitly, I would like to know:
What I am missing about writing functions which allows them to inherit from the data.table object they are called in without the data.table object having to be provided from the function arguments
Additionally I am curious about:
What best practice would be for writing a function which can be called from inside, or outside, a data.table object in this kind of use case, where the goal is to calculate an output column from existing columns in the object. Do you write for just one or the other?
I may have this entirely backwards though, if so please let me know.
You apply a function on a subset of the data.table selected by [i, j, by, .SDcols]. Example:
myDelta2 <- function(x, baseline = 5) {
return(x - 5)
}
library(data.table)
irisDT <- data.table(iris)
irisDT[, lapply(.SD, myDelta2), .SDcols = c("Sepal.Length", "Sepal.Width")]
Of course this can be simply be written also as:
irisDT[, .SD - 5, .SDcols = c("Sepal.Length", "Sepal.Width")]
or inplace
irisDT[, c(paste0("delta", c("Sepal.Length", "Sepal.Width"))) := .SD - 5, .SDcols = c("Sepal.Length", "Sepal.Width")]
Suggest you check out this excellent resource
PS: if you are wondering about .SD then read this
How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.
Data tables in R have three (main) components: DT[i, j, by].
I am creating subsets of my data.table DT using the by functionality, which returns subsets of my data to j, where I can perform operations on them. I within each of the new subsets, I can specify the columns I want to use in j.
From the documentation (slightly altered by me):
DT[, lapply(.SD, mean), by=., .SDcols=...] - applies fun (=mean) to all
columns specified in .SDcols while grouping by the columns specified
in by.
This is great functionality!
I would like to know if it is possible to supply arguments to the function being used in j - in this case: mean?
The function mean can take the following inputs:
mean(x, trim = 0, na.rm = FALSE, ...)
How can I use mean within the j section AND apply, for example, na.rm = TRUE?
On a side note, I did have a similar problem regarding the Reduce
function, which applied a functions to a data sets recursiely. The best idea I found was to create a custom version of the function to apply, so something like:
my_mean <- function(Data) {
output <- mean(Data, na.rm = TRUE)
return(output)
}
then using the example above, I would perform:
DT[, lapply(.SD, my_mean), by=., .SDcols=...]
you can pass the extra arguments into lapply:
DT = data.table(x=c(1,2,3,4,NA),y=runif(5),z=c(1,1,1,2,2))
DT[, lapply(.SD, mean,na.rm=T), by=z]