There are many posts which discuss applying a function over many columns when using data.table. However I need to calculate a function which depends on many columns. As an example:
# Create a data table with 26 columns. Variable names are var1, ..., var 26
data.mat = matrix(sample(letters, 26*26, replace=TRUE),ncol=26)
colnames(data.mat) = paste("var",1:26,sep="")
data.dt <- data.table(data.mat)
Now, say I would like to count the number of 'a's in columns 5,6,7 and 8.
I cannot see how to do this with SDcols and end up doing:
data.dt[,numberOfAs := (var5=='a')+(var6=='a')+(var7=='a')+(var7=='a')]
Which is very tedious. Is there a more sensible way to do this?
Thanks
I really suggest going through the vignettes linked here. Section 2e from the Introduction to data.table vignette explains .SD and .SDcols.
.SD is just a data.table containing the data for current group. And .SDcols tells the columns .SD should have. A useful way is to use print to see the content.
# .SD contains cols 5:8
data.dt[, print(.SD), .SDcols=5:8]
Since there is no by here, .SD contains all the rows of data.dt, corresponding to the columns specified in .SDcols.
Once you understand this, the task reduces to your knowledge of base R really. You can accomplish this in more than one way.
data.dt[, numberOfAs := rowSums(.SD == "a"), .SDcols=5:8]
We return a logical matrix by comparing all the columns in .SD to "a". And then use rowSums to sum them up.
Another way using Reduce:
data.dt[, numberOfAs := Reduce(`+`, lapply(.SD, function(x) x == "a")), .SDcols=5:8]
Related
I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?
This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.
What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.
For a reproducible example, here are the data. In the real world the tables are much larger:
library(data.table)
set.seed(0)
## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)
for(i in 2015:2017) {
varA <- paste0('varA_', i)
varB <- paste0('varB_', i)
varC <- paste0('varC_', i)
dt_original[, (varA) := rnorm(20)]
dt_original[, (varB) := rnorm(20)]
dt_original[, (varC) := rnorm(20)]
}
## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)
newdata_vars <- sample(names(dt_original)[-1], 4)
for(var in newdata_vars) {
dt_newdata[, (var) := rnorm(3)]
}
Here is a way of doing it using a loop and pmax, but there has to be a better way, right?
for(var in newdata_vars) {
k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}
It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.
This code should work in the current format given your criteria.
dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]
It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.
Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to #brian-stamper for pointing out the key preservation in the comments.
dt_original[dt_newdata,
names(dt_newdata)[-1] := Map(pmax,
mget(names(dt_newdata)[-1]),
dt_newdata[, .SD, .SDcols=-1])]
Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.
How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.
Having data in a data.frame, I would like to aggregate some columns (using any general function) grouping by some others, keeping the remaining ones as they are (or even omitting them). The fashion is to recall the group by function in SQL. As an example let us assume we have
df <- data.frame(a=rnorm(4), b=rnorm(4), c=c("A", "B", "C", "A"))
and I want to sum (say) the values in column a and average (say) the values in column b, grouping by the symbols in column c. I am aware it is possible to achieve such using apply, cbind or similars, specifying the functions you want to use, but I was wondering if there were a smarter (one line) way (especially using the aggregate function) to do so.
Sorry but I don't follow how dealing with more than one column complicates things.
library(data.table)
dt <- data.table(df)
dt[,.(sum_a = sum(a),mean_b= mean(b)),by = c]
like this?
mapply(Vectorize(function(x, y) aggregate(
df[, x], by=list(df[, 3]), FUN=y), SIMPLIFY = F),
1:2, c('sum', 'mean'))
I'm trying to use data.table to speed up processing of a large data.frame (300k x 60) made of several smaller merged data.frames. I'm new to data.table. The code so far is as follows
library(data.table)
a = data.table(index=1:5,a=rnorm(5,10),b=rnorm(5,10),z=rnorm(5,10))
b = data.table(index=6:10,a=rnorm(5,10),b=rnorm(5,10),c=rnorm(5,10),d=rnorm(5,10))
dt = merge(a,b,by=intersect(names(a),names(b)),all=T)
dt$category = sample(letters[1:3],10,replace=T)
and I wondered if there was a more efficient way than the following to summarize the data.
summ = dt[i=T,j=list(a=sum(a,na.rm=T),b=sum(b,na.rm=T),c=sum(c,na.rm=T),
d=sum(d,na.rm=T),z=sum(z,na.rm=T)),by=category]
I don't really want to type all 50 column calculations by hand and a eval(paste(...)) seems clunky somehow.
I had a look at the example below but it seems a bit complicated for my needs. thanks
how to summarize a data.table across multiple columns
You can use a simple lapply statement with .SD
dt[, lapply(.SD, sum, na.rm=TRUE), by=category ]
category index a b z c d
1: c 19 51.13289 48.49994 42.50884 9.535588 11.53253
2: b 9 17.34860 20.35022 10.32514 11.764105 10.53127
3: a 27 25.91616 31.12624 0.00000 29.197343 31.71285
If you only want to summarize over certain columns, you can add the .SDcols argument
# note that .SDcols also allows reordering of the columns
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ]
category a c z
1: c 51.13289 9.535588 42.50884
2: b 17.34860 11.764105 10.32514
3: a 25.91616 29.197343 0.00000
This of course, is not limited to sum and you can use any function with lapply, including anonymous functions. (ie, it's a regular lapply statement).
Lastly, there is no need to use i=T and j= <..>. Personally, I think that makes the code less readable, but it is just a style preference.
Documentation
See ?.SD, ?data.table and its .SDcols argument, and the vignette Using .SD for Data Analysis.
Also have a look at data.table FAQ 2.1.