Summing many columns with data.table in R, remove NA [duplicate] - r

This question already has an answer here:
Summarizing multiple columns with data.table
(1 answer)
Closed 9 years ago.
This is really two questions I guess. I'm trying to use the data.table package to summarize a large dataset. Say my original large dataset is df1 and unfortunately df1 has 50 columns (y0... y49) that I want the sum of by 3 fields (segmentfield1, segmentfield2, segmentfield3). Is there a simpler way to do this than typing every y0...y49 column out? Related to this, is there a generic na.rm=T for the data.table instead of typing that with each sum too?
dt1 <- data.table(df1)
setkey(dt1, segmentfield1, segmentfield2, segmentfield3)
dt2 <- dt1[,list( y0=sum(y0,na.rm=T), y1=sum(y1,na.rm=T), y2=sum(y2,na.rm=T), ...
y49=sum(y49,na.rm=T) ),
by=list(segmentfield1, segmentfield2, segmentfield3)]

First, create the object variables for the names in use:
colsToSum <- names(dt1) # or whatever you need
summedNms <- paste0( "y", seq_along(colsToSum) )
If you'd like to copy it to a new data.table
dt2 <- dt1[, lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
setnames(dt2, summedNms)
If alternatively, youd like to append the columns to the original
dt1[, c(summedNms) := lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
As far as a general na.rm process, there is not one specific to data.table, but have a look at ?na.omit and ?na.exclude

Related

efficiently locf by groups in a single R data.table - FromLast

I am wondering how to efficiently locf by groups in a single R data.table from the last, i.e. filling in NA values backward from the last know value.
There is a code efficiently locf by groups in a single R data.table for forward direction but I am looking for the opposite direction. Any idea how to adjust the code?
A bit workaround, but anyway:
first, sort data reversely, apply the code to replace the NAs, sort back.
DT <- arrange(DT, desc(id))
id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT <- DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
DT <- arrange(DT, id)

best way to select columns from data.table by type [duplicate]

This question already has answers here:
Select columns by class (e.g. numeric) from a data.table
(5 answers)
Closed 1 year ago.
I am looking for an elegant or efficient way to select columns in R's data.table.
Personally I value a flexible approach.
Therefore I tend to refer to columns by their characteristics rather than their names.
For example, I want to set the values of all columns to lower case.
If I include all columns in this operation, like so
dt[, lapply(.SD, tolower),.SDcols = names(dt)]
numeric and integer columns, too, will be converted to (lower case) character.
This is undesirable, and hence I first identify all character columns as folows:
char_cols <- as.character(names(dt[ , lapply(.SD, function(x) which(is.character(x)))]))
and subsequently pass char_cols to .SDcols
dt[ , lapply(.SD, tolower), .SDcols = char_cols ]
If instead, all your columns are character (for example to avoid type conversion issues while reading the data) I would go about it like this
char_cols <- as.character(names(dt[ , lapply(.SD, function(x) which(all(is.na(as.numeric(x)))))]))
One should be certain however, that no column is of mixed type: i.e. contains some character strings and some numeric values.
Does anyone have a suggestion to approach this more elegantly, or more efficiently?
You can pass a logical/character vector to .SDcols.
For character columns, we can do
library(data.table)
cols <- names(Filter(is.character, dt))
dt[, (cols) := lapply(.SD, tolower), .SDcols = cols]
We can use
library(data.table)
cols <- names(which(sapply(dt, is.character)))
dt[, (cols) := lapply(.SD, tolower), .SDcols = cols]

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

data.table index of subset [duplicate]

This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 6 days ago.
Working with data.table package in R, I'm trying to get the 'group number' of some data points.
Specifically, my data is trajectories: I have many rows describing a specific observation of the particle I'm tracking, and I want to generate a specific index for the trajectory based on other identifying information I have.
If I do a [, , by] command, I can group my data by this identifying information and isolate each trajectory.
Is there a way, similar to .I or .N, which gives what I would call the index of the subset?
Here's an example with toy data:
dt <- data.table(x1 = c(rep(1,4), rep(2,4)),
x2 = c(1,1,2,2,1,1,2,2),
z = runif(8))
I need a fast way to get the trajectories (here should be c(1,1,2,2,3,3,4,4) for each observation -- my real data set is moderately large.
If we need the trajectories (donno what that means) based on the 'x2', we can use rleid
dt[, Grp := rleid(x2)]
Or if we need the group numbers based on 'x1' and 'x2', .GRP can be used.
dt[, Grp := .GRP,.(x1, x2)]
Or this can be done using rleid alone without the by (as #Frank mentioned)
dt[, Grp := rleid(x1,x2)]

R data.table multi column coversion by names [duplicate]

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 7 years ago.
Let DT be a data.table:
DT<-data.table(V1=factor(1:10),
V2=factor(1:10),
...
V9=factor(1:10),)
Is there a better/simpler method to do multicolumn factor conversion like this:
DT[,`:=`(
Vn1=as.numeric(V1),
Vn2=as.numeric(V2),
Vn3=as.numeric(V3),
Vn4=as.numeric(V4),
Vn5=as.numeric(V5),
Vn6=as.numeric(V6),
Vn7=as.numeric(V7),
Vn8=as.numeric(V8),
Vn9=as.numeric(V9)
)]
Column names are totally arbitrary.
Yes, the most efficient would be probably to run set in a for loop
Set the desired columns to modify (you can chose all the names too using names(DT) instead)
cols <- c("V1", "V2", "V3")
Then just run the loop
for (j in cols) set(DT, i = NULL, j = j, value = as.numeric(DT[[j]]))
Or a bit less efficient but more readable way would be just (note the parenthesis around cols which evaluating the variable)
## if you chose all the names in DT, you don't need to specify the `.SDcols` parameter
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
Both should be efficient even for a big data set. You can read some more about data.table basics here
Though beware of converting factors to numeric classes in such a way, see here for more details

Resources