R data.table multi column coversion by names [duplicate] - r

This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 7 years ago.
Let DT be a data.table:
Is there a better/simpler method to do multicolumn factor conversion like this:
Column names are totally arbitrary.

Yes, the most efficient would be probably to run set in a for loop
Set the desired columns to modify (you can chose all the names too using names(DT) instead)
cols <- c("V1", "V2", "V3")
Then just run the loop
for (j in cols) set(DT, i = NULL, j = j, value = as.numeric(DT[[j]]))
Or a bit less efficient but more readable way would be just (note the parenthesis around cols which evaluating the variable)
## if you chose all the names in DT, you don't need to specify the `.SDcols` parameter
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
Both should be efficient even for a big data set. You can read some more about data.table basics here
Though beware of converting factors to numeric classes in such a way, see here for more details


Insert or print a column name inside a data table call [duplicate]

This question already has answers here:
Converting multiple data.table columns to factors in R
(2 answers)
Closed 2 years ago.
I have a rather simple problem as it seems, which I cannot solve myself however.
Can I somehow insert or print a column name within a data table call? I have something like this in mind:
col_names = c("column1","column2")
for (col in col_names){
datatable$col ...
col_names = c("column1","column2")
for (col in col_names){
datatable[,col] ...
What I eventually would like to do is transform the variables of certain columns into ordered factors. Since there are many columns, I'm looking for a neater way as an alternative of just coding the same line 20 times with the only difference being the column name.
Are you trying to print the just the column name or the entire column within the datatable?
You could try something like this
col_names = c("column1","column2")
for (i in seq_along(col_names)){
Or if you just want the names printed.
col_names = c("column1","column2")
for (i in seq_along(col_names)){
Also, you might want to check out the iteration chapter in R for Data Science.
Perhaps, you can try lapply with SDcols to apply a function over col_names. You can try something like this :=
datatable[, (col_names) := lapply(.SD, function(x) factor(x, ordered = TRUE)),
.SDcols = col_names]
Here we apply factor(x, ordered = TRUE) to each column in col_names where x is each individual column name.

best way to select columns from data.table by type [duplicate]

This question already has answers here:
Select columns by class (e.g. numeric) from a data.table
(5 answers)
Closed 1 year ago.
I am looking for an elegant or efficient way to select columns in R's data.table.
Personally I value a flexible approach.
Therefore I tend to refer to columns by their characteristics rather than their names.
For example, I want to set the values of all columns to lower case.
If I include all columns in this operation, like so
dt[, lapply(.SD, tolower),.SDcols = names(dt)]
numeric and integer columns, too, will be converted to (lower case) character.
This is undesirable, and hence I first identify all character columns as folows:
char_cols <- as.character(names(dt[ , lapply(.SD, function(x) which(is.character(x)))]))
and subsequently pass char_cols to .SDcols
dt[ , lapply(.SD, tolower), .SDcols = char_cols ]
If instead, all your columns are character (for example to avoid type conversion issues while reading the data) I would go about it like this
char_cols <- as.character(names(dt[ , lapply(.SD, function(x) which(all(is.na(as.numeric(x)))))]))
One should be certain however, that no column is of mixed type: i.e. contains some character strings and some numeric values.
Does anyone have a suggestion to approach this more elegantly, or more efficiently?
You can pass a logical/character vector to .SDcols.
For character columns, we can do
cols <- names(Filter(is.character, dt))
dt[, (cols) := lapply(.SD, tolower), .SDcols = cols]
We can use
cols <- names(which(sapply(dt, is.character)))
dt[, (cols) := lapply(.SD, tolower), .SDcols = cols]

Understand the meaning of[.... with=F][[1]]

I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
X <- (X - X.mean) / X.sd
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Summing many columns with data.table in R, remove NA [duplicate]

This question already has an answer here:
Summarizing multiple columns with data.table
(1 answer)
Closed 9 years ago.
This is really two questions I guess. I'm trying to use the data.table package to summarize a large dataset. Say my original large dataset is df1 and unfortunately df1 has 50 columns (y0... y49) that I want the sum of by 3 fields (segmentfield1, segmentfield2, segmentfield3). Is there a simpler way to do this than typing every y0...y49 column out? Related to this, is there a generic na.rm=T for the data.table instead of typing that with each sum too?
dt1 <- data.table(df1)
setkey(dt1, segmentfield1, segmentfield2, segmentfield3)
dt2 <- dt1[,list( y0=sum(y0,na.rm=T), y1=sum(y1,na.rm=T), y2=sum(y2,na.rm=T), ...
y49=sum(y49,na.rm=T) ),
by=list(segmentfield1, segmentfield2, segmentfield3)]
First, create the object variables for the names in use:
colsToSum <- names(dt1) # or whatever you need
summedNms <- paste0( "y", seq_along(colsToSum) )
If you'd like to copy it to a new data.table
dt2 <- dt1[, lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
setnames(dt2, summedNms)
If alternatively, youd like to append the columns to the original
dt1[, c(summedNms) := lapply(.SD, sum, na.rm=TRUE), .SDcols=colsToSum]
As far as a general na.rm process, there is not one specific to data.table, but have a look at ?na.omit and ?na.exclude
