Remove multiple columns from data.table - r

What's the correct way to remove multiple columns from a data.table? I'm currently using the code below, but was getting unexpected behavior when I accidentally repeated one of the column names. I wasn't sure if this was a bug, or if I shouldn't be removing columns this way.
library(data.table)
DT <- data.table(x = letters, y = letters, z = letters)
DT[ ,c("x","y") := NULL]
names(DT)
[1] "z"
The above works fine, but
DT <- data.table(x = letters, y = letters, z = letters)
DT[ ,c("x","x") := NULL]
names(DT)
[1] "z"

This looks like a solid, reproducible bug. It's been filed as Bug #2791.
It appears that repeating the column attempts to delete the subsequent columns.
If no columns remain, then R crashes.
UPDATE : Now fixed in v1.8.11. From NEWS :
Assigning to the same column twice in the same query is now an error rather than a crash in some circumstances; e.g., DT[,c("B","B"):=NULL] (delete by reference the same column twice). Thanks to Ricardo (#2751) and matt_k (#2791) for reporting. Tests added.

This Q has been answered but regard this as a side note.
I prefer the following syntax to drop multiple columns
DT[ ,`:=`(x = NULL, y = NULL)]
because it matches the one to add multiple columns (variables)
DT[ ,`:=`(x = letters, y = "Male")]
This also check for duplicated column names. So trying to drop x twice will throw an error message.

Related

Understand the meaning of[.... with=F][[1]]

I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?

datatables in R "," and keyby, and "." confusion

I'm rewriting someone else's R code into python, and I don't know R.
So I'm trying to decipher what things mean.
What does this line mean?
kable(DT[, .N, keyby=.(target=get(y))], format="html")
So DT is the datatable itself, and y is a column within DT. But I think it's trying to create a table wherever y exists?
There's also this follow up line:
id_bady1= DT[! get(y) %in% c(0,1), get(id)]
Documentation for R says that get returns the object the matches the input, but how does that work when there are multiple matches?
The content of y is the name of a column of the datatable, see:
library("data.table")
DT <- mtcars
setDT(DT)
y <- "cyl"
DT[, .N, keyby=.(target=get(y))]
IMHO it is here complete matching (not partial matching):
DT[, cylA:=7] # construct a second column that begins with "cyl"
DT[, .N, keyby=.(target=get(y))]
y <- "cy" ## no complete matching possible
DT[, .N, keyby=.(target=get(y))]
### Error in get(y) : object 'cy' not found

Arithmetic with unknown number of data.table columns of non-standard type

I'd like to create a new data.table column by adding columns whose names match a certain expression. I don't know how many columns I'll be adding. The catch is, the columns are of class 'integer64' so rowSums(.) does not appear to work on them.
For instance, this works for two (known) integer64 columns:
DT <- data.table(a=as.integer64(1:4),b=as.integer64(5:8),c=as.integer64(9:12))
DT[, y := .SD[, 1] + .SD[, 2], .SDcols=c("a", "b")]
And this works for my case, any number of columns, but not if their class is integer64:
DT[, y := rowSums(.SD), .SDcols=c("a", "b")] # gives incomprehensible data if class of a and b is integer64
One way I can work around it is by defining the data type of the column y beforehand. Is there a simpler way to do it? I may be missing something simple here and I apologize if that's so.

Assign a vector to a specific existing row of data table in R

I've been looking through tutorials and documentation, but have not figured out how to assign a vector of values for all columns to one existing row in a data.table.
I start with an empty data.table that has already the correct number of columns and rows:
dt <- data.table(matrix(nrow=10, ncol=5))
Now I calculate some values for one row outside of the data.table and place them in a vector vec, e. g.:
vec <- rnorm(5)
How could I assign the values of vec to e. g. the first row of the data.table while achieving a good performance (since I also want to fill the other rows step by step)?
First you need to get the correct column types, as the NA matrix you've created is logical. The column types won't be magically changed by assigning numerics to them.
dt[, names(dt) := lapply(.SD, as.numeric)]
Then you can change the first row's values with
dt[1, names(dt) := as.list(vec)]
That said, if you begin with a numeric matrix you wouldn't have to change the column types.
dt <- data.table(matrix(numeric(), 10, 5))
dt[1, names(dt) := as.list(vec)]

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Resources