Error deleting factor column in empty data.table - r

If I have an empty data.table with a factor column, the factor column can't be removed with the := NULL operator. Integer and character columns have no problems.
library(data.table)
DT <- data.table(numbers = integer(0),
char.letters = character(0),
factor.letters = factor(character(0)))
DT[, factor.letters := NULL]
I get the following error:
Error in `[.data.table`(DT, , `:=`(factor.letters, NULL)) :
Can't assign to column 'factor.letters' (type 'factor') a value of type 'NULL' (not character, factor, integer or numeric)
Note that DT[, char.letters := NULL] and DT[, numbers := NULL] do not produce errors.
Since factor columns behave differently from character and integer columns, I suspect this is a problem with data.table, but am I doing anything incorrectly?
Edit: Previous example used join to create the empty data.table (which was then called join), but it can be reproduced just as easily by creating it directly.

Thanks for reporting. Now fixed in v1.8.9
Deleting a (0-length) factor column using :=NULL on an empty
data.table
now works, #4809. Thanks to Frank Pinter for reporting. Test added.

Related

Understand the meaning of[.... with=F][[1]]

I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?

lag row value in R by groups

I want the previous row value for different groups. I have gone through the solution given here and also tried the code shown below.
new_data[,avg_week := shift(.(avg_travel_time),type = "lag"), by = identifier]
This is the error that I am getting.
Error in `[.data.frame`(new_data, , `:=`(avg_week, c(NA, avg_travel_time[-.N])), :
unused argument (by = identifier)
There are two problems in the OP's code, 1) the dataset is a data.frame and not a data.table, 2) use of .( inside shift which is not required. We need to first convert to data.table (setDT(new_data) before applying the data.table syntax
setDT(new_data)[,avg_week := shift(avg_travel_time, type = "lag"), by = identifier]

Arithmetic with unknown number of data.table columns of non-standard type

I'd like to create a new data.table column by adding columns whose names match a certain expression. I don't know how many columns I'll be adding. The catch is, the columns are of class 'integer64' so rowSums(.) does not appear to work on them.
For instance, this works for two (known) integer64 columns:
DT <- data.table(a=as.integer64(1:4),b=as.integer64(5:8),c=as.integer64(9:12))
DT[, y := .SD[, 1] + .SD[, 2], .SDcols=c("a", "b")]
And this works for my case, any number of columns, but not if their class is integer64:
DT[, y := rowSums(.SD), .SDcols=c("a", "b")] # gives incomprehensible data if class of a and b is integer64
One way I can work around it is by defining the data type of the column y beforehand. Is there a simpler way to do it? I may be missing something simple here and I apologize if that's so.

How to transform data table columns, indexed by position, by reference?

I have a data.table that houses several columns of factors. I'd like to convert 2 columns originally read as factors to their original numeric values. Here's what I've tried:
data[, c(4,5):=c(as.numeric(as.character(4)), as.numeric(as.character(5))), with=FALSE]
This gives me the following warnings:
Warning messages:
1: In `[.data.table`(data, , `:=`(c(4, 5), c(as.numeric(as.character(4)), :
Supplied 2 items to be assigned to 7 items of column 'Bentley (R)' (recycled leaving remainder of 1 items).
2: In `[.data.table`(data, , `:=`(c(4, 5), c(as.numeric(as.character(4)), :
Supplied 2 items to be assigned to 7 items of column 'Sparks (D)' (recycled leaving remainder of 1 items).
3: In `[.data.table`(data, , `:=`(c(4, 5), c(as.numeric(as.character(4)), :
Coerced 'double' RHS to 'integer' to match the factor column's underlying type. Character columns are now recommended (can be in keys), or coerce RHS to integer or character first.
4: In `[.data.table`(data, , `:=`(c(4, 5), c(as.numeric(as.character(4)), :
Coerced 'double' RHS to 'integer' to match the factor column's underlying type. Character columns are now recommended (can be in keys), or coerce RHS to integer or character first.
Also I can tell the conversion has not succeeded because the 4th and 5th columns persist in being factors after this code has run.
As an alternate, I tried this code, which won't run at all:
data[, ':=' (4=c(as.numeric(as.character(4)), 5 = as.numeric(as.character(5)))), with=FALSE]
Finally, I tried referencing the column names via colnames:
data[ , (colnames(data)[4]) := as.numeric(as.character(colnames(data)[4]))]
This runs but results in a row of NAs as well as the following errors:
Warning messages:
1: In eval(expr, envir, enclos) : NAs introduced by coercion
2: In `[.data.table`(data, , `:=`((colnames(data)[4]), as.numeric(as.character(colnames(data)[4])))) :
Coerced 'double' RHS to 'integer' to match the factor column's underlying type. Character columns are now recommended (can be in keys), or coerce RHS to integer or character first.
3: In `[.data.table`(data, , `:=`((colnames(data)[4]), as.numeric(as.character(colnames(data)[4])))) :
RHS contains -2147483648 which is outside the levels range ([1,6]) of column 1, NAs generated
I need to do this by position and not by column name, since the column name will depend on the URL. What's the proper way to transform columns by position using data.table?
I also have a related query, which is how to transform numbered columns relative to other numbered columns. For example, if I want to set the 3rd column to be equal to 45 minus the value of the 3rd column plus the value of the 4th column, how would I do that? Is there some way to distinguish between a real # vs a column number? I know something like this is not the way to go:
dt[ , .(4) = 45 - .(3) + .(4), with = FALSE]
So then how can this be done?
If you want to assign by reference and position, you need to get the column names to assign to as a character vector or the column numbers as an integer vector and use .SDcols (at least in data.table 1.9.4).
First a reproducible example:
library(data.table)
DT <- data.table(iris)
DT[, c("Sepal.Length", "Petal.Length") := list(factor(Sepal.Length), factor(Petal.Length))]
str(DT)
Now let's convert the columns:
DT[, names(DT)[c(1, 3)] := lapply(.SD, function(x) as.numeric(as.character(x))),
.SDcols = c(1, 3)]
str(DT)
Alternatively:
DT[, c(1,3) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcols=c(1,3)]
str(DT)
Note that := expects a vector of column names or positions on the left side and a list on the right side.

Remove multiple columns from data.table

What's the correct way to remove multiple columns from a data.table? I'm currently using the code below, but was getting unexpected behavior when I accidentally repeated one of the column names. I wasn't sure if this was a bug, or if I shouldn't be removing columns this way.
library(data.table)
DT <- data.table(x = letters, y = letters, z = letters)
DT[ ,c("x","y") := NULL]
names(DT)
[1] "z"
The above works fine, but
DT <- data.table(x = letters, y = letters, z = letters)
DT[ ,c("x","x") := NULL]
names(DT)
[1] "z"
This looks like a solid, reproducible bug. It's been filed as Bug #2791.
It appears that repeating the column attempts to delete the subsequent columns.
If no columns remain, then R crashes.
UPDATE : Now fixed in v1.8.11. From NEWS :
Assigning to the same column twice in the same query is now an error rather than a crash in some circumstances; e.g., DT[,c("B","B"):=NULL] (delete by reference the same column twice). Thanks to Ricardo (#2751) and matt_k (#2791) for reporting. Tests added.
This Q has been answered but regard this as a side note.
I prefer the following syntax to drop multiple columns
DT[ ,`:=`(x = NULL, y = NULL)]
because it matches the one to add multiple columns (variables)
DT[ ,`:=`(x = letters, y = "Male")]
This also check for duplicated column names. So trying to drop x twice will throw an error message.

Resources