I have a very simple code that reproduces my issue in R (and data.table package):
If I save the column names into a variable without copy(), it gets overwritten when using the := operator later on in the code:
input_table = data.table(A = 1:3, B = 2:4)
some_value = 0.025
original_colnames <- names(input_table)
original_colnames_copy <- copy(names(input_table))
identical(original_colnames, c("A", "B")) # returns TRUE
# create a new column
input_table[, C := A + some_value]
identical(original_colnames, c("A", "B")) # returns FALSE, original_colnames contains "C" as well
identical(original_colnames_copy, c("A", "B")) # returns TRUE
This problem does not persist if I use the following code to create the new column:
input_table = mutate(input_table, C= A + some_value)
Is this intended (am I missing some deeper knowledge in R / data.table)?
R.version is 4.1.0, data.table version is 1.14.0.
Answered by the comments below the question:
Why does data.table update names(DT) by reference, even if I assign to another variable?
Related
I have a Data Table with two Text columns. I need to use column b to determine which letters to replace in column a with an "x".
I can do it using a for loop as in the code below. however my actual data set has 250,000+ rows so the script takes ages. Is there a more efficient way to do this? I considered lappy but couldn't get my head round it.
DT <- data.table(a = c("ABCD","ABCD","ABCD","ABCD"), b = c("A","B","C", "D"))
DT$c <- ""
for (i in 1 : NROW(DT)){
DT[i]$c <- sub(DT[i,b], "x", DT[i,a])
}
Here is one approach using the tidyverse
library(tidyverse)
DT <- data.table::data.table(a = c("ABCD","ABCD","ABCD","ABCD"), b = c("A","B","C", "D"))
DT %>%
mutate(new_vec = str_replace_all(string = a,pattern = b,replacement = "X"))
In a data.table, for non-by-reference j-experssions, I would like to be able to set the resulting column name(s) using a character vector.
As I am in a group-by setting, I cannot use the by-reference syntax, because this will introduce many duplicated rows. Since a j-expression in this case can be specified as a list, my solution is using stats::setNames(). This however gives me a message (which might even be turned into a warning in the future). How do I achieve my goal without data.table complaining about efficiency?
my_fun <- function(tbl, new_names = c("mean", "var")) {
tbl[, setNames(list(mean(b), var(b)), new_names), by = "a", verbose = TRUE]
}
dt <- data.table::data.table(
a = sample(letters[1:5], 1e3, replace = TRUE),
b = rnorm(1e3)
)
my_fun(dt)
While the results are as intended:
a mean var
1: a -0.04117688 1.1080222
2: e 0.00158758 1.1629461
3: c -0.04328856 0.9848994
4: d -0.04832948 0.8760644
5: b 0.10856561 0.9313874
I would like to get rid of the following message:
Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
This changed it to "Making each group and running j (GForce TRUE) ... 0.000sec":
my_fun <- function(tbl, new_names = c("a", "mean", "var")) {
setNames(tbl[, list(mean(b), var(b)), by = "a", verbose = TRUE], new_names)
}
my_fun(dt)
Is there any way to fill a completely empty data.table in R? I have to fill a data.table and columns are given by a function called in a loop. I don't know how many columns will be created or what length they will be before launching the function but I do know that all will be the same length.
My aproach is to create an empty data.table and fill it within the loop. But this does not work either because I cannot append columns to empty data.table or beacuse I cannot insert rows properly. Check the toy example below (for the sake of simplicity let's avoid the for bucle)
f <- someFunction() # just imagine that returns c(1,2,3)
# this does not work. fails with error msg: Cannot use := to add columns to a null data.table (no columns), currently
dt <- data.table()
dt[, c("b", "c") := list(f(), f())]
# this actually work. But creates a 0 row data.table which cannot be filled latter
dt <- data.table(a = numeric() )
dt[, c("b", "c") := list(numeric(), numeric())]
dt[, c("b", "c") := list(f(), f())] # no rows are added
# this workaround works but is ugly as hell
dt <- data.table(a = rep(NA, length(f())) )
dt[, c("b", "c") := list(f(), f())]
dt[, a := NULL]
So Is there any elegant/efficient way of approaching this
You can use something like this:
library("data.table")
f <- function(x) c(1,2,3)
dt <- as.data.table(lapply(11:13, f))
setnames(dt, c("a", "b", "c"))
The lapply() is doing the loop you mentioned in your question.
I often need to store complex object in data.table so I used list column a lot. Sometimes I need to initialize the column with empty list, and sometimes I need to copy a list column to another column.
At first there seemed to be nothing wrong with these:
# the example at https://stackoverflow.com/a/22536321/3718827
dt = data.table(id = 1:2, comment = vector("list", 2L))
# my usage need to add column to existing dt, not creating a data.table from scratch, so I modified it a little bit
dt <- data.table(id = 1:2)
dt[, comment := vector("list", nrow(dt))]
dt[1, comment := .(list(list(a = "a", b = "b")))]
dt[, edited_comment := comment]
However if there is only one row, the code above no longer work
> dt <- data.table(id = 1)
> dt[, comment := vector("list", nrow(dt))]
Warning message:
In `[.data.table`(dt, , `:=`(comment, vector("list", nrow(dt)))) :
Adding new column 'comment' then assigning NULL (deleting it).
> dt[1, comment := .(list(list(a = "a", b = "b")))]
> dt[, edited_comment := comment]
Warning message:
In `[.data.table`(dt, , `:=`(edited_comment, comment)) :
Supplied 2 items to be assigned to 1 items of column 'edited_comment' (1 unused)
I tried a different syntax and this seemed to work
> dt <- data.table(id = 1)
> dt[, comment := .(list(NULL))]
> dt[1, comment := .(list(list(a = "a", b = "b")))]
> dt[, edited_comment := list(comment)]
For the list column initialization, maybe the example code I found is only meant for data.table() call. To add a column it should not be used.
For copying a column, the behavior is understandable but kind of surprising when it worked with nrow > 1 but not working with nrow = 1.
My question is, what's the proper syntax for the task?
And for the surprising behavior difference with nrow >1/=1 of first version, is that expected behavior or a bug?
I want to substitute parts of the transform function with variable inputs.
I have created a df using subset with col1 from an existing table:
col1 = c('A','B','C')
The df looks something like this:
A = c(1, 3)
B = c(3, 1)
C = c(5, 2)
df = data.frame(A, B, C)
I now want to automate calculations which manually would look like this:
df <- transform(df, 'ABC' = (A + B + C))
where (A + B + C) refers to the columns of the df. Because I have hundreds of 'col1's I can't do it by hand. I was trying to use something similar to %s (as available in python 2.X), yet so far nothing really worked and I understand too little of R (related to eval()?)to get things working (tried paste, as.formula, sprintf, substitute etc.).
Using cv(col1) I'm trying to paste the output inside the transform function, yet the furthest I got was transform trying to grab values from the environment (not columns) when using as.formula.
cv = function(var){
output = paste('(', paste(var, collapse = ' + '), ')', sep = '')
return(output)
}
Would appreciate any hints or ideas!
You have maneuvered yourself into a strange corner. This is easy with R:
cols <- c("A", "B", "C")
df[, paste(cols, collapse = "")] <- rowSums(df[, cols])
#alternatively for other binary functions:
#Reduce("+", df[, cols])
# A B C ABC
#1 1 3 5 9
#2 3 1 2 6
You can get a similar effect using mutate from dplyr:
library(dplyr)
cols <- c("A", "B", "C")
df %>% mutate_(.dots = setNames(paste(cols, collapse = '+'),
'new_column_name'))
Here we tell mutate_ (spot the _) what to do via paste() which yields "A+B+C", and use setNames to name the new column.
I acknowledge the syntax is somewhat convoluted, but this is related to non-standard evaluation in dplyr. But if you want to do this in the dplyr ecosystem, this is the way to do it.