I'm trying to manipulate a number of data.tables in similar ways, and would like to write a function to accomplish this. I would like to pass in a parameter containing a list of columns that would have the operations performed. This works fine when the vector declaration of columns is the left hand side of the := operator, but not if it is declared earlier (or passed into the function). The follow code shows the issue.
dt = data.table(a = letters, b = 1:2, c=1:13)
colsToDelete = c('b', 'c')
dt[,colsToDelete := NULL] # doesn't work but I don't understand why not.
dt[,c('b', 'c') := NULL] # works fine, but doesn't allow passing in of columns
The error is "Adding new column 'colsToDelete' then assigning NULL (deleting it)." So clearly, it's interpreting 'colsToDelete' as a new column name.
The same issue occurs when doing something along these lines
dt[, colNames := lapply(.SD, adjustValue, y=factor), .SDcols = colNames]
I new to R, but rather more experienced with some other languages, so this may be a silly question.
It's basically because we allow symbols on LHS of := to add new columns, for convenience: ex: DT[, col := val]. So, in order to distinguish col itself being the name from whatever is stored in col being the column names, we check if the LHS is a name or an expression.
If it's a name, it adds the column with the name as such on the LHS, and if expression, then it gets evaluated.
DT[, col := val] # col is the column name.
DT[, (col) := val] # col gets evaluated and replaced with its value
DT[, c(col) := val] # same as above
The preferred idiom is: dt[, (colsToDelete) := NULL]
HTH
I am surprised no answer provided uses the set() function.
set(DT, , colsToDelete, NULL)
This should be the easiest.
To extend on previous answer, you can delete columns by reference doing:
# delete columns 10 to 15
dt[ , (10:15) := NULL ]
or
# delete columns 3, 5 and 10 to 15
dt[ , (c(3,5,10:15)) := NULL ]
This code did the job for me. you need to have the position of the columns to be deleted e.g., posvec as mentioned in the ?set
j: Column name(s) (character) or number(s) (integer) to be assigned value
when column(s) already exist, and only column name(s) if they are to
be created.
DT_removed_slected_col = set(DT, j = posvec, value = NULL)
Also if you want to get the posvec you can try this:
selected_col = c('col_a','col_b',...)
selected_col = unlist(sapply(selected_col, function(x) grep(x,names(DT))))
namvec = names(selected_col) #col names
posvec = unname(selected_col) #col positions
Related
Given a data table DT with a column Col1, select the rows of DT where the values x in Col1 satisfy some boolean expression, for example f(x) == TRUE or another example f(x) <= 4, and then doing more data table operations.
For example, I tried something like
DT[f(Col1) == TRUE, Col2 := 2]
which does not work because f() acts on values not vectors. Using lapply(), seems to work but it take a long time to run with a very large DT.
A workaround would be to create a column and using that to select the rows
DT[, fvalues := f(Col1)][fvalues == TRUE, Col2 := 2]
but it would be better not to increase the size of DT.
EDIT: Here is an example.
map1<-data.table(k1=c("A","B","C"), v=c(-1,2,3))
map2<-data.table(k2=c("A","B","A","A","C","B"), k3=c("A","B","C","B","C","B"))
f <- function(x) map1[k1 == x, v]
To find the rows in map2 using the corresponding value in map1: these do not work (returning an error about a length mismatch around ==)
map2[f(k2) == 2, flag1 := TRUE]
map2[f(k2) + f(k3) == 2, flag2 := TRUE]
but using lapply() the first one works but it is somehow slower (for a large data table) than adding column to map2 with the values of f and selecting based on that new column
map2[lapply(k2,f) == 2, flag1 := TRUE]
and the second one
map2[lapply(k2,f) + lapply(k3,f) == 2, flag2 := TRUE]
returns an error (non-numeric argument).
The question would be how to do this most efficiently, particularly without making the data table larger.
I think the problem perhaps is when you're adding the parts which are aiming to modify the data.table in-place (i.e. your := parts). I don't think you can filter in place, as such filter really requires writing to a new memory location.
This works for filtering, whilst creating a new object:
library(data.table)
f <- function(x) x > 0.5
DT <- data.table(Col1 = runif(10))
DT[f(Col1),]
#> Col1
#> 1: 0.7916055
#> 2: 0.5391773
#> 3: 0.6855657
#> 4: 0.5250881
#> 5: 0.9089948
#> 6: 0.6639571
To do more data.table operations on a filtered table, assign to a new object and work with that one:
DT2 <- DT[f(Col1),]
DT2[, Col2 := 2]
Perhaps I've misunderstood your problem though - what function are you using? Could you post more code so we can replicate your problem more precisely?
If f is a function working only on scalar values, you could Vectorize it:
DT[Vectorize(f)(Col1)]
Not sure this fully answers your question because Vectorize uses lapply/mapply
I've edited this question based on comments by #akrun (thank you!), realizing I didn't accurately ask my question.
I'm confused why the following doesn't return the contents of the last column in a data table.
> dt <- data.table(A=c(10,10,10),B=c(20,20,20),C=c(30,30,30))
> dt[,ncol(dt)]
[1] 3
If I use with=F it behaves as I would expect, returning the last column as a data table
> dt[,ncol(dt),with=F]
C
1: 30
2: 30
3: 30
This returns the same result as dt[,3] which makes sense. But why is it not true that dt[,ncol(dt)] = dt[,3]? From ?data.table,
When j is a vector of column names or positions to select (as in data.frame). There is no need to use with=FALSE anymore.
Doesn't ncol(dt) return a vector of column positions, a vector of length one? Why doesn't dt[,ncol(dt)] return the contents of the last column?
Thanks for your help!
We need drop = FALSE for data.frame
df[,ncol(df), drop = FALSE]
as by default it is TRUE if we check the ?Extract
x[i, j, ... , drop = TRUE]
For the data.table, we need with = FALSE
dt[, ncol(dt), with = FALSE]
and it is mentioned in the ?data.table help
When j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is not necessary anymore to select columns dynamically. Note that x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols].
Is it possible to create a new column and keep (few) existing columns in the statement ? e.g. creation of "x" column and then keeping "x" and "mpg" column
dt <- data.table(mtcars)
dt[,x:=mpg]
dt[,.(x,mpg)]
If you want to do the replacement by reference, using := then you can do
dt[, x:=mpg][, setdiff(colnames(dt), c('x', 'mpg')) := NULL]
If we need it in a single step, instead of doing the := to modify the original dataset, specify it with = inside list or .(
dt[,.(x = mpg, mpg)]
Or if it necessary to create the column in original dataset, it can be piped
dt[, x := mpg][, .(x, mpg)]
If we want to update the columns in the original object, another option is set
set(dt[, x:= mpg], i = NULL, j = names(dt)[!names(dt) %in% c('x', 'mpg')], value = NULL)
I've been getting used to data.tables and just cannot seem to find the answer to something that feels so simple (or at least is with data frames).
I want to use data.table to aggregate, however, I don't always know which column to aggregate ahead of time (it takes input from the user). I want to define what column to use based off of a character vector. Here's a short example of what I want to do:
require(data.table)
myDT <- data.table(a = 1:10, b = 11:20, n1 = c("first", "second"))
aggWith <- "a"
Now I want to use the aggWith object to define what column to sum on. This does not work:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1)]
Error in sum(aggWith) : invalid 'type' (character) of argument
Nor does this:
> myDT.Agg <- myDT[, .(Agg = sum(aggWith)), by = .(n1), with = FALSE]
Error in sum(aggWith) : invalid 'type' (character) of argument
This does:
myDT.Agg <- myDT[, .(Agg = sum(a)), by = .(n1)]
However, I want to be able to define which column "a" is arbitrarily based off a character vector. I've looking through ?data.table, but am just not seeing what I need. Sorry in advance if this is really simple and I'm just overlooking something.
We could specify the 'aggWith' as .SDcols and then get the sum of .SD
myDT[, list(Agg= sum(.SD[[1L]] )), by = n1, .SDcols=aggWith]
If there are multiple columns, then loop with lapply
myDT[, lapply(.SD, sum), by = n1, .SDcols= aggWith]
Another option would be to use eval(as.name
myDT[, list(Agg= sum(eval(as.name(aggWith)))), by = n1]
While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))