To get rid of a column named "foo" in a data.frame, I can do:
df <- df[-grep('foo', colnames(df))]
However, once df is converted to a data.table object, there is no way to just remove a column.
Example:
df <- data.frame(id = 1:100, foo = rnorm(100))
df2 <- df[-grep('foo', colnames(df))] # works
df3 <- data.table(df)
df3[-grep('foo', colnames(df3))]
But once it is converted to a data.table object, this no longer works.
Any of the following will remove column foo from the data.table df3:
# Method 1 (and preferred as it takes 0.00s even on a 20GB data.table)
df3[,foo:=NULL]
df3[, c("foo","bar"):=NULL] # remove two columns
myVar = "foo"
df3[, (myVar):=NULL] # lookup myVar contents
# Method 2a -- A safe idiom for excluding (possibly multiple)
# columns matching a regex
df3[, grep("^foo$", colnames(df3)):=NULL]
# Method 2b -- An alternative to 2a, also "safe" in the sense described below
df3[, which(grepl("^foo$", colnames(df3))):=NULL]
data.table also supports the following syntax:
## Method 3 (could then assign to df3,
df3[, !"foo"]
though if you were actually wanting to remove column "foo" from df3 (as opposed to just printing a view of df3 minus column "foo") you'd really want to use Method 1 instead.
(Do note that if you use a method relying on grep() or grepl(), you need to set pattern="^foo$" rather than "foo", if you don't want columns with names like "fool" and "buffoon" (i.e. those containing foo as a substring) to also be matched and removed.)
Less safe options, fine for interactive use:
The next two idioms will also work -- if df3 contains a column matching "foo" -- but will fail in a probably-unexpected way if it does not. If, for instance, you use any of them to search for the non-existent column "bar", you'll end up with a zero-row data.table.
As a consequence, they are really best suited for interactive use where one might, e.g., want to display a data.table minus any columns with names containing the substring "foo". For programming purposes (or if you are wanting to actually remove the column(s) from df3 rather than from a copy of it), Methods 1, 2a, and 2b are really the best options.
# Method 4:
df3[, .SD, .SDcols = !patterns("^foo$")]
Lastly there are approaches using with=FALSE, though data.table is gradually moving away from using this argument so it's now discouraged where you can avoid it; showing here so you know the option exists in case you really do need it:
# Method 5a (like Method 3)
df3[, !"foo", with=FALSE]
# Method 5b (like Method 4)
df3[, !grep("^foo$", names(df3)), with=FALSE]
# Method 5b (another like Method 4)
df3[, !grepl("^foo$", names(df3)), with=FALSE]
You can also use set for this, which avoids the overhead of [.data.table in loops:
dt <- data.table( a=letters, b=LETTERS, c=seq(26), d=letters, e=letters )
set( dt, j=c(1L,3L,5L), value=NULL )
> dt[1:5]
b d
1: A a
2: B b
3: C c
4: D d
5: E e
If you want to do it by column name, which(colnames(dt) %in% c("a","c","e")) should work for j.
I simply do it in the data frame kind of way:
DT$col = NULL
Works fast and as far as I could see doesn't cause any problems.
UPDATE: not the best method if your DT is very large, as using the $<- operator will lead to object copying. So better use:
DT[, col:=NULL]
Very simple option in case you have many individual columns to delete in a data table and you want to avoid typing in all column names #careadviced
dt <- dt[, -c(1,4,6,17,83,104)]
This will remove columns based on column number instead.
It's obviously not as efficient because it bypasses data.table advantages but if you're working with less than say 500,000 rows it works fine
Suppose your dt has columns col1, col2, col3, col4, col5, coln.
To delete a subset of them:
vx <- as.character(bquote(c(col1, col2, col3, coln)))[-1]
DT[, paste0(vx):=NULL]
Here is a way when you want to set a # of columns to NULL given their column names
a function for your usage :)
deleteColsFromDataTable <- function (train, toDeleteColNames) {
for (myNm in toDeleteColNames)
train <- train [,(myNm):=NULL]
return (train)
}
DT[,c:=NULL] # remove column c
For a data.table, assigning the column to NULL removes it:
DT[,c("col1", "col1", "col2", "col2")] <- NULL
^
|---- Notice the extra comma if DT is a data.table
... which is the equivalent of:
DT$col1 <- NULL
DT$col2 <- NULL
DT$col3 <- NULL
DT$col4 <- NULL
The equivalent for a data.frame is:
DF[c("col1", "col1", "col2", "col2")] <- NULL
^
|---- Notice the missing comma if DF is a data.frame
Q. Why is there a comma in the version for data.table, and no comma in the version for data.frame?
A. As data.frames are stored as a list of columns, you can skip the comma. You could also add it in, however then you will need to assign them to a list of NULLs, DF[, c("col1", "col2", "col3")] <- list(NULL).
Related
I have a dataset imported from a MongoDb database as a data.table, where some of the columns are formated as lists and contain some NULL values. The NULL values were causing me some issues when trying to fill a column in another data.table by reference to the first table, as the destination column was not in list format (and therefore can't have NULL values).
I found a solution below, which works fine for now, but my test dataset is only 6 records and I'm wondering if this would struggle when working with larger datasets or if there is a more efficient way to do this (in data.table)?
Here is some example data:
library(data.table)
dt <- data.table(id = c(1,2,3), age = list(12, NULL, 15), sex = list("F", "M", NULL))
And here is the solution I applied:
# Function to change NULL to NA in a data.table with lists:
null2na <- function(dtcol){
nowna = lapply(dtcol, function(x) ifelse(is.null(x), NA_real_, x))
return(nowna)
}
# Apply the function to the data.table to replace NULLs with NAs:
dt[, c(names(dt)) := lapply(.SD, null2na), .SDcols = names(dt)]
You can save one lapply call by using the lengths function.
library(data.table)
null2na <- function(dtcol){
dtcol[lengths(dtcol) == 0] <- NA
return(dtcol)
}
dt[, names(dt) := lapply(.SD, null2na)]
dt
# id age sex
#1: 1 12 F
#2: 2 NA M
#3: 3 15 NA
The age and sex column are still lists. If you want them as a simple vector return unlist(dtcol) from the function.
Here another way to solve your problem:
cols <- names(dt)[sapply(dt, is.list)] # get names of list columns
dt[, (cols) := lapply(.SD, function(x) replace(x, lengths(x)==0L, NA)), .SDcols=cols]
My toy example is too small to compare timings, but combining both solutions suggested by #B. Christian Kamgang and #Ronak Shah works well for me:
# Function to replace NULL with NA in lists:
null2na <- function(dtcol){
fullcol = replace(dtcol, lengths(dtcol) == 0L, NA)
return(fullcol)
# Apply function to dataset:
dt[, names(dt) := lapply(.SD, null2na)]
Two things I found advantageous with this approach (thanks to both respondants for suggesting):
Avoiding use of base r ifelse, dplyr::if_else and data.table::fifelse; base r ifelse converts all columns to a list unless you specify them before-hand, and the dplyr and data.table versions of ifelse, while they respect the original column classes don't work in this scenario because NA is interpreted as differing in type from the other values in the list.
The use of the function lengths(dtcol) == 0L targets specifically only the list elements that are null and doesn't do anything to the other columns or values. This means that it is not necessary to specify the subset of columns that are lists before-hand, as inherently it deals only with those.
I've gone with replace() rather than subsetting dtcol in the function as I think with larger datasets the former might be slightly faster (but have yet to test that).
Let's say I have the following data.table.
dt = data.table(one=rep(2,4), two=rnorm(4))
dt
Now I have created a variable with a name of one column.
col_name = "one"
If I want to return that column as a data.table, I can do one of the following. The first option will return the column name as V1 and the second will actually set the column name to "one".
dt[,.(get(col_name))]
dt[,col_name, with=FALSE]
I'm wondering if there is a way to specify the column name which using the get command. Something like the following, which doesn't work.
dt[,as.symbol(col_name) = .(get(col_name))]
The reason that I need the column names with get is that I have pretty extensive loop whereby I'm filling in empty columns. So it could end up looking like this, whereby I loop through and replace imp_val with the median by the columns in cols.
dat2[is.na(get(imp_val)),
as.symbol(imp_val) := dat2[.BY, median(get(imp_val), na.rm=TRUE), on=get(cols)], by=c(get(cols))]
We can specify it in the .SDcols
dt[, .SD,.SDcols = col_name]
Or with ..
dt[, ..col_name]
if the intention is to rename the column as 'col_name'
setnames(dt[, ..col_name], deparse(substitute(col_name)))[]
# col_name
#1: 2
#2: 2
#3: 2
#4: 2
You could also use the tidyverse approach for this. Setup:
library(data.table)
library(magrittr)
library(dplyr)
dt = data.table(one=rep(2,4), two=rnorm(4))
col_name = "one"
Then use select with the non-standard evaluation operator !! (pronounced bang-bang):
> dt %>% dplyr::select(!!col_name)
one
1: 2
2: 2
3: 2
4: 2
The returned object is still a data.table:
> dt %>%
dplyr::select(!!col_name) %>%
class
[1] "data.table" "data.frame"
I'm not sure what you mean with the second part of your question on replacing NAs with the median. Maybe you could update your answer with a small example?
I want to sort two columns to the front of my data.table (id and time in my case). Say I have:
library(data.table)
Data <- as.data.table(iris)
and say I want the order of the columns to be:
example <- Data
setcolorder(example,c("Species","Petal.Length","Sepal.Length",
"Sepal.Width","Petal.Length","Petal.Width"))
but my actual data table has many more variables so I would like to adress this as:
setcolorder(Data, c("Species","Petal.Length",
...all other variables in their original order...))
I played around with something like:
setcolorder(Data,c("Species","Petal.Length",
names(Data)[!c("Species","Petal.Length")]))
but I have a problem subsetting the character vector names(Data) by name reference. Also I'm sure I can avoid this workaround with some neat data.table function, no?
We can use setdiff to subset all the column names that are not in the subset of names i.e. 'nm1', concatenate that with 'nm1' in the setcolorder
nm1 <- c("Species", "Petal.Length")
setcolorder(Data, c(nm1, setdiff(names(Data), nm1)))
names(Data)
#[1] "Species" "Petal.Length" "Sepal.Length" "Sepal.Width" "Petal.Width"
A convenience function for this is:
setcolfirst = function(DT, ...){
nm = as.character(substitute(c(...)))[-1L]
setcolorder(DT, c(nm, setdiff(names(DT), nm)))
}
setcolfirst(Data, Species, Petal.Length)
The columns are passed without quotes here, but extension to a character vector is easy.
You can just do
setcolorder(Data,c("Species","Petal.Length"))
similarly as using xcols in kdb q. ?setcolorder says:
If ‘length(neworder) < length(x)’, the
specified columns are moved in order to the "front" of ‘x’.
My version of data.table is 1.11.4, but it might have been available for earlier versions too.
This is totally a riff off of Akrun's solution, using a bit more functional decomposition and an anaphoric macro because, well why not.
I'm no expert in writing R macros, so this is probably a naive solution.
> toFront <- function(vect, ...) {
c(..., setdiff(vect, c(...)))
}
> withColnames <- function(tbl, thunk) {
.CN = colnames(tbl)
eval(substitute(thunk))
}
> vect = c('c', 'd', 'e', 'a', 'b')
> tbl = data.table(1,2,3,4,5)
> setnames(tbl, vect)
> tbl
c d e a b
1: 1 2 3 4 5
> withColnames(tbl, setcolorder(tbl, toFront(.CN, 'a', 'b') ))
> tbl
a b c d e
1: 4 5 1 2 3
>
While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))
I am trying to figure out an elegant way to use := assignment to replace many columns at once in a data.table by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub) to all character columns in a table. It is not difficult to extend the data.frame way of doing this to a data.table, but I'm looking for a method consistent with the data.table way of doing things.
For example:
library(data.table)
m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))
## replace V20-V100 with sqrt
# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)
# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
I don't like this because I don't like reference the data.table in a j expression. I also know that I can use := to assign with lapply given that I know the column names:
dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]
(You could extend this by building an expression with unknown column names.)
Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?
# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]
Yes, you're right in question here :
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100))
dt[, col := sqrt(dt[[col]]), with = FALSE]
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
because the with = FALSE wasn't easy to read whether it referred to the LHS or the RHS of :=. End aside.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with lapply on the RHS of := is that the RHS (the lapply) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
As #Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table method starts to add up). To eliminate that overhead that there is data.table::set which under ?set is described as a "loopable" :=. I use a for loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set over a large number of rows than a large number of columns.) However, looped set doesn't solve the problem of the repeated reference to the dt symbol name that you mentioned in the question :
I don't like this because I don't like reference the data.table in a j expression.
Agreed. So the best I can do is revert to your looping of := but use get instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
However, I fear that using get in j carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get() on the RHS but not on the LHS. To address that we could sugar the LHS and allow get() as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
Also, maybe value of set could be run within scope of DT, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
These should work if you want to refer to the columns by string name:
n = paste0("V", 20:100)
dt[, (n) := lapply(n, function(x) {sqrt(get(x))})]
or
dt[, (n) := lapply(n, function(x) {sqrt(dt[[x]])})]
Is this what you are looking for?
dt[ , names(dt)[20:100] :=lapply(.SD, function(x) sqrt(x) ) , .SDcols=20:100]
I have heard tell that using .SD is not so efficient because it makes a copy of the table beforehand, but if your table isn't huge (obviously that's relative depending on your system specs) I doubt it will make much of a difference.