Referring to a column through a character variable in data.table R - r

I am using data.table and I want to filter a data.table within a function where I am passing the name of the column as a character vector.
For a simple reproducible example let's take the mtcars dataset of base R.
I can write using data.table syntax:
mtcars[am == 1, .N ]
But what if the name of the variable of interest -- i.e. am -- is stored as a character vector, i.e. "am"?
Your advice will be appreciated.

One option is to use get (?get search object by name):
mtcars[get('am') == 1, .N]
# [1] 13

Another option is to specify it in the .SDcols
mtcars[, sum(.SD==1) ,.SDcols = 'am']
#[1] 13
We can also include multiple variables
mtcars[, sum(Reduce(`&`, lapply(.SD, `==`, 1))), .SDcols = c('am', 'carb')]
#[1] 4

Related

Replace NULL with NA in r data.table with lists

I have a dataset imported from a MongoDb database as a data.table, where some of the columns are formated as lists and contain some NULL values. The NULL values were causing me some issues when trying to fill a column in another data.table by reference to the first table, as the destination column was not in list format (and therefore can't have NULL values).
I found a solution below, which works fine for now, but my test dataset is only 6 records and I'm wondering if this would struggle when working with larger datasets or if there is a more efficient way to do this (in data.table)?
Here is some example data:
library(data.table)
dt <- data.table(id = c(1,2,3), age = list(12, NULL, 15), sex = list("F", "M", NULL))
And here is the solution I applied:
# Function to change NULL to NA in a data.table with lists:
null2na <- function(dtcol){
nowna = lapply(dtcol, function(x) ifelse(is.null(x), NA_real_, x))
return(nowna)
}
# Apply the function to the data.table to replace NULLs with NAs:
dt[, c(names(dt)) := lapply(.SD, null2na), .SDcols = names(dt)]
You can save one lapply call by using the lengths function.
library(data.table)
null2na <- function(dtcol){
dtcol[lengths(dtcol) == 0] <- NA
return(dtcol)
}
dt[, names(dt) := lapply(.SD, null2na)]
dt
# id age sex
#1: 1 12 F
#2: 2 NA M
#3: 3 15 NA
The age and sex column are still lists. If you want them as a simple vector return unlist(dtcol) from the function.
Here another way to solve your problem:
cols <- names(dt)[sapply(dt, is.list)] # get names of list columns
dt[, (cols) := lapply(.SD, function(x) replace(x, lengths(x)==0L, NA)), .SDcols=cols]
My toy example is too small to compare timings, but combining both solutions suggested by #B. Christian Kamgang and #Ronak Shah works well for me:
# Function to replace NULL with NA in lists:
null2na <- function(dtcol){
fullcol = replace(dtcol, lengths(dtcol) == 0L, NA)
return(fullcol)
# Apply function to dataset:
dt[, names(dt) := lapply(.SD, null2na)]
Two things I found advantageous with this approach (thanks to both respondants for suggesting):
Avoiding use of base r ifelse, dplyr::if_else and data.table::fifelse; base r ifelse converts all columns to a list unless you specify them before-hand, and the dplyr and data.table versions of ifelse, while they respect the original column classes don't work in this scenario because NA is interpreted as differing in type from the other values in the list.
The use of the function lengths(dtcol) == 0L targets specifically only the list elements that are null and doesn't do anything to the other columns or values. This means that it is not necessary to specify the subset of columns that are lists before-hand, as inherently it deals only with those.
I've gone with replace() rather than subsetting dtcol in the function as I think with larger datasets the former might be slightly faster (but have yet to test that).

Why do I have to use `[[1]]` with data.table everywhere? [duplicate]

This question already has answers here:
Select column of data.table and return vector
(2 answers)
Closed 2 years ago.
After indexing DT column with a variable name, the data is returned as type data.table data.frame, and the column is not an accessible vector, I have to unlist it first. Am I doing everything as intended?
Consider this example:
require(data.table)
DT <- data.table(a=seq(1.001, 10.999, length=100), b=factor(c(rep('a', 55), rep('b', 45))))
col.name <- 'a'
diff(DT[, col.name]) #column name not found error
diff(DT[, col.name, with=FALSE]) #null data table
diff(DT[, col.name, with=FALSE][[1]]) #works
The second example is what question is about.
You have many options to retrieve single columns. In my opinion, the most readable option, is using .SD, though not the fastest. It's also often desired that single column data.tables are not converted to vectors.
require(data.table)
DT <- data.table(a=seq(1.001, 10.999, length=100), b=factor(c(rep('a', 55), rep('b', 45))))
DT[ , get(col.name) ] # vector
DT[[ col.name ]] # vecotr
DT[ , col.name, with = FALSE ] # data.table
DT[ , .SD, .SDcols = col.name ] # data.table
Two (straight-forward) ways to to get just the vector and not the data.table, given your inputs:
DT[, get(col.name)]
DT[[col.name]] # as per comment by ismirsehregal
And some equally convoluted as DT[, col.name, with=FALSE][[1]]
with(DT, eval(parse(text=col.name)))
DT[, ..col.name][[1]] # as per comment by ismirsehregal

dynamically subseting data.table in R

my data.table contain K columns called claims, among other 30 columns. I want to subset the data.table, such that only rows remain which do not have 0 claims.
So, firstly i get all the column names i need for filtering. For the purpose of this example, i have chosen K = 2
> claimsCols = c("claimsnext", paste0("claims" , 1:K))
> claimsCols
[1] "claimsnext" "claims1" "claims2"
i have tried subsetting like:
for(i in claimsCols){
BTplan <- BTplan[ claimsCols[i] == 0, ]
i+1
}
this doent work:
Error in i + 1 : non-numeric argument to binary operator
I am sure there is a better way to do this?
I would basically do what akrun does
idx = BTplan[ , Reduce(`&`, .SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
The innovations are:
Use patterns in .SDcols to specify the columns to include by pattern
& automatically converts numeric to logical, i.e. 1.1 & 2.2 is TRUE, and becomes FALSE as soon as there's a 0 anywhere (hence filtering the corresponding row)
In a future version of data.table this will be slightly more efficient and concise (and hopefully more readable):
idx = BTplan[ , pall(.SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
Keep an eye on this pull request:
https://github.com/Rdatatable/data.table/pull/4448
In the OP's code, the i is each of the elements of 'claimsCols' which is character, so i +1 won't work and in fact, it is not needed
for(colnm in claimsCols) {
BTplan <- BTplan[BTplan[[colnm]] != 0,]
}
Or using data.table syntax
library(data.table)
setDT(BTplan)
BTplan[BTplan[, Reduce(`&`, lapply(.SD, `!=`, 0)),.SDcols = claimsCols]]

how to deal with data.table column as.character in R?

I'm trying to use data.table rather data.frame(for a faster code). Despite the syntax difference between than, I'm having problems when I need to extract a specific character column and use it as character vector. When I call:
library(data.table)
DT <- fread("file.txt")
vect <- as.character(DT[, 1, with = FALSE])
class(vect)
###[1] "character"
head(vect)
It returns:
[1] "c(\"uc003hzj.4\", \"uc021ofx.1\", \"uc021olu.1\", \"uc021ome.1\", \"uc021oov.1\", \"uc021opl.1\", \"uc021osl.1\", \"uc021ovd.1\", \"uc021ovp.1\", \"uc021pdq.1\", \"uc021pdv.1\", \"uc021pdw.1\")
Any ideas of how to avoid these "\" in the output?
The as.character works on vectors and not on data.frame/data.table objects in the way the OP expected. So, if we need to get the first column as character class, subset with .SD[[1L]] and apply the as.character
DT[, as.character(.SD[[1L]])]
If there are multiple columns, we can specify the column index with .SDcols and loop over the .SD to convert to character and assign (:=) the output back to the particular columns.
DT[, (1:2) := lapply(.SD, as.character), .SDcols= 1:2]
data
DT <- data.table(Col1 = 1:5, Col2= 6:10, Col3= LETTERS[1:5])

How to remove duplicated (by name) column in data.tables in R?

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

Resources