data.table assignment operator with lists in R - r

I have a data.table containing a name column, and I'm trying to extract a regular expression from this name. The most obvious way to do it in this case is with the := operator, as I'm assigning this extracted string as the actual name of the data. In doing so, I find that this doesn't actually apply the function in the way that I would expect. I'm not sure if it's intentional, and I was wondering if there's a reason it does what it does or if it's a bug.
library(data.table)
dt <- data.table(name = c('foo123', 'bar234'))
Searching for the desired expression in a simple character vector behaves as expected:
name <- dt[1, name]
pattern <- '(.*?)\\d+'
regmatches(name, regexec(pattern, name))
[[1]]
[1] "foo123" "foo"
I can easily subset this to get what I want
regmatches(name, regexec(pattern, name))[[1]][2]
[1] "foo"
However, I run into issues when I try to apply this to the entire data.table:
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2]]
dt
name name_final
1: foo123 foo
2: bar234 foo
I don't know how data.table works internally, but I would guess that the function was applied to the entire name column first, and then the result is coerced into a vector somehow and then assigned to the new name_final column. However, the behavior I would expect here would be on a row-by-row basis. I can emulate this behavior by adding a dummy id column;
dt[, id := seq_along(name)]
dt[, name_final := regmatches(name, regexec(pattern, name))[[1]][2], by = list(id)]
dt
name name_final id
1: foo123 foo 1
2: bar234 bar 2
Is there a reason that this isn't the default behavior? If so, I would guess that it had to do with columns being atomic to the data.table rather than the rows, but I'd like to understand what's going on there.

Pretty much nothing in R runs on a row-by-row basis. It's always better to work with columns of data at a time so you can pretty much assume that the entire column vector of values will be passed in as a parameter to your function. Here's a way to extract the second element for each item in the regmatches list
dt[, name_final := sapply(regmatches(name, regexec(pattern, name)), `[`, 2)]
Functions like sapply() or Vectorize() can "fake" a per-row type call for functions that aren't meant to be run on a vector/list of data at a time.

Related

Select column of data.table and return vector

Is it possible to select a column of a data.table and get back a vector? In base R, the argument drop=TRUE would do the trick. For example,
library(data.table)
dat <- as.data.table(iris)
dat[,"Species"] # returns data.table
dat[,"Species", drop=TRUE] # same
iris[, "Species", drop=TRUE] # a factor, wanted result
Is there a way to do this with data.table?
EDIT: the dat[,Species] method is fine, however I need a method where I can pass the column name in a variable:
x <- "Species"
dat[,x, drop=TRUE]
With data.frame, the default is drop = TRUE and in data.table, it is the opposite while it is done internally. According to ?data.table
drop - Never used by data.table. Do not use. It needs to be here because data.table inherits from data.frame.
In order to get the same behavior, we can use [[ to extract the column by passing a string
identical(dat[["Species"]], iris[, "Species"])
#[1] TRUE
Or
dat$Species
By using [[ or $, it extracts as a vector while also bypass the data.table overhead
See data.table FAQ #1.1. This comes as a feature since 2006.

Understand the meaning of[.... with=F][[1]]

I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?

Looking for name of R function for subsetting observation

I'm unsure of what to Google.
I have a column, lets call it x. Within this variable, each row is a list of strings. For example
1: A,B,C,D,E
2: A,B,C,D,E
I am wondering the name of the R function to select, process, etc. within each row? E.g. I may wish to extract only B from each row. Or perhaps delete all C's.
Assuming that it is a data.table, we extract the character 'B' with str_extract
library(data.table)
library(stringr)
dt[, x:= str_extract(x, "B")]
and if we want to delete all 'C's, it can be done with gsub from base R or str_replace_all from stringr
dt[, x := gsub(",*C", "", x)]
data
dt <- data.table(x = c('A,B,C,D,E', 'A,C,D', 'B,C,C,D'))

data.table replace data using values from another data.table, conditionally

This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.
What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.
For a reproducible example, here are the data. In the real world the tables are much larger:
library(data.table)
set.seed(0)
## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)
for(i in 2015:2017) {
varA <- paste0('varA_', i)
varB <- paste0('varB_', i)
varC <- paste0('varC_', i)
dt_original[, (varA) := rnorm(20)]
dt_original[, (varB) := rnorm(20)]
dt_original[, (varC) := rnorm(20)]
}
## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)
newdata_vars <- sample(names(dt_original)[-1], 4)
for(var in newdata_vars) {
dt_newdata[, (var) := rnorm(3)]
}
Here is a way of doing it using a loop and pmax, but there has to be a better way, right?
for(var in newdata_vars) {
k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}
It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.
This code should work in the current format given your criteria.
dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]
It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.
Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to #brian-stamper for pointing out the key preservation in the comments.
dt_original[dt_newdata,
names(dt_newdata)[-1] := Map(pmax,
mget(names(dt_newdata)[-1]),
dt_newdata[, .SD, .SDcols=-1])]
Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.

Use syntactically invalid names in j data.table

I have a data.table with column names that are not valid R names
DT = data.table(a = c(1, 2), `0b` = c(4, 5))
And I want to use something like this
my_column <- "0b"
DT[, mean(eval(parse(text = my_column)))]
but I get an error
Error in parse(text = my_column) : <text>:1:2: unexpected symbol
1: 0b
^
Is there any way how I can do it, i.e. use non valid column name as a variable inside j?
We can either specify the column in .SDcols and get the mean using .SD
DT[, mean(.SD[[1L]]),.SDcols=my_column]
Or we can subset the column using [[ and then get the mean.
mean(DT[[my_column]])
In R, syntactically invalid names need back-ticks in order to be evaluated. Although .SDcols is probably the proper way to go, you can use as.name() or as.symbol() to turn the character my_column into a back-ticked name.
DT[, mean(eval(as.name(my_column)))]
# [1] 4.5
Or a bit more clunky way would be to do
with(DT, do.call(mean, list(as.name(my_column))))
# [1] 4.5
As you specified in declaring your example, using backticks (`) is a common way to handle strange column names:
DT[ , mean(`0b`)]
Though get also works:
DT[ , mean(get("0b"))]
We can also do it the data.frame way
sapply(DT[ , "0b"], mean)
Though you may just want to setnames to get rid of the pesky column names altogether (by reference)
setnames(DT, "0b", "something_digestible")

Resources