I have the following function:
DT <- data.table(col1 = 1:4, col2 = c(2:5))
fun <- function(DT, fct){
DT_out <- DT[,new_col := fct]
return(DT_out)
}
fun(input, fct = function(x = col1, y = col2){y - x})
In reality I have some processing before and after this code snippet, thus I do not wish to use directly the statement DT[,new_col := fct] with a fixed fct (because the fct should be flexible). I know this question is very similar to this one, but I cannot figure out how to reformulate the code such that two columns as arguments for the function are allowed. The code above gives the error:
Error in `[.data.table`(DT, , `:=`(new_col, fct)) :
RHS of assignment is not NULL, not an an atomic vector (see ?is.atomic) and not a list column.
One option if you don't mind adding quotes around the variable names
fun <- function(DT, fun, ...){
fun_args <- c(...)
DT[,new_col := do.call(fun, setNames(mget(fun_args), names(fun_args)))]
}
fun(DT, fun = function(x, y){y - x}, x = 'col1', y = 'col2')
DT
# col1 col2 new_col
# 1: 1 2 1
# 2: 2 3 1
# 3: 3 4 1
# 4: 4 5 1
Or use .SDcols (same result as above)
fun <- function(DT, fun, ...){
fun_args <- c(...)
DT[, new_col := do.call(fun, setNames(.SD, names(fun_args))),
.SDcols = fun_args]
}
Related
:= used for multiple simultaneous assign in data table does not respect updated values. The column x is incremented, and then I intend to assign updated value of x to y. Why is the value not equal to intended ?
> z = data.table(x = 1:5, y= 1:5)
> z[, `:=` (x = x + 1, y = x)]
> # Actual
> z
x y
1: 2 1
2: 3 2
3: 4 3
4: 5 4
5: 6 5
> # Expected
> z
x y
1: 2 2
2: 3 3
3: 4 4
4: 5 5
5: 6 6
Here are two more alternatives for you to consider. As noted, data.table doesn't do the dynamic scoping in the way that dplyr::mutate does, so y = x still refers to z$x in the second part of your statement. You can consider Filing an issue if you strongly prefer this way.
explicitly assign the new x inline:
z[, `:=` (x = (x <- x + 1), y = x)]
In the environment where j is evaluated, now an object x is created to overwrite z$x temporarily. This should be very similar to what dplyr is doing internally -- evaluating the arguments of mutate sequentially and updating the column values iteratively.
Switch to LHS := RHS form (see ?set):
z[ , c('x', 'y') := {
x = x + 1
.(x, x)
}]
. is shorthand in data.table for list. In LHS := RHS form, RHS must evaluate to a list; each element of that list will be one column in the assignment.
More compactly:
z[ , c('x', 'y') := {x = x + 1; .(x, x)}]
; allows you to write multiple statements on the same line (e.g. 3+4; 4+5 will run 3+4 then 4+5). { creates a way to wrap multiple statements and return the final value, see ?"{". Implicitly you're using this whenever you write if (x) { do_true } else { do_false } or function(x) { function_body }.
The value of x is not updated while doing the calculation for y. You might use the same assignment as x for y
library(data.table)
z[, `:=` (x = x + 1, y = x + 1)]
Or update it separately.
z[, x := x + 1][, y:= x]
This behavior is different as compared to mutate from dplyr where the following works.
library(dplyr)
z %>% mutate(x = x + 1, y = x)
I am bit surprised by the behaviour of data.table. I want to select from one row in the data.table all non-NA values.
With NA values it's working:
t = data.table(a=1,b=NA)
t[, !is.na(t), with=F]
Without NA values it doesn't working:
t = data.table(a=1, b=2)
t[, !is.na(t), with=F]
The basic difference is that t[, !c(F, F), with=F] doesn't work. Interestingly t[, c(T, T), with=F] is doing fine.
I know there are many ways to achieve the desired output, but I am only interested in this - for me strange - behaviour of data.table.
I've investigated the data.table:::`[.data.table` source code
And it indeed looks like a bug to me. What basically happens, is that the !is.na() call is divided into ! and is.na() calls. Then, it sums this vector up, and if the length is zero it returns null.data.table(). The issue is, that for dt <- data.table(a = 1, b = 2), sum(is.na(dt)) will always be zero.
Below is a shortened code to illustrate what goes under the hood
sim_dt <- function(...) {
## data.table catches the call
jsub <- substitute(...)
cat("This is your call:", paste0(jsub, collapse = ""))
## data.table separates the `!` from the call and sets notj = TRUE instead
## and saves `is.na(t)` into `jsub`
if (is.call(jsub) && deparse(jsub[[1L]], 500L, backtick=FALSE) %in% c("!", "-")) { # TODO is deparse avoidable here?
notj = TRUE
jsub = jsub[[2L]]
} else notj = FALSE
cat("\nnotj:", notj)
cat("\nThis is the new jsub: ", paste0(jsub, collapse = "("), ")", sep = "")
## data.table evaluates just the `jsub` part which obviously return a vector of `FALSE`s (because `!` was removed)
cat("\nevaluted j:", j <- eval(jsub, setattr(as.list(seq_along(dt)), 'names', names(dt)), parent.frame()))# else j will be evaluated for the first time on next line
## data.table checks if `j` is a logical vector and looks if there are any TRUEs and gets an empty vector
if (is.logical(j)) cat("\nj after `which`:", j <- which(j))
cat("\njs length:", length(j), "\n\n")
## data.table checks if `j` is empty (and it's obviously is) and returns a null.data.table
if (!length(j)) return(data.table:::null.data.table()) else return(dt[, j, with = FALSE])
}
## Your data.table
dt <- data.table(a = 1, b = 2)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE FALSE
# j after `which`:
# js length: 0
#
# Null data.table (0 rows and 0 cols)
dt <- data.table(a = 1, b = NA)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE TRUE
# j after `which`: 2
# js length: 1
#
# b
# 1: NA
As #Roland has already mentioned is.na(t) output is a matrix where you need a vector to select column.
But column selection should work in example given by OP as it got only single row in data.table. All we need to do is to wrap it in () to get that evaluated. e.g. :
library(data.table)
t = data.table(a=1, b=2)
t[,(!c(FALSE,FALSE)),with=FALSE]
# a b
# 1: 1 2
t[,(!is.na(t)),with=FALSE]
# a b
# 1: 1 2
I'm trying to create a variant of pmax / pmin that works with an additional filter_value parameter across an arbitrary set of columns that would be defined using .SD / .SDcols. The first version of the function below hard-codes the filter value, but works with .SD:
testFuncV1 <- function(...) {
cols <- list(...)
num_cols <- length(cols)
num_records <- length(cols[[1]])
max_records <- c()
for (record_num in 1:num_records) {
v <- c()
for (l in cols) {
v <- c(v, l[[record_num]])
}
filt_v <- Filter(function(x) { x <= 1 }, v)
if (length(filt_v) == 0) {
max_records <- c(max_records, NA)
} else {
max_records <- c(max_records, max(filt_v))
}
}
max_records
}
test_dt_v1 <- data.table(a = c(1,3,5), b = c(2,3,-1), c = c(-3, 5, 2))
test_dt_v1[, max_with_filter := do.call(testFuncV1, .SD), .SDcols = c('a', 'b', 'c')]
returns:
a b c max_with_filter
1: 1 2 -3 1
2: 3 3 5 NA
3: 5 -1 2 -1
The second version of the function below takes a second filter parameter, but I was not able to get it to work with .SD, and rather, had to pass the individual column vectors in as a list to get things to work:
testFuncV2 <- function(cols, filter) {
num_cols <- length(cols)
num_records <- length(cols[[1]])
max_records <- c()
for (record_num in 1:num_records) {
v <- c()
for (l in cols) {
v <- c(v, l[[record_num]])
}
filt_v <- Filter(function(x) { x <= filter }, v)
if (length(filt_v) == 0) {
max_records <- c(max_records, NA)
} else {
max_records <- c(max_records, max(filt_v))
}
}
max_records
}
test_dt_v2 <- data.table(a = c(1,3,5), b = c(2,3,-1), c = c(-3, 5, 2))
test_dt_v2[, max_with_filter := do.call(testFuncV2, list(list(test_dt_v2$a, test_dt_v2$b, test_dt_v2$c), 1))]
also returns:
a b c max_with_filter
1: 1 2 -3 1
2: 3 3 5 NA
3: 5 -1 2 -1
Ideally, I'd be able to either figure out an approach that works with .SD using do.call, or substitute in something that works with lapply (which I also experimented around with, to no avail). Thanks in advance!
Here is an option using apply(MARGIN=1, ...)
func <- function(x, threshold) {
if (any(x <= threshold)) return(max(x[x <= threshold]))
NA
}
test_dt_v1[, max_with_filter := apply(.SD, 1, func, threshold=1),
.SDcols=c("a","b","c")]
Another option using do.call and pmax by converting values above 1 to NA first (idea came from rowwise maximum for R)
test_dt_v1[, max_with_filter := do.call(pmax, c(`is.na<-`(.SD, .SD>1), na.rm=T))]
Let's say I have two columns of strings:
library(data.table)
DT <- data.table(x = c("a","aa","bb"), y = c("b","a","bbb"))
For each row, I want to know whether the string in x is present in column y. A looping approach would be:
for (i in 1:length(DT$x)){
DT$test[i] <- DT[i,grepl(x,y) + 0]
}
DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1
Is there a vectorized implementation of this? Using grep(DT$x,DT$y) only uses the first element of x.
You can simply do
DT[, test := grepl(x, y), by = x]
Or mapply (Vectorize is really just a wrapper for mapply)
DT$test <- mapply(grepl, pattern=DT$x, x=DT$y)
Thank you all for your responses. I've benchmarked them all, and come up with the following:
library(data.table)
library(microbenchmark)
DT <- data.table(x = rep(c("a","aa","bb"),1000), y = rep(c("b","a","bbb"),1000))
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
microbenchmark(
DT1[, test := grepl(x, y), by = x]
,
DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2]))
,
DT3$test <- mapply(grepl, pattern=DT3$x, x=DT3$y)
,
{vgrepl <- Vectorize(grepl)
DT4[, test := as.integer(vgrepl(x, y))]}
)
Results
Unit: microseconds
expr min lq mean median uq max neval
DT1[, `:=`(test, grepl(x, y)), by = x] 758.339 908.106 982.1417 959.6115 1035.446 1883.872 100
DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2])) 16840.818 18032.683 18994.0858 18723.7410 19578.060 23730.106 100
DT3$test <- mapply(grepl, pattern = DT3$x, x = DT3$y) 14339.632 15068.320 16907.0582 15460.6040 15892.040 117110.286 100
{ vgrepl <- Vectorize(grepl) DT4[, `:=`(test, as.integer(vgrepl(x, y)))] } 14282.233 15170.003 16247.6799 15544.4205 16306.560 26648.284 100
Along with being the most syntactically simple, the data.table solution is also the fastest.
You can pass the grepl function into an apply function to operate on each row of your data table where the first column contains the string to search for and the second column contains the string to search in. This should give you a vectorized solution to your problem.
> DT$test <- apply(DT, 1, function(x) as.integer(grepl(x[1], x[2])))
> DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1
You can use Vectorize:
vgrepl <- Vectorize(grepl)
DT[, test := as.integer(vgrepl(x, y))]
DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1
I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0.
I am aware that replacing na-values with zero goes like this:
DT[is.na(DT)] <- 0
To select only numeric columns, I found this solution, which works fine:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
I can achieve what I want by assigning
DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
and then do:
DT2[is.na(DT2)] <- 0
But of course I would like to have my original DT modified by reference. With the following, however:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
[is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0
I get
"Error in [.data.table([...] i is invalid type (matrix)"
What am I missing?
Any help is much appreciated!!
We can use set
for(j in seq_along(DT)){
set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
}
Or create a index for numeric columns, loop through it and set the NA values to 0
ind <- which(sapply(DT, is.numeric))
for(j in ind){
set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}
data
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
I wanted to explore and possibly improve on the excellent answer given above by #akrun. Here's the data he used in his example:
library(data.table)
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT
#> v1 v2 v3
#> 1: NA <NA> -0.5458808
#> 2: 1 A 0.5365853
#> 3: 2 B 0.4196231
#> 4: 3 C -0.5836272
#> 5: 4 D NA
And the two methods he suggested to use:
fun1 <- function(x){
for(j in seq_along(x)){
set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
}
}
fun2 <- function(x){
ind <- which(sapply(x, is.numeric))
for(j in ind){
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
I think the first method above is really genius as it exploits the fact that NAs are typed.
First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:
fun3 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
for(j in nms){
x[is.na(get(j)), (j):=0]
}
}
Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns
fun4 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]
}
But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()
fun5 <- function(x){
setDT(
lapply(x, function(i){
if(is.numeric(i))
i[is.na(i)]<-0
i
})
)
}
Finally,we could use the same idea of conditional to limit the columns on which we apply the set()
fun6 <- function(x){
for(j in seq_along(x)){
if (is.numeric(x[[j]]) )
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
Here are the benchmarks:
microbenchmark::microbenchmark(
for.set.2cond = fun1(copy(DT)),
for.set.ind = fun2(copy(DT)),
for.get = fun3(copy(DT)),
for.SDcol = fun4(copy(DT)),
for.list = fun5(copy(DT)),
for.set.if =fun6(copy(DT))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> for.set.2cond 59.812 67.599 131.6392 75.5620 114.6690 4561.597 100 a
#> for.set.ind 71.492 79.985 142.2814 87.0640 130.0650 4410.476 100 a
#> for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202 100 c
#> for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491 100 b
#> for.list 69.722 81.932 137.2275 87.7720 123.6935 3906.149 100 a
#> for.set.if 52.380 58.397 116.1909 65.1215 72.5535 4570.445 100 a
You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.
library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))
Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:
DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table
So, tidyverse can be less verbose than data.table sometimes :-)