r data.table behaviour with booleans as column selector - r

I am bit surprised by the behaviour of data.table. I want to select from one row in the data.table all non-NA values.
With NA values it's working:
t = data.table(a=1,b=NA)
t[, !is.na(t), with=F]
Without NA values it doesn't working:
t = data.table(a=1, b=2)
t[, !is.na(t), with=F]
The basic difference is that t[, !c(F, F), with=F] doesn't work. Interestingly t[, c(T, T), with=F] is doing fine.
I know there are many ways to achieve the desired output, but I am only interested in this - for me strange - behaviour of data.table.

I've investigated the data.table:::`[.data.table` source code
And it indeed looks like a bug to me. What basically happens, is that the !is.na() call is divided into ! and is.na() calls. Then, it sums this vector up, and if the length is zero it returns null.data.table(). The issue is, that for dt <- data.table(a = 1, b = 2), sum(is.na(dt)) will always be zero.
Below is a shortened code to illustrate what goes under the hood
sim_dt <- function(...) {
## data.table catches the call
jsub <- substitute(...)
cat("This is your call:", paste0(jsub, collapse = ""))
## data.table separates the `!` from the call and sets notj = TRUE instead
## and saves `is.na(t)` into `jsub`
if (is.call(jsub) && deparse(jsub[[1L]], 500L, backtick=FALSE) %in% c("!", "-")) { # TODO is deparse avoidable here?
notj = TRUE
jsub = jsub[[2L]]
} else notj = FALSE
cat("\nnotj:", notj)
cat("\nThis is the new jsub: ", paste0(jsub, collapse = "("), ")", sep = "")
## data.table evaluates just the `jsub` part which obviously return a vector of `FALSE`s (because `!` was removed)
cat("\nevaluted j:", j <- eval(jsub, setattr(as.list(seq_along(dt)), 'names', names(dt)), parent.frame()))# else j will be evaluated for the first time on next line
## data.table checks if `j` is a logical vector and looks if there are any TRUEs and gets an empty vector
if (is.logical(j)) cat("\nj after `which`:", j <- which(j))
cat("\njs length:", length(j), "\n\n")
## data.table checks if `j` is empty (and it's obviously is) and returns a null.data.table
if (!length(j)) return(data.table:::null.data.table()) else return(dt[, j, with = FALSE])
}
## Your data.table
dt <- data.table(a = 1, b = 2)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE FALSE
# j after `which`:
# js length: 0
#
# Null data.table (0 rows and 0 cols)
dt <- data.table(a = 1, b = NA)
sim_dt(!is.na(dt))
# This is your call: !is.na(dt)
# notj: TRUE
# This is the new jsub: is.na(dt)
# evaluted j: FALSE TRUE
# j after `which`: 2
# js length: 1
#
# b
# 1: NA

As #Roland has already mentioned is.na(t) output is a matrix where you need a vector to select column.
But column selection should work in example given by OP as it got only single row in data.table. All we need to do is to wrap it in () to get that evaluated. e.g. :
library(data.table)
t = data.table(a=1, b=2)
t[,(!c(FALSE,FALSE)),with=FALSE]
# a b
# 1: 1 2
t[,(!is.na(t)),with=FALSE]
# a b
# 1: 1 2

Related

Function argument as value or column name for data.table

I want my function to be able to take a value or a column name. How can I do this with data.table?
library(data.table)
df <- data.table(a = c(1:5),
b = c(5:1),
c = c(1, 3, 5, 3, 1))
myfunc <- function(val) {
df[a >= val]
}
# This works:
myfunc(2)
# This does not work:
myfunc("c")
If I define my function as:
myfunc <- function(val) {
df[a >= get(val)]
}
# This doesn't work:
myfunc(2)
# This works:
myfunc("c")
What is the best way to resolve this?
Edit: To be clear, I want to results to be the same as:
# myfunc(2)
df %>%
filter(a >= 2)
# myfunc("c")
df %>%
filter(a >= c)
EDIT:
Thanks all for the responses, I think I like dww's answer the best.
I wish it was as easy as in dplyr, where I can do:
myfunc <- function(val) {
df %>%
filter(a >= {{val}})
}
# Both work:
myfunc(2)
myfunc(c)
If you build and parse the whole expression, then you can evaluate it in its entirety. For example
myfunc <- function(val) {
df[eval(parse(text=paste("a >= ", val)))]
}
Though relying on a function that lets you mix values and variable names in the same parameter might be dangerous. Especially in the case where you actually wanted to match on character values rather than variable names. If you passed in the whole expression you could do
myfunc <- function(expr) {
expr <- substitute(expr)
df[eval(expr)]
}
myfunc(a>=3)
myfunc(a>=c)
The question did not actually define the desired behavior so we assume that df must be a data.table and if a character string is passed then the column of that name should be returned and if a number is passed then those rows whose a column exceed that number should be returned.
Define an S3 generic and methods for character and default.
myfunc <- function(x, data = df) UseMethod("myfunc")
myfunc.character <- function(x, data = df) data[[x]]
myfunc.default <- function(x, data = df) data[a > x]
myfunc(2)
## a b c
## 1: 3 3 5
## 2: 4 2 3
## 3: 5 1 1
myfunc("c")
## [1] 1 3 5 3 1

lappy conditional on variable value

I want to lappy two functions on a data set conditional on the value of a specific variable.
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)
Someting like:
df <- lapply(df, if(df$Letters=="A") first_function else second_function )
To produce:
df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4))
You can do it with dplyr and purrr. Obviously this is a basic function, but you should be able to build on it for your needs:
library(dplyr)
library(purrr)
calc <- function(y, x){
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
if(y == "A")
return(first_function(x))
return(second_function(x))
}
df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)
df %>%
mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
Letters Numbers
1 A 1
2 B 3
3 B 4
>(df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4)))
Letters Numbers
1 A 1
2 B 3
3 B 4
BENCHMARKING
I am not a data.table expert (feel free to add), so did not incorporate here. But, #R Yoda is correct. Although it reads nicely and future you will find it easier to read and extend the function, the purrr solution is not that fast. I liked the ifelse approach, so added case_when which is easier to scale when dealing with multiple functions. Here are a couple solutions:
library(dplyr)
library(purrr)
library(microbenchmark)
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
calc <- function(y, x){
if(y == "A")
return(first_function(x))
return(second_function(x))
}
df <- data.frame(Letters = rep(c("A","B","B"),1000), Numbers = 1:3)
basic <- function(){
data.frame(df$Letters, apply(df, 1, function(row) {
num <- as.numeric(row['Numbers'])
if (row['Letters'] == 'A') first_function(num) else second_function(num)
}))
}
dplyr_purrr <- function(){
df %>%
mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
}
dplyr_case_when <- function(){
df %>%
mutate(Numbers = case_when(
Letters == "A" ~ first_function(Numbers),
TRUE ~ second_function(Numbers)))
}
map_list <- function(){
data.frame(df$Letters, map2_dbl(df2$Letters, df2$Numbers, ~calc(.x, .y)))
}
within_mapply <- function(){
within(df, Numbers <- mapply(Letters, Numbers,
FUN = function(x, y){
switch(x,
"A" = first_function(y),
"B" = second_function(y))
}))
}
within_ifelse <- function(){
within(df, Numbers <- ifelse(Letters == "A",
first_function(Numbers),
second_function(Numbers)))
}
within_case_when <- function(){
within(df, Numbers <- case_when(
Letters == "A" ~ first_function(Numbers),
TRUE ~ second_function(Numbers)))
}
(mbm <- microbenchmark(
basic(),
dplyr_purrr(),
dplyr_case_when(),
map_list(),
within_mapply(),
within_ifelse(),
within_case_when(),
times = 1000
))
Unit: microseconds
expr min lq mean median uq max neval cld
basic() 12816.427 24028.3375 27719.8182 26741.7770 29417.267 277756.650 1000 f
dplyr_purrr() 9682.884 17817.0475 20072.2752 19736.8445 21767.001 48344.265 1000 e
dplyr_case_when() 1098.258 2096.2080 2426.7183 2325.7470 2625.439 9039.601 1000 b
map_list() 8764.319 16873.8670 18962.8540 18586.2790 20599.000 41524.564 1000 d
within_mapply() 6718.368 12397.1440 13806.1752 13671.8120 14942.583 24958.390 1000 c
within_ifelse() 279.796 586.6675 690.1919 653.3345 737.232 8131.292 1000 a
within_case_when() 470.155 955.8990 1170.4641 1070.5655 1219.284 46736.879 1000 a
The simple way to do this with *apply would be to put the whole logic (with the conditional and the two functions) into another function and use apply with MARGIN=1 to pass the data in row by row (lapply will pass in the data by column):
apply(df, 1, function(row) {
num <- as.numeric(row['Numbers'])
if (row['Letters'] == 'A') first_function(num) else second_function(num)
})
[1] 1 3 4
The problem with this approach, at #r2evans points out in the comment below, is that when you use apply with a heterogeneous data.frame (in this case, Letters is type factor while Numbers is type integer) each row passed into the applied function is passed as a vector which can only have a single type, so everything in the row is coerced to the same type (in this case character). This is why it's necessary to use as.numeric(row['Numbers']), to turn Numbers back into type numeric. Depending on your data, this could be a simple fix (as above) or it could make things much more complicated and bug-prone. Either way #akrun's solution is much better, since it preserves each variable's original data type.
lapply has difficulty in this case because it's column-based. However you can try transpose your data by t() and use lapply if you persist. Here I provide two ways which use mapply and ifelse :
df$Letters <- as.character(df$Letters)
# Method 1
within(df, Numbers <- mapply(Letters, Numbers, FUN = function(x, y){
switch(x, "A" = first_function(y),
"B" = second_function(y))
}))
# Method 2
within(df, Numbers <- ifelse(Letters == "A",
first_function(Numbers),
second_function(Numbers)))
Both above got the same outputs :
# Letters Numbers
# 1 A 1
# 2 B 3
# 3 B 4
Here a data.table variant for better performance in case of many data rows (but also showing an implicit conversion problem):
library(data.table)
setDT(df) # fast convertion from data.frame to data.table
df[ Letters == "A", Numbers := first_function(Numbers) ]
df[!(Letters == "A"), Numbers := second_function(Numbers)] # issues a warning, see below
df
# Letters Numbers
# 1: A 1
# 2: B 3
# 3: B 4
The issued warning is:
Warning message: In [.data.table(df, !(Letters == "A"),
:=(Numbers, second_function(Numbers))) : Coerced 'double' RHS to
'integer' to match the column's type; may have truncated precision.
Either change the target column ['Numbers'] to 'double' first (by
creating a new 'double' vector length 3 (nrows of entire table) and
assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g.
1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for
speed. Or, set the column type correctly up front when you create the
table and stick to it, please.
The reason is that the data.frame column Numbers is an integer
> str(df)
'data.frame': 3 obs. of 2 variables:
$ Letters: Factor w/ 2 levels "A","B": 1 2 2
$ Numbers: int 1 2 3
but the functions return a double (for whatever reason):
> typeof(first_function(df$Numbers))
[1] "double"

Compare every n rows and show boolean vector

I have similar issue like in this questions Compare every 2 rows and show mismatches in R
I would like to compare not only 2 rows but for example 3, 4, etc.
I have a data.table here:
DT <- data.table(A = rep(1:2, 2), B = rep(1:4, 2),
C = rep(1:2, 1), key = "A")
Then I use
dfs <- split(DT, DT$A)
comp <- function(x) sapply(x, function(u) u[1]==u[2])
matches <- sapply(dfs, comp)
For 3 rows :
comp <- function(x) sapply(x, function(u) u[1]==u[2] & u[1]==u[3])
Is that accurate? How can I generalize it in more elegant way?
try this:
comp2 <- function(dt, i, rws){
k <- length(rws)
tmp <- as.numeric(dt[i])
tmp <- as.data.table(matrix(rep(tmp, k), nrow = k, byrow = TRUE, dimnames = list(NULL, colnames(dt))))
ans <- (dt[rws] == tmp)
ans
}
this function takes three arguments:
-> dt your data.table (or sub-data.tables obtained from splitting your original one, up to you)
-> i -- row you want to compare
-> rws -- vector of row numbers you want to compare i with (e.g. c(2,3,4) would compare i with rows 2, 3 and 4
it then creates a new data.table that consists of row i stacked k times, so a data.frame to data.frame comparison is possible.
example:
comp2(DT, 1, c(2, 3, 4))
# A B C
#[1,] TRUE FALSE TRUE
#[2,] FALSE FALSE FALSE
#[3,] FALSE FALSE FALSE
compares row 1 of your data.table DT to rows 2, 3 and 4.
if you want your output to tell you whether your chosen row differs from at least one of the rows you are comparing it to, then you need an extra operation colSums(ans) == k instead of ans.

Updating a data.table in R

(Edited)
I am using the following code to create two columns in a data.table and update them with some numbers:
T <- data.table(Init_1 = rep(0, 100), Init_2 = rep(0, 100))
for (i in 1:100){
T[, Init_1 := i]
T[, Init_2 := 2*i]
}
I expected that this code would add two columns to the data.table T (Init_1 and Init_2) and fill them with numbers : (1:100) and (2,4,...200) respectively.
However, the code returns constant values:
> T
Init_1 Init_2
1: 100 200
2: 100 200
3: 100 200
4: 100 200
5: 100 200
6: 100 200
7: 100 200
8: 100 200
.................
Could you explain why my code is not working as expected and how it could be fixed?
Your advice will be appreciated.
Edit:
In relation to answer 2, eventually I want to use a function inside the for loop. More specifically:
# A FUNCTION THAT RETURNS THE TRANSITION PROBABILITIES AFTER N STEPS IN A MARKOV CHAIN
#-------------------------------------------------------------------------------------
R <- function(P, n){
if (n==1) return(P)
R(P, n-1) %*% P
}
# A ONE-STEP PROBABILITY MATRIX
#---------------------------------------------------------------------------------------
P = matrix(c(0.6, 0.1, 0.3, 0.2, 0.7, 0.1, 0.3, 0.3, 0.4), nrow = 3, byrow = TRUE)
# EXAMINING THE CONVERGENCE PROCESS OF THE PROBABILITIES OVER TIME
#########################################################################
T <- data.table(Init_1 = rep(0, 100), Init_2 = rep(0, 100))
for (i in 1:100){
T[, Init_1 := R(P, i)[1,1]]
T[, Init_2 := R(P, i)[2,1]]
}
or
for (i in 1:100){
T[, ':=' (Init_1 = R(P, i)[1,1],
Init_2 = R(P, i)[2,1]) ]
}
I'm no data.table expert. But I know it throws
helpful error messages. If you e.g. create an empty data.table and
try to use := to add columns, it says
T <- data.table()
T[,a:=1]
# Error in `[.data.table`(T, , `:=`(a, 1)) :
# Cannot use := to add columns to a null data.table (no columns), currently.
# You can use := to add (empty) columns to a 0-row data.table (1 or more empty columns),
# though.
Your problem might be related. Because data.table(numeric()) or rather T <- data.table(numeric(length = 0)) creates a a 0-row data.table. The empty column gets named V1 by default. Here you could use
:= to add empty columns. However, that's not what you want.
Instead you could do
T <- data.table(numeric(0))
for (i in 1:5){
T <- T[, .(
Init_1=if (exists("Init_1")) c(Init_1, i) else i,
Init_2=if (exists("Init_2")) c(Init_2, 2*i) else 2*i )]
}
T
# Init_1 Init_2
# 1: 1 2
# 2: 2 4
# 3: 3 6
# 4: 4 8
# 5: 5 10
Although that's pretty ugly und probably super unefficient.
First, you should not define a function with name as T. T is reserved for TRUE in logic. Also, it is not recommended to use i for iteration since it is also used for complex number, for example
> (2i)^2
[1] -4+0i
Third, iteration is slow in R. We should avoid to use iteration whenever possible.
Here are the simple codes to generate such matrix. Hope this helps.
T.data <- matrix(NA,nrow=100,ncol=2);
T.data[,1] <- 1:100;
T.data[,2] <- 2*T.data[,1]

Replace NA with 0, only in numeric columns in data.table

I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0.
I am aware that replacing na-values with zero goes like this:
DT[is.na(DT)] <- 0
To select only numeric columns, I found this solution, which works fine:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
I can achieve what I want by assigning
DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
and then do:
DT2[is.na(DT2)] <- 0
But of course I would like to have my original DT modified by reference. With the following, however:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
[is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0
I get
"Error in [.data.table([...] i is invalid type (matrix)"
What am I missing?
Any help is much appreciated!!
We can use set
for(j in seq_along(DT)){
set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
}
Or create a index for numeric columns, loop through it and set the NA values to 0
ind <- which(sapply(DT, is.numeric))
for(j in ind){
set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}
data
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
I wanted to explore and possibly improve on the excellent answer given above by #akrun. Here's the data he used in his example:
library(data.table)
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT
#> v1 v2 v3
#> 1: NA <NA> -0.5458808
#> 2: 1 A 0.5365853
#> 3: 2 B 0.4196231
#> 4: 3 C -0.5836272
#> 5: 4 D NA
And the two methods he suggested to use:
fun1 <- function(x){
for(j in seq_along(x)){
set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
}
}
fun2 <- function(x){
ind <- which(sapply(x, is.numeric))
for(j in ind){
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
I think the first method above is really genius as it exploits the fact that NAs are typed.
First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:
fun3 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
for(j in nms){
x[is.na(get(j)), (j):=0]
}
}
Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns
fun4 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]
}
But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()
fun5 <- function(x){
setDT(
lapply(x, function(i){
if(is.numeric(i))
i[is.na(i)]<-0
i
})
)
}
Finally,we could use the same idea of conditional to limit the columns on which we apply the set()
fun6 <- function(x){
for(j in seq_along(x)){
if (is.numeric(x[[j]]) )
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
Here are the benchmarks:
microbenchmark::microbenchmark(
for.set.2cond = fun1(copy(DT)),
for.set.ind = fun2(copy(DT)),
for.get = fun3(copy(DT)),
for.SDcol = fun4(copy(DT)),
for.list = fun5(copy(DT)),
for.set.if =fun6(copy(DT))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> for.set.2cond 59.812 67.599 131.6392 75.5620 114.6690 4561.597 100 a
#> for.set.ind 71.492 79.985 142.2814 87.0640 130.0650 4410.476 100 a
#> for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202 100 c
#> for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491 100 b
#> for.list 69.722 81.932 137.2275 87.7720 123.6935 3906.149 100 a
#> for.set.if 52.380 58.397 116.1909 65.1215 72.5535 4570.445 100 a
You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.
library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))
Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:
DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table
So, tidyverse can be less verbose than data.table sometimes :-)

Resources