Strange behaviour matching strings in data.table [duplicate] - r

Let's say I have two columns of strings:
library(data.table)
DT <- data.table(x = c("a","aa","bb"), y = c("b","a","bbb"))
For each row, I want to know whether the string in x is present in column y. A looping approach would be:
for (i in 1:length(DT$x)){
DT$test[i] <- DT[i,grepl(x,y) + 0]
}
DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1
Is there a vectorized implementation of this? Using grep(DT$x,DT$y) only uses the first element of x.

You can simply do
DT[, test := grepl(x, y), by = x]

Or mapply (Vectorize is really just a wrapper for mapply)
DT$test <- mapply(grepl, pattern=DT$x, x=DT$y)

Thank you all for your responses. I've benchmarked them all, and come up with the following:
library(data.table)
library(microbenchmark)
DT <- data.table(x = rep(c("a","aa","bb"),1000), y = rep(c("b","a","bbb"),1000))
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
microbenchmark(
DT1[, test := grepl(x, y), by = x]
,
DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2]))
,
DT3$test <- mapply(grepl, pattern=DT3$x, x=DT3$y)
,
{vgrepl <- Vectorize(grepl)
DT4[, test := as.integer(vgrepl(x, y))]}
)
Results
Unit: microseconds
expr min lq mean median uq max neval
DT1[, `:=`(test, grepl(x, y)), by = x] 758.339 908.106 982.1417 959.6115 1035.446 1883.872 100
DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2])) 16840.818 18032.683 18994.0858 18723.7410 19578.060 23730.106 100
DT3$test <- mapply(grepl, pattern = DT3$x, x = DT3$y) 14339.632 15068.320 16907.0582 15460.6040 15892.040 117110.286 100
{ vgrepl <- Vectorize(grepl) DT4[, `:=`(test, as.integer(vgrepl(x, y)))] } 14282.233 15170.003 16247.6799 15544.4205 16306.560 26648.284 100
Along with being the most syntactically simple, the data.table solution is also the fastest.

You can pass the grepl function into an apply function to operate on each row of your data table where the first column contains the string to search for and the second column contains the string to search in. This should give you a vectorized solution to your problem.
> DT$test <- apply(DT, 1, function(x) as.integer(grepl(x[1], x[2])))
> DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1

You can use Vectorize:
vgrepl <- Vectorize(grepl)
DT[, test := as.integer(vgrepl(x, y))]
DT
x y test
1: a b 0
2: aa a 0
3: bb bbb 1

Related

performance considerations get() in data.table

I've been using get() in a loop to manipulate a column j by i with reference to multiple other columns.
I wonder if there is a faster/more efficient way? Any performance considerations?
Here a minimal example of the type of operation I have in mind:
require(data.table) # version 1.12.8
dt = data.table(v1=c(1,2,NA),v2=c(0,0,1),v3=c(0,0,0))
for (i in 1:2){
dt[ is.na(get(paste0('v',i))), (paste0('v',i)):= get(paste0('v',i+1))+2 ][]
}
The actual tables I do this with are much larger (~5 mio rows, ~300 columns).
I'd highly appreciate any thoughts.
We can use set which would assign in place
library(data.table)
for(j in 1:2) {
i1 <- which(is.na(dt[[j]]))
set(dt, i = i1, j = j, value = dt[[j+1]][i1]+ 2)
}
dt
# v1 v2 v3
#1: 1 0 0
#2: 2 0 0
#3: 3 1 0
There is not much difference between a for loop or lapplyif both are using the get. For performace improvement, it is better to use set
In base R, we can also do
setDF(dt)
i1 <- is.na(dt[-length(dt)])
dt[-length(dt)][i1] <- dt[-1][i1] + 2
dt
# v1 v2 v3
#1 1 0 0
#2 2 0 0
#3 3 1 0
Yes your for loop slows you down considerably. Even a simple lapply (and there's probably more elegant ways to do), brings you significant performance gains:
library(data.table)
dt <- data.table(v1 = rnorm(100), v2 = sample(c(NA,1:5)), v3 = sample(c(NA,1:5)), v4 = sample(c(NA,1:5)))
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)
microbenchmark::microbenchmark(
for (i in 1:2){
dt[ is.na(get(paste0('v',i))), (paste0('v',i)):= get(paste0('v',i+1))+2 ]
},
for (i in 1:2){
dt2[ is.na(get(paste0('v',i))), (paste0('v',i)):= get(paste0('v',i+1))+2 ][]
},
lapply(1:2, function(i) dt3[ is.na(get(paste0('v',i))), (paste0('v',i)):= get(paste0('v',i+1))+2 ]),
for(j in 1:2) {
i1 <- which(is.na(dt4[[j]]))
set(dt4, i = i1, j = j, value = dt[[j+1]][i1]+ 2)
}
)
Unit: milliseconds
expr min lq mean median
for (i in 1:2) { dt[is.na(get(paste0("v", i))), `:=`((paste0("v", i)), get(paste0("v", i + 1)) + 2)] } 8.439924 8.651308 10.257670 8.900500
for (i in 1:2) { dt2[is.na(get(paste0("v", i))), `:=`((paste0("v", i)), get(paste0("v", i + 1)) + 2)][] } 8.902435 9.098875 10.469305 9.262659
lapply(1:2, function(i) dt3[is.na(get(paste0("v", i))), `:=`((paste0("v", i)), get(paste0("v", i + 1)) + 2)]) 1.032788 1.144117 1.561741 1.224858
for (j in 1:2) { i1 <- which(is.na(dt4[[j]])) set(dt4, i = i1, j = j, value = dt[[j + 1]][i1] + 2) } 6.216452 6.392754 7.970074 6.502356
uq max neval
9.588571 35.259060 100
9.729876 23.245224 100
1.349337 9.467026 100
7.046646 30.857044 100
Checking results are equivalent:
identical(dt,dt2)
# [1] TRUE
identical(dt,dt3)
# [1] TRUE
identical(dt,dt4)
# [1] TRUE
There's probably more elegant way to do that but a division by 10 of mean computation time for something that only took a few seconds to program is a good yield ;)

data.table bug: lapply on .SD reorder columns when using get(). Possible workaround?

I found a strange behavior of data.table. I would like to know if there is a way to avoid it, or a workaround.
In my data management, I use often lapply with .SD, to assign new values to columns. To assign properly several columns, the order of the output column of the lapply must be kept.
I found a situation where it is not the case.
Here the normal behavior
library(data.table)
plouf <- data.table(x = 1, y = 2, z = 3)
cols <- c("y","x")
plouf[,.SD,.SDcols = cols ,by = z]
plouf[,lapply(.SD,function(x){x}),.SDcols = cols ,by = z]
plouf[,lapply(.SD[x == 1],function(x){x}),.SDcols = cols ,by = z]
All these lines give :
z y x
1: 3 2 1
which I need for example to reassign to c("y","x"). But if I do:
plouf[,lapply(.SD[get("x") == 1],function(x){x}),.SDcols = c("y","x"),by = z]
z x y
1: 3 1 2
Here the order of x and y changed without reason, when it should yield the same result as the last "working" example. If then assign the wrong values to c("y","x") if I assign the output of lapply to new vector of columns. It seems that the use of get in the i part of .SD triggers this bug.
Example of the effect of this on assignment:
plouf[, c(cols ) := lapply(.SD[get("x") == 1],function(x){x}),
.SDcols = cols ,by = z][]
# x y z
# 1: 2 1 3
Does anyone have a workaround ? The code I am using looks more like :
plouf[, c(cols ) := lapply(.SD[get("x") >= 1 & get("x") <= 3],function(x){mean}),
.SDcols = cols ,by = z]
the issue on github: https://github.com/Rdatatable/data.table/issues/4089
Instead of subsetting .SD, you could do the subsetting in your lapply function. If the logical vector used for subsetting is passed as a third argument to lapply it isn't re-evaluated at each lapply pass.
Note: I changed the function to multiply by 10 since otherwise I couldn't tell if the code was doing anything at all
plouf[, (cols) := lapply(.SD, function(x, i) 10*mean(x[i]),
get("x") %between% c(1, 3)),
.SDcols = cols ,by = z][]
# x y z
# 1: 10 20 3
There are other workarounds that would allow you to subset .SD, but I think subsetting .SD by group is slower than subsetting each column individually.
set.seed(0)
df <- rep(1:50000, sample(500:1000, 50000, T)) %>%
data.table(a = runif(length(.))
,b = .)
library(microbenchmark)
microbenchmark(
subSD = df[, lapply(.SD[a < .2], sum), b]
, in_func = df[, lapply(.SD, function(x, i) sum(x[i]), a < .2), b]
, times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# subSD 19323.19 20398.3666 21289.345 20708.4346 22466.010 23738.467 10 b
# in_func 972.64 987.7891 1016.252 995.4236 1038.069 1125.709 10 a
Edit: bigger benchmark
set.seed(0)
rm(df)
df <- rep(1:5e5, sample(50:100, 5e5, T)) %>%
data.table(a = runif(length(.))
,b = .)
library(microbenchmark)
microbenchmark(
subSD = df[, lapply(.SD[a < .2], sum), b]
, in_func = df[, lapply(.SD, function(x, i) sum(x[i]), a < .2), b]
, times = 2L)
# Unit: seconds
# expr min lq mean median uq max neval cld
# subSD 207.111290 207.111290 214.147649 214.147649 221.18401 221.18401 2 b
# in_func 3.560467 3.560467 3.651359 3.651359 3.74225 3.74225 2 a
In the bug report of github, #jangoreki suggested:
As a workaround you can use now substitute rather than get
var = "x"
expr = substitute(
plouf[, c(cols) := lapply(.SD[.var == 1],function(x){x}), .SDcols = cols, by = z][],
list(.var=as.name(var))
)
print(expr)
#plouf[, `:=`(c(cols), lapply(.SD[x == 1], function(x) {
# x
#})), .SDcols = cols, by = z][]
eval(expr)
# x y z
#1: 2 1 3
Personally I would use it regularly, not as a workaround, I find R metaprogramming features superior.
Also be aware that some day instead of get(var) we should be able to use ..var, see (#2816, #3199)
R metaprogramming always worked and, I assume, will always work, thanks to the conservative backward compatible R code development.

Count number of 1's from right to left, stopping at the first 0

I want to count the number of 1's that occur from RIGHT to LEFT across multiple columns, which stops when encountering the first 0.
Example DF:
df<-data.frame(replicate(7,sample(0:1,30,rep=T)))
colnames(df)<-seq(1950,2010,10)
I've manually entered the desired result here under a new column "condition" as an example:
Thanks in advance for your help,
Cai
Here's a fully vectorized attempt
indx <- rowSums(df) == ncol(df) # Per Jaaps comment
df$condition <- ncol(df) - max.col(-df, ties = "last")
df$condition[indx] <- ncol(df) - 1
This is basically finds the first zero from the right and counts how many columns were before that (which are basically the 1s in a binary data)
EDIT
Had to add handling for the special case when all the rows are ones
df$condition <- apply(df, 1, function(x) {
y <- rev(x)
sum(cumprod(y))
})
[Edit: now works]
Try this
df$condition <- apply(df,1,function(x){x<- rev(x);m <- match(0,x)[1]; if (is.na(m)) sum(x) else sum(x[1:m])})
we're matching the first 0, then summing up until this element.
If there's no zero we sum the full row
Here's a benchmark of all solutions :
library(stringr)
microbenchmark(
Moody_Mudskipper = apply(df,1,function(x){x<- rev(x);m <- match(0,x)[1]; if (is.na(m)) sum(x) else sum(x[1:m])}),
akrun = apply(df, 1, function(x) {x1 <- rle(x)
x2 <- tail(x1$lengths, 1)[tail(x1$values, 1)==1]
if(length(x2)==0) 0 else x2}),
akrun2 = str_count(do.call(paste0, df), "[1]+$"),
roland = apply(df, 1, function(x) {y <- rev(x);sum(y * cumprod(y != 0L))}),
David_Arenburg = ncol(df) - max.col(-df, ties = "last"),
times = 10)
# Unit: microseconds
# expr min lq mean median uq max neval
# Moody_Mudskipper 1437.948 1480.417 1677.1929 1536.159 1597.209 3009.320 10
# akrun 6985.174 7121.078 7718.2696 7691.053 7856.862 9289.146 10
# akrun2 1101.731 1188.793 1290.8971 1226.486 1343.099 1790.091 10
# akrun3 693.315 791.703 830.3507 820.371 884.782 1030.240 10
# roland 1197.995 1270.901 1708.5143 1332.305 1727.802 4568.660 10
# David_Arenburg 2845.459 3060.638 3406.3747 3167.519 3495.950 5408.494 10
# David_Arenburg_corrected 3243.964 3341.644 3757.6330 3384.645 4195.635 4943.099 10
For a bigger example David's solution is indeed the fastest, as said in the chosen solution's comments:
df<-data.frame(replicate(7,sample(0:1,1000,rep=T)))
# Unit: milliseconds
# expr min lq mean median uq max neval
# Moody_Mudskipper 31.324456 32.155089 34.168533 32.827345 33.848560 44.952570 10
# akrun 225.592061 229.055097 238.307506 234.761584 241.266853 271.000470 10
# akrun2 28.779824 29.261499 33.316700 30.118144 38.026145 46.711869 10
# akrun3 14.184466 14.334879 15.528201 14.633227 17.237317 18.763742 10
# roland 27.946005 28.341680 29.328530 28.497224 29.760516 33.692485 10
# David_Arenburg 3.149823 3.282187 3.630118 3.455427 3.727762 5.240031 10
# David_Arenburg_corrected 3.464098 3.534527 4.103335 3.833937 4.187141 6.165159 10
We can loop through the rows, use rle
df$condition <- apply(df, 1, function(x) {x1 <- rle(x)
x2 <- tail(x1$lengths, 1)[tail(x1$values, 1)==1]
if(length(x2)==0) 0 else x2})
Or another option is str_extract
library(stringr)
v1 <- str_extract(do.call(paste0, df), "1+$")
d$condition <- ifelse(is.na(v1), 0, nchar(v1))
Or with a slightly more efficient stringi
library(stringi)
v1 <- stri_count(stri_extract(do.call(paste0, df), regex = "1+$"), regex = ".")
v1[is.na(v1)] <- 0
df$condition <- v1
Or with a more compact option
stri_count(do.call(paste0, df), regex = '(?=1+$)')

Replace NA with 0, only in numeric columns in data.table

I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0.
I am aware that replacing na-values with zero goes like this:
DT[is.na(DT)] <- 0
To select only numeric columns, I found this solution, which works fine:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
I can achieve what I want by assigning
DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
and then do:
DT2[is.na(DT2)] <- 0
But of course I would like to have my original DT modified by reference. With the following, however:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
[is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0
I get
"Error in [.data.table([...] i is invalid type (matrix)"
What am I missing?
Any help is much appreciated!!
We can use set
for(j in seq_along(DT)){
set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
}
Or create a index for numeric columns, loop through it and set the NA values to 0
ind <- which(sapply(DT, is.numeric))
for(j in ind){
set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}
data
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
I wanted to explore and possibly improve on the excellent answer given above by #akrun. Here's the data he used in his example:
library(data.table)
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT
#> v1 v2 v3
#> 1: NA <NA> -0.5458808
#> 2: 1 A 0.5365853
#> 3: 2 B 0.4196231
#> 4: 3 C -0.5836272
#> 5: 4 D NA
And the two methods he suggested to use:
fun1 <- function(x){
for(j in seq_along(x)){
set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
}
}
fun2 <- function(x){
ind <- which(sapply(x, is.numeric))
for(j in ind){
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
I think the first method above is really genius as it exploits the fact that NAs are typed.
First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:
fun3 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
for(j in nms){
x[is.na(get(j)), (j):=0]
}
}
Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns
fun4 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]
}
But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()
fun5 <- function(x){
setDT(
lapply(x, function(i){
if(is.numeric(i))
i[is.na(i)]<-0
i
})
)
}
Finally,we could use the same idea of conditional to limit the columns on which we apply the set()
fun6 <- function(x){
for(j in seq_along(x)){
if (is.numeric(x[[j]]) )
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
Here are the benchmarks:
microbenchmark::microbenchmark(
for.set.2cond = fun1(copy(DT)),
for.set.ind = fun2(copy(DT)),
for.get = fun3(copy(DT)),
for.SDcol = fun4(copy(DT)),
for.list = fun5(copy(DT)),
for.set.if =fun6(copy(DT))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> for.set.2cond 59.812 67.599 131.6392 75.5620 114.6690 4561.597 100 a
#> for.set.ind 71.492 79.985 142.2814 87.0640 130.0650 4410.476 100 a
#> for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202 100 c
#> for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491 100 b
#> for.list 69.722 81.932 137.2275 87.7720 123.6935 3906.149 100 a
#> for.set.if 52.380 58.397 116.1909 65.1215 72.5535 4570.445 100 a
You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.
library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))
Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:
DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table
So, tidyverse can be less verbose than data.table sometimes :-)

Equivalent for dlply in data.table

I try to achieve the same what dlply does with data.table. So just as a very simple example:
library(plyr)
library(data.table)
dt <- data.table( p = c("A", "B"), q = 1:2 )
dlply( dt, "p", identity )
$A
p q
1 A 1
$B
p q
1 B 2
dt[ , identity(.SD), by = p ]
p q
1: A 1
2: B 2
foo <- function(x) as.list(x)
dt[ , foo(.SD), by = p ]
p q
1: A 1
2: B 2
Obviously the return values of foo are collapsed to one data.table. And I don't want to use dlply because it passes the split data.tables as data.frames to foo which makes further data.table operations within foo inefficient.
Here's a more data.table oriented approach:
setkey(dt, p)
dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1
#[[1]]
# p q
#1: A 1
#
#[[2]]
# p q
#1: B 2
There are more data.table style alternatives to the above but that seems to be the fastest - here's a comparison with lapply:
dt <- data.table( p = rep( LETTERS[1:25], 1E6), q = 25*1E6, key = "p" )
microbenchmark(dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1, lapply(unique(dt$p), function(x) dt[x]), times = 10)
#Unit: seconds
# expr min lq median uq max neval
#dt[, list(list(dt[J(.BY[[1]])])), by = p]$V1 1.111385 1.508594 1.717357 1.966694 2.108188 10
# lapply(unique(dt$p), function(x) dt[x]) 1.871054 1.934865 2.216192 2.282428 2.367505 10
Try this:
> split(dt, dt[["p"]])
$A
p q
1: A 1
$B
p q
1: B 2
Regarding G. Grothendieck's answer I was curious how well split performs:
dt <- data.table( p = rep( LETTERS[1:25], 1E6), q = 25*1E6, key = "p" )
system.time(
ll <- split(dt, dt[ ,p ] )
)
user system elapsed
5.237 1.340 6.563
system.time(
ll <- lapply( unique(dt[,p]), function(x) dt[x] )
)
user system elapsed
1.179 0.363 1.541
So if there is no better answer, I'd stick with
lapply( unique(dt[,p]), function(x) dt[x] )

Resources