data table ifelse taking along time [duplicate] - r

I am writing a function where multiple ifelse are being used for data table operation. Although I am using data tables for speed but multiple ifelse making my code slow and this function is for large data set. Hence I was wondering if there is an alternative to iflese.
One example iflese from the function(there are close to 15 iflese ), in this example flag is set to 1 if x is blank else 0.
dt<-dt[,flag:=ifelse(is.na(x)|!nzchar(x),1,0)]

The fastest approach will probably depend on what your data looks like. Those mentioned in the comments are all comparable for this example:
(twice was mentioned by #DavidArenburg; and onceadd by #akrun. I'm not really sure how to benchmark these with replications > 1, since the objects are actually modified during the benchmark.)
DT <- data.table(x=sample(c(NA,"",letters),1e8,replace=TRUE))
DT0 <- copy(DT)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
DT5 <- copy(DT)
DT6 <- copy(DT)
DT7 <- copy(DT)
library(rbenchmark)
benchmark(
ifelse = DT0[,flag:=ifelse(is.na(x)|!nzchar(x),1L,0L)],
keyit = {
setkey(DT1,x)
DT1[,flag:=0L]
DT1[J(NA_character_,""),flag:=1L]
},
twiceby = DT2[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L,by=x],
twice = DT3[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L],
onceby = DT4[, flag:= +(is.na(x)|!nzchar(x)), by=x],
once = DT5[, flag:= +(is.na(x)|!nzchar(x))],
onceadd = DT6[, flag:= (is.na(x)|!nzchar(x))+0L],
oncebyk = {setkey(DT7,x); DT7[, flag:= +(is.na(x)|!nzchar(x)), by=x]},
replications=1
)[1:5]
# test replications elapsed relative user.self
# 1 ifelse 1 19.61 31.127 17.32
# 2 keyit 1 0.63 1.000 0.47
# 6 once 1 3.26 5.175 2.68
# 7 onceadd 1 3.24 5.143 2.88
# 5 onceby 1 1.81 2.873 1.75
# 8 oncebyk 1 0.91 1.444 0.82
# 4 twice 1 3.17 5.032 2.79
# 3 twiceby 1 3.45 5.476 3.16
Discussion. In this example, keyit is the fastest. However, it's also the most verbose and it changes the sorting of your table. Also, keyit is very specific to the OP's question (taking advantage of the fact that exactly two character values fit the condition is.na(x)|!nzchar(x)), and so might not be as great for other applications, where it would need to be written something like
keyit = {
setkey(DT1,x)
flagem = DT1[,some_other_condition(x),by=x][(V1)]$x
DT1[,flag:=0L]
DT1[J(flagem),flag:=1L]
}

Related

Take column-wise differences across a data.table

How can I use data.table syntax to produce a data.table where each column contains the differences between the column of the original data.table and the next column?
Example: I have a data.table where each row is a group, and each column is surviving population after year 0, after year 1, 2, etc. Such as:
pop <- data.table(group_id = c(1, 2, 3),
N = c(4588L, 4589L, 4589L),
N_surv_1 = c(4213, 4243, 4264),
N_surv_2 = c(3703, 3766, 3820),
N_surv_3 = c(2953, 3054, 3159) )
# group_id N N_surv_1 N_surv_2 N_surv_3
# 1 4588 4213 3703 2953
# 2 4589 4243 3766 3054
# 3 4589 4264 3820 3159
(Data types differ because N is a true integer count and N_surv_1, etc. are projections that could be fractional.)
What I have done: using the base diff and matrix transposition, we can:
diff <- data.table(t(diff(t(as.matrix(pop[,-1,with=FALSE])))))
setnames(diff, paste0("deaths_",1:ncol(diff)))
cbind(group_id = pop[,group_id],diff)
# produces desired output:
# group_id deaths_1 deaths_2 deaths_3
# 1 -375 -510 -750
# 2 -346 -477 -712
# 3 -325 -444 -661
I know that I can use base diff by group on a single column produced by melt.data.table, so this works but ain't pretty:
melt(pop,
id.vars = "group_id"
)[order(group_id)][, setNames(as.list(diff(value)),
paste0("deaths_",1:(ncol(pop)-2)) ),
keyby = group_id]
Is that the most data.table-riffic way to do this, or is there a way to do it as a multi-column operation in data.table?
Well, you could subtract the subsets:
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
# N_surv_1 N_surv_2 N_surv_3
# 1: -375 -510 -750
# 2: -346 -477 -712
# 3: -325 -444 -661
You could assign these values to new columns with :=. I have no idea why tail and head are not made more easily available... As pointed out by #akrun, you could use with=FALSE instead, like pop[, .SD[, -1, with=FALSE] - .SD[, -ncol(.SD), with=FALSE], .SDcols=ncols].
Anyway, this is pretty convoluted compared to simply reshaping:
melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id]
# group_id V1
# 1: 1 -375
# 2: 1 -510
# 3: 1 -750
# 4: 2 -346
# 5: 2 -477
# 6: 2 -712
# 7: 3 -325
# 8: 3 -444
# 9: 3 -661
Without reshaping data and each row with a unique id, you can group by the id column and then calculate the difference with diff on each row, i.e. unlist(.SD):
pop[, setNames(as.list(diff(unlist(.SD))), paste0("deaths_", 1:(ncol(pop)-2))), group_id]
# group_id deaths_1 deaths_2 deaths_3
# 1: 1 -375 -510 -750
# 2: 2 -346 -477 -712
# 3: 3 -325 -444 -661
Essentially, something like this if you ignore setting up the column names:
pop[, as.list(diff(unlist(.SD))), group_id]
Here's another way to do it without reshaping or grouping which might make it faster. If it's small number of rows then it probably won't be a noticeable difference.
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]
I did some benchmarking
rows<-10000000
pop <- data.table(group_id = 1:rows,
N = runif(rows,3000,4000),
N_surv_1 = runif(rows,3000,4000),
N_surv_2 = runif(rows,3000,4000),
N_surv_3 = runif(rows,3000,4000))
system.time({
cols<-names(pop)[-1]
combs<-list()
for(i in 2:length(cols)) {
combs[[length(combs)+1]]<-c(cols[i-1], cols[i])
}
newnames<-sapply(combs,function(x) gsub('N_surv','death',x[2]))
deathpop<-copy(pop)
deathpop[,(newnames):=lapply(combs,function(x) get(x[2])-get(x[1]))]
deathpop[,(cols):=NULL]})
and it returned
user system elapsed
0.192 0.808 1.003
In contrast I did
system.time(pop[, as.list(diff(unlist(.SD))), group_id])
and it returned
user system elapsed
169.836 0.428 170.469
I also did
system.time({
ncols = grep("^N(_surv_[0-9]+)?", names(pop), value=TRUE)
pop[, Map(
`-`,
utils:::tail.default(.SD, -1),
utils:::head.default(.SD, -1)
), .SDcols=ncols]
})
which returned
user system elapsed
0.044 0.044 0.089
Finally, doing
system.time(melt(pop, id="group_id")[, tail(value, -1) - head(value, -1), by=group_id])
returns
user system elapsed
223.360 1.736 225.315
Frank's Map solution is fastest. If you take the copying out of mine then it gets a lot closer to Frank's time but his still wins for this test case.

Alternative of slower ifelse in R data table

I am writing a function where multiple ifelse are being used for data table operation. Although I am using data tables for speed but multiple ifelse making my code slow and this function is for large data set. Hence I was wondering if there is an alternative to iflese.
One example iflese from the function(there are close to 15 iflese ), in this example flag is set to 1 if x is blank else 0.
dt<-dt[,flag:=ifelse(is.na(x)|!nzchar(x),1,0)]
The fastest approach will probably depend on what your data looks like. Those mentioned in the comments are all comparable for this example:
(twice was mentioned by #DavidArenburg; and onceadd by #akrun. I'm not really sure how to benchmark these with replications > 1, since the objects are actually modified during the benchmark.)
DT <- data.table(x=sample(c(NA,"",letters),1e8,replace=TRUE))
DT0 <- copy(DT)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
DT5 <- copy(DT)
DT6 <- copy(DT)
DT7 <- copy(DT)
library(rbenchmark)
benchmark(
ifelse = DT0[,flag:=ifelse(is.na(x)|!nzchar(x),1L,0L)],
keyit = {
setkey(DT1,x)
DT1[,flag:=0L]
DT1[J(NA_character_,""),flag:=1L]
},
twiceby = DT2[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L,by=x],
twice = DT3[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L],
onceby = DT4[, flag:= +(is.na(x)|!nzchar(x)), by=x],
once = DT5[, flag:= +(is.na(x)|!nzchar(x))],
onceadd = DT6[, flag:= (is.na(x)|!nzchar(x))+0L],
oncebyk = {setkey(DT7,x); DT7[, flag:= +(is.na(x)|!nzchar(x)), by=x]},
replications=1
)[1:5]
# test replications elapsed relative user.self
# 1 ifelse 1 19.61 31.127 17.32
# 2 keyit 1 0.63 1.000 0.47
# 6 once 1 3.26 5.175 2.68
# 7 onceadd 1 3.24 5.143 2.88
# 5 onceby 1 1.81 2.873 1.75
# 8 oncebyk 1 0.91 1.444 0.82
# 4 twice 1 3.17 5.032 2.79
# 3 twiceby 1 3.45 5.476 3.16
Discussion. In this example, keyit is the fastest. However, it's also the most verbose and it changes the sorting of your table. Also, keyit is very specific to the OP's question (taking advantage of the fact that exactly two character values fit the condition is.na(x)|!nzchar(x)), and so might not be as great for other applications, where it would need to be written something like
keyit = {
setkey(DT1,x)
flagem = DT1[,some_other_condition(x),by=x][(V1)]$x
DT1[,flag:=0L]
DT1[J(flagem),flag:=1L]
}

Assigning value to a huge data table element too slow

Given a data.table defined as follow:
n <- 34916339
ds2 <- data.table(X=rep(as.integer(NA),n), Y=rep(as.integer(NA),n),
LAT=rep(as.numeric(NA),n), LON=rep(as.numeric(NA),n),
FCT_DATE=rep(as.Date(NA),n), VAR=rep(as.character(NA),n),
TYPE=rep(as.character(NA),n), VALUE=rep(as.numeric(NA),n))
The following code is too slow.
ds = data.table dim = 572399 x 66
colNames <- rep("any_string",66) # only an example
for (i in 1:nrow(ds)) {
for (j in 6:66) {
colName <- colNames[j]
colName.split <- strsplit(colName, "_") # Split the elements by "_"
k <- ((i-1) * length(colIndex))+(j-5) # creates 61 lines each complete loop
ds2[k,6] <- colName.split[[1]][1]
ds2[k,7] <- colName.split[[1]][2] # so, it reads 61 cols from ds
ds2[k,8] <- ds[i,get(colName)] # and creates 61 lines in ds2
}
}
Does anyone know how I can improve this code? In particular, the attributions to cols 6,7 and 8 are slow. I'm trying to convert the 66 columns of the data.table ds to a 8 column data.table.
Thanks in advance.
EDITED:
# Building of an example of the data.table ds (the faster way I know for the moment)
ds <- data.table(1:nds,1:nds,rep(3.3,nds),rep(4.4,nds),rep(as.Date("2014-08-16"),nds))
for (i in 1:61) {
ds <- cbind(ds,rep(i+i/10,nds))
}
# setting the real names
names.ds <- c("X","Y","LAT","LON","FCT_DATE",
"UVES_01N","VVES_01N","PSNM_01N","PREC_01N","UVES_01P","VVES_01P","PSNM_01P","PREC_01P",
"UVES_02N","VVES_02N","PSNM_02N","PREC_02N","UVES_02P","VVES_02P","PSNM_02P","PREC_02P",
"UVES_03N","VVES_03N","PSNM_03N","PREC_03N","UVES_03P","VVES_03P","PSNM_03P","PREC_03P",
"UVES_04N","VVES_04N","PSNM_04N","PREC_04N","UVES_04P","VVES_04P","PSNM_04P","PREC_04P",
"UVES_05N","VVES_05N","PSNM_05N","PREC_05N","UVES_05P","VVES_05P","PSNM_05P","PREC_05P",
"UVES_06N","VVES_06N","PSNM_06N","PREC_06N","UVES_06P","VVES_06P","PSNM_06P","PREC_06P",
"UVES_07N","VVES_07N","PSNM_07N","PREC_07N","UVES_07P","VVES_07P","PSNM_07P","PREC_07P",
"UVES_AVN","VVES_AVN","PSNM_AVN","PREC_AVN","PREC_OBS")
setnames(ds, old=1:66, new=names.ds)
My goal is to convert it to a data.table like this:
X Y LAT LON FCT_DATE VAR TYPE VALUE
1: 312 54 -39.7401 -68.4375 2009-01-02 UVES 01N 0.63
2: 312 54 -39.7401 -68.4375 2009-01-02 VVES 01N -3.17
3: 312 54 -39.7401 -68.4375 2009-01-02 PSNM 01N 1019.52
...
34916339: 341 83 -39.7401 -68.4375 2009-01-02 PREC OBS 0.50
I think you're trying to reinvent the wheel. This works:
library(reshape2)
ds2 <- melt(ds, 1:5, variable.name = "VAR", value.name = "VALUE")
ds2[, VAR := as.character(VAR)]
ds2[, `:=`(TYPE = sub(".*_", "", VAR), VAR = sub("_.*", "", VAR))]
It was fairly slow on just 1,000,000 rows (on a MacBook Pro w/ OS 10.9, 2.8 GHz i7):
# user system elapsed
# 73.373 1.398 74.809
but at least it's parsimonious and readable. You also didn't say how slow "too slow" was, so I have no idea if this is an improvement. A strsplit-based solution took even longer (> 100 seconds), and stringr::str_match_all longer than that.
Here's a faster way. The other answer calls sub(...) twice for each row. There's no need to do that since these are just the column names, and there are only 66 of them. Using your code with nds <- 1e6 to create ds, the code below runs about 20X faster.
library(reshape2)
# code from other answer
system.time({
ds2 <- melt(ds, 1:5, variable.name = "VAR", value.name = "VALUE")
ds2[, VAR := as.character(VAR)]
ds2[, `:=`(TYPE = sub(".*_", "", VAR), VAR = sub("_.*", "", VAR))]
})
# user system elapsed
# 239.43 1.05 240.78
# this code does not call sub(...) 2 million times
system.time({
cn <- strsplit(colnames(ds)[6:66],"_")
ds3 <- melt(ds,1:5,variable.name="VAR",value.name="VALUE")
ds3[,":="(VAR =rep(sapply(cn,"[",1),each=nrow(ds)),
TYPE=rep(sapply(cn,"[",2),each=nrow(ds)))]
})
# user system elapsed
# 13.87 8.96 22.83
identical(ds2,ds3)
# [1] TRUE

R apply conversion to multiple columns of data.frame

I am wanting to convert several columns in a data.frame from chr to numeric and I would like to do it in a single line. Here is what I am trying to do:
items[,2:4] <- as.numeric(sub("\\$","",items[,2:4]))
But I get an error saying:
Warning message:
NAs introduced by coercion
If I do it column by column though it works:
items[,2:2] <- as.numeric(sub("\\$","",items[,2:2]))
items[,3:3] <- as.numeric(sub("\\$","",items[,3:3]))
items[,4:4] <- as.numeric(sub("\\$","",items[,4:4]))
What am I missing here? Why I specify this command for multiple columns? Is this some odd R idiosyncrasy that I am not aware of?
Example Data:
Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09
Your problem is, that gsub converts its x argument to character. If a list (a data.frame is in fact a list) is converted to character something wired happen:
as.character(list(a=c("1", "1"), b="1"))
# "c(\"1\", \"1\")" "1"
# and "c(\"1\", \"1\")" can not convert into a numeric
as.numeric("c(\"1\", \"1\")")
# NA
A one line solution would be to unlist the x argument:
items[, 2:5] <- as.numeric(gsub("\\$", "", unlist(items[, 2:5])))
Yes there is: apply is the command you are looking for:
items<-read.table(text="Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09", header=TRUE,sep=",")
items[,2:4]<-apply(items[,2:4],2,function(x){as.numeric(gsub("\\$","",x))})
items
Name Cost1 Cost2 Cost3 Cost4
1 A 10.00 15.50 13.20 $45.45
2 B 45.23 34.23 34.24 $23.34
3 C 23.43 45.23 65.23 $34.23
4 D 76.34 98.34 90.34 $45.09
A more efficient approach would be:
items[-1] <- lapply(items[-1], function(x) as.numeric(gsub("$", "", x, fixed = TRUE)))
items
# Name Cost1 Cost2 Cost3 Cost4
# 1 A 10.00 15.50 13.20 45.45
# 2 B 45.23 34.23 34.24 23.34
# 3 C 23.43 45.23 65.23 34.23
# 4 D 76.34 98.34 90.34 45.09
Some benchmarks of the answers so far
fun1 <- function() {
A[-1] <- lapply(A[-1], function(x) as.numeric(gsub("$", "", x, fixed=TRUE)))
A
}
fun2 <- function() {
A[, 2:ncol(A)] <- as.numeric(gsub("\\$", "", unlist(A[, 2:ncol(A)])))
A
}
fun3 <- function() {
A[, 2:ncol(A)] <- apply(A[,2:ncol(A)], 2, function(x) { as.numeric(gsub("\\$","",x)) })
A
}
Here's some sample data and processing times
set.seed(1)
A <- data.frame(Name = sample(LETTERS, 10000, TRUE),
matrix(paste0("$", sample(99, 10000*100, TRUE)),
ncol = 100))
system.time(fun1())
# user system elapsed
# 0.72 0.00 0.72
system.time(fun2())
# user system elapsed
# 5.84 0.00 5.85
system.time(fun3())
# user system elapsed
# 4.14 0.00 4.14

fast subsetting in data.table in R

Given a data.table, I would like to subset the items in there quickly. For example:
dt = data.table(a=1:10, key="a")
dt[a > 3 & a <= 7]
This is pretty slow still. I know I can do joins to get individual rows but is there a way to fact that the data.table is sorted to get quick subsets of this kind?
This is what I'm doing:
dt1 = data.table(id = 1, ym = c(199001, 199006, 199009, 199012), last_ym = c(NA, 199001, 199006, 199009), v = 1:4, key=c("id", "ym"))
dt2 = data.table(id = 1, ym = c(199001, 199002, 199003, 199004, 199005, 199006, 199007, 199008, 199009, 199010, 199011, 199012), v2 = 1:12, key=c("id","ym"))
For each id, here there is only 1, and ym in dt1, I would like to sum the values of v2 between current ym in dt1 and the last ym in dt1. That is, for ym == 199006 in dt1 I would like to return list(v2 = 2 + 3 + 4 + 5 + 6). These are the values of v2 in dt2 that are equal to or less than the current ym (excluding the previous ym). In code:
expr = expression({ #browser();
cur_id = id;
cur_ym = ym;
cur_dtb = dt2[J(cur_id)][ym <= cur_ym & ym > last_ym];
setkey(cur_dtb , ym);
list(r = sum(cur_dtb$v2))
})
dt1[,eval(expr ),by=list(id, ym)]
To avoid the logical condition, perform a rolling join of dt1 and dt2. Then shift ym forward by one position within id. Finally, sum over v2 by id and ym:
setkey(dt1, id, last_ym)
setkey(dt2, id, ym)
dt1[dt2,, roll = TRUE][
, list(v2 = v2, ym = c(last_ym[1], head(ym, -1))), by = id][
, list(v2 = sum(v2)), by = list(id, ym)]
Note that we want to sum everything since the last_ym so the key on dt1 must be last_ym rather than ym.
The result is:
id ym v2
1: 1 199001 1
2: 1 199006 20
3: 1 199009 24
4: 1 199012 33
UPDATE: correction
Regardless of the fact that data.table is sorted, you will be limited to the amount of time it takes to evaluate a > 3 & a <= 7 in the first place:
> dt = data.table(a=1:10000000, key="a")
> system.time(dt$a > 3 & dt$a <= 7)
user system elapsed
0.18 0.01 0.20
> system.time(dt[,a > 3 & a <= 7])
user system elapsed
0.18 0.05 0.24
> system.time(dt[a > 3 & a <= 7])
user system elapsed
0.25 0.07 0.31
Alternative approach:
> system.time({Indices = dt$a > 3 & dt$a <= 7 ; dt[Indices]})
user system elapsed
0.28 0.03 0.31
Multiple Subsets
There can be a speed issue here if you break up factors on an ad hoc basis rather than doing it all at once first:
> dt <- data.table(A=sample(letters, 10000, replace=T))
> system.time(for(i in unique(dt$A)) dt[A==i])
user system elapsed
5.16 0.42 5.59
> system.time(dt[,.SD,by=A])
user system elapsed
0.32 0.03 0.36

Resources