Speed up `strsplit` when possible output are known - r

I have a large data frame with a factor column that I need to divide into three factor columns by splitting up the factor names by a delimiter. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):
data <- readRDS("data.rds")
data.df <- reshape2:::melt.array(data)
head(data.df)
## Time Location Class Replicate Population
##1 1 1 LIDE.1.S 1 0.03859605
##2 2 1 LIDE.1.S 1 0.03852957
##3 3 1 LIDE.1.S 1 0.03846853
##4 4 1 LIDE.1.S 1 0.03841260
##5 5 1 LIDE.1.S 1 0.03836147
##6 6 1 LIDE.1.S 1 0.03831485
Rprof("str.out")
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
Rprof(NULL)
head(data.df)
## Time Location Species SizeClass Infected Replicate Population
##1 1 1 LIDE 1 S 1 0.03859605
##2 2 1 LIDE 1 S 1 0.03852957
##3 3 1 LIDE 1 S 1 0.03846853
##4 4 1 LIDE 1 S 1 0.03841260
##5 5 1 LIDE 1 S 1 0.03836147
##6 6 1 LIDE 1 S 1 0.03831485
summaryRprof("str.out")
$by.self
self.time self.pct total.time total.pct
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"do.call" 0.04 1.49 2.54 94.78
"unique.default" 0.04 1.49 0.04 1.49
"data.frame" 0.02 0.75 0.12 4.48
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
$by.total
total.time total.pct self.time self.pct
"do.call" 2.54 94.78 0.04 1.49
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"cbind" 0.14 5.22 0.00 0.00
"data.frame" 0.12 4.48 0.02 0.75
"as.data.frame.matrix" 0.08 2.99 0.00 0.00
"as.data.frame" 0.08 2.99 0.00 0.00
"as.factor" 0.08 2.99 0.00 0.00
"factor" 0.06 2.24 0.00 0.00
"unique.default" 0.04 1.49 0.04 1.49
"unique" 0.04 1.49 0.00 0.00
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
"[.data.frame" 0.02 0.75 0.00 0.00
"[" 0.02 0.75 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.68
Is there any way to speed up this operation? I note that there are a small (<5) number of each of the categories "Species", "SizeClass", and "Infected", and I know what these are in advance.
Notes:
stringr::str_split_fixed performs this task, but not any faster
The data frame is actually initially generated by calling reshape::melt on an array in which Class and its associated levels are a dimension. If there's a faster way to get from there to here, great.
data.rds at http://dl.getdropbox.com/u/3356641/data.rds

This should probably offer quite an increase:
library(data.table)
DT <- data.table(data.df)
DT[, c("Species", "SizeClass", "Infected")
:= as.list(strsplit(Class, "\\.")[[1]]), by=Class ]
The reasons for the increase:
data.table pre allocates memory for columns
every column assignment in data.frame reassigns the entirety of the data (data.table in contrast does not)
the by statement allows you to implement the strsplit task once per each unique value.
Here is a nice quick method for the whole process.
# Save the new col names as a character vector
newCols <- c("Species", "SizeClass", "Infected")
# split the string, then convert the new cols to columns
DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ]
DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols]
# remove the old column. This is instantaneous.
DT[, Class := NULL]
## Have a look:
DT[, lapply(.SD, class)]
# Time Location Replicate Population Species SizeClass Infected
# 1: integer integer integer numeric factor factor factor
DT

You could get a decent increase in speed by just extracting the parts of the string you need using gsub instead of splitting everything up and trying to put it back together:
data <- readRDS("~/Downloads/data.rds")
data.df <- reshape2:::melt.array(data)
# using `strsplit`
system.time({
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
})
user system elapsed
3.349 0.062 3.411
#using `gsub`
system.time({
data.df$Class <- as.character(data.df$Class)
data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class,
perl = TRUE)
data.df$Infected <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class,
perl = TRUE)
data.df$Class <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class,
perl = TRUE)
})
user system elapsed
0.812 0.037 0.848

Looks like you have a factor, so work on the levels and then map back. Use fixed=TRUE in strsplit, adjusting to split=".".
Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA)
cbind(data.df, df0)

Related

How to reorder letters in string efficiently in R?

I have the following function to reorder letters in a character vector.
reorder_letter <- function(x){
sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))
}
reorder_letter(c("trErty","Bca","def"))
#> [1] "ERRTTY" "ABC" "DEF"
Created on 2020-04-29 by the reprex package (v0.3.0)
Basically I want to return same letter of the character but with upper case and sorted order.
Currently it takes around 1 min to run for a 1.5 million length vector.
EDIT: I also tried to parallelize using future.apply package which is 3x faster than base R solution (also easy to modify current code)
reorder_letter <- function(x){
future_sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))
}
I just wonder
how can I efficiently achieve my purpose?
what is the best approach to find the bottleneck of a function? For example, I have this function finished. What the next step?
Maybe utf8ToInt and intToUtf8 are faster than strsplit and paste.
x <- c("trErty","Bca","def")
unlist(lapply(x, function(y) {intToUtf8(sort(utf8ToInt(toupper(y))))}))
#[1] "ERRTTY" "ABC" "DEF"
Times: (It is not faster ... sorry)
But stringi is faster and writing a Function C++ is even more faster (and might be improved but it's already 10 times faster).
FrankZhang <- function(x) {
unlist(lapply(strsplit(toupper(x),NULL),function(x) paste(sort(x),collapse = "")))}
GKi <- function(x) {
unlist(lapply(toupper(x), function(y) {intToUtf8(sort(utf8ToInt(y)))}))
}
library(stringi)
stringi <- function(y) {
vapply(stri_split_boundaries(toupper(y), type = "character"), function(x) stri_c(x[stri_order(x)], collapse = ""), "")
}
library(Rcpp)
cppFunction("std::string GKiC(std::string &str) {
std::sort(str.begin(), str.end());
return(str);}")
GKi2 <- function(x) {unlist(lapply(toupper(x), GKiC))}
x <- apply(expand.grid(letters, LETTERS), 1, paste, collapse = "")
microbenchmark::microbenchmark(FrankZhang(x), GKi(x), stringi(x), GKi2(x), control=list(order="block"))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# FrankZhang(x) 17.533428 18.686879 20.380002 19.719311 21.014381 33.836692 100 d
# GKi(x) 16.551358 17.665436 18.656223 18.271688 19.343088 23.225199 100 c
# stringi(x) 4.644196 4.844622 5.082298 5.011344 5.237714 7.355251 100 b
# GKi2(x) 1.527124 1.624337 1.997725 1.691099 2.242797 5.593543 100 a
To find out what uses much computation time you can use Rprof e.g.:
reorder_letter <- function(x) { #Function
sapply(strsplit(x,split = ""),function(x) paste(sort(toupper(x)),collapse = ""))}
x <- apply(expand.grid(letters, LETTERS, letters), 1, paste, collapse = "") #Data
Rprof()
y <- reorder_letter(x)
Rprof(NULL)
summaryRprof()
#$by.self
# self.time self.pct total.time total.pct
#"FUN" 0.12 20.69 0.54 93.10
#"sort.int" 0.10 17.24 0.22 37.93
#"paste" 0.08 13.79 0.42 72.41
#"sort" 0.06 10.34 0.34 58.62
#"sort.default" 0.06 10.34 0.28 48.28
#"match.arg" 0.04 6.90 0.10 17.24
#"eval" 0.04 6.90 0.04 6.90
#"sapply" 0.02 3.45 0.58 100.00
#"lapply" 0.02 3.45 0.56 96.55
#".doSortWrap" 0.02 3.45 0.02 3.45
#"formals" 0.02 3.45 0.02 3.45
#
#$by.total
# total.time total.pct self.time self.pct
#"sapply" 0.58 100.00 0.02 3.45
#"reorder_letter" 0.58 100.00 0.00 0.00
#"lapply" 0.56 96.55 0.02 3.45
#"FUN" 0.54 93.10 0.12 20.69
#"paste" 0.42 72.41 0.08 13.79
#"sort" 0.34 58.62 0.06 10.34
#"sort.default" 0.28 48.28 0.06 10.34
#"sort.int" 0.22 37.93 0.10 17.24
#"match.arg" 0.10 17.24 0.04 6.90
#"eval" 0.04 6.90 0.04 6.90
#".doSortWrap" 0.02 3.45 0.02 3.45
#"formals" 0.02 3.45 0.02 3.45
#
#$sample.interval
#[1] 0.02
#
#$sampling.time
#[1] 0.58

apply a function on columns with specific names

I am new in R.
I have hundreds of data frames like this
ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07
This is just an example. The number and names of the Ratio_ columns are different between data frames, but all of them start with Ratio_. I want to apply a function (for example, log(x)), to the Ratio_ columns without specify the column number or the whole name.
I know how to do it df by df, for the one in the example:
A <- function(x) log(x)
df_log<-data.frame(df[1:2], lapply(df[3:6], A))
but I have a lot of them, and as I said the number of columns is different in each.
Any suggestion?
Thanks
Place the datasets in a list and then loop over the list elements
lapply(lst, function(x) {i1 <- grep("^Ratio_", names(x));
x[i1] <- lapply(x[i1], A)
x})
NOTE: No external packages are used.
data
lst <- mget(paste0("df", 1:100))
This type of problem is very easily dealt with using the dplyr package. For example,
df <- read.table(text = 'ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07',
header = TRUE)
library(dplyr)
df_transformed <- mutate_each(df, funs(log(.)), starts_with("Ratio_"))
df_transformed
# > df_transformed
# ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
# 1 AA ABCD -2.4079456 -0.4004776 -2.3025851 -1.96611286
# 2 AB ABCE -3.2188758 -0.1625189 -3.2188758 -2.81341072
# 3 AC ABCG -0.8439701 -1.5606477 -0.6161861 -1.96611286
# 4 AD ABCF -1.8325815 -0.4780358 -1.3862944 -0.03045921
# 5 AF ABCJ -0.5276327 -0.9942523 -0.4155154 -2.65926004

reshape unique strings in rows into columns in R

I would like to reshape my data based in unique string in a "Bull" column (all data frame):
EBV Bulls
0.13 NE001362
0.17 NE001361
0.05 NE001378
-0.12 NE001359
-0.14 NE001379
0.13 NE001380
-0.46 NE001379
-0.46 NE001359
-0.68 NE001394
0.28 NE001391
0.84 NE001394
-0.43 NE001393
-0.18 NE001707
My expected output:
NE001362 NE001361 NE001378 NE001359 NE001379 NE001380 NE001394 NE001391 NE001393 NE001707
0.13 0.17 0.05 -0.12 -0.14 0.13 -0.68 0.28 -0.43 -0.18
-0.46 -0.46 0.84
I tried dat2 <- dcast(all, EBV~variable, value.var = "Bulls") but do not works.
You have two options. Indexing the multiple occurrences for each level of Bulls or using a list to hold the different levels of EBV.
Option 1: Indexing multiple occurrences
You can use data.table to generate an index that numbers multiple occurrences of EBV:
require(data.table)
setDT(all) ## convert to data.table
all[, index:=1:.N, by=Bulls] ## generate index
dcast.data.table(all, formula=index ~ Bulls, value.var='EBV')
Option 2: Using a list to store multiple values
You could use a list as a value with data.table (I'm not sure if plain data.frame supports it).
require(data.table)
setDT(all) ## convert to data.table
all[, list(list(EBV)), by=Bulls] ## multiple values stored as list
Just to make sure that base R gets some acknowledgement:
## Add an ID, like ilir did, but with base R functions
mydf$ID <- with(mydf, ave(rep(1, nrow(mydf)), Bulls, FUN = seq_along))
Here's reshape:
reshape(mydf, direction = "wide", idvar="ID", timevar="Bulls")
# ID EBV.NE001362 EBV.NE001361 EBV.NE001378 EBV.NE001359 EBV.NE001379
# 1 1 0.13 0.17 0.05 -0.12 -0.14
# 7 2 NA NA NA -0.46 -0.46
# EBV.NE001380 EBV.NE001394 EBV.NE001391 EBV.NE001393 EBV.NE001707
# 1 0.13 -0.68 0.28 -0.43 -0.18
# 7 NA 0.84 NA NA NA
And xtabs. Note: This is a table-like matrix, so if you want a data.frame, you'll have to use as.data.frame.matrix on the output.
xtabs(EBV ~ ID + Bulls, mydf)
# Bulls
# ID NE001359 NE001361 NE001362 NE001378 NE001379 NE001380 NE001391
# 1 -0.12 0.17 0.13 0.05 -0.14 0.13 0.28
# 2 -0.46 0.00 0.00 0.00 -0.46 0.00 0.00
# Bulls
# ID NE001393 NE001394 NE001707
# 1 -0.43 -0.68 -0.18
# 2 0.00 0.84 0.00

Fast(er) way of indexing matrix in R

Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:
for (i in 1:99000) {
subset.data <- data[index[, i], ]
}
Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon
this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset
the original data matrix to draw resamples of the data.
# The basic setup
B <- 1000 # no. of bootstrap replications
n <- 250 # no. of observations
m <- 100 # no. of models/data series
# Create index matrix with B columns and n rows.
# Each column represents a resampling of the data.
# (actually block resamples, but doesn't matter here).
boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B)
# Make matrix with m data series of length n.
sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m)
subsetMatrix <- function(data, index) { # fn definition for timing
subset.data <- data[index, ]
return(subset.data)
}
# check how long it takes.
Rprof("subsetMatrix.out")
for (i in 1:(m - 1)) {
for (b in 1:B) { # B * (m - 1) = 1000 * 99 = 99000
boot.data <- subsetMatrix(sample.data, boot.index[, b])
# do some other stuff
}
# do some more stuff
}
Rprof()
summaryRprof("subsetMatrix.out")
# > summaryRprof("subsetMatrix.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 9.96 100 9.96 100
# In the actual application:
#########
# > summaryRprof("seq_testing.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 1.98 15.76 2.20 17.52
# makeIndex 1.08 8.60 2.12 16.88
# makeStats 0.66 5.25 9.66 76.91
# runif 0.60 4.78 0.72 5.73
# apply 0.30 2.39 0.42 3.34
# is.data.frame 0.22 1.75 0.22 1.75
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# estimateMCS 0.10 0.80 12.56 100.00
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
#
# $by.total
# total.time total.pct self.time self.pct
# estimateMCS 12.56 100.00 0.10 0.80
# makeStats 9.66 76.91 0.66 5.25
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 2.20 17.52 1.98 15.76
# makeIndex 2.12 16.88 1.08 8.60
# runif 0.72 5.73 0.60 4.78
# doTest 0.68 5.41 0.00 0.00
# apply 0.42 3.34 0.30 2.39
# aperm 0.26 2.07 0.00 0.00
# is.data.frame 0.22 1.75 0.22 1.75
# sweep 0.20 1.59 0.00 0.00
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# unlist 0.06 0.48 0.00 0.00
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
# mean 0.02 0.16 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 12.56'
Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several
parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but
still not very pleasing :/
Is there a better way to resample the data / get rid of the loop?
Can apply, Vectorize, replicate etc. come in anywhere?
Would it make sense to implement the subsetting in C (e.g. manipulate some pointers)?
Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!
related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix
from there
mapply(function(row) return(sample.data[row,]), row = boot.index)
replicate(B, apply(sample.data, 2, sample, replace = TRUE))
didn't really do it for me.
I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:
makeStats <- function(data, index) {
data.mean <- colMeans(data)
m <- nrow(data)
n <- ncol(index)
tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m))
weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index))
boot.data.mean <- t(data) %*% weights - data.mean
return(list(data.mean = data.mean,
boot.data.mean = boot.data.mean))
}
makeIndex <- function(B, blocks){
n <- ncol(blocks)
l <- nrow(blocks)
z <- ceiling(n/l)
start.points <- sample.int(n, z * B, replace = TRUE)
index <- blocks[, start.points]
keep <- c(rep(TRUE, n), rep(FALSE, z*l - n))
boot.index <- matrix(as.vector(index)[keep],
nrow = n, ncol = B)
return(boot.index)
}
This brought down the computation times from 28 to 6 seconds on my machine. I bet there are other parts of the code that can be improved (including my use of lapply/tabulate above.)

Passing an expression to a nested grouping in data.table

I have a data.table object similar to this one
library(data.table)
c <- data.table(CO = c(10000,10000,10000,20000,20000,20000,20000),
SH = c(1427,1333,1333,1000,1000,300,350),
PRC = c(6.5,6.125,6.2,0.75,0.5,3,3.5),
DAT = c(0.5,-0.5,0,-0.1,NA_real_,0.2,0.5),
MM = c("A","A","A","A","A","B","B"))
and I am trying to perform calculations using nested grouping, passing an expression as an argument. Here is a simplified version of what I have:
setkey(c,MM)
mycalc <- quote({nobscc <- length(DAT[complete.cases(DAT)]);
list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = ifelse(nobscc>=2, sum(DAT^2,na.rm=TRUE), NA_real_),
COVCOMP = ifelse(nobscc >= 2, head(DAT,n=1), NA_real_),
NOBS = nobscc)})
myresults <- c[,.SD[,{setkey=CO; eval(mycalc)},by=CO],by=MM]
which produces
MM CO MKTCAP SQSUM COVCOMP NOBS
[1,] A 10000 8264.6 0.50 0.5 3
[2,] A 20000 500.0 NA NA 1
[3,] B 20000 1225.0 0.29 0.2 2
In the example above I have two elements of the list which use the ifelse construct (in the actual code there are 3), all doing the same test : if the number of observations is greater than 2, then a certain calculation (which is different for each element of the list, and each could be written as a function) is to be performed, otherwise I want the value of the these elements to be NA. Another thing these elements have in common is that they use one and the same column of my data.table: the one called DAT.
So my question is: is there any way I can do the ifelse test only once, and if it is FALSE, pass the value NA to the respective elements of the list, and if TRUE, evaluate a different expression for each of the elements of the list?
NOTE: My goal is to reduce the system.time (system and elapsed). If this modification will not reduce time and calculations, bearing in mind I have 72 million observations, that's an acceptable answer. I also welcome suggestions to change other parts of the code.
EDIT: Results of summaryRprof()
$by.total
total.time total.pct self.time self.pct
"system.time" 18.94 99.79 0.00 0.00
".Call" 18.92 99.68 0.10 0.53
"[" 18.92 99.68 0.04 0.21
"[.data.table" 18.92 99.68 0.02 0.11
"eval" 18.80 99.05 0.24 1.26
"ifelse" 18.30 96.42 0.46 2.42
"lm" 17.70 93.26 0.58 3.06
"sapply" 8.06 42.47 0.36 1.90
"model.frame" 7.74 40.78 0.16 0.84
"model.frame.default" 7.58 39.94 0.98 5.16
"lapply" 6.62 34.88 0.70 3.69
"FUN" 4.24 22.34 1.10 5.80
"model.matrix" 4.04 21.29 0.02 0.11
"model.matrix.default" 4.02 21.18 0.26 1.37
"match" 3.66 19.28 0.86 4.53
".getXlevels" 3.12 16.44 0.12 0.63
"na.omit" 2.40 12.64 0.24 1.26
"%in%" 2.30 12.12 0.34 1.79
"simplify2array" 2.24 11.80 0.12 0.63
"na.omit.data.frame" 2.16 11.38 0.14 0.74
"[.data.frame" 2.12 11.17 1.18 6.22
"deparse" 1.80 9.48 0.66 3.48
"unique" 1.80 9.48 0.54 2.85
"[[" 1.52 8.01 0.12 0.63
"[[.data.frame" 1.40 7.38 0.54 2.85
".deparseOpts" 1.34 7.06 0.96 5.06
"paste" 1.32 6.95 0.16 0.84
"lm.fit" 1.20 6.32 0.64 3.37
"mode" 1.14 6.01 0.14 0.74
"unlist" 1.12 5.90 0.56 2.95
Instead of forming and operating on data subsets like this:
setkey(c,MM)
myresults <- c[, .SD[,{setkey=CO; eval(mycalc)},by=CO], by=MM]
You could try doing this:
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
This should speed up your code, since it avoids all of the nested subsetting of .SD objects, each of which requires its own call to [.data.table.
On your original question, I doubt the ifelse evaluations are taking much time, but if you want to avoid them, you could take them out of mycalc and use := to overwrite the desired values with NA:
mycalc <- quote(list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = sum(DAT^2,na.rm=TRUE),
COVCOMP = head(DAT,n=1),
NOBS = length(DAT[complete.cases(DAT)])))
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
myresults[NOBS<2, c("SQSUM", "COVCOMP"):=NA_real_]
## Or, alternatively
# myresults[NOBS<2, SQSUM:=NA_real_]
# myresults[NOBS<2, COVCOMP:=NA_real_]

Resources