I have a data.table object similar to this one
library(data.table)
c <- data.table(CO = c(10000,10000,10000,20000,20000,20000,20000),
SH = c(1427,1333,1333,1000,1000,300,350),
PRC = c(6.5,6.125,6.2,0.75,0.5,3,3.5),
DAT = c(0.5,-0.5,0,-0.1,NA_real_,0.2,0.5),
MM = c("A","A","A","A","A","B","B"))
and I am trying to perform calculations using nested grouping, passing an expression as an argument. Here is a simplified version of what I have:
setkey(c,MM)
mycalc <- quote({nobscc <- length(DAT[complete.cases(DAT)]);
list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = ifelse(nobscc>=2, sum(DAT^2,na.rm=TRUE), NA_real_),
COVCOMP = ifelse(nobscc >= 2, head(DAT,n=1), NA_real_),
NOBS = nobscc)})
myresults <- c[,.SD[,{setkey=CO; eval(mycalc)},by=CO],by=MM]
which produces
MM CO MKTCAP SQSUM COVCOMP NOBS
[1,] A 10000 8264.6 0.50 0.5 3
[2,] A 20000 500.0 NA NA 1
[3,] B 20000 1225.0 0.29 0.2 2
In the example above I have two elements of the list which use the ifelse construct (in the actual code there are 3), all doing the same test : if the number of observations is greater than 2, then a certain calculation (which is different for each element of the list, and each could be written as a function) is to be performed, otherwise I want the value of the these elements to be NA. Another thing these elements have in common is that they use one and the same column of my data.table: the one called DAT.
So my question is: is there any way I can do the ifelse test only once, and if it is FALSE, pass the value NA to the respective elements of the list, and if TRUE, evaluate a different expression for each of the elements of the list?
NOTE: My goal is to reduce the system.time (system and elapsed). If this modification will not reduce time and calculations, bearing in mind I have 72 million observations, that's an acceptable answer. I also welcome suggestions to change other parts of the code.
EDIT: Results of summaryRprof()
$by.total
total.time total.pct self.time self.pct
"system.time" 18.94 99.79 0.00 0.00
".Call" 18.92 99.68 0.10 0.53
"[" 18.92 99.68 0.04 0.21
"[.data.table" 18.92 99.68 0.02 0.11
"eval" 18.80 99.05 0.24 1.26
"ifelse" 18.30 96.42 0.46 2.42
"lm" 17.70 93.26 0.58 3.06
"sapply" 8.06 42.47 0.36 1.90
"model.frame" 7.74 40.78 0.16 0.84
"model.frame.default" 7.58 39.94 0.98 5.16
"lapply" 6.62 34.88 0.70 3.69
"FUN" 4.24 22.34 1.10 5.80
"model.matrix" 4.04 21.29 0.02 0.11
"model.matrix.default" 4.02 21.18 0.26 1.37
"match" 3.66 19.28 0.86 4.53
".getXlevels" 3.12 16.44 0.12 0.63
"na.omit" 2.40 12.64 0.24 1.26
"%in%" 2.30 12.12 0.34 1.79
"simplify2array" 2.24 11.80 0.12 0.63
"na.omit.data.frame" 2.16 11.38 0.14 0.74
"[.data.frame" 2.12 11.17 1.18 6.22
"deparse" 1.80 9.48 0.66 3.48
"unique" 1.80 9.48 0.54 2.85
"[[" 1.52 8.01 0.12 0.63
"[[.data.frame" 1.40 7.38 0.54 2.85
".deparseOpts" 1.34 7.06 0.96 5.06
"paste" 1.32 6.95 0.16 0.84
"lm.fit" 1.20 6.32 0.64 3.37
"mode" 1.14 6.01 0.14 0.74
"unlist" 1.12 5.90 0.56 2.95
Instead of forming and operating on data subsets like this:
setkey(c,MM)
myresults <- c[, .SD[,{setkey=CO; eval(mycalc)},by=CO], by=MM]
You could try doing this:
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
This should speed up your code, since it avoids all of the nested subsetting of .SD objects, each of which requires its own call to [.data.table.
On your original question, I doubt the ifelse evaluations are taking much time, but if you want to avoid them, you could take them out of mycalc and use := to overwrite the desired values with NA:
mycalc <- quote(list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = sum(DAT^2,na.rm=TRUE),
COVCOMP = head(DAT,n=1),
NOBS = length(DAT[complete.cases(DAT)])))
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
myresults[NOBS<2, c("SQSUM", "COVCOMP"):=NA_real_]
## Or, alternatively
# myresults[NOBS<2, SQSUM:=NA_real_]
# myresults[NOBS<2, COVCOMP:=NA_real_]
Related
I have to do an operation that involves two matrices, matrix #1 with data and matrix #2 with coefficients to multiply columns of matrix #1
matrix #1 is:
dim(dat)
[1] 612 2068
dat[1:6,1:8]
X0005 X0010 X0011 X0013 X0015 X0016 X0017 X0018
1 1.96 1.82 8.80 1.75 2.95 1.10 0.46 0.96
2 1.17 0.94 2.74 0.59 0.86 0.63 0.16 0.31
3 2.17 2.53 10.40 4.19 4.79 2.22 0.31 3.32
4 3.62 1.93 6.25 2.38 2.25 0.69 0.16 1.01
5 2.32 1.93 3.74 1.97 1.31 0.44 0.28 0.98
6 1.30 2.04 1.47 1.80 0.43 0.33 0.18 0.46
and matrix #2 is:
dim(lo)
[1] 2068 8
head(lo)
i1 i2 i3 i4 i5 i6
X0005 -0.11858852 0.10336788 0.62618771 0.08706041 -0.02733101 0.006287923
X0010 0.06405406 0.13692216 0.64813610 0.15750302 -0.13503956 0.139280709
X0011 -0.06789727 0.30473549 0.07727417 0.24907723 -0.05345123 0.141591330
X0013 0.20909664 0.01275553 0.21067894 0.12666704 -0.02836527 0.464548147
X0015 -0.07690560 0.18788859 -0.03551084 0.19120773 -0.10196578 0.234037820
X0016 -0.06442454 0.34993481 -0.04057001 0.20258195 -0.09318325 0.130669546
i7 i8
X0005 0.08571777 0.031531478
X0010 0.31170850 -0.003127279
X0011 0.52527759 -0.065002026
X0013 0.27858049 -0.032178156
X0015 0.50693977 -0.058003429
X0016 0.53162596 -0.052091767
I want to multiply each column of matrix#1 by its correspondent coefficient of matrix#2 first column, and sum up all resulting columns. Then repeat the operation but with coefficients of matrix#2 second column, then third column, and so on...
The result is then a matrix with 8 columns, which are lineal combinations of data in matrix#1
My attempt includes nested for-loops. it works, but takes about 30' to execute. Is there any way to avoid these loops and reduce computational effort?
here is my attempt:
r=nrow(dat)
n=ncol(dat)
m=ncol(lo)
eme<-matrix(NA,r,m)
for (i in(1:m)){
SC<-matrix(NA,r,n)
for (j in(1:n)){
nom<-rownames(lo)
x<-dat[ , colnames(dat) == nom[j]]
SC[,j]<-x*lo[j,i]
SC1<-rowSums(SC)
}
eme[,i]<-SC1
}
Thanks for your help
It looks like you are just doing matrix - vector multiplication. In R, use the%*% operator, so all the looping is delegated to a fortran routine. I think it equates to the following
apply(lo, 2, function(x) dat %*% x)
Your code could be improved by moving the nom <- assignment outside the loops since it recalculates the same thing every iteration. Also, what is the point of SC1 being computed during each iteration?
Data
v1 <- c(52.9799999999814, 53.4200000000128, 52.0899999999965, 57.9700000000012,
60.679999999993, 0.300000000017462, 1.76999999998952, 61.1900000000023,
58.9599999999919, 1.73000000001048, 0.269999999989523, 6.92000000001281,
60.5299999999988, 60.859999999986, 59.5599999999977, 61.0600000000268,
60.6299999999756, 60.9700000000012, 60.1600000000035, 60.4599999999919,
60.0900000000256)
v2 <- c(52.679999999993, 53.140000000014, 52.8899999999849, 57.6700000000128,
60.5199999999895, 2.04000000000815, 61.890000000014, 59.5699999999779,
2.05999999999767, 6.98000000001048, 60.7399999999907, 60.7799999999988,
59.7300000000105, 60.9100000000035, 60.3299999999872, 60.5500000000175,
60.6600000000035, 60.3499999999767, 60.7300000000105, 60.6700000000128,
60.3799999999756)
tv3 <- data.frame(v1,v2)
tv3$v5 <- tv3$v2 - tv3$v1
tv3$v5
[1] -0.30 -0.28 0.80 -0.30 -0.16 1.74 60.12 -1.62 -56.90 5.25 60.47 53.86 -0.80
[14] 0.05 0.77 -0.51 0.03 -0.62 0.57 0.21 0.29
So you see, the difference should remain smaller, if it is larger, like in this case, at particular row, it gets 60.
So basically if we remove the 0.30 row in just V1 and shift it one cell up, The difference wouldn't hike upto 60.
So the 0.30 is noise value and that's what I have to figure and put it in V3
My desired results are the following.
v1 v2 V3
52.98 52.68 0.3
53.42 53.14 0.27
52.09 52.89
57.97 57.67
60.68 60.52
1.77 2.04
61.19 61.89
58.96 59.57
1.73 2.06
6.92 6.98
60.53 60.74
60.86 60.78
59.56 59.73
61.06 60.91
60.63 60.33
60.97 60.55
60.16 60.66
60.46 60.35
60.09 60.73
So notice here that all the sequence of columns seem to be in sync with just a difference of few points.
May be my case requires implementation of Needleman-Wunsch Algo
I would like to reshape my data based in unique string in a "Bull" column (all data frame):
EBV Bulls
0.13 NE001362
0.17 NE001361
0.05 NE001378
-0.12 NE001359
-0.14 NE001379
0.13 NE001380
-0.46 NE001379
-0.46 NE001359
-0.68 NE001394
0.28 NE001391
0.84 NE001394
-0.43 NE001393
-0.18 NE001707
My expected output:
NE001362 NE001361 NE001378 NE001359 NE001379 NE001380 NE001394 NE001391 NE001393 NE001707
0.13 0.17 0.05 -0.12 -0.14 0.13 -0.68 0.28 -0.43 -0.18
-0.46 -0.46 0.84
I tried dat2 <- dcast(all, EBV~variable, value.var = "Bulls") but do not works.
You have two options. Indexing the multiple occurrences for each level of Bulls or using a list to hold the different levels of EBV.
Option 1: Indexing multiple occurrences
You can use data.table to generate an index that numbers multiple occurrences of EBV:
require(data.table)
setDT(all) ## convert to data.table
all[, index:=1:.N, by=Bulls] ## generate index
dcast.data.table(all, formula=index ~ Bulls, value.var='EBV')
Option 2: Using a list to store multiple values
You could use a list as a value with data.table (I'm not sure if plain data.frame supports it).
require(data.table)
setDT(all) ## convert to data.table
all[, list(list(EBV)), by=Bulls] ## multiple values stored as list
Just to make sure that base R gets some acknowledgement:
## Add an ID, like ilir did, but with base R functions
mydf$ID <- with(mydf, ave(rep(1, nrow(mydf)), Bulls, FUN = seq_along))
Here's reshape:
reshape(mydf, direction = "wide", idvar="ID", timevar="Bulls")
# ID EBV.NE001362 EBV.NE001361 EBV.NE001378 EBV.NE001359 EBV.NE001379
# 1 1 0.13 0.17 0.05 -0.12 -0.14
# 7 2 NA NA NA -0.46 -0.46
# EBV.NE001380 EBV.NE001394 EBV.NE001391 EBV.NE001393 EBV.NE001707
# 1 0.13 -0.68 0.28 -0.43 -0.18
# 7 NA 0.84 NA NA NA
And xtabs. Note: This is a table-like matrix, so if you want a data.frame, you'll have to use as.data.frame.matrix on the output.
xtabs(EBV ~ ID + Bulls, mydf)
# Bulls
# ID NE001359 NE001361 NE001362 NE001378 NE001379 NE001380 NE001391
# 1 -0.12 0.17 0.13 0.05 -0.14 0.13 0.28
# 2 -0.46 0.00 0.00 0.00 -0.46 0.00 0.00
# Bulls
# ID NE001393 NE001394 NE001707
# 1 -0.43 -0.68 -0.18
# 2 0.00 0.84 0.00
The product of one simulation is a large data.frame, with fixed columns and rows. I ran several hundreds of simulations, with each result stored in a separate RData file (for efficient reading).
Now I want to gather all those files together and create statistics for each field of this data.frame into the "cells" structure which is basically a list of vectors with . This is how I do it:
#colscount, rowscount - number of columns and rows from each simulation
#simcount - number of simulation.
#colnames - names of columns of simulation's data frame.
#simfilenames - vector with filenames with each simulation
cells<-as.list(rep(NA, colscount))
for(i in 1:colscount)
{
cells[[i]]<-as.list(rep(NA,rowscount))
for(j in 1:rows)
{
cells[[i]][[j]]<-rep(NA,simcount)
}
}
names(cells)<-colnames
addcells<-function(simnr)
# This function reads and appends simdata to "simnr" position in each cell in the "cells" structure
{
simdata<readRDS(simfilenames[[simnr]])
for(i in 1:colscount)
{
for(j in 1:rowscount)
{
if (!is.na(simdata[j,i]))
{
cells[[i]][[j]][simnr]<-simdata[j,i]
}
}
}
}
library(plyr)
a_ply(1:simcount,1,addcells)
The problem is, that this the
> system.time(dane<-readRDS(path.cat(args$rdatapath,pliki[[simnr]]))$dane)
user system elapsed
0.088 0.004 0.093
While
? system.time(addcells(1))
user system elapsed
147.328 0.296 147.644
I would expect both commands to have comparable execution times (or at least the latter be max 10 x slower). I guess I am doing something very inefficient there, but what? The whole cells data structure is rather big, it takes around 1GB of memory.
I need to transpose data in this way, because later I do many descriptive statistics on the results (like computing means, sd, quantiles, and maybe histograms), so it is important, that the data for each cell is stored as a (single-dimensional) vector.
Here is profiling output:
> summaryRprof('/tmp/temp/rprof.out')
$by.self
self.time self.pct total.time total.pct
"[.data.frame" 71.98 47.20 129.52 84.93
"names" 11.98 7.86 11.98 7.86
"length" 10.84 7.11 10.84 7.11
"addcells" 10.66 6.99 151.52 99.36
".subset" 10.62 6.96 10.62 6.96
"[" 9.68 6.35 139.20 91.28
"match" 6.06 3.97 11.36 7.45
"sys.call" 4.68 3.07 4.68 3.07
"%in%" 4.50 2.95 15.86 10.40
"all" 4.28 2.81 4.28 2.81
"==" 2.34 1.53 2.34 1.53
".subset2" 1.28 0.84 1.28 0.84
"is.na" 1.06 0.70 1.06 0.70
"nargs" 0.62 0.41 0.62 0.41
"gc" 0.54 0.35 0.54 0.35
"!" 0.42 0.28 0.42 0.28
"dim" 0.34 0.22 0.34 0.22
".Call" 0.12 0.08 0.12 0.08
"readRDS" 0.10 0.07 0.12 0.08
"cat" 0.10 0.07 0.10 0.07
"readLines" 0.04 0.03 0.04 0.03
"strsplit" 0.04 0.03 0.04 0.03
"addParaBreaks" 0.02 0.01 0.04 0.03
It looks that indexing the list structure takes a lot of time. But I can't make it array, because not all cells are numeric, and R doesn't easily support hash map...
Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:
for (i in 1:99000) {
subset.data <- data[index[, i], ]
}
Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon
this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset
the original data matrix to draw resamples of the data.
# The basic setup
B <- 1000 # no. of bootstrap replications
n <- 250 # no. of observations
m <- 100 # no. of models/data series
# Create index matrix with B columns and n rows.
# Each column represents a resampling of the data.
# (actually block resamples, but doesn't matter here).
boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B)
# Make matrix with m data series of length n.
sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m)
subsetMatrix <- function(data, index) { # fn definition for timing
subset.data <- data[index, ]
return(subset.data)
}
# check how long it takes.
Rprof("subsetMatrix.out")
for (i in 1:(m - 1)) {
for (b in 1:B) { # B * (m - 1) = 1000 * 99 = 99000
boot.data <- subsetMatrix(sample.data, boot.index[, b])
# do some other stuff
}
# do some more stuff
}
Rprof()
summaryRprof("subsetMatrix.out")
# > summaryRprof("subsetMatrix.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 9.96 100 9.96 100
# In the actual application:
#########
# > summaryRprof("seq_testing.out")
# $by.self
# self.time self.pct total.time total.pct
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 1.98 15.76 2.20 17.52
# makeIndex 1.08 8.60 2.12 16.88
# makeStats 0.66 5.25 9.66 76.91
# runif 0.60 4.78 0.72 5.73
# apply 0.30 2.39 0.42 3.34
# is.data.frame 0.22 1.75 0.22 1.75
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# estimateMCS 0.10 0.80 12.56 100.00
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
#
# $by.total
# total.time total.pct self.time self.pct
# estimateMCS 12.56 100.00 0.10 0.80
# makeStats 9.66 76.91 0.66 5.25
# subsetMatrix 6.78 53.98 6.78 53.98
# colMeans 2.20 17.52 1.98 15.76
# makeIndex 2.12 16.88 1.08 8.60
# runif 0.72 5.73 0.60 4.78
# doTest 0.68 5.41 0.00 0.00
# apply 0.42 3.34 0.30 2.39
# aperm 0.26 2.07 0.00 0.00
# is.data.frame 0.22 1.75 0.22 1.75
# sweep 0.20 1.59 0.00 0.00
# ceiling 0.18 1.43 0.18 1.43
# aperm.default 0.14 1.11 0.14 1.11
# array 0.12 0.96 0.12 0.96
# as.vector 0.10 0.80 0.10 0.80
# matrix 0.08 0.64 0.08 0.64
# lapply 0.06 0.48 0.06 0.48
# unlist 0.06 0.48 0.00 0.00
# / 0.04 0.32 0.04 0.32
# : 0.04 0.32 0.04 0.32
# rowSums 0.04 0.32 0.04 0.32
# - 0.02 0.16 0.02 0.16
# > 0.02 0.16 0.02 0.16
# mean 0.02 0.16 0.00 0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 12.56'
Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several
parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but
still not very pleasing :/
Is there a better way to resample the data / get rid of the loop?
Can apply, Vectorize, replicate etc. come in anywhere?
Would it make sense to implement the subsetting in C (e.g. manipulate some pointers)?
Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!
related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix
from there
mapply(function(row) return(sample.data[row,]), row = boot.index)
replicate(B, apply(sample.data, 2, sample, replace = TRUE))
didn't really do it for me.
I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:
makeStats <- function(data, index) {
data.mean <- colMeans(data)
m <- nrow(data)
n <- ncol(index)
tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m))
weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index))
boot.data.mean <- t(data) %*% weights - data.mean
return(list(data.mean = data.mean,
boot.data.mean = boot.data.mean))
}
makeIndex <- function(B, blocks){
n <- ncol(blocks)
l <- nrow(blocks)
z <- ceiling(n/l)
start.points <- sample.int(n, z * B, replace = TRUE)
index <- blocks[, start.points]
keep <- c(rep(TRUE, n), rep(FALSE, z*l - n))
boot.index <- matrix(as.vector(index)[keep],
nrow = n, ncol = B)
return(boot.index)
}
This brought down the computation times from 28 to 6 seconds on my machine. I bet there are other parts of the code that can be improved (including my use of lapply/tabulate above.)