How to efficiently grow large data in R - r

The product of one simulation is a large data.frame, with fixed columns and rows. I ran several hundreds of simulations, with each result stored in a separate RData file (for efficient reading).
Now I want to gather all those files together and create statistics for each field of this data.frame into the "cells" structure which is basically a list of vectors with . This is how I do it:
#colscount, rowscount - number of columns and rows from each simulation
#simcount - number of simulation.
#colnames - names of columns of simulation's data frame.
#simfilenames - vector with filenames with each simulation
cells<-as.list(rep(NA, colscount))
for(i in 1:colscount)
{
cells[[i]]<-as.list(rep(NA,rowscount))
for(j in 1:rows)
{
cells[[i]][[j]]<-rep(NA,simcount)
}
}
names(cells)<-colnames
addcells<-function(simnr)
# This function reads and appends simdata to "simnr" position in each cell in the "cells" structure
{
simdata<readRDS(simfilenames[[simnr]])
for(i in 1:colscount)
{
for(j in 1:rowscount)
{
if (!is.na(simdata[j,i]))
{
cells[[i]][[j]][simnr]<-simdata[j,i]
}
}
}
}
library(plyr)
a_ply(1:simcount,1,addcells)
The problem is, that this the
> system.time(dane<-readRDS(path.cat(args$rdatapath,pliki[[simnr]]))$dane)
user system elapsed
0.088 0.004 0.093
While
? system.time(addcells(1))
user system elapsed
147.328 0.296 147.644
I would expect both commands to have comparable execution times (or at least the latter be max 10 x slower). I guess I am doing something very inefficient there, but what? The whole cells data structure is rather big, it takes around 1GB of memory.
I need to transpose data in this way, because later I do many descriptive statistics on the results (like computing means, sd, quantiles, and maybe histograms), so it is important, that the data for each cell is stored as a (single-dimensional) vector.
Here is profiling output:
> summaryRprof('/tmp/temp/rprof.out')
$by.self
self.time self.pct total.time total.pct
"[.data.frame" 71.98 47.20 129.52 84.93
"names" 11.98 7.86 11.98 7.86
"length" 10.84 7.11 10.84 7.11
"addcells" 10.66 6.99 151.52 99.36
".subset" 10.62 6.96 10.62 6.96
"[" 9.68 6.35 139.20 91.28
"match" 6.06 3.97 11.36 7.45
"sys.call" 4.68 3.07 4.68 3.07
"%in%" 4.50 2.95 15.86 10.40
"all" 4.28 2.81 4.28 2.81
"==" 2.34 1.53 2.34 1.53
".subset2" 1.28 0.84 1.28 0.84
"is.na" 1.06 0.70 1.06 0.70
"nargs" 0.62 0.41 0.62 0.41
"gc" 0.54 0.35 0.54 0.35
"!" 0.42 0.28 0.42 0.28
"dim" 0.34 0.22 0.34 0.22
".Call" 0.12 0.08 0.12 0.08
"readRDS" 0.10 0.07 0.12 0.08
"cat" 0.10 0.07 0.10 0.07
"readLines" 0.04 0.03 0.04 0.03
"strsplit" 0.04 0.03 0.04 0.03
"addParaBreaks" 0.02 0.01 0.04 0.03
It looks that indexing the list structure takes a lot of time. But I can't make it array, because not all cells are numeric, and R doesn't easily support hash map...

Related

avoid nested for-loops for tricky operation R

I have to do an operation that involves two matrices, matrix #1 with data and matrix #2 with coefficients to multiply columns of matrix #1
matrix #1 is:
dim(dat)
[1] 612 2068
dat[1:6,1:8]
X0005 X0010 X0011 X0013 X0015 X0016 X0017 X0018
1 1.96 1.82 8.80 1.75 2.95 1.10 0.46 0.96
2 1.17 0.94 2.74 0.59 0.86 0.63 0.16 0.31
3 2.17 2.53 10.40 4.19 4.79 2.22 0.31 3.32
4 3.62 1.93 6.25 2.38 2.25 0.69 0.16 1.01
5 2.32 1.93 3.74 1.97 1.31 0.44 0.28 0.98
6 1.30 2.04 1.47 1.80 0.43 0.33 0.18 0.46
and matrix #2 is:
dim(lo)
[1] 2068 8
head(lo)
i1 i2 i3 i4 i5 i6
X0005 -0.11858852 0.10336788 0.62618771 0.08706041 -0.02733101 0.006287923
X0010 0.06405406 0.13692216 0.64813610 0.15750302 -0.13503956 0.139280709
X0011 -0.06789727 0.30473549 0.07727417 0.24907723 -0.05345123 0.141591330
X0013 0.20909664 0.01275553 0.21067894 0.12666704 -0.02836527 0.464548147
X0015 -0.07690560 0.18788859 -0.03551084 0.19120773 -0.10196578 0.234037820
X0016 -0.06442454 0.34993481 -0.04057001 0.20258195 -0.09318325 0.130669546
i7 i8
X0005 0.08571777 0.031531478
X0010 0.31170850 -0.003127279
X0011 0.52527759 -0.065002026
X0013 0.27858049 -0.032178156
X0015 0.50693977 -0.058003429
X0016 0.53162596 -0.052091767
I want to multiply each column of matrix#1 by its correspondent coefficient of matrix#2 first column, and sum up all resulting columns. Then repeat the operation but with coefficients of matrix#2 second column, then third column, and so on...
The result is then a matrix with 8 columns, which are lineal combinations of data in matrix#1
My attempt includes nested for-loops. it works, but takes about 30' to execute. Is there any way to avoid these loops and reduce computational effort?
here is my attempt:
r=nrow(dat)
n=ncol(dat)
m=ncol(lo)
eme<-matrix(NA,r,m)
for (i in(1:m)){
SC<-matrix(NA,r,n)
for (j in(1:n)){
nom<-rownames(lo)
x<-dat[ , colnames(dat) == nom[j]]
SC[,j]<-x*lo[j,i]
SC1<-rowSums(SC)
}
eme[,i]<-SC1
}
Thanks for your help
It looks like you are just doing matrix - vector multiplication. In R, use the%*% operator, so all the looping is delegated to a fortran routine. I think it equates to the following
apply(lo, 2, function(x) dat %*% x)
Your code could be improved by moving the nom <- assignment outside the loops since it recalculates the same thing every iteration. Also, what is the point of SC1 being computed during each iteration?

Save unequal output to a csv or txt file

I want to save the following output I get in the R console into a csv or txt file.
Discordancy measures (critical value 3.00)
0.17 3.40 1.38 0.90 1.62 0.13 0.15 1.69 0.34 0.39 0.36 0.68 0.39
0.54 0.70 0.70 0.79 2.08 1.14 1.23 0.60 2.00 1.81 0.77 0.35 0.15
1.55 0.78 2.87 0.34
Heterogeneity measures (based on 100 simulations)
30.86 14.23 3.75
Goodness-of-fit measures (based on 100 simulations)
glo gev gno pe3 gpa
-3.72 -12.81 -19.80 -32.06 -37.66
This is the outcome I get when I run the following
Heter<-regtst(regsamlmu(-extremes), nsim=100)
where Heter is a list (i.e., is.list(Heter) returns TRUE)
You could use capture.output:
capture.output(regtst(regsamlmu(-extremes), nsim=100), file="myoutput.txt")
Or for capturing output coming from several consequential commands:
sink("myfile.txt")
#
# [commands generating desired output]
#
sink()
You could make a character vector which you write to a file. Each entry in the vector will be separated by a newline character.
out <- capture.output(regtst(regsamlmu(-extremes), nsim=100))
write(out, "output.txt", sep="\n")
If you would like to add more lines just do something like c(out, "hello Kostas")

Profiling data.table's setkey operation with Rprof

I am working with a relatively large data.table dataset and trying to profile/optimize the code. I am using Rprof, but I'm noticing that the majority of time spent within a setkey operation is not included in the Rprof summary. Is there a way to include this time spent?
Here is a small test to show how time spent setting the key for a data table is not represented in the Rprof summary:
Create a test function that runs a profiled setkey operation on a data table:
testFun <- function(testTbl) {
Rprof()
setkey(testTbl, x, y, z)
Rprof(NULL)
print(summaryRprof())
}
Then create a test data table that is large enough to feel the weight of the setkey operation:
testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
Then run the code, and wrap it within a system.time operation to show the difference between the system.time total time and the rprof total time:
> system.time(testFun(testTbl))
$by.self
self.time self.pct total.time total.pct
"sort.list" 0.88 75.86 0.88 75.86
"<Anonymous>" 0.08 6.90 1.00 86.21
"regularorder1" 0.08 6.90 0.92 79.31
"radixorder1" 0.08 6.90 0.12 10.34
"is.na" 0.02 1.72 0.02 1.72
"structure" 0.02 1.72 0.02 1.72
$by.total
total.time total.pct self.time self.pct
"setkey" 1.16 100.00 0.00 0.00
"setkeyv" 1.16 100.00 0.00 0.00
"system.time" 1.16 100.00 0.00 0.00
"testFun" 1.16 100.00 0.00 0.00
"fastorder" 1.14 98.28 0.00 0.00
"tryCatch" 1.14 98.28 0.00 0.00
"tryCatchList" 1.14 98.28 0.00 0.00
"tryCatchOne" 1.14 98.28 0.00 0.00
"<Anonymous>" 1.00 86.21 0.08 6.90
"regularorder1" 0.92 79.31 0.08 6.90
"sort.list" 0.88 75.86 0.88 75.86
"radixorder1" 0.12 10.34 0.08 6.90
"doTryCatch" 0.12 10.34 0.00 0.00
"is.na" 0.02 1.72 0.02 1.72
"structure" 0.02 1.72 0.02 1.72
"is.unsorted" 0.02 1.72 0.00 0.00
"simpleError" 0.02 1.72 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.16
user system elapsed
31.112 0.211 31.101
Note the 1.16 and 31.101 time differences.
Reading ?Rprof, I see why this difference might have occurred:
Functions will only be recorded in the profile log if they put a
context on the call stack (see sys.calls). Some primitive functions do
not do so: specifically those which are of type "special" (see the ‘R
Internals’ manual for more details).
So is this the reason why time spent within the setkey operation isn't represented in Rprof? Is there a workaround have Rprof watch all of data.table's operations (including setkey, and maybe others that I haven't noticed)? I essentially want to have the system.time and Rprof time match up.
Here is the most-likely relevant sessionInfo():
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
data.table_1.8.11
I still observe this issue when Rprof() isn't within a function call:
> testFun <- function(testTbl) {
+ setkey(testTbl, x, y, z)
+ }
> Rprof()
> system.time(testFun(testTbl))
user system elapsed
28.855 0.191 28.854
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"sort.list" 0.86 71.67 0.88 73.33
"regularorder1" 0.08 6.67 0.92 76.67
"<Anonymous>" 0.06 5.00 0.98 81.67
"radixorder1" 0.06 5.00 0.10 8.33
"gc" 0.06 5.00 0.06 5.00
"proc.time" 0.04 3.33 0.04 3.33
"is.na" 0.02 1.67 0.02 1.67
"sys.function" 0.02 1.67 0.02 1.67
$by.total
total.time total.pct self.time self.pct
"system.time" 1.20 100.00 0.00 0.00
"setkey" 1.10 91.67 0.00 0.00
"setkeyv" 1.10 91.67 0.00 0.00
"testFun" 1.10 91.67 0.00 0.00
"fastorder" 1.08 90.00 0.00 0.00
"tryCatch" 1.08 90.00 0.00 0.00
"tryCatchList" 1.08 90.00 0.00 0.00
"tryCatchOne" 1.08 90.00 0.00 0.00
"<Anonymous>" 0.98 81.67 0.06 5.00
"regularorder1" 0.92 76.67 0.08 6.67
"sort.list" 0.88 73.33 0.86 71.67
"radixorder1" 0.10 8.33 0.06 5.00
"doTryCatch" 0.10 8.33 0.00 0.00
"gc" 0.06 5.00 0.06 5.00
"proc.time" 0.04 3.33 0.04 3.33
"is.na" 0.02 1.67 0.02 1.67
"sys.function" 0.02 1.67 0.02 1.67
"formals" 0.02 1.67 0.00 0.00
"is.unsorted" 0.02 1.67 0.00 0.00
"match.arg" 0.02 1.67 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.2
EDIT2: Same issue with 1.8.10 on my machine with only the data.table package loaded. Times are not equal even when the Rprof() call is not within a function:
> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> base::source("/tmp/r-plugin-claytonstanley/Rsource-86075-preProcess.R", echo=TRUE)
> testFun <- function(testTbl) {
+ setkey(testTbl, x, y, z)
+ }
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> Rprof()
> system.time(testFun(testTbl))
user system elapsed
29.516 0.281 29.760
> Rprof(NULL)
> summaryRprof()
EDIT3: Doesn't work even if setkey is not within a function:
> library(data.table)
data.table 1.8.10 For help type: help("data.table")
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> Rprof()
> setkey(testTbl, x, y, z)
> Rprof(NULL)
> summaryRprof()
EDIT4: Doesn't work even when R is called from a --vanilla bare-bones terminal prompt.
EDIT5: Does work when tested on a Linux VM. But still does not work on darwin machine for me.
EDIT6: Doesn't work after watching the Rprof.out file get created, so it isn't a write access issue.
EDIT7: Doesn't work after compiling data.table from source and creating a new temp user and running on that account.
EDIT8: Doesn't work when compiling R 3.0.2 from source for darwin via MacPorts.
EDIT9: Does work on a different darwin machine, a Macbook Pro laptop running the same OS version (10.6.8). Still doesn't work on a MacPro desktop machine running same OS version, R version, data.table version, etc.
I'm thinking it's b/c the desktop machine is running in 64-bit kernel mode (not default), and the laptop is 32-bit (default). Confirmed.
Great question. Given edits, I'm not sure then, can't reproduce. Leaving remainder of answer here for now.
I've tested on my (very slow) netbook and it works fine, see output below.
I can tell you right now why setkey is so slow on that test case. When the number of levels are large (greater than 100,000 as here) it reverts to comparison sort rather than counting sort. Yes pretty poor if you have data like that in practice. Typically we have under 100,000 unique values in the first column then, say, dates in the 2nd column. Both columns can be sorted using counting sort and performance is ok.
It's a known issue and we've been working hard on it. Arun has implemented radix sort for integers with range > 100,000 to solve this problem and that's in the next release. But we are still tidying up v1.8.11. See our presentation in Cologne which goes into more detail and gives some idea of speedups.
Inroduction to data.table and news from v1.8.11
Here is the output with v1.8.10, along with R version and lscpu info (for your entertainment). I like to test on a very poor machine with small cache so that in development I can see what's likely to bite when the data is scaled up on larger machines with larger cache.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 20
Model: 2
Stepping: 0
CPU MHz: 800.000
BogoMIPS: 1995.01
Virtualisation: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
NUMA node0 CPU(s): 0,1
$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
> require(data.table)
Loading required package: data.table
data.table 1.8.10 For help type: help("data.table")
> testTbl = data.table(x=sample(1:1e7, 1e7), y=sample(1:1e7,1e7), z=sample(1:1e7,1e7))
> testTbl
x y z
1: 1748920 6694402 7501082
2: 4571252 565976 5695727
3: 1284455 8282944 7706392
4: 8452994 8765774 6541097
5: 6429283 329475 5271154
---
9999996: 2019750 5956558 1735214
9999997: 1096888 1657401 3519573
9999998: 1310171 9002746 350394
9999999: 5393125 5888350 7657290
10000000: 2210918 7577598 5002307
> Rprof()
> setkey(testTbl, x, y, z)
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"sort.list" 195.44 91.34 195.44 91.34
".Call" 5.38 2.51 5.38 2.51
"<Anonymous>" 4.32 2.02 203.62 95.17
"radixorder1" 4.32 2.02 4.74 2.22
"regularorder1" 4.28 2.00 199.30 93.15
"is.na" 0.12 0.06 0.12 0.06
"any" 0.10 0.05 0.10 0.05
$by.total
total.time total.pct self.time self.pct
"setkey" 213.96 100.00 0.00 0.00
"setkeyv" 213.96 100.00 0.00 0.00
"fastorder" 208.36 97.38 0.00 0.00
"tryCatch" 208.36 97.38 0.00 0.00
"tryCatchList" 208.36 97.38 0.00 0.00
"tryCatchOne" 208.36 97.38 0.00 0.00
"<Anonymous>" 203.62 95.17 4.32 2.02
"regularorder1" 199.30 93.15 4.28 2.00
"sort.list" 195.44 91.34 195.44 91.34
".Call" 5.38 2.51 5.38 2.51
"radixorder1" 4.74 2.22 4.32 2.02
"doTryCatch" 4.74 2.22 0.00 0.00
"is.unsorted" 0.22 0.10 0.00 0.00
"is.na" 0.12 0.06 0.12 0.06
"any" 0.10 0.05 0.10 0.05
$sample.interval
[1] 0.02
$sampling.time
[1] 213.96
>
The problem was that the darwin machine was running Snow Leopard with a 64-bit kernel, which is not the default for that OS X version.
I also verified that this is not a problem for another darwin machine running Mountain Lion which uses a 64-bit kernel by default. So it's an interaction between Snow Leopard and running a 64-bit kernel specifically.
As a side note, the official OS X binary installer for R is still built with Snow Leopard, so I do think that this issue is still relevant, as Snow Leopard is still a widely-used OS X version.
When the 64-bit kernel in Snow Leopard is enabled, no kernel extensions that are compatible only with the 32-bit kernel are loaded. After booting into the default 32-bit kernel for Snow Leopard, kextfind shows that these 32-bit only kernel extensions are on the machine and (most likely) loaded:
$ kextfind -not -arch x86_64
/System/Library/Extensions/ACard6280ATA.kext
/System/Library/Extensions/ACard62xxM.kext
/System/Library/Extensions/ACard67162.kext
/System/Library/Extensions/ACard671xSCSI.kext
/System/Library/Extensions/ACard6885M.kext
/System/Library/Extensions/ACard68xxM.kext
/System/Library/Extensions/AppleIntelGMA950.kext
/System/Library/Extensions/AppleIntelGMAX3100.kext
/System/Library/Extensions/AppleIntelGMAX3100FB.kext
/System/Library/Extensions/AppleIntelIntegratedFramebuffer.kext
/System/Library/Extensions/AppleProfileFamily.kext/Contents/PlugIns/AppleIntelYonahProfile.kext
/System/Library/Extensions/IO80211Family.kext/Contents/PlugIns/AirPortAtheros.kext
/System/Library/Extensions/IONetworkingFamily.kext/Contents/PlugIns/AppleRTL8139Ethernet.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/InternalModemSupport.kext
/System/Library/Extensions/IOSerialFamily.kext/Contents/PlugIns/MotorolaSM56KUSB.kext
/System/Library/Extensions/JMicronATA.kext
/System/Library/Extensions/System.kext/PlugIns/BSDKernel6.0.kext
/System/Library/Extensions/System.kext/PlugIns/IOKit6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Libkern6.0.kext
/System/Library/Extensions/System.kext/PlugIns/Mach6.0.kext
/System/Library/Extensions/System.kext/PlugIns/System6.0.kext
/System/Library/Extensions/ufs.kext
So it could be any one of those loaded extensions that is enabling something for the Rprof package to use, so that the setkey operation in data.table is profiled correctly.
If anyone wants to investigate this further, dig a bit deeper, and get to the root cause of the problem, please post an answer and I'll happily accept that one.

Passing an expression to a nested grouping in data.table

I have a data.table object similar to this one
library(data.table)
c <- data.table(CO = c(10000,10000,10000,20000,20000,20000,20000),
SH = c(1427,1333,1333,1000,1000,300,350),
PRC = c(6.5,6.125,6.2,0.75,0.5,3,3.5),
DAT = c(0.5,-0.5,0,-0.1,NA_real_,0.2,0.5),
MM = c("A","A","A","A","A","B","B"))
and I am trying to perform calculations using nested grouping, passing an expression as an argument. Here is a simplified version of what I have:
setkey(c,MM)
mycalc <- quote({nobscc <- length(DAT[complete.cases(DAT)]);
list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = ifelse(nobscc>=2, sum(DAT^2,na.rm=TRUE), NA_real_),
COVCOMP = ifelse(nobscc >= 2, head(DAT,n=1), NA_real_),
NOBS = nobscc)})
myresults <- c[,.SD[,{setkey=CO; eval(mycalc)},by=CO],by=MM]
which produces
MM CO MKTCAP SQSUM COVCOMP NOBS
[1,] A 10000 8264.6 0.50 0.5 3
[2,] A 20000 500.0 NA NA 1
[3,] B 20000 1225.0 0.29 0.2 2
In the example above I have two elements of the list which use the ifelse construct (in the actual code there are 3), all doing the same test : if the number of observations is greater than 2, then a certain calculation (which is different for each element of the list, and each could be written as a function) is to be performed, otherwise I want the value of the these elements to be NA. Another thing these elements have in common is that they use one and the same column of my data.table: the one called DAT.
So my question is: is there any way I can do the ifelse test only once, and if it is FALSE, pass the value NA to the respective elements of the list, and if TRUE, evaluate a different expression for each of the elements of the list?
NOTE: My goal is to reduce the system.time (system and elapsed). If this modification will not reduce time and calculations, bearing in mind I have 72 million observations, that's an acceptable answer. I also welcome suggestions to change other parts of the code.
EDIT: Results of summaryRprof()
$by.total
total.time total.pct self.time self.pct
"system.time" 18.94 99.79 0.00 0.00
".Call" 18.92 99.68 0.10 0.53
"[" 18.92 99.68 0.04 0.21
"[.data.table" 18.92 99.68 0.02 0.11
"eval" 18.80 99.05 0.24 1.26
"ifelse" 18.30 96.42 0.46 2.42
"lm" 17.70 93.26 0.58 3.06
"sapply" 8.06 42.47 0.36 1.90
"model.frame" 7.74 40.78 0.16 0.84
"model.frame.default" 7.58 39.94 0.98 5.16
"lapply" 6.62 34.88 0.70 3.69
"FUN" 4.24 22.34 1.10 5.80
"model.matrix" 4.04 21.29 0.02 0.11
"model.matrix.default" 4.02 21.18 0.26 1.37
"match" 3.66 19.28 0.86 4.53
".getXlevels" 3.12 16.44 0.12 0.63
"na.omit" 2.40 12.64 0.24 1.26
"%in%" 2.30 12.12 0.34 1.79
"simplify2array" 2.24 11.80 0.12 0.63
"na.omit.data.frame" 2.16 11.38 0.14 0.74
"[.data.frame" 2.12 11.17 1.18 6.22
"deparse" 1.80 9.48 0.66 3.48
"unique" 1.80 9.48 0.54 2.85
"[[" 1.52 8.01 0.12 0.63
"[[.data.frame" 1.40 7.38 0.54 2.85
".deparseOpts" 1.34 7.06 0.96 5.06
"paste" 1.32 6.95 0.16 0.84
"lm.fit" 1.20 6.32 0.64 3.37
"mode" 1.14 6.01 0.14 0.74
"unlist" 1.12 5.90 0.56 2.95
Instead of forming and operating on data subsets like this:
setkey(c,MM)
myresults <- c[, .SD[,{setkey=CO; eval(mycalc)},by=CO], by=MM]
You could try doing this:
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
This should speed up your code, since it avoids all of the nested subsetting of .SD objects, each of which requires its own call to [.data.table.
On your original question, I doubt the ifelse evaluations are taking much time, but if you want to avoid them, you could take them out of mycalc and use := to overwrite the desired values with NA:
mycalc <- quote(list(MKTCAP = tail(SH,n=1)*tail(PRC,n=1),
SQSUM = sum(DAT^2,na.rm=TRUE),
COVCOMP = head(DAT,n=1),
NOBS = length(DAT[complete.cases(DAT)])))
setkeyv(c, c("MM", "CO"))
myresults <- c[, eval(mycalc), by=key(c)]
myresults[NOBS<2, c("SQSUM", "COVCOMP"):=NA_real_]
## Or, alternatively
# myresults[NOBS<2, SQSUM:=NA_real_]
# myresults[NOBS<2, COVCOMP:=NA_real_]

transpose 250,000 rows into columns in R

I always transpose by using t(file) command in R.
But i it is not running properly (not running at all) on big data file (250,000 rows and 200 columns). Any ideas.
I need to calculate correlation between 2nd row (PTBP1) with all other rows (except 8 rows including header). In order to do this I transpose rows to columns and then use cor function.
But I struck at transpose fn. Any help would be really appreciated!
I copied example from one of the post in stackoverflow (They are also almost discussing the same problem but seems no answer yet!)
ID A B C D E F G H I [200 columns]
Row0$-1 0.08 0.47 0.94 0.33 0.08 0.93 0.72 0.51 0.55
Row02$1 0.37 0.87 0.72 0.96 0.20 0.55 0.35 0.73 0.44
Row03$ 0.19 0.71 0.52 0.73 0.03 0.18 0.13 0.13 0.30
Row04$- 0.08 0.77 0.89 0.12 0.39 0.18 0.74 0.61 0.57
Row05$- 0.09 0.60 0.73 0.65 0.43 0.21 0.27 0.52 0.60
Row06-$ 0.60 0.54 0.70 0.56 0.49 0.94 0.23 0.80 0.63
Row07$- 0.02 0.33 0.05 0.90 0.48 0.47 0.51 0.36 0.26
Row08$_ 0.34 0.96 0.37 0.06 0.20 0.14 0.84 0.28 0.47
........
250,000 rows
Use a matrix instead. The only advantage of a dataframe over a matrix is the capacity to have different classes in the columns and you clearly do not have that situation, since a transposed dataframe could not support such a result.
I don't get why you want to transpose the data.frame. If you just use cor it doesn't matter if your data is in rows or columns.
Actually, it is one of the major advantages of R that it doen's matter if your data fits in the classical row-column pattern as SPSS and others programs require data to be.
There are numerous ways to correlate the first row with all other rows (I don't get which rows you want to exclude). One is using a loop (here the loop is implicit in the call to one of the *apply family functions):
lapply(2:(dim(fn)[1]), function(x) cor(fn[1,],fn[x,]))
Note that I expect you data.frame to ba called fn. To skip some rows change the 2 to the number you want. Furthermore, I would probably use vapply here.
I hope this answer points you in the correct direction and that is to not use t() if you absolutely don't need it.

Resources