I have a list with hundreds of columns and rows. What I'm doing is looping through nearly every possible iteration of taking the difference between two columns. For example take the difference between 1st and 2nd column, 1st and 3rd column..1st and 500th column... 499th column and 500th column. Once I have those differences I compute some descriptive statistics (ie. mean, st dev, kurtosis, skewness, etc) for output. I know I can use lapply to calculate those statistics for each column individually but sd(x)-sd(y) <> sd(x-y) so it doesn't really cut down much on my looping. I can use avg(x)-avg(y)=avg(x-y) but that's the only statistic where I can use this property.
Here's some pseudo code that I have:
for (n1 in 1:(number of columns) {
for (n2 in n1:(number of columns) {
temp<-bigdata[n1]-bigdata[n2]
results[abc]<-(maxdrawdown,mean,skewness,kurtosis,count,st dev,
median, downsidedeviation)
}
}
Doing it this way can take literally days so I'm looking for some improvements. I'm already using Compiler with enableJIT(3) which actually does make it noticeably faster. I had a couple other ideas and any incites would be helpful. One is trying to utilize the snowfall package (still trying to get my head around how to implement it) with the thought that one core could compute skew and kurtosis while the other computes the other statistics. The other idea is creating big chunks of temp (ie. 1-2, 1-3, 1-4) as another data.frame (or list) so as to use lapply against it to knock out many iterations at once. Would this make much of a difference? Is there anything else I can do that I'm not even thinking of?
A reproducible example would really help, because the way you describe your problem are confusing (e.g. lists don't have rows/columns). My guess is that bigdata and results are data.frames, in which case converting each of them to a matrix will make your loops appreciably faster.
I don't know if it will be any faster, but the following might make the code a bit easier to read if not faster, although it should get a bit faster as well because you've eliminated the for() ....
Try using expand.grid(), which I tend to use less often than I probably should
For instance:
nC <- 3 # Num of cols
nR <- 4 # Num of cols
indices <- expand.grid(nC, nC)
# Now you can use apply cleanly
apply(indices, 1,
function(x) {
c1 <- x[1]; c2 <- x[2]
yourResult[c1,c2] <- doYourThing(bigData[,c1], bigData[,c2])
}
)
Well, you get the idea. :-)
Related
I was trying to do an extensive computation in R. Eighteen hours have passed but my RStudio seems to continue to work. I'm not sure if I could have written the script in a different way to make it faster. I was trying to implement a CrankâNicolson type method over a 50000 by 350 matrix as shown below:
#defining the discretization of cells
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- as.data.frame(matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,length(m)]<-0
#implement the calculation
for(j in 2:(length(m[1,])-1)){
for(i in 2:length(m[[1]])){
m[i,][2:length(m)-1][[j]]<-m[i-1,][[j]]+
D*dt*(m[i-1,][[j+1]]-2*m[i-1,][[j]]+m[i-1,][[j-1]])/(dz^2)-
v*dt*(m[i-1,][[j+1]]-m[i-1,][[j-1]])/(2*dz)
}}
Is there a way to know how long would it take for R to implement it? Is there a better way of constructing the numerical calculation? At this point, I feel like excel could have been faster!!
Just making a few simple optimisations really helps here. The original version code of your code would take ~ 5 days on my laptop. Using a matrix and calculating just once values that are reused in the loop, we bring this down to around 7 minutes
And think about messy constructions like
m[i,][2:length(m)-1][[j]]
This is equivalent to
m[[i, j]]
which would be faster (as well as much easier to understand). Making this change further reduces the runtime by another factor of over 2, to around 3 minutes
Putting this together we have
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- (matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
# cache a few values that get reused many times
NC = NCOL(m)
NR = NROW(m)
C1 = D*dt / dz^2
C2 = v*dt / (2*dz)
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,NC]<-0
#implement the calculation
for(j in 2:(NC-1)){
for(i in 2:NR){
ma = m[i-1,]
ma.1 = ma[[j+1]]
ma.2 = ma[[j-1]]
m[[i,j]] <- ma[[j]] + C1*(ma.1 - 2*ma[[j]] + ma.2) - C2*(ma.1 - ma.2)
}
}
If you need to go even faster than this, you can try out some more optimisations. For example see here for how different ways of indexing the same element can have very different execution times. In general it is better to refer to column first, then row.
If all the optimisations you can do in R are not enough for your speed requirements, then you might implement the loop in RCpp instead.
I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.
I am trying to write R code which acts as a "moving window", just with memory (state). I have figured out (thanks to this question) how to apply a function to subsequent tuples of elements. For example, if I wish to write a (simple) moving average with a typical period 4, I would do the following:
mapply(myfunc, x[1:(length(x)-4)], x[2:(length(x)-3)], x[3:(length(x)-2)], x[4:(length(x)-1)])
Where myfunc is a function with 4 arguments, which calculates their mean (I cannot use mean, as it expects only 1 argument, and I don't know how to make the 4 arguments a single vector).
That's quite cumbersome, though, and if the typical period is 100, say, I am not sure how to do it.
So here's my first question: how do I generalize this?
But here's another issue: suppose I wish the applied function to be able to save state. A simple example would be to keep record of how many values it was applied on so far. Another example is the exponential moving average (EMA), which is not really a window function, but instead a function which works on single values but which keeps state (the last resulted mean).
How can I write a function which when applied to a vector, works on its values one by one, returning a vector of the same length, which is able to retain its last output every time, or save any other "state" during its calculations? In Python, for example, I'd use classes for that, but that's quite difficult in R.
Important note: I am not interested in auxiliary R packages like zoo or TTR to do the work for me. I am trying to learn R, and in any case the functions I wish to write, while having similarities with MA or EMA, are custom, and do not exist in any of these packages.
Regarding your first question,
n <- length(x)
k <- 4
r <- embed(x, n-k)[1:k, seq(n-k, 1)]
do.call("mapply", c("myfunc", split(r, 1:k)))
Regarding the second question, Reduce can be used to iterate over a vector saving state.
For things like this you should consider using a plain for loop:
x <- runif(10000)
k <- 100
n <- length(x)
res <- numeric(n - k)
library(microbenchmark)
microbenchmark(times=5,
for(i in k:n) res[i - k + 1] <- sum(vec[i:(i + k)]),
{
r <- embed(x, n-k)[1:k, seq(n-k, 1)]
gg <- do.call("mapply", c("sum", split(r, 1:k)))
},
flt <- filter(x, rep(1, k))
)
Produces:
Unit: milliseconds
min lq median uq max neval
for 163.5403 164.4929 165.2543 166.6315 167.0608 5
embed/mapply 1255.2833 1307.3708 1338.2748 1341.5719 1405.1210 5
filter 6.7101 6.7971 6.8073 6.8161 6.8991 5
Now, the results are not identical and I don't pretend to understand exactly what GGrothendieck is doing with embed, but generally speaking for loops are just as fast as *pply functions so long as you initialize your result vectors first. Windowed calculations don't lend themselves well to vectorization, so might as well use a for loop.
EDIT: as several have pointed out in comments, there appears to be an internally implemented function to do (filter) this that is quite a bit faster, so that seems to be the best option (though you should confirm it actually does what you want as again, the results are not exactly identical and I am not personally familiar with the function; in it's default configuration it appears to do a rolling weighted sum, or sum if weights are 1, with a centered window).
I come from a Java/Python comp sci theory background so I am still getting used to the various R packages and how they can save run time in functions.
Basically, I am working on a few projects and all of them involve taking individual factors in a long-list data set (15,000 to 200,000 factors) and performing calculations on individual factors in an equally-large data set, and concurrently storing the results of those calculations in an exponentially-longer data frame.
So far I have been using nested while loops and concatenating into a growing list, but that is taking days. Ive recently learned about 'lapply' and the 'data.frame' options in R, and I would love to see an example of how to apply (no pun intended) them to the following basic correlation function:
Corr<-function(miRdf, mRNAdf)
{
j=1
k=1
m=1
n=1
c=0
corrList=NULL
while(n<=71521)
{
while(m<=1477)
{
corr=cor(as.numeric(miRdf[k,2:13]), as.numeric(mRNAdf[j,2:13]), use ="complete.obs")
corrList<-c(corrList, corr)
j=j+1
c=c+1
print(c) #just a counter to see how far the function has run
m=m+1
}
k=k+1
n=n+1
j=1
m=1 #to reset the inner while loop
}
corrList<-matrix(unlist(corrList), ncol=1477, byrow=FALSE)
colnames(corrList)<-miRdf[,1]
rownames(corrList)<-mRNAdf[,1]
write.csv(corrList, "testCorrWhole.csv")
}
As you can see, the nested while loop results in 105,636,517 (71521x1477) miRNA vs mRNA expression-value correlation scores that need to be performed and stored in a data frame that is 1477 cols x 71521 rows in order to generate a scoring matrix.
My question is, can anyone shed light on how to turn the above monstrosity into an efficient function that utilizes 'lapply' instead of the while loops, and uses the 'data.table' set() function to do away with the inefficiency of concatenating a list during every pass through the loop?
Thank you in advance!
Your names end with 'df', which makes it seem like your data are a data.frame. But #Troy's answer uses a matrix. A matrix is appropriate when the data are homogeneous, and generally matrix operations are much faster than data.frame operations. So you can see already that if you'd provided a small example of your data set (e.g., dput(mRNAdf[1:10,]) that people might be in a better position to help you; this is what they're asking for.
In large numerical calculations it makes sense to 'hoist' any repeated calculations outside the loop, so they are performed only once. Repeated calculations in your case include sub-setting to columns 2:13, and coercion to numeric. With this idea, and guessing that you actually have a data.frame where each column is already a numeric vector, I'd start with
mRNAmatrix <- as.matrix(mRNAdf[,2:13])
miRmatrix <- as.matrix(miRdf[,2:13])
From the help page ?cor we see that the arguments can be a matrix, and if so the correlation is calculated between columns. You're interested in the result when the arguments are transposed relative to your current representation. So
result <- cor(t(mRNAmatrix), t(miRmatrix), use="complete.obs")
This is fast enough for your purposes
> m1 = matrix(rnorm(71521 * 12), 71521)
> m2 = matrix(rnorm(1477 * 12), 1477)
> system.time(ans <- cor(t(m1), t(m2)))
user system elapsed
9.124 0.200 9.340
> dim(ans)
[1] 71521 1477
result is the same as your corrList -- it's not a list, but a matrix; probably the row and column names have been carried forward. You'd write this to a file as you do above, write.csv(result, "testCorrWhole.csv")
UPDATED BELOW TO SHOW PARALLEL PROCESSING - ABOUT A 60% SAVING
Using apply() might not be quick enough for you. Here's how to do it, though. Will have a think about performance since this example (1M output correlations in 1000x1000 grid) takes over a minute on laptop.
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
system.time(apply(mRNAdf,1,function(x)corRow(x[1])))
# user system elapsed
# 72.94 0.00 73.39
And with parallel::parApply on a 4 core Win64 laptop
require(parallel) ## Library to allow parallel processing
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
# Make a cluster from all available cores
cl=makeCluster(detectCores())
# Use clusterExport() to distribute the function and data.frames needed in the apply() call
clusterExport(cl,c("corRow","miRdf","mRNAdf"))
# time the call
system.time(parApply(cl,mRNAdf,1,function(x)corRow(x[[1]])))
# Stop the cluster
stopCluster(cl)
# time the call without clustering
system.time(apply(mRNAdf,1,function(x)corRow(x[[1]])))
## WITH CLUSTER (4)
user system elapsed
0.04 0.03 29.94
## WITHOUT CLUSTER
user system elapsed
73.96 0.00 74.46
A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.
How I typically do this in R is something like the following:
data.mat <- read.csv("...")
groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups
results <- matrix(rep("NA",300),ncol=3,nrow=100)
for(i in 1:100) {
tempmat <- subset(data.mat,ID==groupids[i])
# Run various stats on tempmat (correlations, regressions, etc), checking to
# make sure this specific group doesn't have NAs in the variables I'm using
# and assign results to x, y, and z, for example.
results[i,1] <- x
results[i,2] <- y
results[i,3] <- z
}
This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.
Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?
Edit:
Maybe this is just common knowledge (or sampling error), but I tried subsetting with brackets in some of my code rather than using the subset command, and it seemed to provide a slight performance gain which surprised me. I have some code I used and output below using the same object names as above:
system.time(for(i in 1:1000){data.mat[data.mat$ID==groupids[i],]})
user system elapsed
361.41 92.62 458.32
system.time(for(i in 1:1000){subset(data.mat,ID==groupids[i])})
user system elapsed
378.44 102.03 485.94
Update:
In one of the answers, jorgusch suggested that I use the data.table package to speed up my subsetting. So, I applied it to a problem I ran earlier this week. In a dataset with a little over 1,500,000 rows, and 4 columns (ID,Var1,Var2,Var3), I wanted to calculate two correlations in each group (indexed by the "ID" variable). There are slightly more than 50,000 groups. Below is my initial code (which is very similar to the above):
data.mat <- read.csv("//home....")
groupids <- unique(data.mat$ID)
results <- matrix(rep("NA",(length(groupids) * 3)),ncol=3,nrow=length(groupids))
for(i in 1:length(groupids)) {
tempmat <- data.mat[data.mat$ID==groupids[i],]
results[i,1] <- groupids[i]
results[i,2] <- cor(tempmat$Var1,tempmat$Var2,use="pairwise.complete.obs")
results[i,3] <- cor(tempmat$Var1,tempmat$Var3,use="pairwise.complete.obs")
}
I'm re-running that right now for an exact measure of how long that took, but from what I remember, I started it running when I got into the office in the morning and it finished sometime in the mid-afternoon. Figure 5-7 hours.
Restructuring my code to use data.table....
data.mat <- read.csv("//home....")
data.mat <- data.table(data.mat)
testfunc <- function(x,y,z) {
temp1 <- cor(x,y,use="pairwise.complete.obs")
temp2 <- cor(x,z,use="pairwise.complete.obs")
res <- list(temp1,temp2)
res
}
system.time(test <- data.mat[,testfunc(Var1,Var2,Var3),by="ID"])
user system elapsed
16.41 0.05 17.44
Comparing the results using data.table to the ones I got from using a for loop to subset all IDs and record results manually, they seem to have given me the same answers(though I'll have to check that a bit more thoroughly). That looks to be a pretty big speed increase.
Update 2:
Running the code using subsets finally finished up again:
user system elapsed
17575.79 4247.41 23477.00
Update 3:
I wanted to see if anything worked out differently using the plyr package that was also recommended. This is my first time using it, so I may have done things somewhat inefficiently, but it still helped substantially compared to the for loop with subsetting.
Using the same variables and setup as before...
data.mat <- read.csv("//home....")
system.time(hmm <- ddply(data.mat,"ID",function(df)c(cor(df$Var1,df$Var2, use="pairwise.complete.obs"),cor(df$Var1,df$Var3,use="pairwise.complete.obs"))))
user system elapsed
250.25 7.35 272.09
This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.
Besides plyr, you can try to use foreach package to exclude explicit loop counter, but I don't know if it will give you any performance benefits.
Foreach, neverless, gives you a quite simple interface to parallel chunk processing if you have multicore workstation (with doMC/multicore packages) (check Getting Started with doMC and foreach for details), if you exclude parallel processing only because it is not very easy to understand for students. If it is not the only reason, plyr is very good solution IMHO.
Personally, I find plyr not very easy to understand. I prefer data.table which is also faster. For instance you want to do the standard deviation of colum my_column for each ID.
dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column
Three statements of this kind and a cbind at the end - that is all you need.
You can also use dt do some action for only one ID without a subset command in an new syntax:
result.sd.oneiD<- dt[ID="oneID",sd(my_column)]
The first statment refers to rows (i), the second to columns (j).
If find it easier to read then player and it is more flexible, as you can also do sub domains within a "subset"...
The documentation describes that it uses SQL-like methods. For instance, the by is pretty much "group by" in SQL. Well, if you know SQL, you can probably do much more, but it is not necessary to make use of the package.
Finally, it is extremely fast, as each operation is not only parallel, but also data.table grabs the data needed for calculation. Subset, however, maintain the levels of the whole matrix and drag it trough the memory.
You have already suggested vectorizing and avoiding making unnecessary copies of intermediate results, so you are certainly on the right track. Let me caution you not to do what i did and just assume that vectorizing will always give you a performance boost (like it does in other languages, e.g., Python + NumPy, MATLAB).
An example:
# small function to time the results:
time_this = function(...) {
start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent())));
end.time = Sys.time(); print(end.time - start.time)
}
# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)
# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )
print(x) # returns => 0.5312099
print(y) # returns => 0.661242
The 'apply' version is actually slower than the 'for' version. (According to the Inferno author, if you are doing this you are not vectorizing, you are 'loop hiding'.)
But where you can get a performance boost is by using built-ins. Below, i've timed the same operation as the two above, just using the built-in function, 'rowMeans':
z = time_this(rowMeans(a))
print(z) # returns => 0.03679609
An order of magnitude improvement versus the 'for' loop (and the vectorized version).
The other members of the apply family are not just wrappers over a native 'for' loop.
a = abs(floor(10*rnorm(1e6)))
time_this(sapply(a, sqrt))
# returns => 6.64 secs
time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs
'sapply' is about 5x slower compared with a 'for' loop.
Finally, w/r/t vectorized versus 'for' loops, i don't think i ever use a loop if i can use a vectorized function--the latter is usually less keystrokes and and it's a more natural way (for me) to code, which is a different kind of performance boost, i suppose.