I have following code.
for(i in 1:100)
{
for(j in 1:100)
R[i,j]=gcm(i,j)
}
gcm() is some function which returns a number based on the values of i and j and so, R has all values. But this calculation takes a lot of time. My machine's power was interrupted several times due to which I had to start over. Can somebody please help, how can I save R somewhere after every iteration, so as to be safe? Any help is highly appreciated.
You can use the saveRDS() function to save the result of each calculation in a file.
To understand the difference between save and saveRDS, here is a link I found useful. http://www.fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/
If you want to save the R-workspace have a look at ?save or ?save.image (use the first to save a subset of your objects, the second one to save your workspace in toto).
Your edited code should look like
for(i in 1:100)
{
for(j in 1:100)
R[i,j]=gcm(i,j)
save.image(file="path/to/your/file.RData")
}
About your code taking a lot of time I would advise trying the ?apply function, which
Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix
You want gmc to be run for-each cell, which means you want to apply it for each combination of row and column coordinates
R = 100; # number of rows
C = 100; # number of columns
M = expand.grid(1:R, 1:C); # Cartesian product of the coordinates
# each row of M contains the indexes of one of R's cells
# head(M); # just to see it
# To use apply we need gmc to take into account one variable only (that' not entirely true, if you want to know how it really works have a look how at ?apply)
# thus I create a function which takes into account one row of M and tells gmc the first cell is the row index, the second cell is the column index
gmcWrapper = function(x) { return(gmc(x[1], x[2])); }
# run apply which will return a vector containing *all* the evaluated expressions
R = apply(M, 1, gmcWrapper);
# re-shape R into a matrix
R = matrix(R, nrow=R, ncol=C);
If the apply-approach is again slow try considering the snowfall package which will allow you to follow the apply-approach using parallel computing. An introduction to snowfall usage can be found in this pdf, look at page 5 and 6 in particular
Related
I have loaded two source files, performed some iterative calculations, and then i need to display/export the results. There are hundreds of iterative calculations, hence hundreds of results. However, only results of the final calculation is displayed.
In this example, i have shortened the list of calculations to only 3. Please refer to line 7 (k in 1:3). How do i get R to display result of all calculations?
Many thanks in advance to those who can offer help. If this question has already been asked before, a link would be great. I could not find this probably because i do not know the right terms to search for.
# Load files
d1<-read.csv('testhourly.csv',sep=",",header=F)
names(d1)<-c("elapsedtime","units")
d2<-read.csv('testevent.csv',sep=",",header=F)
names(d2)<-c("eventno","starttime","endtime","starttemp","endtemp")
# Perform for calculations 1 to 3
for(k in 1:3){
a<-d2[k,2]
b<-d2[k,3]
x<-d1[a:b,]$q
a2<-d2[k,2]-1
b2<-d2[k,3]-1
y<-d1[a2:b2,]$q
z <- (x-y)}
results <- sum(z)
# Export results
write.csv(results, file = "results.csv")
You are not saving your output inside the loop for every iteration, so your loop only returns the final value of the last iteration.
temp=vector("list",3)
for(k in 1:3) {
a<-d2[k,2]
b<-d2[k,3]
x<-d1[a:b,]$q
a2<-d2[k,2]-1
b2<-d2[k,3]-1
y<-d1[a2:b2,]$q
temp[[k]] <- (x-y)
}
results <- sum(unlist(temp))
first time asker here. I have recently strated working with R and I hope I could get some help with an issue. The problem is probably easy to solve but I haven't been able to find an answer by myself and my research hasn't been succesful either.
Basically I need to create a single object based on the input of a loop. I have 7 simulated asset returns, these objects contain the results from a simulation I ran. I want to match the columns from every object and form a combined one (i.e. every column 1 forms an object), which will be used for some calculations.
Finally, the result from each iteration should be stored on a single object that has to be available outside the loop for further analysis.
I have created the following loop, the problem is that only the result from the last iteration is being written in the final object.
# Initial xts object definition
iteration_returns_combined <- iteration_returns_draft_1
for (i in 2:10){
# Compose object by extracting the i element of every simulation serie
matrix_daily_return_iteration <- cbind(xts_simulated_return_asset_1[,i],
xts_simulated_return_asset_2[,i],
xts_simulated_return_asset_3[,i],
xts_simulated_return_asset_4[,i],
xts_simulated_return_asset_5[,i],
xts_simulated_return_asset_6[,i],
xts_simulated_return_asset_7[,i])
# Transform the matrix to an xts object
daily_return_iteration_xts <- as.xts(matrix_daily_return_iteration,
order.by = index(optimization_returns))
# Calculate the daily portfolio returns using the iteration return object
iteration_returns <- Return.portfolio(daily_return_iteration_xts,
extractWeights(portfolio_optimization))
# Create a combined object for each iteration of portfolio return
# This is the object that is needed in the end
iteration_returns_combined <<- cbind(iteration_returns_draft_combined,
iteration_returns_draft)
}
iteration_returns_combined_after_loop_view
Could somebody please help me to fix this issue, I would be extremely grateful for any information anyone can provide.
Thanks,
R-Rookie
By looking at the code, I surmise that the error is in the last line of your for loop.
iteration_returns_draft_combined
was never defined, so it is assumed to be NULL. Essentially, you only bind columns of the results from each iteration to a NULL object. Hence the output of your last loop is also bound by column to a NULL object, which is what you observe. Try the following:
iteration_returns_combined <- cbind(iteration_returns_combined,
iteration_returns)
This should work, hopefully!
Consider sapply and avoid expanding an object within a loop:
iteration_returns_combined <- sapply(2:10, function(i) {
# Compose object by extracting the i element of every simulation serie
matrix_daily_return_iteration <- cbind(xts_simulated_return_asset_1[,i],
xts_simulated_return_asset_2[,i],
xts_simulated_return_asset_3[,i],
xts_simulated_return_asset_4[,i],
xts_simulated_return_asset_5[,i],
xts_simulated_return_asset_6[,i],
xts_simulated_return_asset_7[,i])
# Transform the matrix to an xts object
daily_return_iteration_xts <- as.xts(matrix_daily_return_iteration,
order.by = index(optimization_returns))
# Calculate the daily portfolio returns using the iteration return object
iteration_returns <- Return.portfolio(daily_return_iteration_xts,
extractWeights(portfolio_optimization))
})
And if needed to column bind first vector/matrix, do so afterwards:
# CBIND INITIAL RUN
iteration_returns_combined <- cbind(iteration_returns_draft_1, iteration_returns_combined)
I come from a Java/Python comp sci theory background so I am still getting used to the various R packages and how they can save run time in functions.
Basically, I am working on a few projects and all of them involve taking individual factors in a long-list data set (15,000 to 200,000 factors) and performing calculations on individual factors in an equally-large data set, and concurrently storing the results of those calculations in an exponentially-longer data frame.
So far I have been using nested while loops and concatenating into a growing list, but that is taking days. Ive recently learned about 'lapply' and the 'data.frame' options in R, and I would love to see an example of how to apply (no pun intended) them to the following basic correlation function:
Corr<-function(miRdf, mRNAdf)
{
j=1
k=1
m=1
n=1
c=0
corrList=NULL
while(n<=71521)
{
while(m<=1477)
{
corr=cor(as.numeric(miRdf[k,2:13]), as.numeric(mRNAdf[j,2:13]), use ="complete.obs")
corrList<-c(corrList, corr)
j=j+1
c=c+1
print(c) #just a counter to see how far the function has run
m=m+1
}
k=k+1
n=n+1
j=1
m=1 #to reset the inner while loop
}
corrList<-matrix(unlist(corrList), ncol=1477, byrow=FALSE)
colnames(corrList)<-miRdf[,1]
rownames(corrList)<-mRNAdf[,1]
write.csv(corrList, "testCorrWhole.csv")
}
As you can see, the nested while loop results in 105,636,517 (71521x1477) miRNA vs mRNA expression-value correlation scores that need to be performed and stored in a data frame that is 1477 cols x 71521 rows in order to generate a scoring matrix.
My question is, can anyone shed light on how to turn the above monstrosity into an efficient function that utilizes 'lapply' instead of the while loops, and uses the 'data.table' set() function to do away with the inefficiency of concatenating a list during every pass through the loop?
Thank you in advance!
Your names end with 'df', which makes it seem like your data are a data.frame. But #Troy's answer uses a matrix. A matrix is appropriate when the data are homogeneous, and generally matrix operations are much faster than data.frame operations. So you can see already that if you'd provided a small example of your data set (e.g., dput(mRNAdf[1:10,]) that people might be in a better position to help you; this is what they're asking for.
In large numerical calculations it makes sense to 'hoist' any repeated calculations outside the loop, so they are performed only once. Repeated calculations in your case include sub-setting to columns 2:13, and coercion to numeric. With this idea, and guessing that you actually have a data.frame where each column is already a numeric vector, I'd start with
mRNAmatrix <- as.matrix(mRNAdf[,2:13])
miRmatrix <- as.matrix(miRdf[,2:13])
From the help page ?cor we see that the arguments can be a matrix, and if so the correlation is calculated between columns. You're interested in the result when the arguments are transposed relative to your current representation. So
result <- cor(t(mRNAmatrix), t(miRmatrix), use="complete.obs")
This is fast enough for your purposes
> m1 = matrix(rnorm(71521 * 12), 71521)
> m2 = matrix(rnorm(1477 * 12), 1477)
> system.time(ans <- cor(t(m1), t(m2)))
user system elapsed
9.124 0.200 9.340
> dim(ans)
[1] 71521 1477
result is the same as your corrList -- it's not a list, but a matrix; probably the row and column names have been carried forward. You'd write this to a file as you do above, write.csv(result, "testCorrWhole.csv")
UPDATED BELOW TO SHOW PARALLEL PROCESSING - ABOUT A 60% SAVING
Using apply() might not be quick enough for you. Here's how to do it, though. Will have a think about performance since this example (1M output correlations in 1000x1000 grid) takes over a minute on laptop.
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
system.time(apply(mRNAdf,1,function(x)corRow(x[1])))
# user system elapsed
# 72.94 0.00 73.39
And with parallel::parApply on a 4 core Win64 laptop
require(parallel) ## Library to allow parallel processing
miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf) # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)
corRow<-function(y){
apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
}
# Make a cluster from all available cores
cl=makeCluster(detectCores())
# Use clusterExport() to distribute the function and data.frames needed in the apply() call
clusterExport(cl,c("corRow","miRdf","mRNAdf"))
# time the call
system.time(parApply(cl,mRNAdf,1,function(x)corRow(x[[1]])))
# Stop the cluster
stopCluster(cl)
# time the call without clustering
system.time(apply(mRNAdf,1,function(x)corRow(x[[1]])))
## WITH CLUSTER (4)
user system elapsed
0.04 0.03 29.94
## WITHOUT CLUSTER
user system elapsed
73.96 0.00 74.46
I am confused by the behavior of is.na() in a for loop in R.
I am trying to make a function that will create a sequence of numbers, do something to a matrix, summarize the resulting matrix based on the sequence of numbers, then modify the sequence of numbers based on the summary and repeat. I made a simple version of my function because I think it still gets at my problem.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq))
##generate a table where the row names are those numbers
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10))
##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <-
count(temp.results[,1])$freq
##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
## the idea would be to keep cutting this sequence of numbers down with
## successive iterations until the desired number of iterations per row in
## details.table was reached. in other words, in the real code i'd do
## something to details.table in the next line
print(rich.seq)
}
}
##call the function
test(desired.iterations=4, max.iterations=2)
On the first run through the for loop the rich.seq looks like I'd expect it to, where 5 & 6 are no longer in the sequence because both ended up with more than 4 iterations. However, on the second run, it spits out something unexpected.
UPDATE
Thanks for your help and also my apologies. After re-reading my original post it is not only less than clear, but I hadn't realized count was part of the plyr package, which I call in my full function but wasn't calling here. I'll try and explain better.
What I have working at the moment is a function that takes a matrix, randomizes it (in any of a number of different ways), then calculates some statistics on it. These stats are temporarily stored in a table--temp.results--where temp.results[,1] is the sum of the non zero elements in each column, and temp.results[,2] is a different summary statistic for that column. I save these results to a csv file (and append them to the same file at subsequent iterations), because looping through it and rbinding hogs a lot of memory.
The problem is that certain column sums (temp.results[,1]) are sampled very infrequently. In order to sample those sufficiently requires many many iterations, and the resulting .csv files would stretch into the hundreds of gigabytes.
What I want to do is create and then update a table (details.table) at each iteration that keeps track of how many times each column sum actually got sampled. When a given element in the table reaches the desired.iterations, I want it to be excluded from the vector rich.seq, so that only columns that haven't received the desired.iterations are actually saved to the csv file. The max.iterations argument will be used in a break() statement in case things are taking too long.
So, what I was expecting in the example case is the exact same line for rich.seq for both iterations, since I didn't actually do anything to change it. I believe that flodel is definitely right that my problem lies in comparing a matrix (details.table) of length longer than rich.seq, leading to unexpected results. However, I don't want the dimensions of details.table to change. Perhaps I can solve the problem implementing %in% somehow when I redefine rich.seq in the for loop?
I agree you should improve your question. However, I think I can spot what is going wrong.
You compute details.table before the for loop. It is a matrix with same length as rich.seq when it was first initialized (length(4:34), i.e. 31).
Inside the for loop, details.table < desired.iterations | is.na(details.table) is then a logical vector of length 31. On the first loop iteration,
rich.seq <- rich.seq[details.table < desired.iterations | is.na(details.table)]
will result in reducing the length of rich.seq. But on the second loop iteration, unless details.table is redefined (not the case), you are trying to subset rich.seq by a logical vector of longer length than rich.seq. This will certainly lead to unexpected results.
You probably meant to redefine details.table as part of your for loop.
(Also I am surprised to see you never used temp.results[,2].)
Thanks to flodel for setting me off on the right track. It had nothing to do with is.na but rather the lengths of vectors I was comparing.
That said, I set the initial values of the details.table to zero to avoid the added complexity of the is.na statement.
This code works, and can be modified to do what I described above.
library(plyr)
test <- function(desired.iterations, max.iterations)
{
rich.seq <- 4:34 ##make a sequence of numbers
details.table <- matrix(nrow=length(rich.seq), ncol=1, dimnames=list(rich.seq)) ##generate a table where the row names are those numbers
details.table[,1] <- 0
print(details.table) ##that's what it looks like
temp.results <- matrix(nrow=10, ncol=2, dimnames=list(1:10)) ##generate some sample data to summarize and fill into details.table
temp.results[,1] <- rep(5:6, 5)
temp.results[,2] <- rnorm(10)
print(temp.results) ##that's what it looks like
details.table[,1][row.names(details.table) %in% count(temp.results[,1])$x] <- count(temp.results[,1])$freq ##summarize, subset to the appropriate rows in details.table, and fill in the summary
print(details.table)
for (i in 1:max.iterations)
{
rich.seq <- row.names(details.table)[details.table[,1] < desired.iterations]
print(rich.seq)
}
}
Rather than trying to cut down the rich.seq I just redefine it every iteration based on whatever happens with details.table during the previous iteration.
I have code for nested for loops here. The output I would like to receive is a matrix of the means of the columns of the matrix produced by the nested loop. So, the interior loop should run 1000 simulations of a randomized vector, and run a function each time. This works fine on its own, and spits the output into R. But I want to save the output from the nested loop to an object (a matrix of 1000 rows and 11 columns), and then print only the colMeans of that matrix, to be performed by the outer loop.
I think the problem lies in the step where I assign the results of the inner loop to the obj matrix. I have tried every variation on obj[i,],obj[i],obj[[i]], etc. with no success. R tells me that it is an object of only one dimension.
x=ACexp
obj=matrix(nrow=1000,ncol=11,byrow=T) #create an empty matrix to dump results into
for(i in 1:ncol(x)){ #nested for loops
a=rep(1,times=i) #repeat 1 for 1:# columns in x
b=rep(0,times=(ncol(x)-length(a))) #have the rest of the vector be 0
Inv=append(a,b) #append these two for the Inv vector
for (i in 1:1000){ #run this vector through the simulations
Inv2=sample(Inv,replace=FALSE) #randomize interactions
temp2=rbind(x,Inv2)
obj[i]<-property(temp2) #print results to obj matrix
}
print.table(colMeans(obj)) #get colMeans and print to excel file
}
Any ideas how this can be fixed?
You're repeatedly printing the whole matrix to the screen as it gets modified but your comment says "print to excel file". I'm guessing you actually want to save your data out to a file. Remove print.table command all together and after your loops are completed use write.table()
write.table(colMeans(obj), 'myNewMatrixFile.csv', quote = FALSE, sep = ',', row.names = FALSE)
(my preferred options... see ?write.table to select the ones you like)
Since your code isn't reproducible, we can't quite tell what you want. However, I guess that property is returning a single number that you want to place in the right row/column place of the obj matrix, which you would refer to as obj[row,col]. But you'll have trouble with that as is, because both your loops are using the same index i. Maybe something like this will work for you.
obj <- matrix(nrow=1000,ncol=11,byrow=T) #create an empty matrix to dump results into
for(i in 1:ncol(x)){ #nested for loops
Inv <- rep(c(1,0), times=c(i, ncol(x)-i)) #repeat 1 for 1:# columns in x, then 0's
for (j in 1:nrow(obj)){ #run this vector through the simulations
Inv2 <- sample(Inv,replace=FALSE) #randomize interactions
temp2 <- rbind(x,Inv2)
obj[j,i] <- property(temp2) #save results in obj matrix
}
}
write.csv(colMeans(obj), 'myFile.csv') #get colMeans and print to csv file