How do I calculate all the cor() between all members of a large dataset using apply instead of for loops? - r

I have a large set of 10,000 vectors of length about 100 stored in a matrix that I want to calculate the correlations between for all of the set. Unfortunately, on my current computer doing a simple double for loop to produce the correlations is taking forever! Is there a more efficient way I can go about this?
I think I have something like an apply function in mind, but I'm not sure how to implement it with cor().

Put your data into a data frame or matrix and use the built in cor() function. Generally, you want to avoid using loops in R.
cor(yourData)

Related

R how to create a large matrix by combining small blocks of matrix

I'm working on a constrained optimization problem using Lagrange Multiplier method. And I'm trying to build this huge sparse matrix in R in order to calculate the values.
Here's how the matrices would look like. And the link below for the details of the problem if needed.
Implementation of Lagrange Multiplier to solve constrained optimization problem.
Here's the code I've come up with, sorry if my approach seems clumsy to you, since I'm new to matrix manipulation and programming.
First, I imported the 3154 by 30 matrices from a csv file, and then combined all columns into one. Then I created a diagonal matrices to imitate the upper left corner of the matrices.
Then, to imitate the lower left corner of the matrices. I created a 3154x3154 identity matrices and tried to replicate it 30 times.
I have two questions here
When I tried to cbind the diagonal sparse matrix, it returned a combination of two lists instead of a matrix. So I had to convert it to matrix, but this is taking too much of my memory. I'd like to know if there's a better way to accomplish this.
I want to know if there's a formula for cbind a matrix multiple times. Since I need to replicate the matrix 30 times. I'm curious if there's a cleaner way to get around all the typings. (This was solved thanks to #Jthorpe)
I was gonna do the same thing for the rest of the matrices. I know this is not the best approach to tackle this problem. Please feel free to to suggest any smarter way of doing this. Thanks!
library(Matrix)
dist_data=read.csv("/Users/xxxxx/dist_mat.csv", header=T)
c=ncol(dist_data) #number of cluster - 30
n=nrow(dist_data) #number of observations - 3153
#Create a c*n+c+n = 3153*30+3153+3 = 97,773 coefficient matrix
dist_list=cbind(unlist(dist_data))
Coeff_mat=2*.sparseDiagonal(c*n,x = c(dist_list))
diag=.sparseDiagonal(n)
Uin <- do.call(cbind,rep(list(as.matrix(diag)),30))

dist function with large number of points

I am using the dist {stats} function to calculate the distance between points, my problem is that I have 24469 points, and the output for the dist function gives me a vector with 18705786 length, instead of the matrix. I tried already to export as.matrix, but the file is 2 large.
How can I have access to what points corresponds each distance?
For example which(distance<=700) gives me the position in the vector, but how can I get the info to what points this distance corresponds to?
There are asome things you could try, also depending on what you need exactly:
Calculate the distances in a loop, and only keep those that match the criterium. Especially when the number of matches is much smaller than the total size of the distance matrix, this saves a lot of RAM usage. This loop is probably very slow if it is implemented in pure R, that is alos why dist does not use R but I believe C to perform the calculations. This could mean that you get your results, but have to wait a while. Alternatively, the excellent Rcpp package would allow you to write this down in C/C++, making it much much faster probably.
Start using packages like bigmemory in storing the distance matrix. You then build it in a loop and store it iteratively in the bigmemory object (I have not worked with bigmemory before, so I don't know the exact details). Then after building the matrix, you can access it to extract your desired results. Effectively, all tricks to handle large data in R apply to this bullet. See e.g. R SO posts on big data.
Some interesting links (found googling for r distance matrix for large vector):
Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
(lucky you!) http://stevemosher.wordpress.com/2012/04/08/using-bigmemory-for-a-distance-matrix/

how to use princomp() or prcomp() functions in R with large datasets, without trasposing the data?

I have just started knowing PCA and i wish to use it for a huge microarray dataset with more than 4,00,000 rows. I have my columns in the form of samples, and rows in the form of genes/locus. I did go through some tutorials on using PCA and came across princomp() and prcomp() and a few others.
Now, as i learn here that, in order to plot ¨samples¨ in the biplot, i would need to have them in the rows, and genes/locus in the columns, and hence i will have to transpose my data before using it for PCA.
However, since the rows are more than 4,00,000, i am not really able to transpose them into columns, because the columns are limited. So my question is that, is there any way to perform a PCA on my data, without transposing it, using these R functions ? If not, can anyone of you suggest me any other way or method to do so ?
Why do you hate to transpose your data? It's easy!
If you read your data into R (for example as the matrix microarray.data) you can transpose them with just a command:
transposed.microarray.data<-t(microarray.data)

Trying to apply formula to each column in R, how to feed data to formula?

So I'm trying to apply an exponential smoothing model to each column in a data frame called 'cities'. I have used apply to identify the data frame, go by columns, and I thought to run the model. However, when I try to do so, it tells me that I need to specify data for the exponential smoothing model...I thought I already had by putting it in the apply loop.
apply(x=cities,2,FUN=HoltWinters(x=x,gamma=FALSE))
Also, eventually I'd like to predict the next 4 periods using the HW model developed using forecast.predict. Do I need to use a different loop or can I combine it all in this one?
FUN takes a function, but you're trying to give it the output of a function.
Try this:
apply(cities, 2, FUN=function(x) HoltWinters(x=x,gamma=FALSE))

Generating variable names for dataframes based on the loop number in a loop in R

I am working on developing and optimizing a linear model using the lm() function and subsequently the step() function for optimization. I have added a variable to my dataframe by using a random generator of 0s and 1s (50% chance each). I use this variable to subset the dataframe into a training set and a validation set If a record is not assigned to the training set it is assigned to the validation set. By using these subsets I am able to estimate how good the fit of the model is (by using the predict function for the records in the validation set and comparing them to the original values). I am interested in the coefficients of the optimized model and in the results of the KS-test between the distributions of the predicted and actual results.
All of my code was working fine, but when I wanted to test whether my model is sensitive to the subset that I chose I ran into some problems. To do this I wanted to create a for (i in 1:10) loop, each time using a different random subset. This turned out to be quite a challenge for me (I have never used a for loop in R before).
Here's the problem (well actually there are many problems, but here is one of them):
I would like to have separate dataframes for each run in the loop with a unique name (for example: Run1, Run2, Run3). I have been able to create a variable with different strings using paste(("Run",1:10,sep=""), but that just gives you a list of strings. How do I use these strings as names for my (subsetted) dataframes?
Another problem that I expect to encounter:
Subsequently I want to use the fitted coefficients for each run and export these to Excel. By using coef(function) I have been able to retrieve the coefficients, however the number of coefficients included in the model may change per simulation run because of the optimization algorithm. This will almost certainly give me some trouble with pasting them into the same dataframe, any thoughts on that?
Thanks for helping me out.
For your first question:
You can create the strings as before, using
df.names <- paste(("Run",1:10,sep="")
Then, create your for loop and do the following to give the data frames the names you want:
for (i in 1:10){
d.frame <- # create your data frame here
assign(df.name[i], d.frame)
}
Now you will end up with ten data frames with ten different names.
For your second question about the coefficients:
As far as I can tell, these don't naturally fit into your data frame structure. You should consider using lists, as they allow different classes - in other words, for each run, create a list containing a data frame and a numeric vector with your coefficients.
Don't create objects with numbers in their names, and then try and access them in a loop later, using get and paste and assign. The right way to do this is to store your elements in an R list object.

Resources