how to use princomp() or prcomp() functions in R with large datasets, without trasposing the data? - r

I have just started knowing PCA and i wish to use it for a huge microarray dataset with more than 4,00,000 rows. I have my columns in the form of samples, and rows in the form of genes/locus. I did go through some tutorials on using PCA and came across princomp() and prcomp() and a few others.
Now, as i learn here that, in order to plot ¨samples¨ in the biplot, i would need to have them in the rows, and genes/locus in the columns, and hence i will have to transpose my data before using it for PCA.
However, since the rows are more than 4,00,000, i am not really able to transpose them into columns, because the columns are limited. So my question is that, is there any way to perform a PCA on my data, without transposing it, using these R functions ? If not, can anyone of you suggest me any other way or method to do so ?

Why do you hate to transpose your data? It's easy!
If you read your data into R (for example as the matrix microarray.data) you can transpose them with just a command:
transposed.microarray.data<-t(microarray.data)

Related

Keras predict repeated columns

I have a question related to keras model code in R. I have finished training the model and need to predict. Predicting a line is very fast, but my data has 2000,000,000 rows and nearly 200 columns, with a structure like the attached image.
Datastructure
I don't know if anyone has any suggestions on which method to use so that predict can run quickly and use less memory. I created a matrix according to the table as shown in order to predict, each matrix is ​​200,000x200 dimensions. Then I use sapply to predict all the remaining matrices. However, even though predict is fast for each matrix, but creating the matrix is ​​slow, so it makes the model run twice or three times as long, and that is not taking into account the sapply step. I wonder if keras has a "smart" way to know that in each of his matrix, the last N columns that are exactly the same? I google and see someone talking about RepeatVector but I don't quite understand and it seems that this is only used for training? I already have the model and just need to predict.
Thank you so much everyone!
One of the most performant ways to feed keras models locally is by creating a tf.data.Dataset object. Please take a look at the tfdatasets R package for guides and example usage.

Operating on Spark data frames with SparkR and Sparklyr - unrealistic settings?

I am currently working with the SparkR and sparklyr package and I think that they are not suitable for high-dimensional sparse data sets.
Both packages have the paradigm that you can select/filter columns and rows of data frames by simple logical conditions on a few columns or rows. But this is often not what you would do on such large data sets. There you need to select rows and columns based on the values of hundreds of row or column entries. Often, you first have to calculate statistics on each row/column, and then use these values for the selections. Or, you want to address certain values in the data frame only.
For example,
How can I select all rows or columns that have less than 75% missing values?
How can I impute missing values with column- or row-specific values that were derived from each column or row?
To solve (#2), I need to execute functions on each row or column of a data frame separately. However, even functions like dapplyCollect of SparkR do not really help, as they are far too slow.
Maybe I am missing something, but I would say that SparkR and sparklyr do not really help in these situations. Am I wrong?
As a side note, I do not understand how libraries like MLlib or H2O could be integrated with Sparklyr if there are such severe limitations, e.g. in handling missing values.

Princomp error in R : covariance matrix is not non-negative definite

I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!

what could be the best tool or package to perform PCA on very large datasets?

This might seem like a similar question which was asked in this URL (Apply PCA on very large sparse matrix).
But I am still not able to get my answer for which i need some help. I am trying to perform a PCA for a very large dataset of about 700 samples (columns) and > 4,00,000 locus (rows). I wish to plot "samples" in the biplot and hence want to consider all of the 4,00,000 locus to calculate the principal components.
I did try using princomp(), but I get the following error which says,
Error in princomp.default(transposed.data, cor = TRUE) :
'`princomp`' can only be used with more units than variables
I checked with the forums and i saw that in the cases where there are less units than variables, it is better to use prcomp() than princomp(), so i tried that as well, but i again get the following error,
Error in cor(transposed.data) : allocMatrix: too many elements specified
So I want to know if any of you could suggest me any other good option which could be best suited for my very large data. I am a beginner for statistics, but I did read about how PCA works. I want to know if there are any other easy-to-use R packages or tools to perform this?

How do I calculate all the cor() between all members of a large dataset using apply instead of for loops?

I have a large set of 10,000 vectors of length about 100 stored in a matrix that I want to calculate the correlations between for all of the set. Unfortunately, on my current computer doing a simple double for loop to produce the correlations is taking forever! Is there a more efficient way I can go about this?
I think I have something like an apply function in mind, but I'm not sure how to implement it with cor().
Put your data into a data frame or matrix and use the built in cor() function. Generally, you want to avoid using loops in R.
cor(yourData)

Resources