what could be the best tool or package to perform PCA on very large datasets? - r

This might seem like a similar question which was asked in this URL (Apply PCA on very large sparse matrix).
But I am still not able to get my answer for which i need some help. I am trying to perform a PCA for a very large dataset of about 700 samples (columns) and > 4,00,000 locus (rows). I wish to plot "samples" in the biplot and hence want to consider all of the 4,00,000 locus to calculate the principal components.
I did try using princomp(), but I get the following error which says,
Error in princomp.default(transposed.data, cor = TRUE) :
'`princomp`' can only be used with more units than variables
I checked with the forums and i saw that in the cases where there are less units than variables, it is better to use prcomp() than princomp(), so i tried that as well, but i again get the following error,
Error in cor(transposed.data) : allocMatrix: too many elements specified
So I want to know if any of you could suggest me any other good option which could be best suited for my very large data. I am a beginner for statistics, but I did read about how PCA works. I want to know if there are any other easy-to-use R packages or tools to perform this?

Related

R: [Indicspecies package] multipatt function: extract values from summary.multipatt

I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.

Which technique is best used to find optimal split on numeric data to reduced error on group?

I have a dataset that contains a numeric variable and a binary categorical variable. I want to find the optimal split on the numeric variable that can be used to quickly classify the categories and limit the amount of error.
I have used a decision tree to do this but am wondering if there are better optimization methods out there to do this?
I would like to be able to do this in R but am having trouble how to write the function for this.
Please help me understand the simple optimisation problem. Thanks!

Determine number of factors in EFA (R) using Comparison Data

I am looking for ways to determine number of optimal factors in R factanal function. The most used method (conduct a pca and use scree plot to determine the number of factors) is already known to me. I have found a method described here to be easier for non technical folks like me. Unfortunately the R script is no longer accessible in which the method was implemented. I was wondering if there is a package available in R that does the same?
The method was originally proposed in this study: Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure.
The R code is now moved here as per the author.
EFA.dimensions ist also a nice and easy to use package for that

Princomp error in R : covariance matrix is not non-negative definite

I have this script which does a simple PCA analysis on number of variables and at the end attaches two coordinates and two other columns(presence, NZ_Field) to the output file. I have done this many times before but now its giving me this error:
I understand that it means there are negative eigenvalues. I looked at similar posts which suggest to use na.omit but it didn't work.
I have uploaded the "biodata.Rdata" file here:
covariance matrix is not non-negative definite
https://www.dropbox.com/s/1ex2z72lilxe16l/biodata.rdata?dl=0
I am pretty sure it is not because of missing values in data because I have used the same data with different "presence" and "NZ_Field" column.
Any help is highly appreciated.
load("biodata.rdata")
#save data separately
coords=biodata[,1:2]
biovars=biodata[,3:21]
presence=biodata[,22]
NZ_Field=biodata[,23]
#Do PCA
bpc=princomp(biovars ,cor=TRUE)
#re-attach data with auxiliary data..coordinates, presence and NZ location data
PCresults=cbind(coords, bpc$scores[,1:3], presence, NZ_Field)
write.table(PCresults,file= "hlb_pca_all.txt", sep= ",",row.names=FALSE)
This does appear to be an issue with missing data so there are a few ways to deal with it. One way is to manually do listwise deletion on the data before running the PCA which in your case would be:
biovars<-biovars[complete.cases(biovars),]
The other option is to use another package, specifically psych seems to work well here and you can use principal(biovars), and while the output is bit different it does work using pairwise deletion, so basically it comes down to whether or not you want to use pairwise or listwise deletion. Thanks!

R - 'princomp' can only be used with more units than variables

I am using R software (R commander) to cluster my data. I have a smaller subset of my data containing 200 rows and about 800 columns. I am getting the following error when trying kmeans cluster and plot on a graph.
"'princomp' can only be used with more units than variables"
I then created a test doc of 10 row and 10 columns whch plots fine but when I add an extra column I get te error again.
Why is this? I need to be able to plot my cluster. When I view my data set after performing kmeans on it I can see the extra results column which shows which clusters they belong to.
IS there anything I am doing wrong, can I ger rid of this error and plot my larger sample???
Please help, been wrecking my head for a week now.
Thanks guys.
The problem is that you have more variables than sample points and the principal component analysis that is being done is failing.
In the help file for princomp it explains (read ?princomp):
‘princomp’ only handles so-called R-mode PCA, that is feature
extraction of variables. If a data matrix is supplied (possibly
via a formula) it is required that there are at least as many
units as variables. For Q-mode PCA use ‘prcomp’.
Principal component analysis is underspecified if you have fewer samples than data point.
Every data point will be it's own principal component. For PCA to work, the number of instances should be significantly larger than the number of dimensions.
Simply speaking you can look at the problems like this:
If you have n dimensions, you can encode up to n+1 instances using vectors that are all 0 or that have at most one 1. And this is optimal, so PCA will do this! But it is not very helpful.
you can use prcomp instead of princomp

Resources