Alternatives to LDA for big datasets - r

I'm analyzing a big dataset of gene expression in R, with 100 samples and 50.000 genes.
I already made some very informative PCA projections of inter-sample patterns. Now I want to make some projections of the data maximizing the differences between the labels I have for the samples.
Normally I would do this with the lda() function from the MASS package. However, this is way too slow and memory intensive.
If the goal is to produce a projection of the samples maximizing the difference between known labels, what are some good alternatives to lda()?
Thanks!

Summary of our discussion in the comments to the question
Linear discriminant analysis does not work on data sets with more features than observations, so you need some form of regularization. If you want to do classification but are mainly interested in the predictive patterns, rather than the predictions themselves, you can use partial least squares discriminant analysis (PLSDA).
However, in your case the components of PLSDA might be hard to interpret since they will contain one coefficient per gene, but it seems unrealistic to believe that all 50000 genes are relevant to the phenotype you are studying. An alternative approach I prefer is to use nearest shrunken centroids or elastic net that produces sparse models (i.e. they only keep the best genes and discard those of little importance).

You could run your LDA model on a sample of the data set.

Related

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

R: training random forest using PCA data

I have a data set called Data, with 30 scaled and centered features and 1 outcome with column name OUTCOME, referred to 700k records, stored in data.table format. I computed its PCA, and observed that its first 8 components account for the 95% of the variance. I want to train a random forest in h2o, so this is what I do:
Data.pca=prcomp(Data,retx=TRUE) # compute the PCA of Data
Data.rotated=as.data.table(Data.pca$x)[,c(1:8)] # keep only first 8 components
Data.dump=cbind(Data.rotated,subset(Data,select=c(OUTCOME))) # PCA dataset plus outcomes for training
This way I have a dataset Data.dump where I have 8 features that are rotated on the PCA components, and at each record I associated its outcome.
First question: is this rational? or do I have to permute somehow the outcomes vector? or the two things are unrelated?
Then I split Data.dump in two sets, Data.train for training and Data.test for testing, all as.h2o. The I feed them to a random forest:
rf=h2o.randomForest(training_frame=Data.train,x=1:8,y=9,stopping_rounds=2,
ntrees=200,score_each_iteration=T,seed=1000000)
rf.pred=as.data.table(h2o.predict(rf,Data.test))
What happens is that rf.pred seems not so similar to the original outcomes Data.test$OUTCOME. I tried to train a neural network as well, and did not even converge, crashing R.
Second question: is it because I am carrying on some mistake from the PCA treatment? or because I badly set up the random forest? Or I am just dealing with annoying data?
I do not know where to start, as I am new to data science, but the workflow seems correct to me.
Thanks a lot in advance.
The answer to your second question (i.e. "is it the data, or did I do something wrong") is hard to know. This is why you should always try to make a baseline model first, so you have an idea of how learnable the data is.
The baseline could be h2o.glm(), and/or it could be h2o.randomForest(), but either way without the PCA step. (You didn't say if you are doing a regression or a classification, i.e. if OUTCOME is a number or a factor, but both glm and random forest will work either way.)
Going to your first question: yes, it is a reasonable thing to do, and no you don't have to (in fact, should not) involve the outcomes vector.
Another way to answer your first question is: no, it unreasonable. It may be that a random forest can see all the relations itself without needing you to use a PCA. Remember when you use a PCA to reduce the number of input dimensions you are also throwing away a bit of signal, too. You said that the 8 components only capture 95% of the variance. So you are throwing away some signal in return for having fewer inputs, which means you are optimizing for complexity at the expense of prediction quality.
By the way, concatenating the original inputs and your 8 PCA components, is another approach: you might get a better model by giving it this hint about the data. (But you might not, which is why getting some baseline models first is essential, before trying these more exotic ideas.)

K-Means Distance Measure - Large Data and mixed Scales

I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/

R linear regression with lm - how to deal with categorical variables with thousands of values (like city or zip code)?

I am using R and the linear regression method function lm() to build a prediction model for business sales of retail stores. Among the many dependent feature variables in my dataset, there are some categorical (factor) features that can take on thousands of different values, such as zip code (and/or city name). For example, there are over 6000 different zip codes for California alone; if I instead use city, there are over 400 cities.
I understand that lm() creates a variable for each value of a categorical feature. The problem is that when I run lm(), the explosion of variables takes a lot of memory and a really long time. How can I avoid or handle this situation with my categorical variables?
Your intuition to move from zip codes to cities is good. However, the question is, is there a further level of spatial aggregation which will capture important spatial variation, but will result in the creation of less categorical (i.e. dummy) variables? Probably. Depending on your question, simply including a dummy for rural/suburban/urban maybe all you need.
In your case geographic region is likely a proxy meant to capture variation in socio-economic data. If so, why not include the socio-economic data directly. To do this you could use your city/zip data to link to US census data.
However, if you really need/want to include cities, try estimating a fixed effect model. The within-estimator that results differences out time invariant categorical coefficients such as your city coefficients.
Even if you find a way to obtain an OLS estimate with 400 cities in R, I would strongly encourage you not do use an OLS estimator, use a Ridge or Lasso estimator. Unless your data is massive (it can't be too big since your using R), the inclusive of so many dummy variables is going to dramatically reduce the degrees of freedom, which can lead to over-fitting and generally poorly estimated coefficients and standard errors.
In a slightly more sophisticated language, when degrees of freedom are low the minimization problem you solve when you estimate the OLS is "ill-posed", consequently you should use a regularization. For example, a Ridge Regression (i.e. Tikhonov regularization), would be a good solution. Remember, however, Ridge regression is a biased estimator and therefore you should perform bias-correction.
My solutions in order of my preference:
Aggregate up to a coarser spatial area (i.e. maybe a regions instead of cities)
Fixed effect estimator.
Ridge regression.
If you don't like my suggestions, I would suggest you pose this question on cross validated. IMO your question is closer to a statistics question than a programming question.

Is there a way to input a covariance matrix (or something like that) into lme4 in R?

I have a very large data set that I extract from a data warehouse. To download the data set to the box where I want to run lme4 takes a long time. I would like to know if I could process the data into a covariance matrix, download that data (which is much smaller), and use that as the data input to lme4. I have done something similar to this for multiple regression models using SAS, and am hoping I can create this type of input for lme4.
Thanks.
I don't know of any way to use the observed covariance matrix to fit an lmer model. But if the goal is to reduce data set size in order to speed up analysis, there may be simpler approaches. For example, if you don't need the conditional modes of the random effects, and you have a very large sample size, then you might try fitting the model to progressively larger subsets of the data until the estimates of the fixed effects and the covariance matrix of the random effects 'stabilize'. This approach has worked well in my experience, and has been discussed by others:
http://andrewgelman.com/2012/04/hierarchicalmultilevel-modeling-with-big-data/
Here's another quotation:
"Related to the “multiple model” approach are simple approximations that speed the computations. Computers are getting faster and faster—but models are getting more and more complicated! And so these general tricks might remain important. A simple and general trick is to break the data into subsets and analyze each subset separately. For example, break the 85 counties of radon data randomly into three sets of 30, 30, and 25 counties, and analyze each set separately." Gelman and Hill (2007), p.547.

Resources