I have a problem where I'm trying to use supervised learning in python. I have a series of x,y coordinates which i know belong to a label in one data set. In the other i have only the x,y coordinates. I am going to use one set to train the other, my approach is that of supervised learning and to use a classification algorithm (linear discriminant analysis) as the number of labels is discrete. Although they are discrete, they are large in number (n=~80,000). My question, at which number of labels should i consider regression over classification where regression is better suited to continuous labels. I'm using SciKit as my machine learning package and using astronml.orgs excellent tutorial as a guide.
It is not about numbers. It is about being continuous or not. It does not matter if you have 80,000 classes or even more; as long as there is no correlation between neighbour classes (for eg. class i and i+1), you should use classification (not regression).
Regression only makes sense when the labels are continuous (real numbers for eg.) or at least when there is a correlation between adjacent classes (for eg. when labels show the count of something, you can do regression and then round up the results).
Related
Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)
I am interested in optimising predictions for a multinomial regression model with 3 (or more) classes according to various measures.
For two-class models (logistic regression), this can be done in the pROC package using the coords function with best.method="youden" or closest.topleft. This will choose the threshold on the probability of success as predicted by the logistic regression model that either maximises specificity+sensitivity (youden) or gives the point in the ROC curve closest in the Euclidean distance to the point (1,1) (closest.topleft).
In the three (ore more) class case, it is possible to generalise sensitivity+specificity to the sum over classes of sensitivity for each class. We can then ask, if we choose a vector of weights on the probabilities of the non-reference classes, which vector of weights will maximise this quantity? This set of weights will give the analogue in the three (or more) class case of the Youden index.
My questions are:
is there an R package and command that implement this? If not, I will write one, but I want to make sure I am not duplicating work that has already been done.
If not, what other functionalities would be useful to build into this package? For example, it would be possible to find the best set of weights that ensure one of the sensitivities is at least above some set threshold. It would also be possible to find the analogue of closest.topleft--the set of weights that give sensitivities closest to (1,1,1), and so on. Also, it would be possible to include some plotting capabilities, e.g., for the 3-class situation, a 3D version of an ROC curve that plots the three sensitivities on three axes.
Thanks!
I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
Cheers,
Martin
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at: http://varsellcm.r-forge.r-project.org/
I'm analyzing a big dataset of gene expression in R, with 100 samples and 50.000 genes.
I already made some very informative PCA projections of inter-sample patterns. Now I want to make some projections of the data maximizing the differences between the labels I have for the samples.
Normally I would do this with the lda() function from the MASS package. However, this is way too slow and memory intensive.
If the goal is to produce a projection of the samples maximizing the difference between known labels, what are some good alternatives to lda()?
Thanks!
Summary of our discussion in the comments to the question
Linear discriminant analysis does not work on data sets with more features than observations, so you need some form of regularization. If you want to do classification but are mainly interested in the predictive patterns, rather than the predictions themselves, you can use partial least squares discriminant analysis (PLSDA).
However, in your case the components of PLSDA might be hard to interpret since they will contain one coefficient per gene, but it seems unrealistic to believe that all 50000 genes are relevant to the phenotype you are studying. An alternative approach I prefer is to use nearest shrunken centroids or elastic net that produces sparse models (i.e. they only keep the best genes and discard those of little importance).
You could run your LDA model on a sample of the data set.
I am comparing various predictive models on a binary classification task using the caret R package with respect to their predictive performance (liftChart) and prediction accuracy (calibration plot). I found the following issues:
1. Sometimes the lift function is very very slow when the number of observation is quite big or there are various competing classifiers. In addition I wonder whether it is possible to manually define the cuts of the calibration plot. I have a severe imbalanced model (average probability is 5%) and the calibration plot function assumes evenly spaced cuts.
The lift plot does the calculation for every unique probability value (much like an ROC curve), which is why it is slow.
Neither of those options are available right now. You can add two issues to the github page. I'm fairly swamped right now but those shouldn't be a big deal to change (you could always contribute solutions too).
Max