Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Recently I watched a lot of Stanford's hilarious Open Classroom's video lectures. Particularly the part about unsupervised Machine Learning got my attention. Unfortunately it stops were it might get even more interesting.
Basically I am looking to classify discrete matrices by an unsupervised algorithm. Those matrices just contain discrete values of the same range. Let's say I have 1000s of 20x15 matrices that with values ranging from 1-3. I just started to read through the literature and I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there.
I also looked at the Machine Learning and Cluster Cran Task Views but do not know where to start with a practical example.
So my question is: which package / algorithm would be a good pick to start playing around and working on the problem in R?
EDIT:
I realized that I might have been to imprecise: My matrix contains discrete choice data – so mean clustering might(!) not be the right idea. I do understand with what you said about vectors and observation but I am hoping for some function that accepts matrices or data.frames, because I have several observations over time.
EDIT2:
I realize that a package / function, introduction that focuses on unsupervised classification of categorical data is what would help me the most right now.
... classify discrete matrices by an unsupervised algorithm
You must mean cluster them. Classification is commonly done by supervised algorithms.
I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there
Without knowing what your matrices represent, it's hard to tell what kind of algorithm you need. But a starting point might be to flatten your 20*15 matrices to produce length-300 vectors; each element of such a vector would then be a feature (or variable) to base a clustering on. This is the way must ML packages, including the Cluster package you link to, work: "In case of a matrix or data frame, each row corresponds to an observation, and
each column corresponds to a variable."
So far I found daisy from the cluster package respectively the argument "gower" which refers to Gower's similarity coefficient to handle multiple modes of data. Gower seems to be a fairly only distance metric, still it's what I found for use with categorical data.
You might want to start from here : http://cran.r-project.org/web/views/MachineLearning.html
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I tried several ways of selecting predictors for a logistic regression in R. I used lasso logistic regression to get rid of irrelevant features, cutting their number from 60 to 24, then I used those 24 variables in my stepAIC logistic regression, after which I further cut 1 variable with p-value of approximately 0.1. What other feature selection methods I can or even should use? I tried to look for Anova correlation coefficient analysis, but I didn't find any examples for R. And I think I cannot use correlation heatmap in this situation since my output is categorical? I seen some instances recommending Lasso and StepAIC, and other instances criticising them, but I didn't find any definitive comprehensive alternative, which left me confused.
Given the methodological nature of your question you might also get a more detailed answer at Cross Validated: https://stats.stackexchange.com/
From your information provided, 23-24 independent variables seems quite a number to me. If you do not have a large sample, remember that overfitting might be an issue (i.e. low cases to variables ratio). Indications of overfitting are large parameter estimates & standard errors, or failure of convergence, for instance. You obviously have already used stepwise variable selection according to stepAIC which would have also been my first try if I would have chosen to let the model do the variable selection.
If you spot any issues with standard errors/parameter estimates further options down the road might be to collapse categories of independent variables, or check whether there is any evidence of multicollinearity which could also result in deleting highly-correlated variables and narrow down the number of remaining features.
Apart from a strictly mathematical approach you might also want to identify features that are likely to be related to your outcome of interest according to your underlying content hypothesis and your previous experience, meaning to look at the model from your point of view as expert in your field of interest.
If sample size is not an issue and the point is reduction of feature numbers, you may consider running a principal component analysis (PCA) to find out about highly correlated features and do the regression with the PCAs instead which are non-correlated linear combination of your "old" features. A package to accomplish PCA is factoextra using prcomp or princomp arguments http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have been following the last DESeq2 pipeline to perform an RNAseq analysis. My problem is the rin of the experimental samples is quite low compared to the control ones. Iread a paper in which they perform RNAseq analysis with time-course RNA degradation and conclude that including RIN value as a covariate can mitigate some of the effects of low rin in samples.
My question is how I should construct the design in the DESeq2 object:
~conditions+rin
~conditions*rin
~conditions:rin
none of them... :)
I cannot find proper resources where explain how to construct these models (I am new to the field...) and I recognise I crashed against a wall with these kinds of things. I would appreciate also some links to good resources to be able to understand which one is correct and why.
Thank you so much
Turns out to be quite long for typing in a comment.
It depends on your data.
First of all, counts ~conditions:rin does not make sense in your case, because conditions is categorical. You cannot fit only an interaction term model.
I would go with counts ~condition + rin, this assumes there is a condition effect and a linear effect from rin. And the counts' dependency of rin is independent of condition.
As you mentioned, rin in one of the conditions is quite low, but is there any reason to suspect the relationship between rin and counts to differ in the two conditions? If you fit counts ~condition * rin, you are assuming a condition effect and a rin effect that is different in conditions. Meaning a different slope for rin effect if you plot counts vs rin. You need to take a few genes out, and see whether this is true. And also, for fitting this model, you need quite a lot of samples to estimate the effects accurately. See if both of these holds
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have 50 predictors and 1 target variable. All my predictors and target variable are only binary numbers 0s and 1s. I am performing my analysis using R.
I will be implementing four algorithms.
1. RF
2. Log Reg
3. SVM
4. LDA
I have the following questions:
I convert them all into factors. How should i treat my variables priorly, before feeding them into my other algorithms.
I used the caret package to train my model, it takes very much time. I do practice ML regularly, but I dont know how to proceed with all variable being binary.
How to remove collinear variables?
I'm not mostly R-user, but Python. Bet there is common approach:
1. Check you columns. Remove column if number of zeroes or ones is > 95% of total amount (you can try 2.5% or even 1% later).
2. Run simple Random Forest by default and get feature importance. Columns that are unnecessary you can process with LDA.
3. Check target column. If it's highly unbalanced try oversampling or downsampling. Or use classification methods that can handle unbalanced target column (like XGBoost).
For Linear regression you'll need to calculate correlation matrix and remove correlated columns. Other methods can live without it.
Please check SVM (or SVC) does it support all features to be boolean or not. But usually it works very good with binary classification.
Also I advice to try Neural Network.
PS About collinear variables. I wrote a code on Python for my project. That's simple - you can do it:
- plot correlation matrix
- find pairs that have correlation over some threshold
- remove column that have lower correlation with target variable (you can also check that columns you want to remove is not important, otherwise try other way, probably union columns)
In my code I ran this algorithm iteratively for different thresholds: from 0.99 down to 0.9. Works good.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I'm trying to cluster the movies data set that comes with the package "ggplot2" in R. I will be using k-means. The column names that comes with this data set are :
[1] "title" "year" "length" "budget" "rating"
[6] "votes" "r1" "r2" "r3" "r4"
[11] "r5" "r6" "r7" "r8" "r9"
[16] "r10" "mpaa" "Action" "Animation" "Comedy"
[21] "Drama" "Documentary" "Romance" "Short"
Would you think it is a good idea to do clustering based on the movie genre ? I'm kind of lost and don't know where to start off. Any advice ?
You need to figure out what makes a good cluster.
There are millions of way to cluster this data set. Because you can preprocess the data differently, use different algorithms, distances, and so on.
Without your guidance, the clustering algorithm will just do something, and likely return a completely useless result!
So you need to first get a clear objective: what is a good clustering?
Then you can try to adapt the data such that the clustering algorithms optimize for this objective. For k-means, you need to do all of this in preprocessing. For hclust you can also choose distance functions that match your desires.
To answer your first question: Yes, I think that this is an interesting project. Working with this dataset could be a cool way to learn about different data mining techniques.
To answer your second question, here is some advice. Clustering is an unsupervised learning technique. Learning is unsupervised when the target variable (in this case, the target variable might be the genre of the movie) is unknown. However, looking at the columns that you listed, it seems like you do have the genre information. With that in mind, you have two options. First, you could pretend like you don't have the genre information. In this case you would apply k-means to the rest of the data. After the clustering is done, you could evaluate how well the algorithm done by comparing it to the known genre. Second, you could treat this problem as a classification problem. In this case, you would use the the genre information to learn a model that can predict the genre. You might already know this, but I just wanted to say it.
To give you some advice on the clustering problem specifically, I first would want to know what the 'r1', ..., 'r10' variables represent. Are they numeric variables or categorical? K-means has two steps: one where you assign data points to the centroid that is nearest to it and one where you calculate the new centroid by taking the mean of all data points in a cluster. Does taking the mean of these variables make sense?
With that in mind, I would recommend first choosing the variables that you want to use in the clustering algorithm. Then write the following functions: one that can calculate the distance between two points, one that can assign an observation to the nearest centroid, and one that can recalculate the centroids based on the assignments.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have data of a lot of students who got selected by some colleges based on their marks. Iam new to machine Learning. Can I have some suggestions how can I add Azure Machine Learning for predicting the colleges that they can get based on their marks
Try a multi-class logistic regression - also look at this https://gallery.cortanaanalytics.com/Experiment/da44bcd5dc2d4e059ebbaf94527d3d5b?fromlegacydomain=1
Apart from logistic regression, as #neerajkh suggested, I would try as well
One vs All classifiers. This method use to work very well in multiclass problems (I assume you have many inputs, which are the marks of the students) and many outputs (the different colleges).
To implement one vs all algorithm I would use Support Vector Machines (SVM). It is one of the most powerful algorithms (until deep learning came into the scene, but you don't need deep learning here)
If you could consider changing framework, I would suggest to use python libraries. In python it is very straightforward to compute very very fast the problem you are facing.
use randomforesttrees and feed this ML algorithm to OneVsRestClassifer which is a multi class classifier
Keeping in line with other posters' suggestions of using multi-class classification, you could use artificial neural networks (ANNs)/multilayer perceptron to do this. Each output node could be a college and, because you would be using a sigmoid transfer function (logistic) the output for each of the nodes could be directly viewed as the probability of that college accepting a particular student (when trying to make predictions).
Why don't you try softmax regression?
In extremely simple terms, Softmax takes an input and produces the probability distribution of the input belonging to each one of your classes. So in other words based on some input (grade in this case), your model can output the probability distribution that represents the "chance" a given sudent has to be accepted to each college.
I know this is an old thread but I will go ahead and add my 2 cents too.
I would recommend adding multi-class, multi-label classifier. This allows you to find more than one college for a student. Of course this is much easier to do with an ANN but is much harder to configure (say with the configuration of the network; number of nodes/hidden nodes or even the activation function for that matter).
The easiest method to do this as #Hoap Humanoid suggests is to use a Support Vector Classifier.
To do any of these method its a given that you have to havea well diverse data set. I cant say the number of data points you need that you have to experiment with but the accuracy of the model is dependent on number of data points and its diversity.
This is very subjective. Just applying any algorithm that classifies into categories won't be a good idea. Without performing Exploratory Data Analysis and checking following things you can't be sure of a doing predictive analytics, apart from missing values:
Quantitative and Qualitative variable.
Univariate, Bivariate and multivariate distribution.
Variable relationship to your response(college) variable.
Looking for outliers(multivariate and univariate).
Required variable transformation.
Can be the Y variable broken down into chunks for example location, for example whether a candidate can be a part of Colleges in California or New York. If there is a higher chance of California, then what college. In this way you could capture Linear + non-linear relationships.
For base learners you can fit Softmax regression model or 1 vs all Logistic regression which does not really matters a lot and CART for non-linear relationship. I would also do K-nn and K-means to check for different groups within data and decide on predictive learners.
I hope this makes sense!
The Least-square support vector machine (LSSVM) is a powerful algorithm for this application. Visit http://www.esat.kuleuven.be/sista/lssvmlab/ for more information.