Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have 50 predictors and 1 target variable. All my predictors and target variable are only binary numbers 0s and 1s. I am performing my analysis using R.
I will be implementing four algorithms.
1. RF
2. Log Reg
3. SVM
4. LDA
I have the following questions:
I convert them all into factors. How should i treat my variables priorly, before feeding them into my other algorithms.
I used the caret package to train my model, it takes very much time. I do practice ML regularly, but I dont know how to proceed with all variable being binary.
How to remove collinear variables?
I'm not mostly R-user, but Python. Bet there is common approach:
1. Check you columns. Remove column if number of zeroes or ones is > 95% of total amount (you can try 2.5% or even 1% later).
2. Run simple Random Forest by default and get feature importance. Columns that are unnecessary you can process with LDA.
3. Check target column. If it's highly unbalanced try oversampling or downsampling. Or use classification methods that can handle unbalanced target column (like XGBoost).
For Linear regression you'll need to calculate correlation matrix and remove correlated columns. Other methods can live without it.
Please check SVM (or SVC) does it support all features to be boolean or not. But usually it works very good with binary classification.
Also I advice to try Neural Network.
PS About collinear variables. I wrote a code on Python for my project. That's simple - you can do it:
- plot correlation matrix
- find pairs that have correlation over some threshold
- remove column that have lower correlation with target variable (you can also check that columns you want to remove is not important, otherwise try other way, probably union columns)
In my code I ran this algorithm iteratively for different thresholds: from 0.99 down to 0.9. Works good.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I tried several ways of selecting predictors for a logistic regression in R. I used lasso logistic regression to get rid of irrelevant features, cutting their number from 60 to 24, then I used those 24 variables in my stepAIC logistic regression, after which I further cut 1 variable with p-value of approximately 0.1. What other feature selection methods I can or even should use? I tried to look for Anova correlation coefficient analysis, but I didn't find any examples for R. And I think I cannot use correlation heatmap in this situation since my output is categorical? I seen some instances recommending Lasso and StepAIC, and other instances criticising them, but I didn't find any definitive comprehensive alternative, which left me confused.
Given the methodological nature of your question you might also get a more detailed answer at Cross Validated: https://stats.stackexchange.com/
From your information provided, 23-24 independent variables seems quite a number to me. If you do not have a large sample, remember that overfitting might be an issue (i.e. low cases to variables ratio). Indications of overfitting are large parameter estimates & standard errors, or failure of convergence, for instance. You obviously have already used stepwise variable selection according to stepAIC which would have also been my first try if I would have chosen to let the model do the variable selection.
If you spot any issues with standard errors/parameter estimates further options down the road might be to collapse categories of independent variables, or check whether there is any evidence of multicollinearity which could also result in deleting highly-correlated variables and narrow down the number of remaining features.
Apart from a strictly mathematical approach you might also want to identify features that are likely to be related to your outcome of interest according to your underlying content hypothesis and your previous experience, meaning to look at the model from your point of view as expert in your field of interest.
If sample size is not an issue and the point is reduction of feature numbers, you may consider running a principal component analysis (PCA) to find out about highly correlated features and do the regression with the PCAs instead which are non-correlated linear combination of your "old" features. A package to accomplish PCA is factoextra using prcomp or princomp arguments http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have prediction of my data in different classifier. I would like to ensemble the results of them in order to get better final result. Is it possible in R?
lets say:
SVMpredict=[1 0 0 1...]
RFpredict=[1 1 0 1 ...]
NNpredict=[0 0 0 1 ...]
is it possible to combine results by any ensemble technique in R?how?
thanks
Edited:
I run my classifiers on different samples (my case DNA chromosomes). In some samples SVM works better than the others like RF. I want a technique to ensemble the results by considering which classifier works better.
for example if i take average of output probabilities and round them, it would be considered as all classifier are equal effective in result. but when SVM worked better, we should consider results for SVM (with 86% accuracy) has 60% importance and 25% (72% accuracy) for RF and 15% NN (64% accuracy). (these numbers are just examples for clarification)
is there anyway that I can do that?
It depends on the structure of the output of Your classifier.
If it is {0,1} outcome, as You provided, You can just make mean of the predictions and then average it and round it:
round((SVMpredict+RFpredict+NNpredict)/3)
If You know performance of the classifiers, weighted mean is a good idea - favor the ones that perform better. Hardcore apporach is to optimize weights via optim function.
If You know class probabilites for each prediction, it is better average them instead of letting them just vote ({0,1} output case above).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
This question is more about statistics than R programming, though as I am a beginning user of R, I would especially appreciate any thoughts in the context of R; thanks for considering it:
The outcome variable in one of our linear models (lm) is waist circumference, which is missing in about 20% of our dataset. Last year a model was published which reliably estimates waist circumference from BMI, age, and gender (all of which we do have). I'd like to use this model to impute the missing waist circumferences in our data, but I'm wanting to make sure I incorporate the known error in that estimation model. The standard error of the intercept and of each coefficient has been reported.
Could you suggest how I might go about responsibly imputing (or perhaps a better word is estimating) the missing waist circumferences and evaluating any effect on my own waist circumference prediction models?
Thanks again for any coding strategy.
As Frank has indicated, this question has a strong stats flavor to it. But one possible solution does indeed entail some sophisticated programming, so perhaps it's legitimate to put it in an R thread.
In order to "incorporate the known error in that estimation", one standard approach is multiple imputation, and if you want to go this route, R is a good way to do it. It's a little involved, so you'll have to work out the specifics of the code for yourself, but if you understand the basic strategy it's relatively straightforward.
The basic idea is that for every subject in your dataset you impute the waist circumference by first using the published model and the BMI, age, and gender to determine the expected value, and then you add some simulated random noise to that; you'll have to read through the publication to determine the numerical value of that noise. Once you've filled in every missing value, then you just perform whatever statistical computation you want to run, and save the standard errors. Now, you create a second dataset, derived from your original dataset with missing values, once again using the published model to impute the expected values, along with some random noise -- since the noise is random, the imputed values for this dataset should be different from the imputed values for the first dataset. Now do your statistical computation, and save the standard errors, which will be a little different than those from the first imputed dataset, since the imputed values contain random noise. Repeat for a bunch of times. Finally, average the saved standard errors, and this will give you an estimate for the standard error incorporating the uncertainty due to the imputation.
What you're doing is essentially a two-level simulation: on a low level, for each iteration you are using the published model to create a simulated dataset with noisy imputed values for missing data, which then gives you a simulated standard error, and then on a high level you repeat the process to obtain a sample of such simulated standard errors, which you then average to get your overall estimate.
This is a pain to do in traditional stats packages such as SAS or Stata, although it IS possible, but it's much easier to do in R because it's based on a proper programming language. So, yes, your question is properly speaking a stats question, but the best solution is probably R-specific.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am doing regression task - do I need to normalize (or scale) data for randomForest (R package)? And is it neccessary to scale also target values?
And if - I want to use scale function from caret package, but I did not find how to get data back (descale, denormalize). Do not you know about some other function (in any package) which is helpfull with normalization/denormalization?
Thanks,
Milan
No, scaling is not necessary for random forests.
The nature of RF is such that convergence and numerical precision issues, which can sometimes trip up the algorithms used in logistic and linear regression, as well as neural networks, aren't so important. Because of this, you don't need to transform variables to a common scale like you might with a NN.
You're don't get any analogue of a regression coefficient, which measures the relationship between each predictor variable and the response. Because of this, you also don't need to consider how to interpret such coefficients which is something that is affected by variable measurement scales.
Scaling is done to Normalize data so that priority is not given to a particular feature.
Role of Scaling is mostly important in algorithms that are distance based and require Euclidean Distance.
Random Forest is a tree-based model and hence does not require feature scaling.
This algorithm requires partitioning, even if you apply Normalization then also> the result would be the same.
I do not see any suggestions in either the help page or the Vignette that suggests scaling is necessary for a regression variable in randomForest. This example at Stats Exchange does not use scaling either.
Copy of my comment: The scale function does not belong to pkg:caret. It is part of the "base" R package. There is an unscale function in packages grt and DMwR that will reverse the transformation, or you could simply multiply by the scale attribute and then add the center attribute values.
Your conception of why "normalization" needs to be done may require critical examination. The test of non-normality is only needed after the regressions are done and may not be needed at all if there are no assumptions of normality in the goodness of fit methodology. So: Why are you asking? Searching in SO and Stats.Exchange might prove useful:
citation #1 ; citation #2 ; citation #3
The boxcox function is a commonly used tranformation when one does not have prior knowledge of twhat a distribution "should" be and when you really need to do a tranformation. There are many pitfalls in applying transformations, so the fact that you need to ask the question raises concerns that you may be in need of further consultations or self-study.
Guess, what will happen in the following example?
Imagine, you have 20 predictive features, 18 of them are in [0;10] range and the other 2 in [0;1,000,000] range (taken from a real-life example). Question1: what feature importances will Random Forest assign. Question2: what will happen to the feature importance after scaling the 2 large-range features?
Scaling is important. It is that Random Forest is less sensitive to the scaling then other algorithms and can work with "roughly"-scaled features.
If you are going to add interactions to dataset - that is, new variable being some function of other variables (usually simple multiplication), and you dont feel what that new variable stands for (cant interprete it), then you should calculate this variable using scaled variables.
Random Forest uses information gain / gini coefficient inherently which will not be affected by scaling unlike many other machine learning models which will (such as k-means clustering, PCA etc). However, it might 'arguably' fasten the convergence as hinted in other answers
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Recently I watched a lot of Stanford's hilarious Open Classroom's video lectures. Particularly the part about unsupervised Machine Learning got my attention. Unfortunately it stops were it might get even more interesting.
Basically I am looking to classify discrete matrices by an unsupervised algorithm. Those matrices just contain discrete values of the same range. Let's say I have 1000s of 20x15 matrices that with values ranging from 1-3. I just started to read through the literature and I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there.
I also looked at the Machine Learning and Cluster Cran Task Views but do not know where to start with a practical example.
So my question is: which package / algorithm would be a good pick to start playing around and working on the problem in R?
EDIT:
I realized that I might have been to imprecise: My matrix contains discrete choice data – so mean clustering might(!) not be the right idea. I do understand with what you said about vectors and observation but I am hoping for some function that accepts matrices or data.frames, because I have several observations over time.
EDIT2:
I realize that a package / function, introduction that focuses on unsupervised classification of categorical data is what would help me the most right now.
... classify discrete matrices by an unsupervised algorithm
You must mean cluster them. Classification is commonly done by supervised algorithms.
I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there
Without knowing what your matrices represent, it's hard to tell what kind of algorithm you need. But a starting point might be to flatten your 20*15 matrices to produce length-300 vectors; each element of such a vector would then be a feature (or variable) to base a clustering on. This is the way must ML packages, including the Cluster package you link to, work: "In case of a matrix or data frame, each row corresponds to an observation, and
each column corresponds to a variable."
So far I found daisy from the cluster package respectively the argument "gower" which refers to Gower's similarity coefficient to handle multiple modes of data. Gower seems to be a fairly only distance metric, still it's what I found for use with categorical data.
You might want to start from here : http://cran.r-project.org/web/views/MachineLearning.html