How should I select features for logistic regression in R? [closed] - r

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I tried several ways of selecting predictors for a logistic regression in R. I used lasso logistic regression to get rid of irrelevant features, cutting their number from 60 to 24, then I used those 24 variables in my stepAIC logistic regression, after which I further cut 1 variable with p-value of approximately 0.1. What other feature selection methods I can or even should use? I tried to look for Anova correlation coefficient analysis, but I didn't find any examples for R. And I think I cannot use correlation heatmap in this situation since my output is categorical? I seen some instances recommending Lasso and StepAIC, and other instances criticising them, but I didn't find any definitive comprehensive alternative, which left me confused.

Given the methodological nature of your question you might also get a more detailed answer at Cross Validated: https://stats.stackexchange.com/
From your information provided, 23-24 independent variables seems quite a number to me. If you do not have a large sample, remember that overfitting might be an issue (i.e. low cases to variables ratio). Indications of overfitting are large parameter estimates & standard errors, or failure of convergence, for instance. You obviously have already used stepwise variable selection according to stepAIC which would have also been my first try if I would have chosen to let the model do the variable selection.
If you spot any issues with standard errors/parameter estimates further options down the road might be to collapse categories of independent variables, or check whether there is any evidence of multicollinearity which could also result in deleting highly-correlated variables and narrow down the number of remaining features.
Apart from a strictly mathematical approach you might also want to identify features that are likely to be related to your outcome of interest according to your underlying content hypothesis and your previous experience, meaning to look at the model from your point of view as expert in your field of interest.
If sample size is not an issue and the point is reduction of feature numbers, you may consider running a principal component analysis (PCA) to find out about highly correlated features and do the regression with the PCAs instead which are non-correlated linear combination of your "old" features. A package to accomplish PCA is factoextra using prcomp or princomp arguments http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

Related

Clarifying the aim of linear regression with multiple predictor variables and how to plot using ggplot2 [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I'm trying to learn the intricacies of linear regression for prediction, and I'd like to ask two questions:
I've got one dependent variable (call it X) and, let's say, ten independent variables. I can use lm() to generate a model. But my question is this: is the aim of generating a model (or, more likely, multiple models) to identify the single best predictor of X, or is the aim to discover the best combination of predictors of X? I assumed the latter, but after several hours of reading online I am now unsure.
If the aim is to discover the best combination of predictors of X, then (once I've identified that combination) how is a combination plotted properly? Plotting one line is easy, but for a combination would it be proper to (a) plot ten distinct regression lines (one per independent variable) or (b) plot a single line that somehow represents the combination? I've provided the summary() I'm working with in case it facilitates answering this question.
Is the aim of generating a model (or, more likely, multiple models) to identify the single best predictor of X, or is the aim to discover the best combination of predictors of X?
This depends mainly on the situation/context you are in. If you are always going to have access to these predictors, then yes, you'd like to identify the best model that will (likely) use a combination of these predictors. Obviously you want to keep in mind issues like overfitting and make sure the predictors you include are actually contributing something meaningful to your model, but there's no reason not to include multiple predictors if they make your model meaningfully better.
However, in many real world scenarios predictors are not free. It might cost $10,000 to collect each predictor and the organization you are working for only has the budget to collect one predictor. Thus, you might only be interested in the single best predictor because it is not practical to collect more than one going forward. In this case you'd also just be interested in how well that variable predicts in a simple regression, not a multiple regression, since you won't be controlling for other variables in the future anyway (but looking at the multiple regression results could still provide insight).
how is a combination plotted properly?
Again, this depends on context. However, in most cases you probably don't want to plot 10 regression lines because that's too overwhelming to look at and you will probably never have 10 variables that meaningfully contribute to your model. I'm actually kind of surprised your adjusted R^2 is not lower given you have quite a few variables so close to zero, unless they're just on massive scales.
First, who is viewing this graph? Is it you? If so, what information do you need to see that isn't being conveyed by the beta parameters? If it's someone else, who are they? Are they a stakeholder who knows nothing about statistics? If that's the case, you want a pretty simple graph that drives home your main point. Second, what is the purpose of your predictions and how does the process you are predicting unfold in the real world? Let's say I'm predicting how well people perform on the job given their scores on some different selection measures. The first thing you need to consider is, how is that selection happening? Are candidates screened on their answers to some personality questions and only the top scorers get an interview? In that case, it might be useful to create multiple graphs that show that process. However, candidates might be reviewed holistically and assigned a sum score based on all these predictors. In that case one regression line makes sense because you are interested in how these predictors act in concert.
There is no one answer to this question because the answers depend on the reason you're doing a regression in the first place. Once you identify the reason you're trying to predict this thing and the context that the process is happening in you should probably be able to determine what makes most sense. There is no "right" answer you'll find in a textbook because most real life problems are not in textbooks.

How to specify a vector full of means with degrees of freedom for a Lack of Fit F-Test in R

Currently I'm working through Applied Linear Models $5^{th}$ ed - by Kutner, et al. A question I'm working on is asking me to perform an F-Test for Lack of Fit on my linear model. The linear model is a simple linear model of one parameter nothing too troublesome.
To perform the test one has to assess the difference between the full model and the reduced model. At this current junction the authors have stated to take the full model as $\hat{\mu_{j}} = \bar{Y_{j}}$. Specifically the screenshot below says the following:
The reduced model would be the simple linear model:
I have no problem being able to do this manually within R, by computing the necessary values where need be as I've done for other questions. But I'm trying to improve my R skill set and this is where my problem lies.
I have done some reading to other answers related to this and model comparison can be done directly in the anova() function. But I'm having issues stating my full model correctly to be able to leverage the anova() function. I thought about computing a "vector of means" for the subgroups of data (which I display here just for completeness)
But I'm going to run into the problem of the anova() function most likely not being able to compute the degrees of freedom correctly. My data set is very small and this seems like the sort of situation that would show up all the time. With huge data sets I wouldn't see it being feasible to compute things manually so surely there has to be a way for me to phrase my Full Model properly to allow for the computation of means from the subgroups of replicates. But how do I do so? is the question of the day.
For completeness and posterity an answer was given on a sister site I asked this question on:
https://stats.stackexchange.com/questions/539958/how-to-specify-a-vector-full-of-means-with-degrees-of-freedom-for-a-lack-of-fit
the mods can delete the question if they deem fit and doesn't contribute to the community.

DESeq2 design matrix including RIN as covariate in the formula [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have been following the last DESeq2 pipeline to perform an RNAseq analysis. My problem is the rin of the experimental samples is quite low compared to the control ones. Iread a paper in which they perform RNAseq analysis with time-course RNA degradation and conclude that including RIN value as a covariate can mitigate some of the effects of low rin in samples.
My question is how I should construct the design in the DESeq2 object:
~conditions+rin
~conditions*rin
~conditions:rin
none of them... :)
I cannot find proper resources where explain how to construct these models (I am new to the field...) and I recognise I crashed against a wall with these kinds of things. I would appreciate also some links to good resources to be able to understand which one is correct and why.
Thank you so much
Turns out to be quite long for typing in a comment.
It depends on your data.
First of all, counts ~conditions:rin does not make sense in your case, because conditions is categorical. You cannot fit only an interaction term model.
I would go with counts ~condition + rin, this assumes there is a condition effect and a linear effect from rin. And the counts' dependency of rin is independent of condition.
As you mentioned, rin in one of the conditions is quite low, but is there any reason to suspect the relationship between rin and counts to differ in the two conditions? If you fit counts ~condition * rin, you are assuming a condition effect and a rin effect that is different in conditions. Meaning a different slope for rin effect if you plot counts vs rin. You need to take a few genes out, and see whether this is true. And also, for fitting this model, you need quite a lot of samples to estimate the effects accurately. See if both of these holds

How to estimate missing DV using its own estimation model within a linear model? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
This question is more about statistics than R programming, though as I am a beginning user of R, I would especially appreciate any thoughts in the context of R; thanks for considering it:
The outcome variable in one of our linear models (lm) is waist circumference, which is missing in about 20% of our dataset. Last year a model was published which reliably estimates waist circumference from BMI, age, and gender (all of which we do have). I'd like to use this model to impute the missing waist circumferences in our data, but I'm wanting to make sure I incorporate the known error in that estimation model. The standard error of the intercept and of each coefficient has been reported.
Could you suggest how I might go about responsibly imputing (or perhaps a better word is estimating) the missing waist circumferences and evaluating any effect on my own waist circumference prediction models?
Thanks again for any coding strategy.
As Frank has indicated, this question has a strong stats flavor to it. But one possible solution does indeed entail some sophisticated programming, so perhaps it's legitimate to put it in an R thread.
In order to "incorporate the known error in that estimation", one standard approach is multiple imputation, and if you want to go this route, R is a good way to do it. It's a little involved, so you'll have to work out the specifics of the code for yourself, but if you understand the basic strategy it's relatively straightforward.
The basic idea is that for every subject in your dataset you impute the waist circumference by first using the published model and the BMI, age, and gender to determine the expected value, and then you add some simulated random noise to that; you'll have to read through the publication to determine the numerical value of that noise. Once you've filled in every missing value, then you just perform whatever statistical computation you want to run, and save the standard errors. Now, you create a second dataset, derived from your original dataset with missing values, once again using the published model to impute the expected values, along with some random noise -- since the noise is random, the imputed values for this dataset should be different from the imputed values for the first dataset. Now do your statistical computation, and save the standard errors, which will be a little different than those from the first imputed dataset, since the imputed values contain random noise. Repeat for a bunch of times. Finally, average the saved standard errors, and this will give you an estimate for the standard error incorporating the uncertainty due to the imputation.
What you're doing is essentially a two-level simulation: on a low level, for each iteration you are using the published model to create a simulated dataset with noisy imputed values for missing data, which then gives you a simulated standard error, and then on a high level you repeat the process to obtain a sample of such simulated standard errors, which you then average to get your overall estimate.
This is a pain to do in traditional stats packages such as SAS or Stata, although it IS possible, but it's much easier to do in R because it's based on a proper programming language. So, yes, your question is properly speaking a stats question, but the best solution is probably R-specific.

Do I need to normalize (or scale) data for randomForest (R package)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am doing regression task - do I need to normalize (or scale) data for randomForest (R package)? And is it neccessary to scale also target values?
And if - I want to use scale function from caret package, but I did not find how to get data back (descale, denormalize). Do not you know about some other function (in any package) which is helpfull with normalization/denormalization?
Thanks,
Milan
No, scaling is not necessary for random forests.
The nature of RF is such that convergence and numerical precision issues, which can sometimes trip up the algorithms used in logistic and linear regression, as well as neural networks, aren't so important. Because of this, you don't need to transform variables to a common scale like you might with a NN.
You're don't get any analogue of a regression coefficient, which measures the relationship between each predictor variable and the response. Because of this, you also don't need to consider how to interpret such coefficients which is something that is affected by variable measurement scales.
Scaling is done to Normalize data so that priority is not given to a particular feature.
Role of Scaling is mostly important in algorithms that are distance based and require Euclidean Distance.
Random Forest is a tree-based model and hence does not require feature scaling.
This algorithm requires partitioning, even if you apply Normalization then also> the result would be the same.
I do not see any suggestions in either the help page or the Vignette that suggests scaling is necessary for a regression variable in randomForest. This example at Stats Exchange does not use scaling either.
Copy of my comment: The scale function does not belong to pkg:caret. It is part of the "base" R package. There is an unscale function in packages grt and DMwR that will reverse the transformation, or you could simply multiply by the scale attribute and then add the center attribute values.
Your conception of why "normalization" needs to be done may require critical examination. The test of non-normality is only needed after the regressions are done and may not be needed at all if there are no assumptions of normality in the goodness of fit methodology. So: Why are you asking? Searching in SO and Stats.Exchange might prove useful:
citation #1 ; citation #2 ; citation #3
The boxcox function is a commonly used tranformation when one does not have prior knowledge of twhat a distribution "should" be and when you really need to do a tranformation. There are many pitfalls in applying transformations, so the fact that you need to ask the question raises concerns that you may be in need of further consultations or self-study.
Guess, what will happen in the following example?
Imagine, you have 20 predictive features, 18 of them are in [0;10] range and the other 2 in [0;1,000,000] range (taken from a real-life example). Question1: what feature importances will Random Forest assign. Question2: what will happen to the feature importance after scaling the 2 large-range features?
Scaling is important. It is that Random Forest is less sensitive to the scaling then other algorithms and can work with "roughly"-scaled features.
If you are going to add interactions to dataset - that is, new variable being some function of other variables (usually simple multiplication), and you dont feel what that new variable stands for (cant interprete it), then you should calculate this variable using scaled variables.
Random Forest uses information gain / gini coefficient inherently which will not be affected by scaling unlike many other machine learning models which will (such as k-means clustering, PCA etc). However, it might 'arguably' fasten the convergence as hinted in other answers

Resources