Multiple Regression in R - r

I have been trying to do a simple regression in R using the following syntax:
Unfortunately, R keeps giving me warnings and the summary is not possible:
I can't find out the problem. The data includes more than just the 11 predictors mentioned in the syntax.
Thank you!
Melanie

This answer partially consists of comments in the original question.
That is not an error. It's a warning message (it differs from error). It's generated because you attempt to use lm() for a factor-type response variable. Operations like + and - does not work on factor, hence the message "-" not meaningful for factors.
If the response is truly a categorical variable, lm() might not be the right way to go to model it. Alternatives in this situation:
glm(): Binary logistic regression, Poisson regression, negative binomial regression
MASS::polr(): Ordinal logistic regression
nnet::multinom(): Multinomial logistic regression
and many more others.
Please research the corresponding methods before actually using it.
If the response is actually NOT a categorical variable, you will want to look further why it is coded as a factor, and turn it to numeric first.

Related

GLMM in R versus SPSS (convergence and singularity problems vanish)

Unfortunately, I had convergence (and singularity) issues when calculating my GLMM analysis models in R. When I tried it in SPSS, I got no such warning message and the results are only slightly different. Does it mean I can interpret the results from SPSS without worries? Or do I have to test for singularity/convergence issues to be sure?
You have two questions. I will answer both.
First Question
Does it mean I can interpret the results from SPSS without worries?
You do not want to do this. The reason being is that mixed models have a very specific parameterization. Here is a screenshot of common lme4 syntax from the original article about lme4 from the author:
With this comes assumptions about what your model is saying. If for example you are running a model with random intercepts only, you are assuming that the slopes do not vary by any measure. If you include correlated random slopes and random intercepts, you are then assuming that there is a relationship between the slopes and intercepts that may either be positive or negative. If you present this data as-is without knowing why it produced this summary, you may fail to explain your data in an accurate way.
The reason as highlighted by one of the comments is that SPSS runs off defaults whereas R requires explicit parameters for the model. I'm not surprised that the model failed to converge in R but not SPSS given that SPSS assumes no correlation between random slopes and intercepts. This kind of model is more likely to converge compared to a correlated model because the constraints that allow data to fit a correlated model make it very difficult to converge. However, without knowing how you modeled your data, it is impossible to actually know what the differences are. Perhaps if you provide an edit to your question that can be answered more directly, but just know that SPSS and R do not calculate these models the same way.
Second Question
Or do I have to test for singularity/convergence issues to be sure?
SPSS and R both have singularity checks as a default (check this page as an example). If your model fails to converge, you should drop it and use an alternative model (usually something that has a simpler random effects structure or improved optimization).

zero inflated data and lognormal distribution

I am trying to fit a regression model to zero-inflated data with a lognormal distribution using r.
The histogram looks like this:
I searched a little on the net. So far I believe there is no a possibility to fit these conditions to glm. I found the gamlss function as the possibility to fit a lognormal distribution with the LOGNO family. However I get an error: "family = LOGNO, : response variable out of range" - maybe because of the zero inflation?
To make my question a little clearer:
I am trying to investigate the influence of various Aminoacid combinations collected under certain conditions on a certain ratio. The ratio is my response variable plotted in the shown histogram. In the end I end up with a continuous response variable and some other categorical independent variables
Has anyone an idea how I can deal with the above-mentioned problem? I couldn't find a solution so far!
Thank you!

Procedure to identify the most significant predictors variables using R when data has tremendous multicollinearity?

I have a database of around 36 predictor variables which I am using to predict a target variable. The target is a categorical variable consisting of three different classes whereas predictor variables include both numeric and categorical variables.
However, data is subject to severe multi-collinearity. I am trying to build a parsimonious logistic regression model so need to reduce the variables. According to VIF values results become counter intuitive as soon as I reduce the number of variables. On the other hand, I am not very sure that PCR can solve the problem as I need inferences from the sensitivity from each variable.
What is the better option to deal with such problem?
Which packages from 'R' I can use?
Will factor analysis solve the problem?
Or can we infer everything from PCR?
You have first to run ANOVA/Kruskall Wallis test to check which variables are well suited for your problem. For 36 variables I don't think you will need PCA, as this will make your model lose some explainability.
Remember that PCA will reduce dimensionality and also explain only part of the data variance. Factor Analysis will generate groups of variables in factors, in case you want to run a segmented logistic regression for each factor of grouped variables.
If you want to build a parsimonious logistic regression, you can apply some regularization so that you can increase the generalization properties of it, instead of reducing number of variables.
You can use the following R packages: caret (logistic regression), ROCR (AUC), ggplot (plot), DMwR (outliers), mice (missing values)
Also, if you want to make a regularization, you can use the following formula:
In this case, you can develop regularization from scratch, without a library, to adjust the inclination of the sigmoid, so that you can correctly classify your classes:

can we get probabilities the same way that we get them in logistic regression through random forest?

I have a data structure with binary 0-1 variable (click & Purchase; click & not-purchase) against a vector of the attributes. I used logistic regression to get the probabilities of the purchase. How can I use Random Forest to get the same probabilities? Is it by using Random Forest regression? or is it Random Forest classification with type='prob' in R which gives the probability of categorical variable?
It won't give you the same result since the structure of the two method are different. Logistic regression is given by a definitive linear specification, where RF is a collective vote from multiple independent/random trees. If specification and input feature are properly tuned for both, they can produce comparable results. Here is the major difference between the two:
RF will give more robust fit against noise, outliers, overfitting or multicollinearity etc which are common pitfalls in regression type of solution. Basically if you don't know or don't want to know much about whats going in with the input data, RF is a good start.
logistic regression will be good if you know expertly about the data and how to properly specify the equation. Or somehow want to engineer how the fit/prediction works. The explicit form of GLM specification will allow you to do that.

R: Estimating model variance

In the demo for ROC, there are models that when plotted have a spread, like hiv.svm$predictions which contains 10 estimates of response. Can someone remind me how to calculate N estimates of a model. I'm using RPART and neural network to estimate a single output (true/false). How can I run 10 different sampling for training data to get 10 different model responses to the input. I think the function is called bootstraping, but I don't know how to implement it.
I need to do this outside of caret, cause when I use caret I keep getting the message "Error in tab[1:m, 1:m] : subscript out of bounds". Is there a "simple" bootstrap function?
Obviously the answer is too late, but you could have used caret just by simply renaming the levels of your factor, because caret doesn't work if your binary response is of type logical. For example:
factor(responseWithTrueFalseLevel,
levels=c(TRUE,FALSE),
labels=c("myTrueLevel","myFalseLevel"))

Resources