Applicable Regression Models for complex X(predictor) vs Y(response) plot? [closed] - r

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
http://i43.tinypic.com/8yz893.png
The figure in the link shows the relation between one of my predictors(vms) and the response(responses[i]).
We can distinguish many log-like trends within the same graph.
According to this, a single value of my predictor can be mapped to many values of the response.
Is this acceptable or should I be alarmed that there is a problem with my data?
What regression model would seem more suitable for this picture?

This isn't really an R question, but rather a general statistics question, so you may get downvoted, but I'll try to help you out.
There's nothing wrong with having individual values of the predictor mapping to multiple values of response. This would be a problem if you were defining and evaluating a function, but you're not technically evaluating a function, you're evaluating the statistical relationship between two variables. You will then create a functional form to model this relationship.
It seems to me that a conventional OLS model would be very inappropriate here, as one of the assumptions of OLS is that the relationship between the predictor and the outcome variable is linear, which is clearly is not in this case. The relationship actually looks a lot like a 1/x curve, so you may want to try a 1/x transformation and see where that gets you.

Related

Is there any relationship between logistics regression and decision tree? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
I tried to make a decision tree, and predict the result of the tree with predict function.
predict(c.tree1,D1, type = "prob")
when the type is "prob" it will give you back the probability of getting result instead of simple classify it. I think it is pretty close to logistic regression.
Could you please tell me is there any relationship between them?
If you want to ask question related to concepts there is different website of stack over flow.
Logistic regression is used only when our dependent variable is binary (0,1,Yes,NO,etc).
but decision tree can be used if dependent variable is discrete or binary(categorical).
In short logistic is used for classification and decision tree is used for both regression and classification.
hope answer to your question.

R influence variables on y target [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
One task of Machine Learning / Data Science is making predictions. But, I want to get more insights in the variables of my model.
To get more insights, I tried different methods:
Logistic Regression (The output provides some 'insights' in the influence of the different variables, see: Checking interpretation of GLM summary in R)
The xgb.plot.importance function applied on a Boosting Tree, see picture below (applied on the Titanic Data Set).
And I saw a great article (but unfortunately, it is not working) how to explain a boosting tree (see: https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211).
My question: are there other methods to give yourself (or even better: the business) more insights about which variables have a influence on the target variable? And of course: is the influence positive/negative and how big is it?
You could also try to use lasso regression (https://stats.stackexchange.com/questions/17251/what-is-the-lasso-in-regression-analysis), which basically selects the variables that influence the response variable mostly.
The glmnet package provides support for this type of regression.

R Linear model allow quadratic and first order but not higher [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
My current linear model is: fit<-lm(ES~Area+Anear+Dist+DistSC+Elevation)
I have been asked to further this by:
Fit a linear model for ES using the five explanatory variables and
include up to quadratic terms and first order interactions (i.e. allow
Area^2 and Area*Elevation, but don't allow Area^3 or
Area*Elevation*Dist).
From my research I can do +I(Area^2) and +(Area*Elevation) but this would make a huge list.
Assuming I am understanding the question correctly I would be adding 5 squared terms and 10 * terms giving 20 total. Or do I not need all of these?
Is that really the most efficient way of going about it?
EDIT:
Note that I am planning on carrying out a stepwise regression for the null model and the full model after. I am seemingly having trouble with this when using poly.
Look at ?formula to further your education:
fit<-lm( ES~ (Area+Anear+Dist+DistSC+Elevation)^2 )
Those will not be squared terms but rather part of what you were asked to provide... all the 2-way interactions (and main effects). Formula "mathematics" is different than regular use of powers. To add the squared terms in a manner that allows proper statistical interpretation use poly
fit<-lm( ES~ (Area+Anear+Dist+DistSC+Elevation)^2 +
poly(Area,2) +poly(Anear,2)+ poly(Dist,2)+ poly(DistSC,2)+ poly(Elevation,2) )

How can a SE be above 1000 in a multilevel logistic regression? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Maybe my question will fail to be specific but when fitting a glme model (using lme4 package in R) I get for one of the parameters SE=1000, with the estimated parameter as high as 16. The variable is a dichotomous variable. My question is if there might be an explanation for such a result, considering that the other parameters have parameters and SE that seem ok
That's a sign that you have complete separation. You should re-run the model without that covariate. Since its an ME model you may need to do a tabulation of outcome by covariate by levels to see what is happening. More details would allow greater specificity in our answers.
This is a link to a posting by Jarrod Hadfield, one of the guRus on the R mixed model mailing list. It demonstrates how complete separation leads to the Hauck-Donner effect, and it offers some further approaches to attempt dealing with it.
You may be seeing a case of the Hauck-Donner effect. Here is one post that discusses it, you can read the original paper or search the web for additional discussions.

Should Categorical predictors within a linear model be normally distributed? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am running simple linear models(Y~X) in R where my predictor is a categorical variable (0-10). However, this variable is not normally distributed and none of the transformation techniques available are healpful (e.g. log, sq etc.) as the data is not negatively/positively skewed but rather all over the place. I am aware that for lm the outcome variable (Y) has to be normally distributed but is this also required for predictors? If yes, any suggestions of how to do this would be more than welcome.
Also, as the data I am looking at has two groups, patients vs controls (I am interested in group differences, as you can guess), do I have to look at whether the data is normally distributed within the two groups or overall across the two groups?
Thanks.
See #Roman LuĊĦtriks comment above: it does not matter how your predictors are distributed. (Except for problems with multicollinearity.) What is important is that the residuals be normal (and with homogeneous variances).

Resources