R Linear model allow quadratic and first order but not higher [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
My current linear model is: fit<-lm(ES~Area+Anear+Dist+DistSC+Elevation)
I have been asked to further this by:
Fit a linear model for ES using the five explanatory variables and
include up to quadratic terms and first order interactions (i.e. allow
Area^2 and Area*Elevation, but don't allow Area^3 or
Area*Elevation*Dist).
From my research I can do +I(Area^2) and +(Area*Elevation) but this would make a huge list.
Assuming I am understanding the question correctly I would be adding 5 squared terms and 10 * terms giving 20 total. Or do I not need all of these?
Is that really the most efficient way of going about it?
EDIT:
Note that I am planning on carrying out a stepwise regression for the null model and the full model after. I am seemingly having trouble with this when using poly.

Look at ?formula to further your education:
fit<-lm( ES~ (Area+Anear+Dist+DistSC+Elevation)^2 )
Those will not be squared terms but rather part of what you were asked to provide... all the 2-way interactions (and main effects). Formula "mathematics" is different than regular use of powers. To add the squared terms in a manner that allows proper statistical interpretation use poly
fit<-lm( ES~ (Area+Anear+Dist+DistSC+Elevation)^2 +
poly(Area,2) +poly(Anear,2)+ poly(Dist,2)+ poly(DistSC,2)+ poly(Elevation,2) )

Related

R influence variables on y target [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
One task of Machine Learning / Data Science is making predictions. But, I want to get more insights in the variables of my model.
To get more insights, I tried different methods:
Logistic Regression (The output provides some 'insights' in the influence of the different variables, see: Checking interpretation of GLM summary in R)
The xgb.plot.importance function applied on a Boosting Tree, see picture below (applied on the Titanic Data Set).
And I saw a great article (but unfortunately, it is not working) how to explain a boosting tree (see: https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211).
My question: are there other methods to give yourself (or even better: the business) more insights about which variables have a influence on the target variable? And of course: is the influence positive/negative and how big is it?
You could also try to use lasso regression (https://stats.stackexchange.com/questions/17251/what-is-the-lasso-in-regression-analysis), which basically selects the variables that influence the response variable mostly.
The glmnet package provides support for this type of regression.

Text analysis : What after term-document matrix? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to build predictive models from text data. I built document-term matrix from the text data (unigram and bigram) and built different types of models on that (like svm, random forest, nearest neighbor etc). All the techniques gave decent results, but I want to improve the results. I tried tuning the models by changing parameters, but that doesn't seem to improve the performance much. What are the possible next steps for me?
This isn't really a programming question, but anyway:
If your goal is prediction, as opposed to text classification, usual methods are backoff models (Katz Backoff) and interpolation/smoothing, e.g. Kneser-Ney smoothing.
More complicated models like Random Forests are AFAIK not absolutely necessary and may pose problems if you need to make predictions quickly. If you are using an interpolation model, you can still tune the model parameters (lambda) using a held out portion of the data.
Finally, I agree with NEO on the reading part and would recommend "Speech and Language Processing" by Jurafsky and Martin.

How to work with numeric probability distribution functions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have to calculate the probability distribution function of a random variable that is composed of (sum, division, product, exponentiation, etc...) some other simple random variables. It is pretty complex so I am morte then happy to get a numerical solution
While thought this was a very standard thing to do , I was unable to find a framework to do that. I'd preferably use R, but any major language will do.
What I would like therefore is a library that allowed me to:
i) create numerical random variables from classic distributions
ii) compose them by simple operations (+,-,*,/, exp,min, max,...)
Of course I could work with vectors and use convolutions and the like, but I wanted something more polished.
I am also aware that is possible to use simulation to create the variables, then compose them with the operations and finally getting the PDF using a histogram, but again, I would prefer a non - simulating approach.
Try the rv package. Note that if X is an exponential random variable with mean 1, then -log(X) has a standard Gumbel distribution.

Fitting repeated measures in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Fitting repeated measures in R, convergence issues. I have the following fit which is one of many datasets and it doesn't converge. I do other sets that do. This dataset and model work in SAS... Could I get some direction in what to do to have this work in R? Things to look at (matrices,option settings,a reference on this topic for r/splus ...).
fit.gls <- gls(resp~grpnum+grpnum/day,data=long, corr=cormat,na.action=na.omit)
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 62
I have read the following and still trying to work thru it...
Converting Repeated Measures mixed model formula from SAS to R
The problem is the data. gls needs to invert a matrix to work (see Wikipedia for the formula to estimate the covariates). For you particular data set, that matrix is not invertible.
You can allow for singular values to be allowed with the control argument:
fit.gls <- gls(resp~grpnum+grpnum/day,data=long, corr=cormat,na.action=na.omit,
control = list(singular.ok = TRUE))
Be careful with this as you may get bad results! Always check the model fit afterwards.
Look at the help for gls and glsConrol for more details about options.

Should Categorical predictors within a linear model be normally distributed? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am running simple linear models(Y~X) in R where my predictor is a categorical variable (0-10). However, this variable is not normally distributed and none of the transformation techniques available are healpful (e.g. log, sq etc.) as the data is not negatively/positively skewed but rather all over the place. I am aware that for lm the outcome variable (Y) has to be normally distributed but is this also required for predictors? If yes, any suggestions of how to do this would be more than welcome.
Also, as the data I am looking at has two groups, patients vs controls (I am interested in group differences, as you can guess), do I have to look at whether the data is normally distributed within the two groups or overall across the two groups?
Thanks.
See #Roman LuĊĦtriks comment above: it does not matter how your predictors are distributed. (Except for problems with multicollinearity.) What is important is that the residuals be normal (and with homogeneous variances).

Resources