Automatic model creation, for model selection, in polynomial regression in R - r

Let's imagine that for a target value 'price', I have predictive variables of x, y, z, m, and n.
I have been able to analyse different models that I could fit through following methods:
Forward, backward, and stepwise selection
Grid and Lasso
KNN (IBk)
For each I got RMSE and MSE for prediction and I can choose the best model.
All these are helpful with linear models.
I'm just wondering if there is any chance to do the same for polynomial regressions (squared, cubic, ...) so I can fit and analyse them as well in the same dataset.

Have you seen caret package? Its very powerfull and groups a lot of machine learning models. It can compares different models and also see the best metaparameters.
http://topepo.github.io/caret/index.html

Related

comparison of goodness-of-fit under robust circumstances [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated yesterday.
I have fitted respectively a zero-knot, a one-knot and a two-knot linear spline to my data, and I need some index of goodness-of-fit for model selection. The crucial point is that the splines are fitted with robust linear regressions (using function rlm in R package MASS), specifically with Huber estimations and Tukey's bisquare estimation, which makes the usual estimator of prediction error like AIC inappropriate.
So my problem is:
What criterion should I use to perform model selection on my zero, one and two-knot splines? Can I use SSE?
I also need to compare between a model using Huber estimation and a model using Tukey's bisquare estimation. What criterion should I use?

Does the function multinom() from R's nnet package fit a multinomial logistic regression, or a Poisson regression?

The documentation for the multinom() function from the nnet package in R says that it "[f]its multinomial log-linear models via neural networks" and that "[t]he response should be a factor or a matrix with K columns, which will be interpreted as counts for each of K classes." Even when I go to add a tag for nnet on this question, the description says that it is software for fitting "multinomial log-linear models."
Granting that statistics has wildly inconsistent jargon that is rarely operationally defined by whoever is using it, the documentation for the function even mentions having a count response and so seems to indicate that this function is designed to model count data. Yet virtually every resource I've seen treats it exclusively as if it were fitting a multinomial logistic regression. In short, everyone interprets the results in terms of logged odds relative to the reference (as in logistic regression), not in terms of logged expected count (as in what is typically referred to as a log-linear model).
Can someone clarify what this function is actually doing and what the fitted coefficients actually mean?
nnet::multinom is fitting a multinomial logistic regression as I understand...
If you check the source code of the package, https://github.com/cran/nnet/blob/master/R/multinom.R and https://github.com/cran/nnet/blob/master/R/nnet.R, you will see that the multinom function is indeed using counts (which is a common thing to use as input for a multinomial regression model, see also the MGLM or mclogit package e.g.), and that it is fitting the multinomial regression model using a softmax transform to go from predictions on the additive log-ratio scale to predicted probabilities. The softmax transform is indeed the inverse link scale of a multinomial regression model. The way the multinom model predictions are obtained, cf.predictions from nnet::multinom, is also exactly as you would expect for a multinomial regression model (using an additive log-ratio scale parameterization, i.e. using one outcome category as a baseline).
That is, the coefficients predict the logged odds relative to the reference baseline category (i.e. it is doing a logistic regression), not the logged expected counts (like a log-linear model).
This is shown by the fact that model predictions are calculated as
fit <- nnet::multinom(...)
X <- model.matrix(fit) # covariate matrix / design matrix
betahat <- t(rbind(0, coef(fit))) # model coefficients, with expicit zero row added for reference category & transposed
preds <- mclustAddons::softmax(X %*% betahat)
Furthermore, I verified that the vcov matrix returned by nnet::multinom matches that when I use the formula for the vcov matrix of a multinomial regression model, Faster way to calculate the Hessian / Fisher Information Matrix of a nnet::multinom multinomial regression in R using Rcpp & Kronecker products.
Is it not the case that a multinomial regression model can always be reformulated as a Poisson loglinear model (i.e. as a Poisson GLM) using the Poisson trick (glmnet e.g. uses the Poisson trick to fit multinomial regression models as a Poisson GLM)?

Fixed effects logit lasso model

My data set has a binary dependent variable (0/1) and a lot of continuous independent variables for many individuals and three time periods. Therefore, I am facing a panel data set with a binary dependent variable, which asks for the use of a non-linear panel data model. However, I also have a lot of independent variables, which asks for the use of a variable selection method. Therefore, I want to apply lasso on a fixed effects logit model.
As far as I know, there is only the possibility in cv.glmnet to estimate a logit lasso model by the function cv.glmnet(x, y, weights, offset, lambda, type.measure='binomial', nfolds, foldid, grouped, keep, parallel, ...) using type.measure='binomial'. This estimation procedure pools all individuals as it is a cross-sectional estimation procedure and does not take the panel component of my data set.
Therefore, I would like to adjust the cv.glmnet function such that I can take as input for example type.measure='fe binomial' and so it runs a fixed effects logit lasso model.
In conclusion, it is possible to run a fixed effects logit model and a lasso model separately but I want to combine both. How can I do this in R?
(Also, in the attachment I wrote my model down in more detail)
Explanation model

How do you compare a gam model with a gamm model? (mgcv)

I've fit two models, one with gam and another with gamm.
gam(y ~ x, family= betar)
gamm(y ~ x)
So the only difference is the distributional assumption. I use betar with gam and normal with gamm.
I would like to compare these two models, but I am guessing AIC will not work since the two models are based on different methods? Is there then some other suitable estimate I can use for comparison? I know I could just fit the second with gam, but let's ignore that for the sake of this question.
AIC is independent of the type of model used as soon as y is exactly the same observation to be predicted. This is only a computation of deviance explained penalised by the number of parameters fitted.
However, depending on the goal of your model, if you want to be able to use the model for prediction for instance, you should use validation to compare model performance. 10-fold cross-validation would be a good idea for instance.

Regression evaluation in R

Are there any utilities/packages for showing various performance metrics of a regression model on some labeled test data? Basic stuff I can easily write like RMSE, R-squared, etc., but maybe with some extra utilities for visualization, or reporting the distribution of prediction confidence/variance, or other things I haven't thought of. This is usually reported in most training utilities (like caret's train), but only over the training data (AFAICT). Thanks in advance.
This question is really quite broad and should be focused a bit, but here's a small subset of functions written to work with linear models:
x <- rnorm(seq(1,100,1))
y <- rnorm(seq(1,100,1))
model <- lm(x~y)
#general summary
summary(model)
#Visualize some diagnostics
plot(model)
#Coefficient values
coef(model)
#Confidence intervals
confint(model)
#predict values
predict(model)
#predict new values
predict(model, newdata = data.frame(y = 1:10))
#Residuals
resid(model)
#Standardized residuals
rstandard(model)
#Studentized residuals
rstudent(model)
#AIC
AIC(model)
#BIC
BIC(model)
#Cook's distance
cooks.distance(model)
#DFFITS
dffits(model)
#lots of measures related to model fit
influence.measures(model)
Bootstrap confidence intervals for parameters of models can be computed using the recommended package boot. It is a very general package requiring you to write a simple wrapper function to return the parameter of interest, say fit the model with some supplied data and return one of the model coefficients, whilst it takes care of the rest, doing the sampling and computation of intervals etc.
Consider also the caret package, which is a wrapper around a large number of modelling functions, but also provides facilities to compare model performance using a range of metrics using an independent test set or a resampling of the training data (k-fold, bootstrap). caret is well documented and quite easy to use, though to get the best out of it, you do need to be familiar with the modelling function you want to employ.

Resources