the meaning of model output number in R for linear regression model - r

Do I understand correctly that the value (model output) retrieving from evaluate_model() for the linear regression model is RMSE?

No, the column output, model_output, is not the root mean square error (RMSE). It is the prediction value for your model.
So it appears that the evaluate_model() function is from the statisticalModeling package.
According to the documentation for this function, it "Evaluate a model for specified inputs" and (emphasis added below)
Find the model outputs for specified inputs. This is equivalent to the generic predict() function, except it will choose sensible values by default. This simplifies getting a quick look at model values.
In other words, this function, evaluate_model(), takes inputs and shows you the "prediction" of your input data using the model.
For your instance, evaluate_model() will take each row of your data and use the column/variable information (age, sex, coverage) for each row, and fits the data to estimate what the dependent variable would be based on this model.

Related

How to get adjusted means following multiple imputation?

I am trying to find the adjusted means for values following multiple imputation using the mice package for a ANCOVA model (zScorePALAdj~Grouping+age+sex+education_years).
I have had to impute some missing values for a model, and then used the mi.anova function to provide summary output, that looks like the below:
mi.anova(mi.res=imputed_Data, formula="zScorePALAdj~Grouping+age+sex+education_years", type = 3)
I would like to get the adjusted means for the zScorePALAdj for the different groups from the 'grouping' variable.
I understand that emmeans would allow this if it were not pooled imputed data, for instance: emmeans(model, "Grouping"). However it does not seem to be able to use the output from the mi.anova model.
I was wondering if anyone can point me in the right direction as I am drawing a blank.

rpart variable importance shows more variables than decision tree plots

I fitted an rpart model in Leave One Out Cross Validation on my data using Caret library in R. Everything is ok, but I want to understand the difference between model's variable importance and decision tree plot.
Calling the variable importance with the function varImp() shows nine variables. Plotting the decision tree using functions such as fancyRpartPlot() or rpart.plot() shows a decision tree that uses only two variables to classify all subjects.
How can it be? Why does the decision tree plot not shows the same nine variables from the variable importance table?
Thank you.
Similar to rpart(), Caret has a cool property: it deals with surrogate variables, i.e. variables that are not chosen for splits, but that were close to win the competition.
Let me be more clear. Say at a given split, the algorithm decided to split on x1. Suppose also there is another variable, say x2, which would be almost as good as x1 for splitting at that stage. We call x2 surrogate, and we assign it its variable importance as we do for x1.
This is way you can get in the importance ranking variables that are actually not used for splitting. You can also find that such variables are more important than others actuall used!
The rationale for this is explained in the documentation for rpart(): suppose we have two identical covariates, say x3 and x4. Then rpart() is likely to split on one of them only, e.g., x3. How can we say that x4 is not important?
To conclude, variable importance considers the increase in fit for both primary variables (i.e., the ones actually chosen for splitting) and surrogate variables. So, the importance for x1 considers both splits for which x1 is chosen as splitting variable, and splits for which another variables is chosen but x1 is a close competitor.
Hope this clarifies your doubts. For more details, see here. Just a quick quotation:
The following methods for estimating the contribution of each variable to the model are available [speaking of how variable importance is computed]:
[...]
- Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control.
I am not used to caret, but from this quote it appears that such package actually uses rpart() to grow trees, thus inheriting the property about surrogate variables.

plm Package in R - empty model when including only variables without variation over time per individual

I have a dataframe ('math') like this (there are three different methods, although only one is shown) -
dataframe
I am trying to create a multi-level growth model for MathScore, where VerbalScore is an independent, time invariant, random effect.
I believe the R code should be similar to this -
random <- plm(MathScore ~ VerbalScore + Method, data=math, index=c("id","Semester"),
model="random")
However, running this code results in the following error:
Error in plm.fit(object, data, model = "within", effect = effect) :
empty model
I believe it's an issue with the index, as the code will run if I use:
random <- plm(MathScore ~ VerbalScore + Method + Semester, data=math, index="id",
model="random")
I would be grateful for any advice on how to create a multi-level, random effect model as described.
This is likely a problem with your data:
As it seems, the variables VerbalScore and Method do not vary per individual. Thus, for the Swamy-Arora RE model (default) the within variance necessary cannot be computed. Affected variables drop out of the model which are here all RHS variables and you get the (not very specific) error message empty model.
You can check variation per individual with the command pvar().
If my assumption is true and still you want to estimate a random effects model, you will have to use a different random effect estimator which does not rely on the within variance, e.g. try the Wallace-Hussain estimator (random.method="walhus").

Interpretation auto.arima results in R

As a beginner, I am trying to understand the auto.arima function in the R forecasting package.
Particularly, I am interested in the selection based on the information criteria.
For instance, I set ic=c("aicc","aic", "bic").
I then obtain the best fitting model with AIC, AICc, and BIC.
I also obtain a certain output value for every tested model, e.g. -18661.23 for (1,1,1); -18451.12 for (1,1,2) etc. If e.g. (1,1,1) is the selected model with lowest output value, this value is not equal to the given AIC, AICc, or BIC.
In simple words, how do I interpret the output value of every model? Is it a parallely weighted AIC, AICc, and BIC?
P.S.: I really tried to understand the documentation but it is hard for me to read.
Thank you very much in advance!
As far as I can tell, by "output value" you mean the value printed when you use auto.arima with trace=TRUE.
These values are the AIC (or AICc or BIC) for each model tried. An approximation is used during the search to speed things up, so the value printed may different slightly from the value returned, which is calculated without the approximation.
The argument ic determines which information criterion will be used. For example, setting ic="bic" means that the BIC is used in selecting the model. By default, ic="aicc".
In a function definition, an argument with default value equal to a vector of values is often a shorthand for showing what the possible values that argument can take, with the first value in the vector equal to the default. In this case, the function definition contains ic = c("aicc", "aic", "bic") meaning that ic can take only one of those three values, and the default is aicc if ic is not explicitly passed.

combining regression model estimates into a single table

I am trying to automate the process by which I output the coefficients and standard errors from a number of regression models into a single table for output via xtable as html.
Some similar questions (like this one) have been directed to a function called "outreg" by Paul Johnson, but the page no longer exists and I can't find the code.
Others (like this one) use solutions that seem to give me errors because my models do not all have the same number of variables.
To clarify my task ...
I have n polr (ordinal logistic) models from which I want to output the coefficients and standard errors.
Each model includes a different number of predictors.
I need one data.frame (?) with a column or two for each model and a row for each predictor
it's not critical how the standard errors are output in relation to the coefficients
each model has output like this with successively more predictors:
>summary(model1)["coefficients"]
$coefficients
Value Std. Error t value
relGPA 0.8683499 0.04185389 20.74717
mcAvgGPA 1.3885515 0.09688030 14.33265
Deny|Waitlist 0.5707912 0.01553476 36.74283
Waitlist|Accept 0.8398921 0.01618358 51.89779
You can cut and paste from here.
FWIW, I have tinkered with his outreg.R to add some features (t-stats, multipage, etc), but it's only on my work computer, so I'll post tomorrow.
Update
Here's my tweaked version, but like the original, it still requires a list of lm or glm objects. It seems too long to cut&paste, so this is a link to my dropbox.com public folder.

Resources