How predict function works - r

Iam using this command in R for building decision trees :
> library(party)
> ind = sample(2,nrow(iris),replace=TRUE,prob=c(0.8,0.2))
> myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
> iris_ctree <- ctree(myFormula,data = iris[ind==1,])
> predict(iris_ctree)
What exactly does predict function compute and how does it perform the computation?

the example first constructs "ind" based on a sampling of 1's with probability .8 and 2's with probability .2. It then specifies a Formula that defines the hypothesis function for the model. It then fits the conditional inference tree to the estimate the parameters based on the hypothesis specification using the sampled data - which is just the data containing 1's.
It then runs a prediction based on the full sample of 1's and 2's.
So basically it trained on 1's, but runs predict on 1's and 2's.

Related

Is there a way to derive GVIF in Jamovi?

Apparently the car package vif function and performance package check_collinearity function both calculate generalized variance inflation factor (GVIF) automatically if a categorical variable is entered into a regression. As an example, this can be done with the iris dataset with the following code:
#### Categorical Fit ####
fit.cat <- lm(
Petal.Length ~ Species + Petal.Width + Sepal.Width,
iris
)
check_collinearity(fit.cat)
Which gives an an expected value of 26.10 that I have already hand calculated. However, Jamovi doesn't allow one to automatically add factors to a regression, so I dummy coded the same regression factor and entered the regression like so:
You can see in the arrow that the value doesn't match that obtained from the R function. I also double checked in R to see if it is just calculating VIF instead:
1/(1-summary(lm(as.numeric(Species) ~ 1 + Petal.Width + Sepal.Width,
iris))$r.squared)
But the values don't match, as this gives me a VIF of 12.78. Why is it doing this and is there a solution in Jamovi for hacking this?

R: glmrob can't predict models with dropped co-linear columns, while glm can?

I'm learning to implement robust glms in R, but can't figure out why I am unable to get glmrob to predict values from my regression models when I have a model where some columns are dropped due to co-linearity. Specifically when I use the predict function to predict values from a glmrob, it always gives NA for all values. I don't observe this when predicting values from the same data & model using glm. It doesn't seem to matter what data I use -- as long as there is a NA coefficient in the fitted model (and the NA isn't the last coefficient in the coefficient vector), the predict does not work.
This behavior holds for all datasets and models I have tried where an internal column is dropped due to co-linearity. I include a fake data set where two columns are dropped from the model, which gives two NAs in the coefficient list. Both glm and glmrob give nearly identical coefficients, yet predict only works with the glm model. So my question is: what don't I understand about robust regression that would prevent my glmrob models from generating predicted values?
library(robustbase)
#Make fake data with two categorial predictors
df <- data.frame("category" = rep(c("A","B","C"),each=6))
df$location <- rep(1:6,each=3)
val <- rep(c(500,50,5000),each=6)+rep(c(50,100,25,200,100,1),each=3)
df$value <- rpois(NROW(df),val)
#note that predict works if we omit the newdata parameter. However I need the newdata param
#so I use the original dataframe here as a stand-in.
mod <- glm(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) # works fine
mod <- glmrob(val ~ category + as.factor(location), data=df, family=poisson)
predict(mod, newdata=df) #predicts NA for all values
I've been digging into this and have concluded that the problem does not lie in my understanding of robust regression, but rather the problem lies with a bug in the robustbase package. The predict.lmrob function does not correctly pick the necessary coefficients from the model before the prediction. It needs to pick the first x non-NA coefficients (where x=rank of the model matrix). Instead it merely picks the first x coefficients without checking if they are NA. This explains why this problem only surfaces for models where the NA isn't the last coefficient in the coefficient vector.
To fix this, I copied the predict.lmrob source using:
getAnywhere(predict.lmrob)
and created my own replacement function. In this function I made a single modification to the code:
...
p <- object$rank
if (is.null(p)) {
df <- Inf
p <- sum(!is.na(coef(object)))
#piv <- seq_len(p) # old code
piv <- which(!is.na(coef(object))) # new code
}
else {
p1 <- seq_len(p)
piv <- if (p)
qr(object)$pivot[p1]
}
...
I've run a few hundred datasets using this change and it has worked well.

Fit a linear mixed model with only 1 group

Is it possible to force a lmer model with random effect to be fitted on data with only one level? We want to do this to keep the same model structure in rare case where our data only contains 1 grouping level. The following illustrate the error.
library(lme4)
#> Loading required package: Matrix
sleepstudy$Subject <- as.character(sleepstudy$Subject)
ss <- sleepstudy[sleepstudy$Subject == "308", ]
m1 <- lmer(Reaction~Days+(1|Subject), ss)
#> Error: grouping factors must have > 1 sampled level
It is to be noted that we are fixing the variance (see previous question: Fixing variance values in lme4). Hence, we do not need to estimate the variance.

Residuals and plots in ordered multinomial regression

I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks
For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.
In polr(), there is no function that returns residual. You should manually calculate it using its definition.
There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.

Boot package in R simple assistance

If I want to use the the boot() function from R's boot package for calculating the significance of the Pearson correlation coefficient between two vectors, should I do it like this:
boot(re1, cor, R = 1000)
where re1 is a two column matrix for these two observation vectors? I can't seem to get this right because cor of these vectors is 0.8, but the above function returns -0.2 as t0.
Just to emphasize the general idea on bootstrapping in R, although #caracal already answered your question through his comment. When using boot, you need to have a data structure (usually, a matrix) that can be sampled by row. The computation of your statistic is usually done in a function that receives this data matrix and returns the statistic of interest computed after resampling. Then, you call the boot() that takes care of applying this function to R replicates and collecting results in a structured format. Those results can be assessed using boot.ci() in turn.
Here are two working examples with the low birth baby study in the MASS package.
require(MASS)
data(birthwt)
# compute CIs for correlation between mother's weight and birth weight
cor.boot <- function(data, k) cor(data[k,])[1,2]
cor.res <- boot(data=with(birthwt, cbind(lwt, bwt)),
statistic=cor.boot, R=500)
cor.res
boot.ci(cor.res, type="bca")
# compute CI for a particular regression coefficient, e.g. bwt ~ smoke + ht
fm <- bwt ~ smoke + ht
reg.boot <- function(formula, data, k) coef(lm(formula, data[k,]))
reg.res <- boot(data=birthwt, statistic=reg.boot,
R=500, formula=fm)
boot.ci(reg.res, type="bca", index=2) # smoke

Resources