train() function and rate model (poisson regression with offset) with caret - r

I fitted a rate model using glm() (poisson link with offset, like
y ~ offset(log(x1)) + x2 + x3
the response is y/x1 in this case).
Then I wanted to do cross validation using caret package so I used 'train()' function with k-fold CV control. It turns out the 2 models I have are very different. It seems that train() can't handle offset: I change the variable within offset to be offset(log(log(x1)) or offset(log(sqrt(x1)), the models remain the same.
Any one have this kind of experience before and how did you deal with it?
Thanks!
btw I want to save the prediction on each validation set so so far I only know caret can do that, thats why I didnt choose to use cv.glm.

I cannot claim to have prior experience with this exact process, and have not done any testing in the absence of you offering a reproducible example and code. But I do have experience with moving offsets to the LHS of a glm-Poission regression call, so why not change the formula (and family) to:
glm( I(y/x1) ~ x2 + x3, family=quasipoisson, data= , ...)

Related

GAM with only Categorical/Logical

I'm currently trying to use a GAM to calculate a rough estimation of expected goals model based purely on the commentary data from ESPN. However, all the data is either a categorical variable or a logical vector, so I'm not sure if there's a way to smooth, or if I should just use the factor names.
Here are my variables:
shot_where (factor): shot location (e.g. right side of the box)
assist_class (factor): type of assist (cross, through ball, pass)
follow_corner (logical): whether the shot follows a corner
shot_with (factor): right foot, left food, header
follow_set_piece (logical): whether the shot follows a set piece
I think I should just use the formula as just the variable names.
model <- bam(is_goal ~ shot_where + assist_class + follow_set_piece + shot_where + follow_corner + shot_where:shot_with, family = "binomial", method = "REML")
The shot_where and shot_with would incorporate any interactions between these two varaibles.
However, I was told I could smooth factor variables as well using the below structure.
model <- bam(is_goal ~ s(shot_where, bs = 'fs') + s(assist_class, bs = 'fs') + as.logical(follow_set_piece) +
as.logical(follow_corner) + s(shot_with, bs = 'fs'), data = model_data, family = "binomial", method = "REML")
This worked for creating a model, but I want to make sure this is a correct method of building the model. I've yet to see any information on using only factor/logical variables in a GAM model, so I thought it was worth asking.
If you only have categorical covariates then you aren't fitting a GAM, whether you fit the model with gam(), bam(), or something else.
What you are doing when you pass factor variables to s() using the fs basis like this
s(f, bs = 'fs')`
is creating a random intercept for each level of the factor f.
There's no smoothing going on here at all; the model is simply exploiting the equivalence of the Bayesian view of smoothing with random effects.
Given that none of your covariates could reasonably be considered random in the sense of a mixed effects model then the only justification for doing what you're doing might be as a computational trick.
Your first model is just a simple GLM (note the typo in the formula as shot_where is repeated twice in the formula.)
It's not clear to me why you are using bam() to fit this model; you're loosing computational efficiency that bam() provides by using method = 'REML'; it should be 'fREML' for bam() models. But as there is no smoothness selection going on in the first model you'd likely be better off using glm() to fit that model. If the issue is large sample sizes, there are several packages that can fit GLMs to large data, for example biglm and it's bigglm() function.
In the second model there is no smoothing going on but there is penalisation which is shrinking the estimates for the random intercepts toward zero. You're likely to get better performance on big data using the lme4 package or TMB and the glmmTMB package to fit what is a GLMM.
This is more of a theoretical question than about R, but let me provide a brief answer. Essentially, the most flexible model you could estimate would be one where you used the variables as factors. It also produces a model that is reasonably easily interpreted - where each coefficient gives you the expected difference in y between the reference level and the level represented by the dummy regressor.
Smoothing splines try to strike the appropriate bias-variance tradeoff. If you've got lots of data and relatively few categories in the categorical variables, there will be no real loss in efficiency for including all of the dummy regressors representing the categories and the bias will also be as small as possible. To the extent that the smoothing spline model is different from the one treating everything as factors, it is likely inducing bias without a corresponding increase in efficiency. If it were me, I would stick with a model that treats all of the categorical variables as factors.

Why do heteroscedasticity-robust standard errors in logistic regression?

I am following a course on R. At the moment, we are working with logistic regression. The basic form we are taught is this one:
model <- glm(
formula = y ~ x1 + x2,
data = df,
family = quasibinomial(link = "logit"),
weights = weight
)
This makes perfectly sense to me. However, then we are being recommended to use the following to get coefficients and heteroscedasticity-robust inference:
model_rob <- lmtest::coeftest(model, sandwich::vcovHC(model))
This confuses me bit. Reading about vcovHC is states that it creates a "heteroskedasticity-consistent estimation". Why would you do this when doing logistic regression? I taught it did not assume homoscedasticity? Also, I am not sure what the coeftest does?
Thank you!
You're right - homoscedasticity (residuals at each level of the predictor have the same variance), is not an assumption in logistic regression. However, the binary response in logistic regression is heteroscedastic (0 or 1) which is why a corresponding estimator should be consistent with it. I guess that is what is meant with "heteroscedasticity-consistent". As #MrFlick already pointed out, if you would like more information on that topic, Cross Validated is likely to be the place to be. The coeftest produces the Wald test statistic of the estimated coefficients. These tests give you some information on whether a predictor (independent variable) seems to be associated to the dependent variable according to your data.

Scenario development with GAM models

I'm working with a mgcv::gam model in R to generate predictions in which the relationship between time (year) and the outcome variable (out) varies. For example, in one scenario, I'd like to force time to affect the outcome variable in a linear manner, in another a marginally decreasing manner, and in another, I'd like to specify specific slopes of the time-outcome interaction. I'm unsure how to force the prediction to treat the interaction between time and the outcome variable in a specific manner:
res <- gam(out ~ s(time) + s(GEOID, bs='re'), data = df, method = "REML")
pred <- predict(gam, newdata = ndf, type="response", se=T)
There isn't an interaction betweentime and out; here time has a potentially non-linear effect on out.
Are we talking about trying to force certain shapes for the function of time? If so, you will need to estimate different models; use time if you want a linear effect:
res_lin <- gam(out ~ time + s(GEOID, bs='re'), data = df, method = "REML")
and look at shape constrained p splines to enforce montonicity or concave/convex relationships.
The scam package has these sorts of constraints and uses mgcv with GCV smoothness selection to fit the shape constrained models.
As for specifying a specific slope for the linear effect of time, I think you'll need to include time as an offset in the model. So say the slope you want is 0.5 I think you need to do + offset(I(0.5*time)) because an offset has by definition a coefficient of 1. I would double check this though as I might have messed up my thinking here.

R Prediction on a Linear Regression Model

I'm sure this is something that can be done, just not sure how!
I have a dataset that is around 500 rows(csv) and it shows footballers match stas(e,g passes, shots on target)etc.I have some of their salaries(around 10) and I'n trying to predict their salaries using a linear regression equation.
In the below, if Y is salaries, is there a way on R to essentially autopopulate? what the rest of the salaries might be based on the ten salaries I do have?
lm(y ~ x1 + x2 +x3)
Any help would be much appreciated.
This is what the predict function does.
Note that you don't need to call predict.lm explicitly. Because the result of a call to lm is an object with class "lm", R "knows" to use predict.lm when you call predict on it.
Eg:
lm1 <- lm(y ~ x1 + x2 +x3)
y.fitted <- predict(lm1)
You should also be able to test the predictive accuracy of your model using cross validation with the function cv.lm in the DAAG library. With this function you create test data to test the model which is generated using training data.

Heteroscedasticity robust standard errors with the PLM package

I am trying to learn R after using Stata and I must say that I love it. But now I am having some trouble. I am about to do some multiple regressions with Panel Data so I am using the plm package.
Now I want to have the same results with plm in R as when I use the lm function and Stata when I perform a heteroscedasticity robust and entity fixed regression.
Let's say that I have a panel dataset with the variables Y, ENTITY, TIME, V1.
I get the same standard errors in R with this code
lm.model<-lm(Y ~ V1 + factor(ENTITY), data=data)
coeftest(lm.model, vcov.=vcovHC(lm.model, type="HC1))
as when I perform this regression in Stata
xi: reg Y V1 i.ENTITY, robust
But when I perform this regression with the plm package I get other standard errors
plm.model<-plm(Y ~ V1 , index=C("ENTITY","YEAR"), model="within", effect="individual", data=data)
coeftest(plm.model, vcov.=vcovHC(plm.model, type="HC1))
Have I missed setting some options?
Does the plm model use some other kind of estimation and if so how?
Can I in some way have the same standard errors with plm as in Stata with , robust
By default the plm package does not use the exact same small-sample correction for panel data as Stata. However in version 1.5 of plm (on CRAN) you have an option that will emulate what Stata is doing.
plm.model<-plm(Y ~ V1 , index=C("ENTITY","YEAR"), model="within",
effect="individual", data=data)
coeftest(plm.model, vcov.=function(x) vcovHC(x, type="sss"))
This should yield the same clustered by group standard-errors as in Stata (but as mentioned in the comments, without a reproducible example and what results you expect it's harder to answer the question).
For more discussion on this and some benchmarks of R and Stata robust SEs see Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R.
See also:
Clustered standard errors in R using plm (with fixed effects)
Is it possible that your Stata code is different from what you are doing with plm?
plm's "within" option with "individual" effects means a model of the form:
yit = a + Xit*B + eit + ci
What plm does is to demean the coefficients so that ci drops from the equation.
yit_bar = Xit_bar*B + eit_bar
Such that the "bar" suffix means that each variable had its mean subtracted. The mean is calculated over time and that is why the effect is for the individual. You could also have a fixed time effect that would be common to all individuals in which case the effect would be through time as well (that is irrelevant in this case though).
I am not sure what the "xi" command does in STATA, but i think it expands an interaction right ? Then it seems to me that you are trying to use a dummy variable per ENTITY as was highlighted by #richardh.
For your Stata and plm codes to match you must be using the same model.
You have two options:(1) you xtset your data in stata and use the xtreg option with the fe modifier or (2) you use plm with the pooling option and one dummy per ENTITY.
Matching Stata to R:
xtset entity year
xtreg y v1, fe robust
Matching plm to Stata:
plm(Y ~ V1 + as.factor(ENTITY) , index=C("ENTITY","YEAR"), model="pooling", effect="individual", data=data)
Then use vcovHC with one of the modifiers. Make sure to check this paper that has a nice review of all the mechanics behind the "HC" options and the way they affect the variance covariance matrix.
Hope this helps.

Resources