mgcv: Error Model has more coefficients than data, related to the argument by in the gam() - r

In case a, the gam code in mgcv R package is working well.
library(mgcv)
dat <- gamSim(1,n=400,dist="normal",scale=2)
num_knots = nrow(dat)
fit <- gam(y~s(x0, bs = "cr", k = num_knots, m=2),data=dat)
summary(fit)
But after I added the argument by in the gam(), it reported the error "Model has more coefficients than data".
fit <- gam(y~s(x0, bs = "cr", k = num_knots, m=2, by = x1),data=dat)
The error confuses me because I thought adding the by argument to create the interaction between the smoothing term and the parametric term should not increase the number of unknown coefficients, though it turns out that I am wrong. Where was I wrong?

When you pass a continuous variable to by, what you are getting is varying coefficient model where the effect of x1 varies as a smooth function of x0.
What is happening in the first case is that because of identifiability constraints being applied to the basis expansion for x0, you requested num_knots basis functions but actually got num_knots - 1 basis functions. When you add the intercept you get num_knots coefficients which is OK to fit with this model as it is a penalised spline (though you probably want method = 'REML'). The identifiability constraint is applied because there is a basis function (or combination) that is confounded with the model intercept and you can't fit two constant terms in the model and have them be uniquely identified.
In the second case, the varying coefficient model, there isn't the same issue, so when you ask for num_knots basis functions plus an intercept you are trying to fit a model with 401 coefficients with 400 observations which isn't going to work.

Related

Optimizing a GAM for Smoothness

I am currently trying to generate a general additive model in R using a response variable and three predictor variables. One of the predictors is linear, and the dataset consists of 298 observations.
I have run the following code to generate a basic GAM:
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
This produces a model with 18 degrees of freedom and seems to substantially overfit the data. I'm wondering how I might generate a GAM that maximizes smoothness and predictive error. I realize that each of these features is going to come at the expense of the other, but is there good a way to find the optimal model that doesn't overfit?
Additionally, I need to perform leave one out cross validation (LOOCV), and I am not sure how to make sure that gam() does this in the MGCV package. Any help on either of these problems uld be greatly appreciated. Thank you.
I've run this to generate a GAM, but it overfits the data.
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
I have also generated 1,000,000 GAMs with varying combinations of smoothing parameters and ranged the maximum degrees of freedom allowed from 10 (as shown in the code below) to 19. The variable "combinations2" is a list of all 1,000,000 combinations of smoothers I selected. This code is designed to try and balance degrees of freedom and AIC score. It does function, but I'm not sure that I'm actually going to be able to find the optimal model from this. I also cannot tell how to make sure that it uses LOOCV.
BestGAM <- gam(response~ linearpredictor+ predictor2+ predictor3, data = data[2:5])
for(i in 1:100000){
PotentialGAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5], sp=c(combinations2[i,]$Var1,combinations2[i,]$Var2))
if (AIC(PotentialGAM,BestGAM)$df[1] <= 10 & AIC(PotentialGAM,BestGAM)$AIC[1] < AIC(PotentialGAM,BestGAM)$AIC[2]){
BestGAM <<- PotentialGAM
listNumber <- i
}
}
You are fitting your GAM using generalised cross validation (GCV) smoothness selection. GCV is a way to get around the invariance problem of ordinary cross validation (OCV; what you also call LOOCV) when estimating GAMs. Note that GCV is the same as OCV on a rotated version of the fitting problem (rotating y - Xβ by Q, any orthogonal matrix), and while when fitting with GCV {mgcv} doesn't actually need to do the rotation and the expected GCV score isn't affected by the rotation, GCV is just OCV (wood 2017, p. 260)
It has been shown that GCV can undersmooth (resulting in more wiggly models) as the objective function (GCV profile) can become flat around the optimum. Instead it is preferred to estimate GAMs (with penalized smooths) using REML or ML smoothness selection; add method = "REML" (or "ML") to your gam() call.
If the REML or ML fit is as wiggly as the GCV one with your data, then I'd be likely to presume gam() is not overfitting, but that there is something about your response data that hasn't been explained here (are the data ordered in time, for example?)
As to your question
how I might generate a GAM that maximizes smoothness and [minimize?] predictive error,
you are already doing that using GCV smoothness selection and for a particular definition of "smoothness" (in this case it is squared second derivatives of the estimated smooths, integrated over the range of the covariates, and summed over smooths).
If you want GCV but smoother models, you can increase the gamma argument above 1; gamma 1.4 is often used for example, which means that each EDF costs 40% more in the GCV criterion.
FWIW, you can get the LOOCV (OCV) score for your model without actually fitting 288 GAMs through the use of the influence matrix A. Here's a reproducible example using my {gratia} package:
library("gratia")
library("mgcv")
df <- data_sim("eg1", seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")
A <- influence(m)
r <- residuals(m, type = "response")
ocv_score <- mean(r^2 / (1 - A))

LASSO-type regressions with non-negative continuous dependent variable (dependent var)

I am using "glmnet" package (in R) mostly to perform regularized linear regression.
However I am wondering if it can perform LASSO-type regressions with non-negative (integer) continuous (dependent) outcome variable.
I can use family = poisson, but the outcome variable is not specifically "count" variable. It is just a continuous variable with lower limit 0.
I aware of "lower.limits" function, but I guess it is for covariates (independent variables). (Please correct me if my understanding of this function not right.)
I look forward to hearing from you all! Thanks :-)
You are right that setting lower limit in glmnet is meant for covariates. Poisson will set a lower limit to zero because you exponentiate to get back the "counts".
Going along those lines, most likely it will work if you transform your response variable. One quick way is to take the log of your response variable, do the fit and transform it back, this will ensure that it's always positive. you have to deal with zeros
An alternative is a power transformation. There's a lot to think about and I can only try a two parameter box-cox with a dataset since you did not provide yours:
library(glmnet)
library(mlbench)
library(geoR)
data(BostonHousing)
data = BostonHousing
data$chas=as.numeric(data$chas)
# change it to min 0 and max 1
data$medv = (data$medv-min(data$medv))/diff(range(data$medv))
Then here I use a quick approximation via pca (without fitting all the variables) to get the suitable lambda1 and lambda2 :
bcfit = boxcoxfit(object = data[,14],
xmat = prcomp(data[,-14],scale=TRUE,center=TRUE)$x[,1:2],
lambda2=TRUE)
bcfit
Fitted parameters:
lambda lambda2 beta0 beta1 beta2 sigmasq
0.42696313 0.00001000 -0.83074178 -0.09876102 0.08970137 0.05655903
Convergence code returned by optim: 0
Check the lambda2, it is the one thats critical for deciding whether you get a negative value.. It should be rather small.
Create the functions to power transform:
bct = function(y,l1,l2){((y+l2)^l1 -1)/l1}
bctinverse = function(y,l1,l2){(y*l1+1)^(1/l1) -l2}
Now we transform the response:
data$medv_trans = bct(data$medv,bcfit$lambda[1],bcfit$lambda[2])
And fit glmnet:
fit = glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans,nlambda=500)
Get predictions over all lambdas, and you can see there's no negative predictions once you transform back:
pred = predict(fit,as.matrix(data[,1:13]))
range(bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]))
[1] 0.006690685 0.918473356
And let's say we do a fit with cv:
fit = cv.glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans)
pred = predict(fit,as.matrix(data[,1:13]))
pred_transformed = bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]
plot(data$medv,pred_transformed,xlab="orig response",ylab="predictions")

What are the differences between directly plotting the fit function and plotting the predicted values(they have same shape but different ranges)?

I am trying to learn gam() in R for a logistic regression using spline on a predictor. The two methods of plotting in my code gives the same shape but different ranges of response in the logit scale, seems like an intercept is missing in one. Both are supposed to be correct but, why the differences in range?
library(ISLR)
attach(Wage)
library(gam)
gam.lr = gam(I(wage >250) ~ s(age), family = binomial(link = "logit"), data = Wage)
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
pred=predict(gam.lr, newdata = list(age = age.grid), type = "link")
par(mfrow = c(2,1))
plot(gam.lr)
plot(age.grid, pred)
I expected that both of the methods would give the exact same plot. plot(gam.lr) plots the additive effects of each component and since here there's only one so it is supposed to give the predicted logit function. The predict method is also giving me estimates in the link scale. But the actual outputs are on different ranges. The minimum value of the first method is -4 while that of the second is less than -7.
The first plot is of the estimated smooth function s(age) only. Smooths are subject to identifiability constraints as in the basis expansion used to parametrise the smooth, there is a function or combination of functions that are entirely confounded with the intercept. As such, you can't fit the smooth and an intercept in the same model as you could subtract some value from the intercept and add it back to the smooth and you have the same fit but different coefficients. As you can add and subtract an infinity of values you have an infinite supply of models, which isn't helpful.
Hence identifiability constraints are applied to the basis expansions, and the one that is most useful is to ensure that the smooth sums to zero over the range of the covariate. This involves centering the smooth at 0, with the intercept then representing the overall mean of the response.
So, the first plot is of the smooth, subject to this sum to zero constraint, so it straddles 0. The intercept in this model is:
> coef(gam.lr)[1]
(Intercept)
-4.7175
If you add this to values in this plot, you get the values in the second plot, which is the application of the full model to the data you supplied, intercept + f(age).
This is all also happening on the link scale, the log odds scale, hence all the negative values.

Type measure differences in glmnet package?

What is the difference between using "mse" and "class" in the glmnet package?
log_x <- model.matrix(response~.,train)
log_y <- ifelse(train$response=="good",1,0)
log_cv <- cv.glmnet(log_x,log_y,alpha=1,family="binomial", type.measure = "class")
summary(log_cv)
plot(log_cv)
vs.
log_x <- model.matrix(response~.,train)
log_y <- ifelse(train$response=="good",1,0)
log_cv <- cv.glmnet(log_x,log_y,alpha=1,family="binomial", type.measure = "mse")
summary(log_cv)
plot(log_cv)
I'm noticing that I'm getting a slightly different curve, or smootness in my plot, and a few % difference in accuracy. But for predicting a binnomial class response is one type measure more appropriate than the other?
It depends on your case study and what you want to learn from your model. From the help files
The default is type.measure="deviance", which uses squared-error
for gaussian models (a.k.a type.measure="mse" there) [...]. type.measure="class"
applies to binomial and multinomial logistic regression only, and gives misclassification
error
Therefore, you have to ask yourself whether, in your problem, you want to minimize misclassification error or the mean squared error.
There is no straight forward answer to which is best. They are two different statistics from which the model decides what is the best penalization parameter to go for given the different models generated by the cross validation.

glmer - predict with binomial data (cbind count data)

I am trying to predict values over time (Days in x axis) for a glmer model that was run on my binomial data. Total Alive and Total Dead are count data. This is my model, and the corresponding steps below.
full.model.dredge<-glmer(cbind(Total.Alive,Total.Dead)~(CO2.Treatment+Lime.Treatment+Day)^3+(Day|Container)+(1|index),
data=Survival.data,family="binomial")
We have accounted for overdispersion as you can see in the code (1:index).
We then use the dredge command to determine the best fitted models with the main effects (CO2.Treatment, Lime.Treatment, Day) and their corresponding interactions.
dredge.models<-dredge(full.model.dredge,trace=FALSE,rank="AICc")
Then made a workspace variable for them
my.dredge.models<-get.models(dredge.models)
We then conducted a model average to average the coefficients for the best fit models
silly<-model.avg(my.dredge.models,subset=delta<10)
But now I want to create a graph, with the Total Alive on the Y axis, and Days on the X axis, and a fitted line depending on the output of the model. I understand this is tricky because the model concatenated the Total.Alive and Total.Dead (see cbind(Total.Alive,Total.Dead) in the model.
When I try to run a predict command I get the error
# 9: In UseMethod("predict") :
# no applicable method for 'predict' applied to an object of class "mer"
Most of your problem is that you're using a pre-1.0 version of lme4, which doesn't have the predict method implemented. (Updating would be easiest, but I believe that if you can't for some reason, there's a recipe at http://glmm.wikidot.com/faq for doing the predictions by hand by extracting the fixed-effect design matrix and the coefficients ...)There's actually not a problem with the predictions, which predict the log-odds (by default) or the probability (if type="response"); if you wanted to predict numbers, you'd have to multiply by N appropriately.
You didn't give one, but here's a reproducible (albeit somewhat trivial) example using the built-in cbpp data set (I do get some warning messages -- no non-missing arguments to max; returning -Inf -- but I think this may be due to the fact that there's only one non-trivial fixed-effect parameter in the model?)
library(lme4)
packageVersion("lme4") ## 1.1.4, but this should work as long as >1.0.0
library(MuMIn)
It's convenient for later use (with ggplot) to add a variable for the proportion:
cbpp <- transform(cbpp,prop=incidence/size)
Fit the model (you could also use glmer(prop~..., weights=size, ...))
gm0 <- glmer(cbind(incidence, size - incidence) ~ period+(1|herd),
family = binomial, data = cbpp)
dredge.models<-dredge(gm0,trace=FALSE,rank="AICc")
my.dredge.models<-get.models(dredge.models)
silly<-model.avg(my.dredge.models,subset=delta<10)
Prediction does work:
predict(silly,type="response")
Creating a plot:
library(ggplot2)
theme_set(theme_bw()) ## cosmetic
g0 <- ggplot(cbpp,aes(period,prop))+
geom_point(alpha=0.5,aes(size=size))
Set up a prediction frame:
predframe <- data.frame(period=levels(cbpp$period))
Predict at the population level (ReForm=NA -- this may have to be REForm=NA in lme4 `1.0.5):
predframe$prop <- predict(gm0,newdata=predframe,type="response",ReForm=NA)
Add it to the graph:
g0 + geom_point(data=predframe,colour="red")+
geom_line(data=predframe,colour="red",aes(group=1))

Resources