I would like to fit a cubic spline using the gam function in mgcv package in R. Furthermore, I would like to constrain values outside of the training set (beyond the outer knots) to be equal to the nearest knot value. That is, no model prediction should be done outside the range of the training data. I know I can do this simply by eliminating those points in the predict call and then setting them to the min and max of the training data. However, is there a built in method in gam to do this (just so it's a bit cleaner?)
Example code:
require(mgcv)
x = 10:90
y = x^2
mdl = gam(y ~ s(x, bs="cr"))
needed_x = 1:100
p = predict(mdl, newdata = list(x = needed_x)) #this returns model values form 1:9 and 91:100
Related
When we plot a GAM model using the mgcv package with isotropic smoothers, we have a contour plot that looks something like this:
x axis for one predictor,
y axis for another predictor,
the main is a function s(x1, x2) (isotropic smother).
Suppose that in this model we have many other isotropic smoothers like:
y ~ s(x1, x2) + s(x3, x4) + s(x5, x6)
My doubts are: when interpreting the contour plot for s(x1, x2), what happens to the others isotropic smoothers? Are they "fixed at their medians"? Can we interpret a s(x1, x2) plot separately?
Because this model is additive in the functions you can interpret the functions (the separate s() terms) separately, but not necessarily as separate effects of covariates on the response. In your case there is no overlap between the covariates in each of the bivariate smooths, so you can also interpret them as the effects of the covariates on the response separately from the other smoothers.
All of the smooth functions are typically subject to a sum to zero constraint to allow the model constant term (the intercept) to be an identifiable parameter. As such, the 0 line in each plot is the value of the model constant term (on the scale of the link function or linear predictor).
The plots shown in the output from plot.gam(model) are partial effects plots or partial plots. You can essentially ignore the other terms if you are interested in understanding the effect of that term on the response as a function of the covariates for the term.
If you have other terms in the model that might include one or more covariates in another terms, and you want to look at how the response changes as you vary that term or coavriate, then you should predict from the model over the range of the variables you are interested in, whilst holding the other variables at some representation values, say their means or medians.
For example if you had
model <- gam(y ~ s(x, z) + s(x, v), data = foo, method = 'REML')
and you want to know how the response varied as a function of x only, you would fix z and v at representative values and then predict over a range of values for x:
newdf <- with(foo, expand.grid(x = seq(min(x), max(x), length = 100),
z = median(z)
v = median(v)))
newdf <- cbind(newdf, fit = predict(model, newdata = newdf, type = 'response'))
plot(fit ~ x, data = newdf, type = 'l')
Also, see ?vis.gam in the mgcv package as a means of preparing plots like this but where it does the hard work.
I fit a Generalized Additive Model in the Negative Binomial family using gam from the mgcv package. I have a data frame containing my dependent variable y, an independent variable x, a factor fac and a random variable ran. I fit the following model
gam1 <- gam(y ~ fac + s(x) + s(ran, bs = 're'), data = dt, family = "nb"
I have read in Negative Binomial Regression book that it is still possible for the model to be overdisperesed. I have found code to check for overdispersion in glm but I am failing to find it for a gam. I have also encountered suggestions to just check the QQ plot and standardised residuals vs. predicted residuals, but I can not decide from my plots if the data is still overdisperesed. Therefore, I am looking for an equation that would solve my problem.
A good way to check how well the model compares with the observed data (and hence check for overdispersion in the data relative to the conditional distribution implied by the model) is via a rootogram.
I have a blog post showing how to do this for glm() models using the countreg package, but this works for GAMs too.
The salient parts of the post applied to a GAM version of the model are:
library("coenocliner")
library('mgcv')
## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10) # environmental locations
A0 <- 90 # maximal abundance
mu <- 3 # position on gradient of optima
alpha <- 1.5 # parameter of beta response
gamma <- 4 # parameter of beta response
r <- 6 # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5 # overdispersion parameter 1/theta
zprobs <- 0.3 # prob(y == 0) in binomial model
## simulate some negative binomial data from this response model
nb <- coenocline(locs, responseModel = "beta", params = pars,
countModel = "negbin",
countParams = list(alpha = nb.alpha))
df <- setNames(cbind.data.frame(locs, nb), c("x", "yNegBin"))
OK, so we have a sample of data drawn from a negative binomial sampling distribution and we will now fit two models to these data:
A Poisson GAM
m_pois <- gam(yNegBin ~ s(x), data = df, family = poisson())
A negative binomial GAM
m_nb <- gam(yNegBin ~ s(x), data = df, family = nb())
The countreg package is not yet on CRAN but it can be installed from R-Forge:
install.packages("countreg", repos="http://R-Forge.R-project.org")
Then load the packages and plot the rootograms:
library("countreg")
library("ggplot2")
root_pois <- rootogram(m_pois, style = "hanging", plot = FALSE)
root_nb <- rootogram(m_nb, style = "hanging", plot = FALSE)
Now plot the rootograms for each model:
autoplot(root_pois)
autoplot(root_nb)
This is what we get (after plotting both using cowplot::plot_grid() to arrange the two rootograms on the same plot)
We can see that the negative binomial model does a bit better here than the Poisson GAM for these data — the bottom of the bars are closer to zero throughout the range of the observed counts.
The countreg package has details on how you can add an uncertain band around the zero line as a form of goodness of fit test.
You can also compute the Pearson estimate for the dispersion parameter using the Pearson residuals of each model:
r$> sum(residuals(m_pois, type = "pearson")^2) / df.residual(m_pois)
[1] 28.61546
r$> sum(residuals(m_nb, type = "pearson")^2) / df.residual(m_nb)
[1] 0.5918471
In both cases, these should be 1; we see substantial overdispersion in the Poisson GAM, and some under-dispersion in the Negative Binomial GAM.
I am running GLM with linear regression, then i am using predict to fit the response on my test data, but the problem is i am getting the probabilities and i don't know how to convert those probabilities to real values.
log<- glm(formula=stock_out_duration~lag_2_market_unres_dos+lag_2_percentage_bias_forecast_error + forecast,train_data_final,family = inverse.gaussian(link = "log"),maxit=100)
summary(log)
predict <- predict(log, test_data, type = 'response')
table_mat <- table(test_data$stock_out_duration)
table_mat
As far as I'm aware, there isn't a magic function that does this for you given that you're using glm. As you've noted, what typically gets returned is the probabilities. You can convert the probabilities into predictions for the outcome of the underlying categories by choosing the outcome with the largest probability. I agree a one-line function for this would be nice though.
You can get this functionality if use the glmnet package.
library(glmnet)
y = ifelse(rnorm(100) > 0, "red", "blue")
y = factor(y)
x = rnorm(100)
fit = glmnet(x, y, family="binomial") # use family="multinomial" if there are more than 2 categories in your factor
yhat = predict(fit, newx=x, type="class", s=0)
yhat in the above will be a vector containing either "red" or "blue".
Note, the type="class" is the bit that gets you the category outcomes returned in yhat. The s=0 means to use a lambda penalty of zero for the coefficients you use to get predictions. You indicated in the question that you were just doing ordinary regression without any ridge or lasso style penalty factors, so s=0 ensures you get that in your predictions.
As far as I am concerned, cvfit does a K fold cross validation, which means that in each time, it separates all the data into training & validation set. For every fixed lambda, first it uses training data to get a coefficient vector. Then implements this constructed model to predict on the validation set to get the error.
Hence, for K fold CV, it has k coefficient vectors (each is generated from a training set). So what does
coef(cvfit)
get?
Here is an example:
x <- iris[1:100,1:4]
y <- iris[1:100,5]
y <- factor(y)
fit <- cv.glmnet(data.matrix(x), y, family = "binomial", type.measure = "class",alpha=1,nfolds=3,standardize = T)
coef(fit, s=c(fit$lambda.min,fit$lambda.1se))
fit1 <- glmnet(data.matrix(x), y, family = "binomial",
standardize = T,
lambda = c(fit$lambda.1se,fit$lambda.min))
coef(fit1)
in fit1, I use the whole dataset as the training set, seems that the coefficients of fit1 and fit are just the same. That's why?
Thanks in advance.
Although cv.glmnet checks model performance by cross-validation, the actual model coefficients it returns for each lambda value are based on fitting the model with the full dataset.
The help for cv.glmnet (type ?cv.glmnet) includes a Value section that describes the object returned by cv.glmet. The returned list object (fit in your case) includes an element called glmnet.fit. The help describes it like this:
glmnet.fit a fitted glmnet object for the full data.
I am fitting GAM models to data using the mgcv package in R. Some of my predictors are circular, so I am using a periodic smoother. I run into an issue in cross validation where my holdout dataset can contain values outside the range of the training data. Since the gam package automatically chooses knots for the smooths, this leads to an error (see my related question here -- thanks to #nograpes and #DWin for their explanations of the errors there).
How can I manually specify the outer knots in a periodic smooth?
Example code
The first block generates some data.
library(mgcv)
set.seed(223) # produces error.
# set.seed(123) # no error.
# generate data:
x <- runif(100,min=-pi,max=pi)
linPred <- 2*cos(x) # value of the linear predictor
theta <- 1 / (1 + exp(-linPred)) #
y <- rbinom(100,1,theta)
plot(x,theta)
df <- data.frame(x=x,y=y)
The next block fits the GAM model with the periodic smooth:
gamFit <- gam(y ~ s(x,bs="cc",k=5),data=df,family=binomial())
summary(gamFit)
plot(gamFit)
It will be somewhere in the specification of the smoother term s(x,bs="cc",k=5) where I'm sure you'll be able to set some knots, but this is not obvious to me from the help of gam or from googling.
This block will fit some holdout data and produce the error if you set the seed as above:
# predict y values for new data:
x.2 <- runif(100,min=-pi,max=pi)
df.2 <- data.frame(x=x.2)
predict(gamFit,newdata=df.2)
Ideally, I would only set the outer knots and let gam pick the rest.
Apologies if this question is better for CrossValidated than SO.
Try this:
gamFit <- gam(y ~ s(x,bs="cc",k=5),
knots=list( x=seq(-pi,pi, len=5) ),
data=df, family=binomial())
You will find a worked example at:
?smooth.construct.cr.smooth.spec
I learned in testing this code that the 'k' parameter in s() needs to match the 'len' parameter in the 'x'-seq() value passed to knots(). I thought incorrectly that the knots argument would get passed to s().
You can do this in {mgcv} now and for some years (but perhaps not at the time the question was posed and answered). Using the model in #IRTFM's answer, one can just specify the outer knots for a cyclic CRS:
gamFit <- gam(y ~ s(x, bs = "cc"),
knots = list(x = c(-pi, pi)),
data = df, family = binomial())