I am creating a varying-coefficient GAMM using 'mgcv' in R with a continuous 'by' variable by using the by setting. However, I am having difficulty in locating the parameter estimate of the effect of the 'by' variable. In this example we determine the spatially-dependent effect of temperature t on sole eggs (i.e. how the linear effect of temperature on sole eggs changes across space):
require(mgcv)
require(gamair)
data(sole)
b = gam(eggs ~ s(la,lo) + s(la,lo, by = t), data = sole)
We can then plot the predicted effects of s(la,lo, by = t) against the predictor t:
pred <- predict(b, type = "terms", se.fit =T)
by.variable.prediction <- pred[[1]][,2]
plot(x= sole$t, y = by.variable.prediction)
However, I can't find a listing/function with the parameter estimates of the 'by' variable t for each sampling location. summary(), coef(), and predict() do not give you the parameter estimates.
Any help would be appreciated!
So the coefficient for the variable t is the value where t is equal to 1, conditional on the latitude and longitude. So one way to get the coefficient/parameter estimate for t at each latitude and longitude is to construct your own dataframe with a range of latitude/longitude combinations with t=1 and run predict.gam on that (rather than running predict.gam on the data used the fit the model, as you have done). So:
preddf <- expand.grid(list(la=seq(min(sole$la), max(sole$la), length.out=100),
lo=seq(min(sole$lo), max(sole$lo), length.out=100),
t=1))
preddf$parameter <- predict(b, preddf, type="response")
And then if you want to visualize this coefficient over space, you could graph it with ggplot2.
library(ggplot2)
ggplot(preddf) +
geom_tile(aes(x=lo, y=la, fill=parameter))
Related
I am conducting a lasso regression modeling predictors of a count outcome in glmnet.
I am wondering what to make of the predictions from this model.
Here is some toy data. It's not very good because I don't know how to simulate multivariate data but I'm mainly interested in whether I'm getting the syntax right.
set.seed(123)
df <- data.frame(count = rpois(500, lambda = 3),
pred1 = rnorm(500),
pred2 = rnorm(500),
pred3 = rnorm(500),
pred4 = rnorm(500),
pred5 = rnorm(500),
pred6 = rnorm(500),
pred7 = rnorm(500),
pred8 = rnorm(500),
pred9 = rnorm(500),
pred10 = rnorm(500))
Now run the model
x <- model.matrix(count ~ ., df)[,-1]
y <- df$count
cvg <- cv.glmnet(x,y,family = "poisson")
now when I generate predicted outcomes
yTest <- predict(cvg, newx = x, family = "poisson", type = "link")
This is the output
# 1 1.094604
# 2 1.094604
# 3 1.094604
# 4 1.094604
# 5 1.094604
# 6 1.094604
# ... ........
Now obviously the model predictions are all the same and all terrible (unsurprising given the absence of any association between the predictors and the outcome), but the thing I am wondering is why they are not integers (with my real data I have the same problem).
I have several questions.
So my questions are:
Am I specifying the correct arguments in the glmnet.predict() function? In the help for the predict function it states that specifying type = "link" gives "the linear predictors" for poisson models, whereas specifying type = "response" gives the "fitted mean" for poisson models (in the case of my dumb example it generates 500 values of 2.988).
Shouldn't the predicted outcomes match the form of the data itself, i.e. be integers?
If I am specifying the correct arguments in the predict() function, how do I use the non-integer predictions Do I round them to the nearest integer, or just leave them alone?
Shouldn't the predicted outcomes match the form of the data itself,
i.e. be integers?
When you use a regression model you are associating a (conditional) probability distribution, indexed by parameters (in the Poisson case, the lambda parameter, which represents the mean) to each predictor configuration. A prediction of the response minimizes some expected loss function conditional to the predictor values so it depends on what loss function you are using.
If you consider a 0-1 loss, then yes, the predicted values should be an integer: the mode of the distribution, its most probable value, which in the case of a Possion distribution is the floor of lambda if it is not an integer (https://en.wikipedia.org/wiki/Poisson_distribution).
If you consider a squared loss (y - y_prediction)^2 then your prediction is the conditional expectation (see https://en.wikipedia.org/wiki/Minimum_mean_square_error#Properties), which is not necessarily an integer, just like the result you are getting.
glmnet uses squared loss, but you can easily predict an integer value (the one that minimizes the 0-1 loss) by applying the floor() function to the predicted values output by glmnet.
When we plot a GAM model using the mgcv package with isotropic smoothers, we have a contour plot that looks something like this:
x axis for one predictor,
y axis for another predictor,
the main is a function s(x1, x2) (isotropic smother).
Suppose that in this model we have many other isotropic smoothers like:
y ~ s(x1, x2) + s(x3, x4) + s(x5, x6)
My doubts are: when interpreting the contour plot for s(x1, x2), what happens to the others isotropic smoothers? Are they "fixed at their medians"? Can we interpret a s(x1, x2) plot separately?
Because this model is additive in the functions you can interpret the functions (the separate s() terms) separately, but not necessarily as separate effects of covariates on the response. In your case there is no overlap between the covariates in each of the bivariate smooths, so you can also interpret them as the effects of the covariates on the response separately from the other smoothers.
All of the smooth functions are typically subject to a sum to zero constraint to allow the model constant term (the intercept) to be an identifiable parameter. As such, the 0 line in each plot is the value of the model constant term (on the scale of the link function or linear predictor).
The plots shown in the output from plot.gam(model) are partial effects plots or partial plots. You can essentially ignore the other terms if you are interested in understanding the effect of that term on the response as a function of the covariates for the term.
If you have other terms in the model that might include one or more covariates in another terms, and you want to look at how the response changes as you vary that term or coavriate, then you should predict from the model over the range of the variables you are interested in, whilst holding the other variables at some representation values, say their means or medians.
For example if you had
model <- gam(y ~ s(x, z) + s(x, v), data = foo, method = 'REML')
and you want to know how the response varied as a function of x only, you would fix z and v at representative values and then predict over a range of values for x:
newdf <- with(foo, expand.grid(x = seq(min(x), max(x), length = 100),
z = median(z)
v = median(v)))
newdf <- cbind(newdf, fit = predict(model, newdata = newdf, type = 'response'))
plot(fit ~ x, data = newdf, type = 'l')
Also, see ?vis.gam in the mgcv package as a means of preparing plots like this but where it does the hard work.
I am running GLM with linear regression, then i am using predict to fit the response on my test data, but the problem is i am getting the probabilities and i don't know how to convert those probabilities to real values.
log<- glm(formula=stock_out_duration~lag_2_market_unres_dos+lag_2_percentage_bias_forecast_error + forecast,train_data_final,family = inverse.gaussian(link = "log"),maxit=100)
summary(log)
predict <- predict(log, test_data, type = 'response')
table_mat <- table(test_data$stock_out_duration)
table_mat
As far as I'm aware, there isn't a magic function that does this for you given that you're using glm. As you've noted, what typically gets returned is the probabilities. You can convert the probabilities into predictions for the outcome of the underlying categories by choosing the outcome with the largest probability. I agree a one-line function for this would be nice though.
You can get this functionality if use the glmnet package.
library(glmnet)
y = ifelse(rnorm(100) > 0, "red", "blue")
y = factor(y)
x = rnorm(100)
fit = glmnet(x, y, family="binomial") # use family="multinomial" if there are more than 2 categories in your factor
yhat = predict(fit, newx=x, type="class", s=0)
yhat in the above will be a vector containing either "red" or "blue".
Note, the type="class" is the bit that gets you the category outcomes returned in yhat. The s=0 means to use a lambda penalty of zero for the coefficients you use to get predictions. You indicated in the question that you were just doing ordinary regression without any ridge or lasso style penalty factors, so s=0 ensures you get that in your predictions.
I'm building a penalized multinomial logistic regression, but I'm having trouble coming up with a easy way to get the prediction accuracy. Here's my code:
fit.ridge.cv <- cv.glmnet(train[,-1], train[,1], type.measure="mse", alpha=0,
family="multinomial")
fit.ridge.best <- glmnet(train[,-1], train[,1], family = "multinomial", alpha = 0,
lambda = fit.ridge.cv$lambda.min)
fit.ridge.pred <- predict(fit.ridge.best, test[,-1], type = "response")
The first column of my test data is the response variable and it has 4 categories. And if I look at the result(fit.ridge.pred) it looks like this:
1,2,3,4
0.8743061353, 0.0122328811, 0.004798154, 0.1086628297
From what I understand these are the class probabilities. I want to know if there's a easy way to compute the model accuracy on the test data. Now I'm taking the max for each row and comparing with the original label. Thanks
Something like:
predicted <- colnames(fit.ridge.pred)[apply(fit.ridge.pred,1,which.max)]
table(predicted, test[, 1]
The first line takes the class for which the model outputs the highest probability per row, after which the second line constructs a confusion matrix.
The accuracy is then basically the proportion of observations classified correct (sum of the diagonal / total)
For more details see Glmnet Vignette
fit.ridge.pred <- predict(fit.ridge.best, test[,-1], type = "class") # predict classes, not probability
table(fit.ridge.pred,test[,1]) # confusion matrix
mean(fit.ridge.pred==test[,1]) # accuracy
I am running a GAM model through the mgcv package with family = cox.ph() and have my data grouped by strata (strata = id). The data corresponds to one use location for an individual animal and 20 random locations associated with that individual that were available for use.
require(mgcv)
require(survival)
require(smoothHR)
gam1 = gam(time1~s(DWL)+strata(id),family=cox.ph(),method = "REML",data=dataset, weight = event1)
The model is running smoothly but I am unsure how to plot relationships to x-variable. DWL is a continuous variable. I have used the following to graph predictions:
x = seq(0,120) #extent of DWL values
plot(gam1,residuals=T,trans=function(x)exp(x)/(1+exp(x)),shade=T)
I am a bit confused about the use of the trans argument in the plot syntax. Using the cox.ph() for your family agrument, Is the logit-link the proper way to evaluate your predicted y-response to the x variable DWL?
Thank you,
P Farrell