Changing the Y axis of default plot.gam graphs - r

I have run a GAM in R using the mgcv package with the following form:
shark.gamFINAL <- gam(ln.raw.CPUE...0.1 ~ Year + Month +
s(Mean.Temp, bs = "cr") + s(Mean.Chl.a, bs = "cr") +
s(Mean.Front.density, bs = "cr"), data=r, family=gaussian)
After running this model and calculating the percentage deviance explained by each variable I would like to plot the effect of each variable against the response
However when I use the plot.gam function in R my graphs come out with a y axis that is "s(predictor variable, edf)"
I am not sure what this scale of the y axis represents?
Is there a way that I could change the y axis range to that which represents the response, as has been done in this paper: Walsh and Kleiber (2001), 'Generalized additive model and regression tree analyses of blue shark (Prionace glauca) catch rates by the Hawaii-based commercial longline fishery' in FIGURE 3.
I would have posted some examples of the plots I am describing but as this is my first post I do not have at least 10 reputation, so it won't let me do it!
I have searched many sites and forums to try and find an answer for this, but to no avail, any help would therefore be hugely appreciated!

The axis is the value taken by the centred smooth. It is the contribution (at a value of the covariate) made to the fitted value for that smooth function.
It is easy to change the y axis label - supply the one you want to the ylab argument. This of course means you have to plot each smooth separately if you want a separate y-axis label for each plot. In that case also use the select argument to plot specific smooth functions, for example:
layout(matrix(1:4, ncol = 2, byrow = TRUE)
plot(shark.gamFINAL, select = 1, ylab = "foo")
plot(shark.gamFINAL, select = 2, ylab = "bar")
plot(shark.gamFINAL, select = 3, ylab = "foobar")
layout(1)
The only way I know to change the scale of the y-axis is to build the plot by hand. Those plots are without the contribution from the model constant term, plus the other parametric terms. If your model only had an intercept and one smooth, you could generate new data over the range of that covariate and then predict from the model for these new data values but use type = "terms" to get the contribution for the smooth. You then plot the value returned from predict plus the value of the "constant" attribute returned by predict.
In your case, you need to control for the other variables when predicting. One way to do that is to set all the other covariates to their means or typical value but allow the covariate of interest to vary over its range, as before. You then sum up the values in each row of the matrix returned by predict(shark.gamFINAL, newdata = NEW, type = "terms") (where NEW is the new data frame to predict at, varying one covariate but holding the rest at some typical value), again adding on the constant. You have to redo this for each covariate in turn (i.e. once per plot) as you need to keep the other covariates held at the typical value.
All this does is shift the scale on the axis though - none of the smooths in your model interact with other smooths or terms in the model so perhaps it might be easier to think of the y-axis as the effect on the response for each smooth?

Related

interpretation of a GAM plot with a square rooted response variable

I have a simple GAM where my goal is to understand the variation of the distance to a feature along the year, and I originally ran it with the following formula:
m1 <- gam(dist ~ s(month, bs="cc",k = 12) + s(id, bs="re"), data=db1, method = "REML")
Where "dist" is the distance in meters to a feature, and "id" is the animal id. When plotting the GAM I obtain the following plot:
First question, if I would be interepreting the plot/writting a figure caption, is it correct to say something like:
"GAM plot showing the partial effects of month (x-axis) on the distance to a feature (y-axis). GAM smooths are centered at zero, therefore the zero line reflects the overall mean of the distance the feature. Thus, values below zero on the y-axis reflect higher proximity to the feature, while values above zero reflect longer distances to the feature."
I say that a negative value (below zero) would mean proximity to the feature as that's also the way to interpret distance coefficients in a GLM, but I would also like to make sure that this is correct and that I'm not misinterpreting the plot.
Second question, are the values on the y-axis directly interpretable? If so, what is the scale? Is it a % of change? (Based on a comment here, but I'm not sure if I understood it properly)
Then I transform the response variable to achieve normality (the original scale was a bit left skewed), and I run this model (residuals look better with the transformation):
m2 <- gam(sqrt(dist) ~ s(month, bs="cc",k = 12) + s(id, bs="re"), data=db1, method = "REML")
And I obtain this plot:
Pretty similar to the previous one, and I believe I can interpret it in the same way as described above. But, third question, if I would want to say exactly what the y axis mean, what would be the most correct way to describe it with the transformation?
Any help with this is very appreciated! Many thanks in advance!

Scale LDA decision boundary

I have a rather unconventional problem and having a hard time finding a solution to this. Would really appreciate your help.
I have 4 genes(features) and my classification here is binary(0 and 1). After a lot of back and forth, I have finalized on using LDA to do my classification. I have different studies each comparing the same two classes and I trained my model using these 4 genes on each of these studies.
I want to visualize the LDA scores in the form of points plot. Something like below, where each section represents a different study/dataset. Samples of that dataset on the X axis and the LD1 value I get using -
lda_model = lda(formula = class ~ ., data = train)
predict(lda_model,train) on the Y axis.
Since I trained a different model on each dataset, we can clearly see the the decision boundary (which I assume is the black line) for each dataset is different and on a different scale. However, I want to scale the values on the Y axis is such a way that all my datasets are on the same scale and I can represent this plot with a single decision boundary( again, something I can clearly draw on the plot, like the red line).
The LD1 values here are - a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) - mean(a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD)). This is done for each dataset individually. However, this is not exactly equal to (a(GeneA) + b(GeneB) + c(GeneC) + d(GeneD) + intercept) which we can get using logistic regression. I am trying to find that value or some method which can scale my Y axis across all the datasets using LDA.
Thanks for your help!
I did a min-max scaling and that seemed to work. It scaled all my data points across all datasets with decision boundary at zero.

How to change the y-axis for a multivariate GAM model from smoothed to actual values?

I am using multivariate GAM models to learn more about fog trends in multiple regions. Fog is determined by visibility going below a certain threshold (< 400 meters). Our GAM model is used to determine the response of visibility to a range of meteorological variables.
However, my challenge right now is that I'd really like the y-axis to be the actual visibility observations rather than the centered smoothed. It is interesting to see how visibility is impacted by the covariates relative to the mean visibility in that location, but it's difficult to compare this for multiple locations where the mean visibility is different (and thus the 0 point in which visibility is enhanced or diminished has little comparable meaning).
In order to compare the results of multiple locations, I'm trying to make the y-axis actual visibility observations, and then I'll put a line at the visibility threshold we're interested in looking at (400 m)
to evaluate what the predictor variables values are like below that threshold (eg what temperatures are associated with visibility below 400 m).
I'm still a beginner when it comes to GAMs and R in general, but I've figured out a few helpful pieces so far.
Helpful things so far:
Attempt 1. how to extract gam fit for each variable in model
Extracting data used to make a smooth plot in mgcv
Attempt 2. how to use predict function to reconstruct a univariable model
http://zevross.com/blog/2014/09/15/recreate-the-gam-partial-regression-smooth-plots-from-r-package-mgcv-with-a-little-style/
Attempt 3. how to get some semblance of a y-axis that looks like visibility observations using "fitted" -- though I don't think this is
the correct approach since I'm not taking the intercept into account
http://gsp.humboldt.edu/OLM/R/05_03_GAM.html
simulated data
install.packages("mgcv") #for gam package
require(mgcv)
install.packages("pspline")
require(pspline)
#simulated GAM data for example
dataSet <- gamSim(eg=1,n=400,dist="normal",scale=2)
visibility <- dataSet[[1]]
temperature <- dataSet[[2]]
dewpoint <- dataSet[[3]]
windspeed <- dataSet[[4]]
#Univariable GAM model
gamobj <- gam(visibility ~ s(dewpoint))
plot(gamobj, scale=0, page=1, shade = TRUE, all.terms=TRUE, cex.axis=1.5, cex.lab=1.5, main="Univariable Model: Dew Point")
summary(gamobj)
AIC(gamobj)
abline(h=0)
Univariable Model of Dew Point
https://imgur.com/1uzP34F
ATTEMPT 2 -- predict function with univariable model, but didn't change y-axis
#dummy var that spans length of original covariate
maxDP <-max(dewpoint)
minDP <-min(dewpoint)
DPtrial.seq <-seq(minDP,maxDP,length=3071)
DPtrial.seq <-data.frame(dewpoint=DPtrial.seq)
#predict only the DP term
preds <- predict(gamobj, type="terms", newdata=DPtrial.seq, se.fit=TRUE)
#determine confidence intervals
DPplot <-DPtrial.seq$dewpoint
fit <-preds$fit
fit.up95 <-fit-1.96*preds$se.fit
fit.low95 <-fit+1.96*preds$se.fit
#plot
plot(DPplot, fit, lwd=3,
main="Reconstructed Dew Point Covariate Plot")
#plot confident intervals
polygon(c(DPplot, rev(DPplot)),
c(fit.low95,rev(fit.up95)), col="grey",
border=NA)
lines(DPplot, fit, lwd=2)
rug(dewpoint)
Reconstructed Dew Point Covariate Plot
https://imgur.com/VS8QEcp
ATTEMPT 3 -- changed y-axis using "fitted" but without taking intercept into account
plot(dewpoint,fitted(gamobj), main="Fitted Response of Y (Visibility) Plotted Against Dew Point")
abline(h=mean(visibility))
rug(dewpoint)
Fitted Response of Y Plotted Against Dew Point https://imgur.com/RO0q6Vw
Ultimately, I want a horizontal line where I can investigate the predictor variable relative to 400 meters, rather than just the mean of the response variable. This way, it will be comparable across multiple sites where the mean visibility is different. Most importantly, it needs to be for multiple covariates!
Gavin Simpson has explained the method in a couple of posts but unfortunately, I really don't understand how I would hold the mean of the other covariates constant as I use the predict function:
Changing the Y axis of default plot.gam graphs
Any deeper explanation into the method for doing this would be super helpful!!!
I'm not sure how helpful this will be as your Q is a little more open ended than we'd typically like on SO, but, here goes.
Firstly, I think it would help to think about modelling the response variable, which I assume is currently visibility. This is going to be a continuous variable, bounded at 0 (perhaps the data never reach zero?) which suggests modelling the data as conditionally distributed either
gamma (family = Gamma(link = 'log')) for visibility that never takes a value of zero.
Tweedie (family = tw()) for data that do have zeroes.
An alternative approach would be to model the occurrence of fog; if this is defined as an event <400m visibility then you could turn all your observations into 0/1 values for being a fog event or otherwise. Then you'd model the data as conditionally distributed Bernoulli, using family = binomial().
Having decided on a modelling approach, we need to model the response. This should be done using a multiple regression type of approach, with a GAM including multiple predictors. This way you get to estimate the effect of each potential predictor variable on the response while controlling for the effects of the other predictors. If you just do this using a single predictor at a time, say dewpoint, that variable could well "explain" variation in the data that might be due to another predictor, windspeed say, and you wouldn't know it.
Furthermore, there may well be interactions between predictors that you'll want to control for if they exist, which can only be done in
Then, to finally get to the crux of your problem, having fitted the multi-predictor model to "explain" visibility, you will need to predict from the model for sets of likely conditions. To look at how the visibility varies with dewpoint in a model where other predictor variables have effects, you need to fix the other variables at some reasonable values; one option is to set them to their mean (or modal value in the case of any factor predictor variables), or some other value indicative of typically values for that variable. You'll have to use your domain knowledge for this.
If you have interactions in the model, then you'll need to vary the two variables in the interaction, whilst holding all other variable fixed at some values.
Let's assume you don't have interactions and are interested in dewpoint but the model also includes windspeed. The mean windspeed for the values used to fit the model can be found from the cmX component of the fitted model. Of you could just calculate this from the observed windpseed values or set it to some known number you want to use. Denote the fitted by m, and the data frame with your data in it by df, then we can create new data to predict at over the range of dewpoint, whilst holding windspeed fixed.
mn.windspd <- m$cmX['windspeed']
## or
mn.windspd <- with(df, mean(windspeed))
## or set it some some value
mn.windspd <- 10 # say
Then you can do
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 300),
windspeed = mn.windspd))
Then you use this to predict from the fitted model:
pred <- predict(m, newdata = preddata, type = "link", se.fit = TRUE)
pred <- as.data.frame(pred)
Now we want to put these predictions back on to the response scale, and we want a confidence interval so we have to create that first before back transforming:
ilink <- family(m)$linkinv
pred <- transform(pred,
Fitted = ilink(fit),
Upper = ilink(fit + (2 * se.fit)),
Lower = ilink(fit - (2 * se.fit)),
dewpoint = preddata = dewpoint)
Now you can visualised the effect of dewpoint on the response whilst keeping windspeed fixed.
In your case, you will have to extend this to keeping temperature constant also, but that is done in the same way
mn.windspd <- m$cmX['windspeed']
mn.temp <- m$cmX['temperature']
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 300),
windspeed = mn.windspd,
temperature = mn.temp))
and then follow the steps above to do the prediction.
For one or two variables varying I have a function data_slice() in my gratia package which will do the above expand.grid() stuff for you so you don't have to specify the mean values of the other covariates:
preddata <- data_slice(m, 'dewpoint', n = 300)
technically this finds the value in the data closest to the median value (for the covariates not varying). If you want means, then do
fixdf <- data.frame(windspeed = mn.windspd, temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', data = fixdf, n = 300)
If you have an interaction, say between dewpoint and windspeed then you need to vary two variables. This is pretty easy again with expand.grid():
mn.temp <- m$cmX['temperature']
preddata <- with(df,
expand.grid(dewpoint = seq(min(dewpoint),
max(dewpoint),
length = 100),
windspeed = seq(min(windspeed),
max(windspeed),
length = 300),
temperature = mn.temp))
This will create a 100 x 100 grid of values of the covariates to predict at, whilst holding temperature constant.
For data_slice() you'd need to do:
fixdf <- data.frame(temperature = mn.temp)
preddata <- data_slice(m, 'dewpoint', 'windpseed',
data = fixdf, n = 300)
And extending this on to more covariates you want to vary, is also easy following this pattern with expand.grid(); I have yet to implement more than 2 variables varying in data_slice.

Predict.lm() in R - how to get nonconstant prediction bands around fitted values

So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation.
This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) interval
Type of interval calculation.
okay.. but what is "none" for?
3a) type
Type of prediction (response or model term).
3b) terms
If type="terms", which terms (default is all terms)
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.
Make up some data:
d <- data.frame(x=c(1,4,5,7),
y=c(0.8,4.2,4.7,8))
Fit the model:
lm1 <- lm(y~x,data=d)
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
Plotting everything together:
par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".
interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.

finding functions that match dot plots

ggplot(test,aes(x=timepoints,y= mean,ymax = mean + sde, ymin = mean - sde)) +
geom_errorbar(width=2) +
geom_point() +
geom_line() +
stat_smooth(method='loess') +
xlab('Time (min)') +
ylab('Fold Induction') +
opts(title = 'yo')
I can plot the blue 'loess'-ed line. But is there a way to find the mathematical function of the blue 'loess'-ed line?
You can get the predictions for a regular sequence:
fit <-loess( mean ~ timepoints, data=test)
fit.points <- predict(fit, newdata= data.frame(
speed = seq(min(timepoints), max(timepoints), length=100)),
se = FALSE)
fitdf <- dataframe(x = seq(min(timepoints), max(timepoints), length=100)
y = fit.points)
You can then fit to that set of points with splines of an appropriate degree. Cubic spline fits can be described with greater ease than can loess fits.It would be easier to synchronize an answer to variable names it you had offered a data example to work with. The plot does not seem to be created with that code.
Rule Number One: not all distributions have a (closed-form) function which generates them. Yes, you can create a close fit by way of splines, or calculating moments (mean, variance, skew, etc) and building the series, so your choice depends on whether you intend to interpolate, extrapolate, or just "view" the resultant function.
In the scientific world, it's more common to have a theory, or premise, about the behavior behind your data. You can then do standard (e.g. nls) fitting methods to see how well the proposed fit function can be made to match your data.
To understand how the loess line is computed see the loess.demo function in the TeachingDemos package. This is an interactive graphical demonstration that will show how the y-value at each point is computed for each x-value based on the data and bandwidth parameter (it also shows the difference in the raw loess fit and the spline that is often fit to the loess estimates).

Resources