R: GAM with fit on subset of data - r

I fit a Generalized Additive Model using gam from the mgcv package. I have a data table containing my dependent variable Y, an independent variable X, other independent variables Oth and a two-level factor Fac. I would like to fit the following model
Y ~ s(X) + Oth
BUT with the additional constraint that the s(X) term is fit only on one of the two levels of the factor, say Fac==1. The other terms Oth should be fit with the whole data.
I tried exploring s(X,by=Fac) but this biases the fit for Oth. In other words, I would like to express the belief that X relates to Y only if Fac==1, otherwise it does not make sense to model X.

Cheap trick: use an auxiliary variable that is X if Fac == 1 and 0 elsewhere.
library("mgcv")
library("ggplot2")
# simulate data
N <- 1e3
dat <- data.frame(covariate = runif(N),
predictor = runif(N),
group = factor(sample(0:1, N, TRUE)))
dat$outcome <- rnorm(N,
1 * dat$covariate +
ifelse(dat$group == 1,
.5 * dat$predictor +
1.5 * sin(dat$predictor * pi),
0), .1)
# some plots
ggplot(dat, aes(x = predictor, y = outcome,
col = group, group = group)) +
geom_point()
ggplot(dat, aes(x = covariate, y = outcome,
col = group, group = group)) +
geom_point()
# create auxiliary variable
dat$aux <- ifelse(dat$group == 1,
dat$predictor,
0)
# fit the data
fit1 <- gam(outcome ~ covariate + s(predictor, by = group),
data = dat)
fit2 <- gam(outcome ~ covariate + s(aux, by = group),
data = dat)
# compare fits
summary(fit1)
summary(fit2)

If I understand it right, you're thinking about some model with interaction like this:
Y ~ 0th + (Fac==1)*s(X)
If you want to "express the belief that X relates to Y only if Fac==1" don't treat Fac as a factor, but as a numeric variable. In this case you will get numeric interaction and only one set of coefficients (when it's a factor there where two). This type of model is a varying coefficient model.
# some data
data <- data.frame(th = runif(100),
X = runif(100),
Y = runif(100),
Fac = sample(0:1, 100, TRUE))
data$Fac<-as.numeric(as.character(data$Fac)) #change to numeric
# then run model
gam(Y~s(X, by=Fac)+th,data=data)
See the documentation for by option in the documentation ?s

Related

how to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns

Is it possible to force a polynomial regression to be very flexible for big sharp turns while reducing flexibility for small turns?
The reason is that I have a variabele y which depends on x for which I am sure there is a positive correlation, although my dataset contains some noise. This is why I do not want a wiggly line for x between 1 and 75 as in the graph below.
library(ggplot2)
library(dplyr)
x <- c(1:100)
y <- c(1,3,6,12,22,15,13,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,5,9,10,1,6,12,22,15,13,11,5,-1,-12,-23,6,12,22,11,5,1,3,6,12,22,11,5,-11,-22,-9,12,22,11,5,9,10,18,1,3,6,12,22,15,13,11,5,-5, -9, -12,6,12,22,11,5,1,3,6,12,22,11,5,1,3,6,12,22,11,5,9,10,18,28,37,50,90,120,150,200)
y <- y + x
df <- data.frame(x, y)
model <- lm(y ~ poly(x, 6, raw = TRUE), data = df)
predictions <- model %>% predict(df)
df <- cbind(df, predictions)
ggplot() +
geom_point(data = df, aes(x = x, y = y), size = 0.1) +
geom_line(data = df, aes(x = x, y = predictions), colour="blue", size=0.1)
I can alter the model to:
model <- lm(y ~ poly(x, 2, raw = TRUE), data = df)
Which gives this graph:
In this case the model is without wigglyness for x between 0 and 90 although it is missing the flexibility to make the turn around x is 90.
I am not looking for a specific solution for this example dataset. I am looking for a solution to have a polynomial regression flexible enough to make sharp (big) turns, although reducing wigglyness for small turns at the same time. (Maybe this can be solved by a limit to make a maximum of n turns?)
I want to use it automated at several datasets. For this reason I do not want to specify different ranges of x for different models.
Thank you!
I have also tried using gam from mgcv, although this gives similar results:
mygam <- gam(y ~ s(x, k = 7), data = df)
mygam <- gam(y ~ s(x, k = 3), data = df)
This graph is based on pmax(p1, p2) where p1 and p2 are two polynomials:

Difference between two geom_smooth() lines

I made a plot for my data and am now I would like to have the difference in y for every x that was estimated by geom_smooth(). There is a similiar question which unfortunately has no answer. For example, how to get the differences for the following plot (data below):
EDIT
Two suggestions were made but I still don't know how to calculate the differences.
First suggestion was to access the data from the ggplot object. I did so with
pb <- ggplot_build(p)
pb[["data"]][[1]]
That approach kind of works, but the data doesn't use the same x values for the groups. For example, the first x value of the first group is -3.21318853, but there is no x of -3.21318853 for the second group, hence, I can not calculate the difference in y for -3.21318853 between both groups
Second suggestion was to see what formula is used in geom_smooth(). The package description says that "loess() is used for less than 1,000 observations; otherwise mgcv::gam() is used with formula = y ~ s(x, bs = "cs")". My N is more than 60,000, hence, gam is used by default. I am not familiar with gam; can anyone provide a short answer how to calculate the difference between the two lines considering the things just described?
R Code
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
Hi and welcome on Stack Overflow,
The first suggestion is good. To make the x-sequences match, you can interpolate the values in between using the approx function (in stats).
library("ggplot2") # library ggplot
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
p <- ggplot(df, aes(x, y, col= g)) + geom_smooth() + geom_point(alpha= .1) # make a plot
pb <- ggplot_build(p) # Get computed data
data.of.g1 <- pb[['data']][[1]][pb[['data']][[1]]$group == 1, ] # Extract info for group 1
data.of.g2 <- pb[['data']][[1]][pb[['data']][[1]]$group == 2, ] # Extract info for group 2
xlimit.inf <- max(min(data.of.g1$x), min(data.of.g2$x)) # Get the minimum X the two smoothed data have in common
xlimit.sup <- min(max(data.of.g1$x), max(data.of.g2$x)) # Get the maximum X
xseq <- seq(xlimit.inf, xlimit.sup, 0.01) # Sequence of X value (you can use bigger/smaller step size)
# Based on data from group 1 and group 2, interpolates linearly for all the values in `xseq`
y.g1 <- approx(x = data.of.g1$x, y = data.of.g1$y, xout = xseq)
y.g2 <- approx(x = data.of.g2$x, y = data.of.g2$y, xout = xseq)
difference <- data.frame(x = xseq, dy = abs(y.g1$y - y.g2$y)) # Compute the difference
ggplot(difference, aes(x = x, y = dy)) + geom_line() # Make the plot
Output:
As I mentioned in the comments above, you really are better off doing this outside of ggplot and instead do it with a full model of the two smooths from which you can compute uncertainties on the difference, etc.
This is basically a short version of a blog post that I wrote a year or so back.
OP's exmaple data
set.seed(1) # make example reproducible
n <- 5000 # set sample size
df <- data.frame(x= rnorm(n), g= factor(rep(c(0,1), n/2))) # generate data
df$y <- NA # include y in df
df$y[df$g== 0] <- df$x[df$g== 0]**2 + rnorm(sum(df$g== 0))*5 # y for group g= 0
df$y[df$g== 1] <-2 + df$x[df$g== 1]**2 + rnorm(sum(df$g== 1))*5 # y for g= 1 (with intercept 2)
Start by fitting the model for the example data:
library("mgcv")
m <- gam(y ~ g + s(x, by = g), data = df, method = "REML")
Here I'm fitting a GAM with a factor-smooth interaction (the by bit) and for this model we need to also include g as a parametric effect as the group-specific smooths are both centred about 0 so we need to include the group means in the parametric part of the model.
Next we need a grid of data along the x variable at which we will estimate the difference between the two estimated smooths:
pdat <- with(df, expand.grid(x = seq(min(x), max(x), length = 200),
g = c(0,1)))
pdat <- transform(pdat, g = factor(g))
then we use this prediction data to generate the Xp matrix, which is a matrix that maps values of the covariates to values of the basis expansion for the smooths; we can manipulate this matrix to get the difference smooth that we want:
xp <- predict(m, newdata = pdat, type = "lpmatrix")
Next some code to identify which rows and columns in xp belong to the smooths for the respective levels of g; as there are only two levels and only a single smooth term in the model, this is entirely trivial but for more complex models this is needed and it is important to get the smooth component names right for the grep() bits to work.
## which cols of xp relate to splines of interest?
c1 <- grepl('g0', colnames(xp))
c2 <- grepl('g1', colnames(xp))
## which rows of xp relate to sites of interest?
r1 <- with(pdat, g == 0)
r2 <- with(pdat, g == 1)
Now we can difference the rows of xp for the pair of levels we are comparing
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
As we focus on the difference, we need to zero out all the column not associated with the selected pair of smooths, which includes any parametric terms.
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
(In this example, these two lines do exactly the same thing, but in more complex examples both are needed.)
Now we have a matrix X which contains the difference between the two basis expansions for the pair of smooths we're interested in, but to get this in terms of fitted values of the response y we need to multiply this matrix by the vector of coefficients:
## difference between smooths
dif <- X %*% coef(m)
Now dif contains the difference between the two smooths.
We can use X again and covariance matrix of the model coefficients to compute the standard error of this difference and thence a 95% (in this case) confidence interval for the estimate difference.
## se of difference
se <- sqrt(rowSums((X %*% vcov(m)) * X))
## confidence interval on difference
crit <- qt(.975, df.residual(m))
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
Note that here with the vcov() call we're using the empirical Bayesian covariance matrix but not the one corrected for having chosen the smoothness parameters. The function I show shortly allows you to account for this additional uncertainty via argument unconditional = TRUE.
Finally we gather the results and plot:
res <- data.frame(x = with(df, seq(min(x), max(x), length = 200)),
dif = dif, upr = upr, lwr = lwr)
ggplot(res, aes(x = x, y = dif)) +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = x), alpha = 0.2) +
geom_line()
This produces
Which is consistent with an assessment that shows the model with the group-level smooths doesn't provide substantially better fit than a model with different group means but only single common smoother in x:
r$> m0 <- gam(y ~ g + s(x), data = df, method = "REML")
r$> AIC(m0, m)
df AIC
m0 9.68355 30277.93
m 14.70675 30285.02
r$> anova(m0, m, test = 'F')
Analysis of Deviance Table
Model 1: y ~ g + s(x)
Model 2: y ~ g + s(x, by = g)
Resid. Df Resid. Dev Df Deviance F Pr(>F)
1 4990.1 124372
2 4983.9 124298 6.1762 73.591 0.4781 0.8301
Wrapping up
The blog post I mentioned has a function which wraps the steps above into a simple function, smooth_diff():
smooth_diff <- function(model, newdata, f1, f2, var, alpha = 0.05,
unconditional = FALSE) {
xp <- predict(model, newdata = newdata, type = 'lpmatrix')
c1 <- grepl(f1, colnames(xp))
c2 <- grepl(f2, colnames(xp))
r1 <- newdata[[var]] == f1
r2 <- newdata[[var]] == f2
## difference rows of xp for data from comparison
X <- xp[r1, ] - xp[r2, ]
## zero out cols of X related to splines for other lochs
X[, ! (c1 | c2)] <- 0
## zero out the parametric cols
X[, !grepl('^s\\(', colnames(xp))] <- 0
dif <- X %*% coef(model)
se <- sqrt(rowSums((X %*% vcov(model, unconditional = unconditional)) * X))
crit <- qt(alpha/2, df.residual(model), lower.tail = FALSE)
upr <- dif + (crit * se)
lwr <- dif - (crit * se)
data.frame(pair = paste(f1, f2, sep = '-'),
diff = dif,
se = se,
upper = upr,
lower = lwr)
}
Using this function we can repeat the entire analysis and plot the difference with:
out <- smooth_diff(m, pdat, '0', '1', 'g')
out <- cbind(x = with(df, seq(min(x), max(x), length = 200)),
out)
ggplot(out, aes(x = x, y = diff)) +
geom_ribbon(aes(ymin = lower, ymax = upper, x = x), alpha = 0.2) +
geom_line()
I won't show the plot here as it is identical to that shown above except for the axis labels.

brms package in R smoother

I have this data frame in R:
x = rep(seq(-10,10,1),each=5)
y = rep(0,length(x) )
weights = sample( seq(1,20,1) ,length(x), replace = TRUE)
weights = weights/sum(weights)
groups = rep( letters[1:5], times =length(x)/5 )
and some data that looks like this:
library(ggplot2)
ggplot(data = dat, aes(x = x, y = y, color = group))+geom_point( aes(size = weights))+
ylab("outcome")+
xlab("predictor x1")+
geom_vline(xintercept = 0)+ geom_hline(yintercept = 0)
fit_brms = brm(y~ s(x)+(1|group), data = dat)
by_group = marginal_effects(fit_brms, conditions = data.frame(group = dat$group) ,
re_formula = NULL, method = "predict")
plot(by_group, ncol = 5, points = TRUE)
I'd like to make a hierarchical nonlinear model so that there is a different nonlinear fit for each group.
In brms I have the code below which is doing a spline fit on the x predictor with random intercepts on group the fitted line is the same for all groups. the difference is where the lines cross the y intercept. Is there a way to make the non-linear fit be different for each group's data points?
ON page 13 here : https://cran.r-project.org/web/packages/brms/vignettes/brms_multilevel.pdf
It states "As the smooth term itself cannot be modeled as varying by year in a multilevel manner,we add a basic varying intercept in an effort to account for variation between years"
So the spline will be the same for all groups it appears? The only difference in the plots is where the spline cross the y intercept. That seems very restrictive. Can this be modified to make the spline unique to each group?
Use the formula: y ~ s(x, by = group) + (1|group)

estimate h2o glm coefficients by a categorical variable level

I would like to estimate coefficient for a predictor by a categorical variable level in h2o glm. For example, if my data frame has product price (continuous variable) and product type (categorical variable), then I want to estimate a coefficient for price by product. In SAS, you can easily accomplish this by specifying model effect as price*type. How can I do the same in h2o or R?
There is an interactions() function, but it cannot handle interaction between a continuous and categorical variables. Any tips to get around this problem?
Many thanks,
set.seed(1234)
x1 = rnorm(100,0,1)
x2 = as.factor(rep(c("A","B","C","D"), each = 25))
y = as.factor(rep(0:1, each = 50))
data = data.frame(x1 = x1, x2 = x2, y = y)
Interactions can be specified using a ":" in the formula argument
# glm base example
fit <- glm(data = data, y ~ x1 + x2 + x1:x2, family = "binomial")
print(fit)
Using h2o.glm pairwise interactions can be specified by passing column indices to the interactions argument
# h2o.glm example
library("h2o")
h2o.init(nthreads = -1)
data.hex = as.h2o(data)
h2o_fit <- h2o.glm(x = 1:2, y = 3, training_frame = data.hex, family = "binomial", interactions = 1:2)
h2o_fit#model$coefficients_table
h2o.shutdown(prompt = F)

Obtaining predictions from an mgcv::gam fit that contains a matrix "by" variable to a smooth

I just discovered that mgcv::s() permits one to supply a matrix to its by argument, permitting one to smooth a continuous variable with separate smooths for each of a combination of variables (and their interactions if so desired). However, I'm having trouble getting sensible predictions from such models, for example:
library(mgcv) #for gam
library(ggplot2) #for plotting
#Generate some fake data
set.seed(1) #for replicability of this example
myData = expand.grid(
var1 = c(-1,1)
, var2 = c(-1,1)
, z = -10:10
)
myData$y = rnorm(nrow(myData)) + (myData$z^2 + myData$z*4) * myData$var1 +
(3*myData$z^2 + myData$z) * myData$var2
#note additive effects of var1 and var2
#plot the data
ggplot(
data = myData
, mapping = aes(
x = z
, y = y
, colour = factor(var1)
, linetype = factor(var2)
)
)+
geom_line(
alpha = .5
)
#reformat to matrices
zMat = matrix(rep(myData$z,times=2),ncol=2)
xMat = matrix(c(myData$var1,myData$var2),ncol=2)
#get the fit
fit = gam(
formula = myData$y ~ s(zMat,by=xMat,k=5)
)
#get the predictions and plot them
predicted = myData
predicted$value = predict(fit)
ggplot(
data = predicted
, mapping = aes(
x = z
, y = value
, colour = factor(var1)
, linetype = factor(var2)
)
)+
geom_line(
alpha = .5
)
Yields this plot of the input data:
And this obviously awry plot of the predicted values:
Whereas replacing the gam fit above with:
fit = gam(
formula = y ~ s(z,by=var1,k=5) + s(z,by=var2,k=5)
, data = myData
)
but otherwise running the same code yields this reasonable plot of predicted values:
What am I doing wrong here?
The use of vector-valued inputs to mgcv smooths is taken up here. It seems to me that you are misunderstanding these model types.
Your first formula
myData$y ~ s(zMat,by=xMat,k=5)
fits the model
y ~ f(z)*x_1 + f(z)*x_2
That is, mgcv estimates a single smooth function f(). This function is evaluated at each covariate, with the weightings supplied to the by argument.
Your second formula
y ~ s(z,by=var1,k=5) + s(z,by=var2,k=5)
fits the model
y ~ f_1(z)*x_1 +f_2(z)*x_2
where f_1() and f_2() are two different smooth functions. Your data model is essentially the second formula, so it is not surprising that it gives a more sensible looking fit.
The first formula is useful when you want an additive model where a single function is evaluated on each variable, with given weightings.

Resources