I would like to visualize the third order interaction fitted with thin-plate regression splines - r

I am a beginner of R so it may be a simple question.
I am now trying to fit a 4-dimensional point using thin-plate regression splines. One variable is a target variable and three variables are an explanatory variable.
I made a model with third order interaction and fitted the data to this.
library(mgcv)
dat <- read.csv('../data//data.csv')
model <- gam(Y ~ s(x1, x2, x3), data=dat)
By giving x3, I want to visualize a three-dimensional graph of spline curve or estimated contour plot, but how do I do it?
It will be very helpful if you can answer.
Thanks.
This is the sample data.
n = 100
x1 <- runif(n, min = 0, max = 100)
x2 <- runif(n, min = 0, max = 100)
x3 <- runif(n, min = 0, max = 100)
Y = numeric(n)
for(i in 1:n){
Y[i] <- x1[i]**0.5*x2[i]**2*x3[i]/10000
}
dat = data.frame(Y=Y, x1=x1, x2=x2, x3=x3)
I do thin-plane regression spline using this dat.
model <- gam(Y ~ s(x1, x2, x3, k= 50), data=dat)
Then, I would like to obtain a fitting curve of three-dimensional thin-plane regression spline or contour plot estimated by regression spline when x3 = 25, for example.

To make a contour plot, you can use contour(x, y, z, ...). z is your data Matrix (in your case, Y[x1,x2, ], x and y are index vectors from 0 to 1 with a length of nrow(Y[x1,x2, ]) and ncol(Y[x1,x2, ]).
You should be able to use it similar to:
contour( x = seq(0, 1, length.out = length(x1)), y = seq(0, 1, length.out = length(x2)), z = Y[x1,x2, ] )

I found a solution with reference to the answer of d0d0.
n=100
const=25
x = y = seq(0, n, 1)
f = function(x,y){
dtmp <- data.frame(x1=(x), x2=(y), x3=(const))
pred <- predict.gam(model, dtmp)
}
z = outer(x, y, f)
contour(x,y,z)

Related

Confidence intervals from mocel coefficients vs whole model

I'm trying to demonstrate that there is an important difference between two ways of making linear model predictions. The first way, which my heart tells me is more correct, uses predict.lm which as I understand preserves the correlations between coefficients. The second approach tries to use the parameters independently.
Is this the correct way to show the difference? The two approaches seem somewhat close.
Also, is the StdErr of the coefficients the same as the standard deviation of their distributions? Or have I misunderstood what the model table is saying.
Below is a quick reprex to show what I mean:
# fake dataset
xs <- runif(200, min = -1, max = 1)
true_inter <- -1.3
true_slope <- 3.1
ybar <- true_inter + true_slope*xs
ys <- rnorm(200, ybar, sd = 1)
model <- lm(ys~xs)
# predictions
coef_sterr <- summary(model)$coefficients
inters <- rnorm(500, mean = coef_sterr[1,1], sd = coef_sterr[1,2])
slopes <- rnorm(500, mean = coef_sterr[2,1], sd = coef_sterr[2,2])
newx <- seq(from = -1, to= 1, length.out = 20)
avg_predictions <- cbind(1, newx) %*% rbind(inters, slopes)
conf_predictions <- apply(avg_predictions, 1, quantile, probs = c(.25, .975), simplify = TRUE)
# from confint
conf_interval <- predict(model, newdata=data.frame(xs = newx),
interval="confidence",
level = 0.95)
# plot to visualize
plot(ys~xs)
# averages are exactly the same
abline(model)
abline(a = coef(model)[1], b = coef(model)[2], col = "red")
# from predict, using parameter covariance
matlines(newx, conf_interval[,2:3], col = "blue", lty=1, lwd = 3)
# from simulated lines, ignoring parameter covariance
matlines(newx, t(conf_predictions), col = "orange", lty = 1, lwd = 2)
Created on 2022-04-05 by the reprex package (v2.0.1)
In this case, they would be close because there is very little correlation between the model parameters, so drawing them from two independent normals versus a multivariate normal is not that different:
set.seed(519)
xs <- runif(200, min = -1, max = 1)
true_inter <- -1.3
true_slope <- 3.1
ybar <- true_inter + true_slope*xs
ys <- rnorm(200, ybar, sd = 1)
model <- lm(ys~xs)
cov2cor(vcov(model))
# (Intercept) xs
# (Intercept) 1.00000000 -0.08054106
# xs -0.08054106 1.00000000
Also, it is probably worth calculating both of the intervals the same way, though it shouldn't make that much difference. That said, 500 observations may not be enough to get reliable estimates of the 2.5th and 97.5th percentiles of the distribution. Let's consider a slightly more complex example. Here, the two X variables are correlated - the correlation of the parameters derives in part from the correlation of the columns of the design matrix, X.
set.seed(519)
X <- MASS::mvrnorm(200, c(0,0), matrix(c(1,.65,.65,1), ncol=2))
b <- c(-1.3, 3.1, 2.5)
ytrue <- cbind(1,X) %*% b
y <- ytrue + rnorm(200, 0, .5*sd(ytrue))
dat <- data.frame(y=y, x1=X[,1], x2=X[,2])
model <- lm(y ~ x1 + x2, data=dat)
cov2cor(vcov(model))
# (Intercept) x1 x2
# (Intercept) 1.00000000 0.02417386 -0.01515887
# x1 0.02417386 1.00000000 -0.73228003
# x2 -0.01515887 -0.73228003 1.00000000
In this example, the coefficients for x1 and x2 are correlated around -0.73. As you'll see, this still doesn't result in a huge difference. Let's calculate the relevant statistics.
First, we draw B1 using the multivariate method that you rightly suspect is correct. Then, we'll draw B2 from a bunch of independent normals (actually, I'm using a multivariate normal with a diagonal variance-covariance matrix, which is the same thing).
b_est <- coef(model)
v <- vcov(model)
B1 <- MASS::mvrnorm(2500, b_est, v, empirical=TRUE)
B2 <- MASS::mvrnorm(2500, b_est, diag(diag(v)), empirical = TRUE)
Now, let's make a hypothetical X matrix and generate the relevant predictions:
hypX <- data.frame(x1=seq(-3,3, length=50),
x2 = mean(dat$x2))
yhat1 <- as.matrix(cbind(1, hypX)) %*% t(B1)
yhat2 <- as.matrix(cbind(1, hypX)) %*% t(B2)
Then we can calculate confidence intervals, etc...
yh1_ci <- t(apply(yhat1, 1, function(x)unname(quantile(x, c(.025,.975)))))
yh2_ci <- t(apply(yhat2, 1, function(x)unname(quantile(x, c(.025,.975)))))
yh1_ci <- as.data.frame(yh1_ci)
yh2_ci <- as.data.frame(yh2_ci)
names(yh1_ci) <- names(yh2_ci) <- c("lwr", "upr")
yh1_ci$fit <- c(as.matrix(cbind(1, hypX)) %*% b_est)
yh2_ci$fit <- c(as.matrix(cbind(1, hypX)) %*% b_est)
yh1_ci$method <- factor(1, c(1,2), labels=c("Multivariate", "Independent"))
yh2_ci$method <- factor(2, c(1,2), labels=c("Multivariate", "Independent"))
yh1_ci$x1 <- hypX[,1]
yh2_ci$x1 <- hypX[,1]
yh <- rbind(yh1_ci, yh2_ci)
We could then plot the two confidence intervals as you did.
ggplot(yh, aes(x=x1, y=fit, ymin=lwr, ymax=upr, fill=method)) +
geom_ribbon(colour="transparent", alpha=.25) +
geom_line() +
theme_classic()
Perhaps a better visual would be to compare the widths of the intervals.
w1 <- yh1_ci$upr - yh1_ci$lwr
w2 <- yh2_ci$upr - yh2_ci$lwr
ggplot() +
geom_point(aes(x=hypX[,1], y=w2-w1)) +
theme_classic() +
labs(x="x1", y="Width (Independent) - Width (Multivariate)")
This shows that for small values of x1, the independent confidence intervals are wider than the multivariate ones. For values of x1 above 0, it's a more mixed bag.
This tells you that there is some difference, but you don't need the simulation to know which one is 'right'. That's because the prediction is a linear combination of constants and random variables.
In this case, the b terms are the random variables and the x values are the constants. We know that the variance of a linear combination can be calculated this way:
All that is to say that your intuition is correct.

How to fit Gaussian distribution with one-sided data?

x <- c(-3,-2.5,-2,-1.5,-1,-0.5)
y <- c(2,2.5,2.6,2.9,3.2,3.3)
The challenge is that the entire data is from the left slope, how to generate a two-sided Gaussian Distribution?
There is incomplete information with regards to the question. Hence several ways can be implemented. NOTE that the data is insufficient. ie trying fitting tis by nls does not work.
Here is one way to tackle it:
f <- function(par, x, y )sum((y - par[3]*dnorm(x,par[1],par[2]))^2)
a <- optim(c(0, 1, 1), f, x = x, y = y)$par
plot(x, y, xlim = c(-3,3.5), ylim = c(2, 3.5))
curve(dnorm(x, a[1], a[2])*a[3], add = TRUE, col = 2)
There is no way to fit a Gaussian distribution with these densities. If correct y-values had been provided this would be one way of solving the problem:
# Define function to be optimized
f <- function(pars, x, y){
mu <- pars[1]
sigma <- pars[2]
y_hat <- dnorm(x, mu, sigma)
se <- (y - y_hat)^2
sum(se)
}
# Define the data
x <- c(-3,-2.5,-2,-1.5,-1,-0.5)
y <- c(2,2.5,2.6,2.9,3.2,3.3)
# Find the best paramters
opt <- optim(c(-.5, .1), f, 'SANN', x = x, y = y)
plot(
seq(-5, 5, length.out = 200),
dnorm(seq(-5, 5, length.out = 200), opt$par[1], opt$par[2]), type = 'l', col = 'red'
)
points(c(-3,-2.5,-2,-1.5,-1,-0.5), c(2,2.5,2.6,2.9,3.2,3.3))
Use nls to get a least squares fit of y to .lin.a * dnorm(x, b, c) where .lin.a, b and c are parameters to be estimated.
fm <- nls(y ~ cbind(a = dnorm(x, b, c)),
start = list(b = mean(x), c = sd(x)), algorithm = "plinear")
fm
giving:
Nonlinear regression model
model: y ~ cbind(a = dnorm(x, b, c))
data: parent.frame()
b c .lin.a
0.2629 3.2513 27.7287
residual sum-of-squares: 0.02822
Number of iterations to convergence: 7
Achieved convergence tolerance: 2.582e-07
The dnorm model (black curve) seems to fit the points although even a straight line (blue line) involving only two parameters (intercept and slope) instead of 3 isn't bad.
plot(y ~ x)
lines(fitted(fm) ~ x)
fm.lin <- lm(y ~ x)
abline(fm.lin, col = "blue")

Force GAM model fit to be monotonic and go through a fixed point (x0, y0) with R mgcv

I am trying to fit a GAM model to data under two constraints simultatenously: (1) the fit is monotonic (increasing), (2) the fit goes through a fixed point, say, (x0,y0).
So far, I managed to have these two constraints work separately:
For (1), based on mgcv::pcls() documentation examples, by using mgcv::mono.con() to get linear constraints sufficient for monotonicity, and estimate model coefs via mgcv::pcls(), using the constraints.
For (2), based on this post, by setting the value of spline at knot location x0 to 0 + using offset term in the model formula.
However, I struggle to combine these two constraints simultaneously. I guess a way to go is mgcv::pcls(), but I could work out neither (a) doing a similar trick of setting the value of spline at knot location x0 to 0 + using offset nor (b) setting equality constraint(s) (which I think could yield my (2) constraint setup).
I also note that the approach for setting the value of spline at knot location x0 to 0 for my constrain condition (2) yields weirdly wiggly outcome (as compared to unconstrained GAM fit) -- as showed below.
Attempt so far: fit a smooth function to data under two constraints separately
Simulate some data
library(mgcv)
set.seed(1)
x <- sort(runif(100) * 4 - 1)
f <- exp(4*x)/(1+exp(4*x))
y <- f + rnorm(100) * 0.1
dat <- data.frame(x=x, y=y)
GAM unconstrained (for comparison)
k <- 13
fit0 <- gam(y ~ s(x, k = k, bs = "cr"), data = dat)
# predict from unconstrained GAM fit
newdata <- data.frame(x = seq(-1, 3, length.out = 1000))
newdata$y_pred_fit0 <- predict(fit0, newdata = newdata)
GAM constrained: (1) the fit is monotonic (increasing)
k <- 13
# Show regular spline fit (and save fitted object)
f.ug <- gam(y~s(x,k=k,bs="cr"))
# explicitly construct smooth term's design matrix
sm <- smoothCon(s(x,k=k,bs="cr"),dat,knots=NULL)[[1]]
# find linear constraints sufficient for monotonicity of a cubic regression spline
# it assumes "cr" is the basis and its knots are provided as input
F <- mono.con(sm$xp)
G <- list(
X=sm$X,
C=matrix(0,0,0), # [0 x 0] matrix (no equality constraints)
sp=f.ug$sp, # smoothing parameter estimates (taken from unconstrained model)
p=sm$xp, # array of feasible initial parameter estimates
y=y,
w= dat$y * 0 + 1 # weights for data
)
G$Ain <- F$A # matrix for the inequality constraints
G$bin <- F$b # vector for the inequality constraints
G$S <- sm$S # list of penalty matrices; The first parameter it penalizes is given by off[i]+1
G$off <- 0 # Offset values locating the elements of M$S in the correct location within each penalty coefficient matrix. (Zero offset implies starting in first location)
p <- pcls(G); # fit spline (using smoothing parameter estimates from unconstrained fit)
# predict
newdata$y_pred_fit2 <- Predict.matrix(sm, data.frame(x = newdata$x)) %*% p
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit2 ~ x, data = newdata, col = 4, lwd = 2)
Blue line: constrained; red line: unconstrained
GAM constrained: (2) fitted go through (x0,y0)=(-1, -0.1)
k <- 13
## Create a spline basis and penalty
## Make sure there is a knot at the constraint point (here: -1)
knots <- data.frame(x = seq(-1,3,length=k))
# explicit construction of a smooth term in a GAM
sm <- smoothCon(s(x,k=k,bs="cr"), dat, knots=knots)[[1]]
## 1st parameter is value of spline at knot location -1, set it to 0 by dropping
knot_which <- which(knots$x == -1)
X <- sm$X[, -knot_which] ## spline basis
S <- sm$S[[1]][-knot_which, -knot_which] ## spline penalty
off <- dat$y * 0 + (-0.1) ## offset term to force curve through (x0, y0)
## fit spline constrained through (x0, y0)
gam_1 <- gam(y ~ X - 1 + offset(off), paraPen = list(X = list(S)))
# predict (add offset of -0.1)
newdata_tmp <- Predict.matrix(sm, data.frame(x = newdata$x))
newdata_tmp <- newdata_tmp[, -knot_which]
newdata$y_pred_fit1 <- (newdata_tmp %*% coef(gam_1))[, 1] + (-0.1)
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit1 ~ x, data = newdata, col = 3, lwd = 2)
# lines at cross of which the plot should go throught
abline(v=-1, col = 3); abline(h=-0.1, col = 3)
Green line: constrained; red line: unconstrained
I think you could augment the data vectors x and y with (x0, y0) and then put a (really) high weight on the first observation (i.e. add a weight vector to your G list).
Alternatively to the simple weighting strategy, we can write the quadratic programming problem starting from the results of the preliminary smoothing. This is illustrated in the second R-code below (in this case I used p-spline smoothers, see Eilers and Marx 1991).
Hope this helps a bit (a similar problem is discussed here).
Rcode example 1 (weight strategy)
set.seed(123)
N = 100
x <- sort(runif(N) * 4 - 1)
f <- exp(4*x)/(1+exp(4*x))
y <- f + rnorm(N) * 0.1
x = c(-1, x)
y = c(-0.1, y)
dat = data.frame(x = x, y= y)
k <- 13
fit0 <- gam(y ~ s(x, k = k, bs = "cr"), data = dat)
# predict from unconstrained GAM fit
newdata <- data.frame(x = seq(-1, 3, length.out = 1000))
newdata$y_pred_fit0 <- predict(fit0, newdata = newdata)
k <- 13
# Show regular spline fit (and save fitted object)
f.ug <- gam(y~s(x,k=k,bs="cr"))
# explicitly construct smooth term's design matrix
sm <- smoothCon(s(x,k=k,bs="cr"),dat,knots=NULL)[[1]]
# find linear constraints sufficient for monotonicity of a cubic regression spline
# it assumes "cr" is the basis and its knots are provided as input
F <- mono.con(sm$xp)
G <- list(
X=sm$X,
C=matrix(0,0,0), # [0 x 0] matrix (no equality constraints)
sp=f.ug$sp, # smoothing parameter estimates (taken from unconstrained model)
p=sm$xp, # array of feasible initial parameter estimates
y=y,
w= c(1e8, 1:N * 0 + 1) # weights for data
)
G$Ain <- F$A # matrix for the inequality constraints
G$bin <- F$b # vector for the inequality constraints
G$S <- sm$S # list of penalty matrices; The first parameter it penalizes is given by off[i]+1
G$off <- 0 # Offset values locating the elements of M$S in the correct location within each penalty coefficient matrix. (Zero offset implies starting in first location)
p <- pcls(G); # fit spline (using smoothing parameter estimates from unconstrained fit)
# predict
newdata$y_pred_fit2 <- Predict.matrix(sm, data.frame(x = newdata$x)) %*% p
# plot
plot(y ~ x, data = dat)
lines(y_pred_fit0 ~ x, data = newdata, col = 2, lwd = 2)
lines(y_pred_fit2 ~ x, data = newdata, col = 4, lwd = 2)
abline(v = -1)
abline(h = -0.1)
rm(list = ls())
library(mgcv)
library(pracma)
library(colorout)
set.seed(123)
N = 100
x = sort(runif(N) * 4 - 1)
f = exp(4*x)/(1+exp(4*x))
y = f + rnorm(N) * 0.1
x0 = -1
y0 = -0.1
dat = data.frame(x = x, y= y)
k = 50
# Show regular spline fit (and save fitted object)
f.ug = gam(y~s(x,k=k,bs="ps"))
# explicitly construct smooth term's design matrix
sm = smoothCon(s(x,k=k,bs="ps"), dat,knots=NULL)[[1]]
# Build quadprog to estimate the coefficients
scf = sapply(f.ug$smooth, '[[', 'S.scale')
lam = f.ug$sp / scf
Xp = rbind(sm$X, sqrt(lam) * f.ug$smooth[[1]]$D)
yp = c(dat$y, rep(0, k - 2))
X0 = Predict.matrix(sm, data.frame(x = x0))
sm$deriv = 1
X1 = Predict.matrix(sm, data.frame(x = dat$x))
coef_mono = pracma::lsqlincon(Xp, yp, Aeq = X0, beq = y0, A = -X1, b = rep(0, N))
# fitted values
fit = sm$X %*% coef_mono
sm$deriv = 0
xf = seq(-1, 3, len = 1000)
Xf = Predict.matrix(sm, data.frame(x = xf))
fine_fit = Xf %*% coef_mono
# plot
par(mfrow = c(2, 1), mar = c(3,3,3,3))
plot(dat$x, dat$y, pch = 1, main= 'Data and fit')
lines(dat$x, f.ug$fitted, lwd = 2, col = 2)
lines(dat$x, fit, col = 4, lty = 1, lwd = 2)
lines(xf, fine_fit, col = 3, lwd = 2, lty = 2)
abline(h = -0.1)
abline(v = -1)
plot(dat$x, X1 %*% coef_mono, type = 'l', main = 'Derivative of the fit', lwd = 2)
abline(h = 0.0)
The following package seems to implement what you are looking for:
The proposed shape constrained smoothing has been incorporated into generalized
additive models with a mixture of unconstrained and shape restricted smooth terms
(mono-GAM). [...]
The proposed modelling approach has been implemented in an R package monogam.
The model setup is the same as in mgcv(gam) with the addition of shape constrained
smooths. In order to be consistent with the unconstrained GAM, the package provides
key functions similar to those associated with mgcv(gam).
Additive models with shape constraints

Is it possible to specify lower bound in response variable during smooth with gam?

I am trying to fit a smoothed surface of z against x and y using formula z ~ s(x, y) with gam function
in mgcv package. My goal is to predict response z based on new values of x and y.
In my real situation, z should be a positive number negative z would be meaningless. However, the predicted zs
are sometimes negative. It seems that for some region, there is not enough points in the training data to estimate z
accurately.
My question is: Is there a way to specifiy a lower bound of z during smooth in gam so that later I won't get negative zs with predict?
Below is a minimal example that reproduces this issue.
library(mgcv)
x <- seq(0.1, 1, by = 0.01)
y <- seq(0.1, 1, by = 0.01)
dtt <- expand.grid(x = x, y = y)
set.seed(123)
dtt$xp <- dtt$x + rnorm(nrow(dtt)) / 100
dtt$yp <- dtt$y + rnorm(nrow(dtt)) / 100
dtt$z <- 1 / (dtt$xp^2 + dtt$yp^2)
m <- sample.int(nrow(dtt), 3000)
dtt.train <- dtt[m, ]
dtt.test <- dtt[!(1:nrow(dtt) %in% m), ]
fit <- gam(z ~ s(x, y), data = dtt.train)
p <- predict(fit, newdata = dtt.test)
plot(dtt.test$z, p, xlab = 'Real', ylab = 'Predicted', pch = 19, col = 1 + (p < 0))
abline(h = 0, v = 0)
As you can see, for the red points. the real values are positive but the predicted values are negative.

Quadratic spline

Is it there a way to adjust a quadratic spline (instead of a cubic one) to some data?
I have this data and I don't seem to find the appropiate function in R to do this.
Expanding just a bit on the comments above, you can use a B-spline basis (implemented in function splines::bs()), setting degree=2 rather than the default degree=3:
library(splines)
## Some example data
set.seed(1)
x <- 1:10
y <- rnorm(10)
## Fit a couple of quadratic splines with different degrees of freedom
f1 <- lm(y ~ bs(x, degree = 2)) # Defaults to 2 - 1 = 1 degree of freedom
f9 <- lm(y ~ bs(x, degree = 2, df=9))
## Plot the splines
x0 <- seq(1, 10, by = 0.1)
plot(x, y, pch = 16)
lines(x0, predict(f1, data.frame(x = x0)), col = "blue")
lines(x0, predict(f9, data.frame(x = x0)), col = "red")

Resources