Multivariate regression splines in R - r

Most people are probably familiar with bs from splines:
library(splines)
workingModel <- lm(mpg ~ factor(gear) + bs(wt, knots = 5) + hp, data = mtcars)
bs(mtcars$wt, knots = 4)
This uses a b-spline for the singe variable weight, but you can also do multivariate splines:
bs(cbind(mtcars$wt,mtcars$hp), knots = 4)
But this produces a matrix with twice as many rows as mtcars, so when I try:
brokenModel <- lm(mpg ~ bs(cbind(mtcars$wt,mtcars$hp), knots = 4), data = mtcars)
I get an error about differing lengths.
My question is: how do I use the multivariate spline in a model if it has a different number of rows than my outcome variable? Do I stack my outcome variable on top of itself y <- c(y, y)? Why does the multivariate spline produce extra rows?
Thanks.

You can't use splines::bs in this case, as it is strictly for construction of a univariate spline. If you do bs(mat) where mat is a matrix, it is just doing bs(c(mat)). For example,
mat <- matrix(runif(8), 4, 2)
identical(bs(mat), bs(c(mat)))
# [1] TRUE
This explains why you get double number of rows, when doing bs(cbind(mtcars$wt,mtcars$hp).
To create a 2D spline, the simplest way is to create additive spline:
lm(mpg ~ factor(gear) + bs(wt, knots = 5) + bs(hp, knots = 4), mtcars)
but this may not be what you want. Then consider interaction:
model <- lm(mpg ~ factor(gear) + bs(wt, knots = 5):bs(hp, knots = 4), mtcars)
The bs(wt, knots = 5):bs(hp, knots = 4) is forming row-wise Kronecker product between the two design matrices. Since bs(wt, knots = 5) is a matrix of 4 columns, and bs(hp, knots = 4) is a matrix of 3 columns, the interaction has 4 * 3 = 12 columns.
Alternatively, consider using mgcv package. In mgcv, multivariate splines can be constructed in two ways:
isotropic thin-plate splines;
scale invariant tensor product splines.
Clearly you want the second here, as wt and hp have different units. To construct tensor product splines, we can use:
library(mgcv)
fit <- gam(mpg ~ factor(gear)
+ s(wt, bs = 'cr', k = 4, fx = TRUE)
+ s(hp, bs = 'cr', k = 4, fx = TRUE)
+ ti(wt, hp, bs = 'cr', k = c(4, 4), d = c(1, 1), fx = TRUE),
data = mtcars)
Here, I specially set fx = TRUE to disable penalized regression.
I don't want to write an extensive answer to introduce mgcv. For how s, ti and gam work, just read documentation. If you need to bridge the gap in theory, grab Simon Wood's book published in 2006: Generalized Additive Models: an introduction with R.
A practical example of mgcv usage?
I had an answer Cubic spline method for longitudinal series data which might help you get familiar with mgcv. But as an introductory example, it only shows how to work with univariate spline. Fortunately, that is also the key. Tensor product spline is constructed from univariate spline.
My other answers related to mgcv is more of theoretical aspect; while not all my answers related to spline is making reference to mgcv. So that question & answer is the best I could give you at this stage.
Would the scale invariant tensor product splines be equivalent to radial smoothing or would that be the isotropic thin-place splines?
Radial smoothing is equivalent to thin-plate spline, as the basis function for a thin-plate spline is radial. That is why it is isotropic and can be used in spatial regression.
Tensor product spline is scale invariant, as it is constructed as (pairwise) multiplication of univariate spline basis.

Related

maximum likelihood estimation of parameters following polynomial logistic regression

enter image description here
this is the datset, library(frair), data=gammarus
i want to estimate the parameters p0, p1 p2 and p3 formula is
*NA/No= exp(P0+ P1*density+ P2*density^2+P3*density^3)/(1+exp(P0+P1*density+ P2*density^2+P3*density^3))*, where Na is prey eaten and No is prey offered
Setup
library(dplyr)
library(frair)
d <- gammarus %>% mutate(y = eaten/(eaten + alive))
Step 1: Regression
You can estimate the coefficients from an equation with the lm (linear model) function:
lm(y ~ density, data = d)
Step 2: Polynomial regression
To have a polynomial functional form instead, you can use the poly function. The first argument is the variable, the second is the degree of the polynomial, and you must then specify whether you want a raw or an orthogonal polynomial. In our case it would be a raw polynomial, check this post for more detail.
You can estimate the four coefficients from by replacing density with a third degree raw polynomial of density:
lm(y ~ poly(density, 3, raw = T), data = d)
Step 3: Logistic regression
The final step is to switch from the linear to the logistic . For this you would need the glm function (generalized linear model) and you must specify that you want a logit (and not a probit for instance, cf. this post) specification with family = binomial(link = "logit").
glm(y ~ poly(density, 3, raw = T), data = d, family = binomial(link = "logit"))

How to check for overdispersion in a GAM with negative binomial distribution?

I fit a Generalized Additive Model in the Negative Binomial family using gam from the mgcv package. I have a data frame containing my dependent variable y, an independent variable x, a factor fac and a random variable ran. I fit the following model
gam1 <- gam(y ~ fac + s(x) + s(ran, bs = 're'), data = dt, family = "nb"
I have read in Negative Binomial Regression book that it is still possible for the model to be overdisperesed. I have found code to check for overdispersion in glm but I am failing to find it for a gam. I have also encountered suggestions to just check the QQ plot and standardised residuals vs. predicted residuals, but I can not decide from my plots if the data is still overdisperesed. Therefore, I am looking for an equation that would solve my problem.
A good way to check how well the model compares with the observed data (and hence check for overdispersion in the data relative to the conditional distribution implied by the model) is via a rootogram.
I have a blog post showing how to do this for glm() models using the countreg package, but this works for GAMs too.
The salient parts of the post applied to a GAM version of the model are:
library("coenocliner")
library('mgcv')
## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10) # environmental locations
A0 <- 90 # maximal abundance
mu <- 3 # position on gradient of optima
alpha <- 1.5 # parameter of beta response
gamma <- 4 # parameter of beta response
r <- 6 # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5 # overdispersion parameter 1/theta
zprobs <- 0.3 # prob(y == 0) in binomial model
## simulate some negative binomial data from this response model
nb <- coenocline(locs, responseModel = "beta", params = pars,
countModel = "negbin",
countParams = list(alpha = nb.alpha))
df <- setNames(cbind.data.frame(locs, nb), c("x", "yNegBin"))
OK, so we have a sample of data drawn from a negative binomial sampling distribution and we will now fit two models to these data:
A Poisson GAM
m_pois <- gam(yNegBin ~ s(x), data = df, family = poisson())
A negative binomial GAM
m_nb <- gam(yNegBin ~ s(x), data = df, family = nb())
The countreg package is not yet on CRAN but it can be installed from R-Forge:
install.packages("countreg", repos="http://R-Forge.R-project.org")
Then load the packages and plot the rootograms:
library("countreg")
library("ggplot2")
root_pois <- rootogram(m_pois, style = "hanging", plot = FALSE)
root_nb <- rootogram(m_nb, style = "hanging", plot = FALSE)
Now plot the rootograms for each model:
autoplot(root_pois)
autoplot(root_nb)
This is what we get (after plotting both using cowplot::plot_grid() to arrange the two rootograms on the same plot)
We can see that the negative binomial model does a bit better here than the Poisson GAM for these data — the bottom of the bars are closer to zero throughout the range of the observed counts.
The countreg package has details on how you can add an uncertain band around the zero line as a form of goodness of fit test.
You can also compute the Pearson estimate for the dispersion parameter using the Pearson residuals of each model:
r$> sum(residuals(m_pois, type = "pearson")^2) / df.residual(m_pois)
[1] 28.61546
r$> sum(residuals(m_nb, type = "pearson")^2) / df.residual(m_nb)
[1] 0.5918471
In both cases, these should be 1; we see substantial overdispersion in the Poisson GAM, and some under-dispersion in the Negative Binomial GAM.

How can I perform a linear regression on my group variances in R?

Here is my data
Comparing the mean seems to yield to some interested results. And it indeed does as revealed by the linear model:
lm(data=data, y~factor(x)))
Now, it also looks like the variances are not equal in all groups. Here is a plot of the variance in y for each group in x.
I'd be interested to test different linear regression to see if a quadratic regression is a better model than a linear model (compared by AIC or BIC).
I could estimate the sampling distribution for the variance and get a confidence interval for the variance in each group, so it should be feasible to perform a regression on the variance of my groups. However, I don't know how to perform a regression on my sample variances in R.
How can I perform a linear regression on my group variances in R?
I wouldn't use a polynomial to model variances. Among the variance functions offered by package nlme is varConstPower. Let's try this:
n <- c(1, 2, 4, 8, 16, 32)
v <- c(5.85, 6.35, 6.55, 6.85, 7.02, 7.15)
plot(v ~ n)
fit_ConstPower <- nls(v ~ n^(2*theta) + c,
start = list(theta = 2, c = 4),
data = data.frame(n, v))
summary(fit_ConstPower)
lines(npred <- seq(1, 32, length.out = 100),
predict(fit_ConstPower, newdata = data.frame(n = npred)),
col = "blue")
However, that doesn't seem to be a good fit. To me it looks more like there might be saturation:
fit_hyper <- nls(v ~ k * n^(theta) + c,
start = list(k = -1, theta = -1, c = 7.2),
data = data.frame(n, v))
summary(fit_hyper)
lines(npred,
predict(fit_hyper, newdata = data.frame(n = npred)),
col = "green")
PS: I'm having trouble understanding your question. The literal answer would be so simple (just use lm) that I must miss something. Furthermore, you don't tell us whether x is a covariate or something else (e.g., sample size?). Thus, I can't even guess if there might be some fundamental relationship behind what you observe.
Theres a package called 'glmnet' in R. This library is used for generalized linear models. Included Multivariated Linear Regression. If you are familiar with lasso or ridge penalization glmnet( ) function performs a regression with a combination of both penalizations. And cv.glmnet( ) function performs a crossvalidated model for choosing the best penalization parameter(s).
So, you can do the following:
library('glmnet')
x <- yourdata[sample(1:nrow(mydata),]
cv.model1 <- cv.glmnet(x,y,alpha = your_parameter)
Or if you want to get the simplest linear regression model
you can use de lm( ) function which doesnt perform penalization.
simple_model <- lm(y~x,data=data.frame(x,y))
For more info on both models visit the documentation for the corresponding functions. Hope this helps.

How to estimate the best fitting function to a scatter plot in R?

I have scatterplot of two variables, for instance this:
x<-c(0.108,0.111,0.113,0.116,0.118,0.121,0.123,0.126,0.128,0.131,0.133,0.136)
y<-c(-6.908,-6.620,-5.681,-5.165,-4.690,-4.646,-3.979,-3.755,-3.564,-3.558,-3.272,-3.073)
and I would like to find the function that better fits the relation between these two variables.
to be precise I would like to compare the fitting of three models: linear, exponential and logarithmic.
I was thinking about fitting each function to my values, calculate the likelihoods in each case and compare the AIC values.
But I don't really know how or where to start. Any possible help about this would be extremely appreciated.
Thank you very much in advance.
Tina.
I would begin by an explantory plots, something like this :
x<-c(0.108,0.111,0.113,0.116,0.118,0.121,0.123,0.126,0.128,0.131,0.133,0.136)
y<-c(-6.908,-6.620,-5.681,-5.165,-4.690,-4.646,-3.979,-3.755,-3.564,-3.558,-3.272,-3.073)
dat <- data.frame(y=y,x=x)
library(latticeExtra)
library(grid)
xyplot(y ~ x,data=dat,par.settings = ggplot2like(),
panel = function(x,y,...){
panel.xyplot(x,y,...)
})+
layer(panel.smoother(y ~ x, method = "lm"), style =1)+ ## linear
layer(panel.smoother(y ~ poly(x, 3), method = "lm"), style = 2)+ ## cubic
layer(panel.smoother(y ~ x, span = 0.9),style=3) + ### loeess
layer(panel.smoother(y ~ log(x), method = "lm"), style = 4) ## log
looks like you need a cubic model.
summary(lm(y~poly(x,3),data=dat))
Residual standard error: 0.1966 on 8 degrees of freedom
Multiple R-squared: 0.9831, Adjusted R-squared: 0.9767
F-statistic: 154.8 on 3 and 8 DF, p-value: 2.013e-07
Here is an example of comparing five models. Due to the form of the first two models we are able to use lm to get good starting values. (Note that models using different transforms of y should not be compared so we should not use lm1 and lm2 as comparison models but only for starting values.) Now run an nls for each of the first two. After these two models we try polynomials of various degrees in x. Fortunately lm and nls use consistent AIC definitions (although its not necessarily true that other R model fitting functions have consistent AIC definitions) so we can just use lm for the polynomials. Finally we plot the data and fits of the first two models.
The lower the AIC the better so nls1 is best followed by lm3.2 following by nls2 .
lm1 <- lm(1/y ~ x)
nls1 <- nls(y ~ 1/(a + b*x), start = setNames(coef(lm1), c("a", "b")))
AIC(nls1) # -2.390924
lm2 <- lm(1/y ~ log(x))
nls2 <- nls(y ~ 1/(a + b*log(x)), start = setNames(coef(lm2), c("a", "b")))
AIC(nls2) # -1.29101
lm3.1 <- lm(y ~ x)
AIC(lm3.1) # 13.43161
lm3.2 <- lm(y ~ poly(x, 2))
AIC(lm3.2) # -1.525982
lm3.3 <- lm(y ~ poly(x, 3))
AIC(lm3.3) # 0.1498972
plot(y ~ x)
lines(fitted(nls1) ~ x, lty = 1) # solid line
lines(fitted(nls2) ~ x, lty = 2) # dashed line
ADDED a few more models and subsequently fixed them up and changed notation. Also to follow up on Ben Bolker's comment we can replace AIC everywhere above with AICc from the AICcmodavg package.
You could start by reading the classic paper by Box and Cox on transformations. They discuss how to compare transformations and how to find meaningful transformations within a set or family of potential transforms. The log transform and linear model are special cases of the Box-Cox family.
And as #agstudy said, always plot the data as well.

Calculating R^2 for a nonlinear least squares fit

Suppose I have x values, y values, and expected y values f (from some nonlinear best fit curve).
How can I compute R^2 in R? Note that this function is not a linear model, but a nonlinear least squares (nls) fit, so not an lm fit.
You just use the lm function to fit a linear model:
x = runif(100)
y = runif(100)
spam = summary(lm(x~y))
> spam$r.squared
[1] 0.0008532386
Note that the r squared is not defined for non-linear models, or at least very tricky, quote from R-help:
There is a good reason that an nls model fit in R does not provide
r-squared - r-squared doesn't make sense for a general nls model.
One way of thinking of r-squared is as a comparison of the residual
sum of squares for the fitted model to the residual sum of squares for
a trivial model that consists of a constant only. You cannot
guarantee that this is a comparison of nested models when dealing with
an nls model. If the models aren't nested this comparison is not
terribly meaningful.
So the answer is that you probably don't want to do this in the first
place.
If you want peer-reviewed evidence, see this article for example; it's not that you can't compute the R^2 value, it's just that it may not mean the same thing/have the same desirable properties as in the linear-model case.
Sounds like f are your predicted values. So the distance from them to the actual values devided by n * variance of y
so something like
1-sum((y-f)^2)/(length(y)*var(y))
should give you a quasi rsquared value, so long as your model is reasonably close to a linear model and n is pretty big.
As a direct answer to the question asked (rather than argue that R2/pseudo R2 aren't useful) the nagelkerke function in the rcompanion package will report various pseudo R2 values for nonlinear least square (nls) models as proposed by McFadden, Cox and Snell, and Nagelkerke, e.g.
require(nls)
data(BrendonSmall)
quadplat = function(x, a, b, clx) {
ifelse(x < clx, a + b * x + (-0.5*b/clx) * x * x,
a + b * clx + (-0.5*b/clx) * clx * clx)}
model = nls(Sodium ~ quadplat(Calories, a, b, clx),
data = BrendonSmall,
start = list(a = 519,
b = 0.359,
clx = 2304))
nullfunct = function(x, m){m}
null.model = nls(Sodium ~ nullfunct(Calories, m),
data = BrendonSmall,
start = list(m = 1346))
nagelkerke(model, null=null.model)
The soilphysics package also reports Efron's pseudo R2 and adjusted pseudo R2 value for nls models as 1 - RSS/TSS:
pred <- predict(model)
n <- length(pred)
res <- resid(model)
w <- weights(model)
if (is.null(w)) w <- rep(1, n)
rss <- sum(w * res ^ 2)
resp <- pred + res
center <- weighted.mean(resp, w)
r.df <- summary(model)$df[2]
int.df <- 1
tss <- sum(w * (resp - center)^2)
r.sq <- 1 - rss/tss
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
out <- list(pseudo.R.squared = r.sq,
adj.R.squared = adj.r.sq)
which is also the pseudo R2 as calculated by the accuracy function in the rcompanion package. Basically, this R2 measures how much better your fit becomes compared to if you would just draw a flat horizontal line through them. This can make sense for nls models if your null model is one that allows for an intercept only model. Also for particular other nonlinear models it can make sense. E.g. for a scam model that uses stricly increasing splines (bs="mpi" in the spline term), the fitted model for the worst possible scenario (e.g. where your data was strictly decreasing) would be a flat line, and hence would result in an R2 of zero. Adjusted R2 then also penalize models with higher nrs of fitted parameters. Using the adjusted R2 value would already address a lot of the criticisms of the paper linked above, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2892436/ (besides if one swears by using information criteria to do model selection the question becomes which one to use - AIC, BIC, EBIC, AICc, QIC, etc).
Just using
r.sq <- max(cor(y,yfitted),0)^2
adj.r.sq <- 1 - (1 - r.sq) * (n - int.df) / r.df
I think would also make sense if you have normal Gaussian errors - i.e. the correlation between the observed and fitted y (clipped at zero, so that a negative relationship would imply zero predictive power) squared, and then adjusted for the nr of fitted parameters in the adjusted version. If y and yfitted go in the same direction this would be the R2 and adjusted R2 value as reported for a regular linear model. To me this would make perfect sense at least, so I don't agree with outright rejecting the usefulness of pseudo R2 values for nls models as the answer above seems to imply.
For non-normal error structures (e.g. if you were using a GAM with non-normal errors) the McFadden pseudo R2 is defined analogously as
1-residual deviance/null deviance
See here and here for some useful discussion.
Another quasi-R-squared for non-linear models is to square the correlation between the actual y-values and the predicted y-values. For linear models this is the regular R-squared.
As an alternative to this problem I used at several times the following procedure:
compute a fit on data with the nls function
using the resulting model make predictions
Trace (plot...) the data against the values predicted by the model (if the model is good, points should be near the bissectrix).
Compute the R2 of the linear régression.
Best wishes to all. Patrick.
With the modelr package
modelr::rsquare(nls_model, data)
nls_model <- nls(mpg ~ a / wt + b, data = mtcars, start = list(a = 40, b = 4))
modelr::rsquare(nls_model, mtcars)
# 0.794
This gives essentially the same result as the longer way described by Tom from the rcompanion resource.
Longer way with nagelkerke function
nullfunct <- function(x, m){m}
null_model <- nls(mpg ~ nullfunct(wt, m),
data = mtcars,
start = list(m = mean(mtcars$mpg)))
nagelkerke(nls_model, null_model)[2]
# 0.794 or 0.796
Lastly, using predicted values
lm(mpg ~ predict(nls_model), data = mtcars) %>% broom::glance()
# 0.795
Like they say, it's only an approximation.

Resources