maximum likelihood estimation of parameters following polynomial logistic regression - r

enter image description here
this is the datset, library(frair), data=gammarus
i want to estimate the parameters p0, p1 p2 and p3 formula is
*NA/No= exp(P0+ P1*density+ P2*density^2+P3*density^3)/(1+exp(P0+P1*density+ P2*density^2+P3*density^3))*, where Na is prey eaten and No is prey offered

Setup
library(dplyr)
library(frair)
d <- gammarus %>% mutate(y = eaten/(eaten + alive))
Step 1: Regression
You can estimate the coefficients from an equation with the lm (linear model) function:
lm(y ~ density, data = d)
Step 2: Polynomial regression
To have a polynomial functional form instead, you can use the poly function. The first argument is the variable, the second is the degree of the polynomial, and you must then specify whether you want a raw or an orthogonal polynomial. In our case it would be a raw polynomial, check this post for more detail.
You can estimate the four coefficients from by replacing density with a third degree raw polynomial of density:
lm(y ~ poly(density, 3, raw = T), data = d)
Step 3: Logistic regression
The final step is to switch from the linear to the logistic . For this you would need the glm function (generalized linear model) and you must specify that you want a logit (and not a probit for instance, cf. this post) specification with family = binomial(link = "logit").
glm(y ~ poly(density, 3, raw = T), data = d, family = binomial(link = "logit"))

Related

Estimating risk ratio instead of odds ratio in mixed effect logistic regression in `R`

glmer is used to estimate effects on the logit scale of y when the data are clustered. In the following model
fit1 = glmer(y ~ treat + x + ( 1 | cluster), family = binomial(link = "logit"))
the exp of the coefficient of treat is the odds ratio of a binary 0-1 treatment variable, x is a covariate, and cluster is a clustering indicator across which we model a random effect (intercept). A standard approach in glm's to estimate risk ratios is to use a log link instead, i.e. family=binomial(link = "log"). However using this in glmer I get error
Error in (function (fr, X, reTrms, family, nAGQ = 1L, verbose = 0L, maxit = 100L, :
(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
after calling
fit1 = glmer(y ~ treat + x + ( 1 | cluster), family = binomial(link = "log"))
A web search revealed other people had similar issues with the Gamma family.
This seems to be a general problem as the reproducible example below demonstrates. My question thus is: how can I estimate risk ratios using a mixed effect model like glmer?
Reproducible Example
The following code simulates data that replicates the problem.
n = 1000 # sample size
m = 50 # number of clusters
J = sample(1:m, n, replace = T) # simulate cluster membership
x = rnorm(n) # simulate covariate
treat = rbinom(n, 1, 0.5) # simulate random treatment
u = rnorm(m) # simulate random intercepts
lt = x + treat + u[J] # compute linear term of logistic mixed effect model
p = 1/(1+exp(-lt)) # use logit link to transform to probabilities
y = rbinom(n,1,p) # draw binomial outcomes
d = data.frame(y, x, treat)
# First fit logistic model with glmer
fit1 = glmer( y ~ treat + x + (1 | as.factor(J)),
family = binomial(link = "logit"), data = d)
summary(fit1)
# Now try to log link
fit2 = glmer( y ~ treat + x + (1 | as.factor(J)),
family = binomial(link = "log"), data = d)
This error is returned due to your model producing values > 1:
PIRLS step-halvings failed to reduce deviance in pwrssUpdate
...
When using lme4 to fit GLMMs with link functions that do not automatically constrain the response to the allowable range of the distributional family (e.g. binomial models with a log link, where the estimated probability can be >1, or inverse-Gamma models, where the estimated mean can be negative), it is not unusual to get this error. This occurs because lme4 doesn’t do anything to constrain the predicted values, so NaN values pop up, which aren’t handled gracefully. If possible, switch to a link function to one that constrains the response (e.g. logit link for binomial or log link for Gamma).
Unfortunately, the suggested workaround is to use a different link function.
The following paper surveys a number of alternative model choices for calculation for [adjusted] relative risk:
Model choices to obtain adjusted risk difference estimates from a binomial regression model with convergence problems: An assessment of methods of adjusted risk difference estimation (2016)

How to check for overdispersion in a GAM with negative binomial distribution?

I fit a Generalized Additive Model in the Negative Binomial family using gam from the mgcv package. I have a data frame containing my dependent variable y, an independent variable x, a factor fac and a random variable ran. I fit the following model
gam1 <- gam(y ~ fac + s(x) + s(ran, bs = 're'), data = dt, family = "nb"
I have read in Negative Binomial Regression book that it is still possible for the model to be overdisperesed. I have found code to check for overdispersion in glm but I am failing to find it for a gam. I have also encountered suggestions to just check the QQ plot and standardised residuals vs. predicted residuals, but I can not decide from my plots if the data is still overdisperesed. Therefore, I am looking for an equation that would solve my problem.
A good way to check how well the model compares with the observed data (and hence check for overdispersion in the data relative to the conditional distribution implied by the model) is via a rootogram.
I have a blog post showing how to do this for glm() models using the countreg package, but this works for GAMs too.
The salient parts of the post applied to a GAM version of the model are:
library("coenocliner")
library('mgcv')
## parameters for simulating
set.seed(1)
locs <- runif(100, min = 1, max = 10) # environmental locations
A0 <- 90 # maximal abundance
mu <- 3 # position on gradient of optima
alpha <- 1.5 # parameter of beta response
gamma <- 4 # parameter of beta response
r <- 6 # range on gradient species is present
pars <- list(m = mu, r = r, alpha = alpha, gamma = gamma, A0 = A0)
nb.alpha <- 1.5 # overdispersion parameter 1/theta
zprobs <- 0.3 # prob(y == 0) in binomial model
## simulate some negative binomial data from this response model
nb <- coenocline(locs, responseModel = "beta", params = pars,
countModel = "negbin",
countParams = list(alpha = nb.alpha))
df <- setNames(cbind.data.frame(locs, nb), c("x", "yNegBin"))
OK, so we have a sample of data drawn from a negative binomial sampling distribution and we will now fit two models to these data:
A Poisson GAM
m_pois <- gam(yNegBin ~ s(x), data = df, family = poisson())
A negative binomial GAM
m_nb <- gam(yNegBin ~ s(x), data = df, family = nb())
The countreg package is not yet on CRAN but it can be installed from R-Forge:
install.packages("countreg", repos="http://R-Forge.R-project.org")
Then load the packages and plot the rootograms:
library("countreg")
library("ggplot2")
root_pois <- rootogram(m_pois, style = "hanging", plot = FALSE)
root_nb <- rootogram(m_nb, style = "hanging", plot = FALSE)
Now plot the rootograms for each model:
autoplot(root_pois)
autoplot(root_nb)
This is what we get (after plotting both using cowplot::plot_grid() to arrange the two rootograms on the same plot)
We can see that the negative binomial model does a bit better here than the Poisson GAM for these data — the bottom of the bars are closer to zero throughout the range of the observed counts.
The countreg package has details on how you can add an uncertain band around the zero line as a form of goodness of fit test.
You can also compute the Pearson estimate for the dispersion parameter using the Pearson residuals of each model:
r$> sum(residuals(m_pois, type = "pearson")^2) / df.residual(m_pois)
[1] 28.61546
r$> sum(residuals(m_nb, type = "pearson")^2) / df.residual(m_nb)
[1] 0.5918471
In both cases, these should be 1; we see substantial overdispersion in the Poisson GAM, and some under-dispersion in the Negative Binomial GAM.

Find R-square value of Weibull fit (Survival model) in R

I have a survival object (S) for which I am doing a weibull fit using the survreg function and weibull distribution in R.
S = Surv(data$ValueX, data$ValueY)
W = Survreg(S ~ 1, data=data, dist="weibull")
How do I extract the R-square value of the Weibull fit which is essentially a linear line? Or is there a function to calculate the correlation coefficient value Rho?
Basically, I want to calculate the goodness of fit.
Look at pam.censor in the PAmeasures package which produces an R^2 like statistic. Using the ovarian dataset from the survival package:
library(PAmeasures)
library(survival)
fit.s <- survreg(Surv(futime, fustat) ~ age, data = ovarian, dist="weibull" )
p <- predict(fit.s, type = "response")
with(ovarian, pam.censor(futime, p, fustat))
For the ovarian data with an age regressor we get a value of only 0.0915 .
Another idea is that for a Weibull model with no covariates we have S(t) = exp(- (lambda * t)^p) so log(-log(S(t))) is linear in log(t) hence we could use the R squared of the corresponding regression to measure how well the model fits to a Weibull.
library(survival)
fit1 <- survfit(Surv(futime, fustat) ~ 1, data = ovarian)
sum1 <- summary(fit1, times = ovarian$futime)
fo <- log(-log(surv)) ~ log(time)
d <- as.data.frame(sum1[c("time", "surv")])
fit.lm <- lm(fo, d)
summary(fit.lm)$r.sq
plot(fo, d)
abline(fit.lm)
For the ovarian data without covariates the R^2 at 93% is high but the plot does suggest systematic departures from linearity so it may not really be Weibull.
Other
Not sure if this is of interest but the eha package has the check.dist function which can be used for a visual comparison of a parametric baseline hazard model to a cox proportional hazard model. See the documentation as well as:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5233524/
Using the ovarian dataset from survival:
library(eha)
library(surival)
fit.c <- coxreg(Surv(futime, fustat) ~ age, data = ovarian)
fit.p <- phreg(Surv(futime, fustat) ~ age, data = ovarian, dist = "weibull")
check.dist(fit.c, fit.p)
The survAUC package has three functions that provide r squared type statistics for cox proportional hazard models (OXS, Nagelk and XO).

Multivariate regression splines in R

Most people are probably familiar with bs from splines:
library(splines)
workingModel <- lm(mpg ~ factor(gear) + bs(wt, knots = 5) + hp, data = mtcars)
bs(mtcars$wt, knots = 4)
This uses a b-spline for the singe variable weight, but you can also do multivariate splines:
bs(cbind(mtcars$wt,mtcars$hp), knots = 4)
But this produces a matrix with twice as many rows as mtcars, so when I try:
brokenModel <- lm(mpg ~ bs(cbind(mtcars$wt,mtcars$hp), knots = 4), data = mtcars)
I get an error about differing lengths.
My question is: how do I use the multivariate spline in a model if it has a different number of rows than my outcome variable? Do I stack my outcome variable on top of itself y <- c(y, y)? Why does the multivariate spline produce extra rows?
Thanks.
You can't use splines::bs in this case, as it is strictly for construction of a univariate spline. If you do bs(mat) where mat is a matrix, it is just doing bs(c(mat)). For example,
mat <- matrix(runif(8), 4, 2)
identical(bs(mat), bs(c(mat)))
# [1] TRUE
This explains why you get double number of rows, when doing bs(cbind(mtcars$wt,mtcars$hp).
To create a 2D spline, the simplest way is to create additive spline:
lm(mpg ~ factor(gear) + bs(wt, knots = 5) + bs(hp, knots = 4), mtcars)
but this may not be what you want. Then consider interaction:
model <- lm(mpg ~ factor(gear) + bs(wt, knots = 5):bs(hp, knots = 4), mtcars)
The bs(wt, knots = 5):bs(hp, knots = 4) is forming row-wise Kronecker product between the two design matrices. Since bs(wt, knots = 5) is a matrix of 4 columns, and bs(hp, knots = 4) is a matrix of 3 columns, the interaction has 4 * 3 = 12 columns.
Alternatively, consider using mgcv package. In mgcv, multivariate splines can be constructed in two ways:
isotropic thin-plate splines;
scale invariant tensor product splines.
Clearly you want the second here, as wt and hp have different units. To construct tensor product splines, we can use:
library(mgcv)
fit <- gam(mpg ~ factor(gear)
+ s(wt, bs = 'cr', k = 4, fx = TRUE)
+ s(hp, bs = 'cr', k = 4, fx = TRUE)
+ ti(wt, hp, bs = 'cr', k = c(4, 4), d = c(1, 1), fx = TRUE),
data = mtcars)
Here, I specially set fx = TRUE to disable penalized regression.
I don't want to write an extensive answer to introduce mgcv. For how s, ti and gam work, just read documentation. If you need to bridge the gap in theory, grab Simon Wood's book published in 2006: Generalized Additive Models: an introduction with R.
A practical example of mgcv usage?
I had an answer Cubic spline method for longitudinal series data which might help you get familiar with mgcv. But as an introductory example, it only shows how to work with univariate spline. Fortunately, that is also the key. Tensor product spline is constructed from univariate spline.
My other answers related to mgcv is more of theoretical aspect; while not all my answers related to spline is making reference to mgcv. So that question & answer is the best I could give you at this stage.
Would the scale invariant tensor product splines be equivalent to radial smoothing or would that be the isotropic thin-place splines?
Radial smoothing is equivalent to thin-plate spline, as the basis function for a thin-plate spline is radial. That is why it is isotropic and can be used in spatial regression.
Tensor product spline is scale invariant, as it is constructed as (pairwise) multiplication of univariate spline basis.

Statistical significance of a nls model in R

I have some multiple linear models without intercept like below:
Y = a*X1 + b*X2 + c*X3
However this model is a linear model, but since it does not have an intercept we have to write it as a non-linear model in R:
model1= nls(Y ~ a*X1+b*X2, data = trainDat, start = list(a = 1, b=1))
The problem is that the summary(model1) does not give us the statistics of the model like F-statistics because it is not lm.
How can we report the significance of these models in R?

Resources