emtrends and piecewise regression - r

I want to obtain four slopes for piecewise regression. Two slopes for each release type before 365 days, and after 365 days. I also know I should use the emmeans package.
Here is a dummy dataset.
df <- data.frame (tsr = c(0,0,9,10,19,20,20,21, 30,30,100,101,200,205,350,360, 400,401,500,501,600,605,700,710,800,801,900,902,1000,1001,1100,1105,2000,2250,2500,2501),
release_type = c('S','H','S','H','S','S','H','S','H','S','S','H','S','H','S','S','H','S','H','S','S','H','S','H','S','S','H','S','H','S','S','H','S','H','S', 'H'),
cond = c(250,251,250,251,300,301,351,375,250,249,216,257,264,216,250,251,250,251,300,301,351,375,250,249,216,257,264,216, 250,251,250,251,300,301,351,375),
notch = c('A','B','C','D','A','B','C','D','A','B','C','D','E','G','E','G','A','H','J','K','L','Q','W','E','R','Y','U','I','O','P','Y','U','I','O','P', 'Z'))
#Load libraries
library(emmeans)
library(lme4)
#Set up break point manually
bp = 365
b1 <- function(x, bp) ifelse(x < bp, bp - x, 0)
b2 <- function(x, bp) ifelse(x < bp, 0, x - bp)
#Fit linear mixed effect model using piecewise regression
m1 <- lmer(cond~b1(tsr, bp) + b2(tsr,bp) + b1(tsr, bp):release_type
+ b2(tsr,bp):release_type + release_type + (1|notch), data = df)
#Obtain slopes
emtrends(m1, params = "bp", var = "tsr", pairwise ~ release_type)
I am only getting estimates for one slope of each release type. What am I doing wrong?
Note: I cannot use the summary() function to obtain slopes because it uses the function above to generate those estimates. So it is not the pure slopes.

You have to add at = list(tsr = c(10, 400)) to the emtrends() call to specify representative times before and after the breakpoint. Otherwise, it just uses the average value of tsr since it is a quantitative predictor.

Related

Constrained Spline Function in r

hope all is well.
I have been exploring a few options for constraining a spline function so that it not only stays positive, but, so that it stays above the lowest value of y in the dataframe. I am assuming there is a penalized spline function out there where one can readily adapt the shape, though I have not found easily or tried yet. I have also tried nls with an exponential decay function which works, however, the last estimated point is much higher than desired (would like it to pass through, or be closer to the final value of y). see code below with the options i have tried. The ultimate goal however is to fit a spline that passes through all points and never decreases below the lowest value of y at any point while also acknowledging that yes there are only 5 data points. thanks in advance for the help.
library(tidyverse)
library(broom)
library(gnm)
library(cobs)
library(zoo)
DF <- data.frame(x = seq(1,5,1),y=c(26419753,9511111,3566667,57993,52194))
t=1:5
# option 1a and 1b: preferred method which is fitting a spline function
mod1a <- splinefun(DF$x,DF$y)
curve(mod1a, 1,5)
pred_interval_mod1a <- seq(1,5,length = 40)
interp(pred_interval_mod1a) # has that dip to negative near the end which should remain larger than y= 52,194
mod1b <- cobs(x= DF$x,y = DF$y,pointwise=rbind(c(0,52194,-1),c(0,26419753,1)))
pred_interval_mod1b <- seq(1,5,length = 40)
interp(pred_interval_mod1b)
# option 2: NLS for exponential decay with starting values
mod2 <- nls(y ~ SSasymp(t, yf, y0, log_alpha), data = DF)
qplot(t, y, data = augment(mod2)) + geom_line(aes(y = .fitted))
# option 3: similar NLS premise but with lower defined
mod3 <- nls(y ~ yf + (y0 - yf) * exp(-alpha * t), data = DF,
start = list(y0 = 26419753, yf = 52194, alpha = 1),
lower= c(-Inf,52194,-Inf),algorithm="port")
# option 4: similar to 2 and 3
a=log(52194)
mod4 <- gnm(y ~ Exp(1 + t) -1, verbose = FALSE, constrain="Exp(.+x).Intercept",
constrainTo=a, start=c(a,-0.05), data=DF)
mod4_df <- data.frame(t = seq(1,5,by=1))
mod4_pred <- predict(mod4,newdata=mod4_df)
mod4_pred

Syntax for three-piece segmented regression using NLS in R when concave

My goal is to fit a three-piece (i.e., two break-point) regression model to make predictions using propagate's predictNLS function, making sure to define knots as parameters, but my model formula seems off.
I've used the segmented package to estimate the breakpoint locations (used as starting values in NLS), but would like to keep my models in the NLS format, specifically, nlsLM {minipack.lm} because I am fitting other types of curves to my data using NLS, want to allow NLS to optimize the knot values, am sometimes using variable weights, and need to be able to easily calculate the Monte Carlo confidence intervals from propagate. Though I'm very close to having the right syntax for the formula, I'm not getting the expected/required behaviour near the breakpoint(s). The segments SHOULD meet directly at the breakpoints (without any jumps), but at least on this data, I'm getting a weird local minimum at the breakpoint (see plots below).
Below is an example of my data and general process. I believe my issue to be in the NLS formula.
library(minpack.lm)
library(segmented)
y <- c(-3.99448113, -3.82447011, -3.65447803, -3.48447030, -3.31447855, -3.14448753, -2.97447972, -2.80448401, -2.63448380, -2.46448069, -2.29448796, -2.12448912, -1.95448783, -1.78448797, -1.61448563, -1.44448719, -1.27448469, -1.10448651, -0.93448525, -0.76448637, -0.59448626, -0.42448586, -0.25448588, -0.08448548, 0.08551417, 0.25551393, 0.42551411, 0.59551395, 0.76551389, 0.93551398)
x <- c(61586.1711, 60330.5550, 54219.9925, 50927.5381, 48402.8700, 45661.9175, 37375.6023, 33249.1248, 30808.6131, 28378.6508, 22533.3782, 13901.0882, 11716.5669, 11004.7305, 10340.3429, 9587.7994, 8736.3200, 8372.1482, 8074.3709, 7788.1847, 7499.6721, 7204.3168, 6870.8192, 6413.0828, 5523.8097, 3961.6114, 3460.0913, 2907.8614, 2016.1158, 452.8841)
df<- data.frame(x,y)
#Use Segmented to get estimates for parameters with 2 breakpoints
my.seg2 <- segmented(lm(y ~ x, data = df), seg.Z = ~ x, npsi = 2)
#extract knot, intercept, and coefficient values to use as NLS start points
my.knot1 <- my.seg2$psi[1,2]
my.knot2 <- my.seg2$psi[2,2]
my.m_2 <- slope(my.seg2)$x[1,1]
my.b1 <- my.seg2$coefficients[[1]]
my.b2 <- my.seg2$coefficients[[2]]
my.b3 <- my.seg2$coefficients[[3]]
#Fit a NLS model to ~replicate segmented model. Presumably my model formula is where the problem lies
my.model <- nlsLM(y~m*x+b+(b2*(ifelse(x>=knot1&x<=knot2,1,0)*(x-knot1))+(b3*ifelse(x>knot2,1,0)*(x-knot2-knot1))),data=df, start = c(m = my.m_2, b = my.b1, b2 = my.b2, b3 = my.b3, knot1 = my.knot1, knot2 = my.knot2))
How it should look
plot(my.seg2)
How it does look
plot(x, y)
lines(x=x, y=predict(my.model), col='black', lty = 1, lwd = 1)
I was pretty sure I had it "right", but when the 95% confidence intervals are plotted with the line and prediction resolution (e.g., the density of x points) is increased, things seem dramatically incorrect.
Thank you all for your help.
Define g to be a grouping vector having the same length as x which takes on values 1, 2, 3 for the 3 sections of the X axis and create an nls model from these. The resulting plot looks ok.
my.knots <- c(my.knot1, my.knot2)
g <- cut(x, c(-Inf, my.knots, Inf), label = FALSE)
fm <- nls(y ~ a[g] + b[g] * x, df, start = list(a = c(1, 1, 1), b = c(1, 1, 1)))
plot(y ~ x, df)
lines(fitted(fm) ~ x, df, col = "red")
(continued after graph)
Constraints
Although the above looks ok and may be sufficient it does not guarantee that the segments intersect at the knots. To do that we must impose the constraints that both sides are equal at the knots:
a[2] + b[2] * my.knots[1] = a[1] + b[1] * my.knots[1]
a[3] + b[3] * my.knots[2] = a[2] + b[2] * my.knots[2]
so
a[2] = a[1] + (b[1] - b[2]) * my.knots[1]
a[3] = a[2] + (b[2] - b[3]) * my.knots[2]
= a[1] + (b[1] - b[2]) * my.knots[1] + (b[2] - b[3]) * my.knots[2]
giving:
# returns a vector of the three a values
avals <- function(a1, b) unname(cumsum(c(a1, -diff(b) * my.knots)))
fm2 <- nls(y ~ avals(a1, b)[g] + b[g] * x, df, start = list(a1 = 1, b = c(1, 1, 1)))
To get the three a values we can use:
co <- coef(fm2)
avals(co[1], co[-1])
To get the residual sum of squares:
deviance(fm2)
## [1] 0.193077
Polynomial
Although it involves a large number of parameters, a polynomial fit could be used in place of the segmented linear regression. A 12th degree polynomial involves 13 parameters but has a lower residual sum of squares than the segmented linear regression. A lower degree could be used with corresponding increase in residual sum of squares. A 7th degree polynomial involves 8 parameters and visually looks not too bad although it has a higher residual sum of squares.
fm12 <- nls(y ~ cbind(1, poly(x, 12)) %*% b, df, start = list(b = rep(1, 13)))
deviance(fm12)
## [1] 0.1899218
It may, in part, reflect a limitation in segmented. segmented returns a single change point value without quantifying the associated uncertainty. Redoing the analysis using mcp which returns Bayesian posteriors, we see that the second change point is bimodally distributed:
library(mcp)
model = list(
y ~ 1 + x, # Intercept + slope in first segment
~ 0 + x, # Only slope changes in the next segments
~ 0 + x
)
# Fit it with a large number of samples and plot the change point posteriors
fit = mcp(model, data = data.frame(x, y), iter = 50000, adapt = 10000)
plot_pars(fit, regex_pars = "^cp*", type = "dens_overlay")
FYI, mcp can plot credible intervals as well (the red dashed lines):
plot(fit, q_fit = TRUE)

Logistic Regression's ROC Goes Abnormal

Currently, I'm learning about logistic regression and LDA (Linear Discriminant Analysis) classification. I'm trying to generate the data differently to learn logistic regression and LDA behavior.
Here is the data visualization of 2-dimensional predictors with class plotted as color:
Here is my code:
library(ggplot2)
library(MASS)
set.seed(1)
a <- mvrnorm(n = 1000, mu = c(0,0), Sigma = matrix(c(0.4,0,0,0.4), nrow = 2, ncol = 2))
b <- mvrnorm(n = 1000, mu = c(0,0), Sigma = matrix(c(10,0,0,10), nrow = 2, ncol =2 ))
#I want to make sure b1 separated from a
b1 <- b[sqrt(b[,1]^2 + b[,2]^2) > 4,]
df <- as.data.frame(rbind(a,b1))
names(df) <- c('x','y')
labelA <- rep('A', nrow(a))
labelB <- rep('B', nrow(b1))
#Put the label column to the data frame
df$labs <- c(labelA,labelB)
ggplot(df, aes(x = x, y = y, col = labs)) + geom_point()
prd <- glm(as.factor(labs) ~ x + y, family = binomial('probit'), data = df)
prd_score <- predict(prd, type = 'response')
plot(roc(df$labs,prd_score))
auc(roc(df$labs,prd_score))
And this is the roc curve plot
It's really frustrating because I couldn't find any mistake in my code that generates this kind of problem. Can anyone help me to point out any mistake in my code that generates this weird kind of ROC or any explanation on why the ROC could become weird like that?
NB: Please assume that the generated data set above is the training data and I want to predict the training data again.
There is no mistake in your code.
Your dataset is a typical example that cannot be separated with a linear combination of features. Therefore linear classification method such as logistic regression or LDA won't help you here. This is why your ROC curve looks "weird", but it's totally normal and only telling you that your model fails to separate the data.
You need to investigate non-linear classification techniques. Given the radial distribution of the data, I can imagine that support vector machines (SVM) with a radial basis kernel could do the trick.
require(e1071)
# We need a numeric label for SVM regression
labelA <- rep(0, nrow(a))
labelB <- rep(1, nrow(b1))
df$labsNum <- c(labelA,labelB)
# We create a radial basis model
svm_prd <- svm(labsNum ~ x + y, data = df, kernel = "radial", type = "eps-regression")
svm_score <- predict(svm_prd)
plot(roc(df$labs,prd_score))
auc(roc(df$labs,prd_score))

Plotting interaction effects in Bayesian models (using rstanarm)

I'm trying to show how the effect of one variables changes with the values of another variable in a Bayesian linear model in rstanarm(). I am able to fit the model and take draws from the posterior to look at the estimates for each parameter, but it's not clear how to give some sort of plot of the effects of one variable in the interaction as the other changes and the associated uncertainty (i.e. a marginal effects plot). Below is my attempt:
library(rstanarm)
# Set Seed
set.seed(1)
# Generate fake data
w1 <- rbeta(n = 50, shape1 = 2, shape2 = 1.5)
w2 <- rbeta(n = 50, shape1 = 3, shape2 = 2.5)
dat <- data.frame(y = log(w1 / (1-w1)),
x = log(w2 / (1-w2)),
z = seq(1:50))
# Fit linear regression without an intercept:
m1 <- rstanarm::stan_glm(y ~ 0 + x*z,
data = dat,
family = gaussian(),
algorithm = "sampling",
chains = 4,
seed = 123,
)
# Create data sets with low values and high values of one of the predictors
dat_lowx <- dat
dat_lowx$x <- 0
dat_highx <- dat
dat_highx$x <- 5
out_low <- rstanarm::posterior_predict(object = m1,
newdata = dat_lowx)
out_high <- rstanarm::posterior_predict(object = m1,
newdata = dat_highx)
# Calculate differences in posterior predictions
mfx <- out_high - out_low
# Somehow get the coefficients for the other predictor?
In this (linear, Gaussian, identity link, no intercept) case,
mu = beta_x * x + beta_z * z + beta_xz * x * z
= (beta_x + beta_xz * z) * x
= (beta_z + beta_xz * x) * z
So, to plot the marginal effect of x or z, you just need an appropriate range of each and the posterior distribution of the coefficients, which you can obtain via
post <- as.data.frame(m1)
Then
dmu_dx <- post[ , 1] + post[ , 3] %*% t(sort(dat$z))
dmu_dz <- post[ , 2] + post[ , 3] %*% t(sort(dat$x))
And you can then estimate a single marginal effect for each observation in your data by using something like the below, which calculated the effect of x on mu for each observation in your data and the effect of z on mu for each observation.
colnames(dmu_dx) <- round(sort(dat$x), digits = 1)
colnames(dmu_dz) <- dat$z
bayesplot::mcmc_intervals(dmu_dz)
bayesplot::mcmc_intervals(dmu_dx)
Note that the column names are simply the observations in this case.
You could also use either the ggeffects-package, especially for marginal effects; or the sjPlot-package for marginal effects and other plot types (for marginal effects, sjPlot simply wraps the functions from ggeffects).
To plot marginal effects of interactions, use sjPlot::plot_model() with type = "int". Use mdrt.values to define which values to plot for continuous moderator variables, and use ppd to let prediction be based on either the posterior distribution of the linear predictor or draws from posterior predictive distribution.
library(sjPlot)
plot_model(m1, type = "int", terms = c("x", "z"), mdrt.values = "meansd")
plot_model(m1, type = "int", terms = c("x", "z"), mdrt.values = "meansd", ppd = TRUE)
or to plot marginal effects at other specific values, use type = "pred" and specify the values in the terms-argument:
plot_model(m1, type = "pred", terms = c("x", "z [10, 20, 30, 40]"))
# same as:
library(ggeffects)
dat <- ggpredict(m1, terms = c("x", "z [10, 20, 30, 40]"))
plot(dat)
There are more options, and also different ways of customizing the plot appearance. See related help files and package vignettes.

Plotting estimated HR from coxph object with time-dependent coefficient and splines

I want to plot the estimated hazard ratio as a function of time in the case of a coxph model with a time-dependent coefficient that is based on a spline term. I created the time-dependent coefficient using function tt, analogous to this example that comes straight from ?coxph:
# Fit a time transform model using current age
cox = coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung,
tt=function(x,t,...) pspline(x + t/365.25))
Calling survfit(cox) results in an error that survfit does not understand models with a tt term (as described in 2011 by Terry Therneau).
You can extract the linear predictor using cox$linear.predictors, but I would need to somehow extract ages and less trivially, times to go with each. Because tt splits the dataset on event times, I can't just match up the columns of the input dataframe with the coxph output. Additionally, I really would like to plot the estimated function itself, not just the predictions for the observed data points.
There is a related question involving splines here, but it does not involve tt.
Edit (7/7)
I'm still stuck on this. I've been looking in depth at this object:
spline.obj = pspline(lung$age)
str(spline.obj)
# something that looks very useful, but I am not sure what it is
# cbase appears to be the cardinal knots
attr(spline.obj, "printfun")
function (coef, var, var2, df, history, cbase = c(43.3, 47.6,
51.9, 56.2, 60.5, 64.8, 69.1, 73.4, 77.7, 82, 86.3, 90.6))
{
test1 <- coxph.wtest(var, coef)$test
xmat <- cbind(1, cbase)
xsig <- coxph.wtest(var, xmat)$solve
cmat <- coxph.wtest(t(xmat) %*% xsig, t(xsig))$solve[2, ]
linear <- sum(cmat * coef)
lvar1 <- c(cmat %*% var %*% cmat)
lvar2 <- c(cmat %*% var2 %*% cmat)
test2 <- linear^2/lvar1
cmat <- rbind(c(linear, sqrt(lvar1), sqrt(lvar2), test2,
1, 1 - pchisq(test2, 1)), c(NA, NA, NA, test1 - test2,
df - 1, 1 - pchisq(test1 - test2, max(0.5, df - 1))))
dimnames(cmat) <- list(c("linear", "nonlin"), NULL)
nn <- nrow(history$thetas)
if (length(nn))
theta <- history$thetas[nn, 1]
else theta <- history$theta
list(coef = cmat, history = paste("Theta=", format(theta)))
}
So, I have the knots, but I am still not sure how to combine the coxph coefficients with the knots in order to actually plot the function. Any leads much appreciated.
I think what you need can be generated by generating an input matrix using pspline and matrix-multiplying this by the relevant coefficients from the coxph output. To get the HR, you then need to take the exponent.
i.e.
output <- data.frame(Age = seq(min(lung$age) + min(lung$time) / 365.25,
max(lung$age + lung$time / 365.25),
0.01))
output$HR <- exp(pspline(output$Age) %*% cox$coefficients[-1] -
sum(cox$means[-1] * cox$coefficients[-1]))
library("ggplot2")
ggplot(output, aes(x = Age, y = HR)) + geom_line()
Note the age here is the age at the time of interest (i.e. the sum of the baseline age and the elapsed time since study entry). It has to use the range specified to match with the parameters in the original model. It could also be calculated using the x output from using x = TRUE as shown:
cox <- coxph(Surv(time, status) ~ ph.ecog + tt(age), data=lung,
tt=function(x,t,...) pspline(x + t/365.25), x = TRUE)
index <- as.numeric(unlist(lapply(strsplit(rownames(cox$x), "\\."), "[", 1)))
ages <- lung$age[index]
output2 <- data.frame(Age = ages + cox$y[, 1] / 365.25,
HR = exp(cox$x[, -1] %*% cox$coefficients[-1] -
sum(cox$means[-1] * cox$coefficients[-1])))

Resources