Segmented linear regression with discontinuous data - r

I have a dataset that looks to be piecewise linear. I would like to perform a segmented linear regression in R. The issue is that there is a discontinuity at the breakpoint. By using some pieces of code from this question I managed to get something, but I am not satisfied.
Dataset
Here is a dummy dataset.
NB = 100
A1 = 2 # coeff for first part
A2 = 1 # coeff for second part
B1 = 0 # intercept for first part
B2 = 300 # intercept for second part
df = data.frame(n=1:NB)
df$n = sample(500, size=NB, replace=TRUE)
df$noise = sample(20, size=NB, replace=TRUE)-10
my_func <- function(n, noise) {
if(n < 100) {
return(A1*n+B1 + noise)
}
else {
return(A2*n+B2 + noise)
}
}
df$fn = mapply(my_func, df$n, df$noise)
Using segmented package
This is quite straightforward, we simply perform a classical linear regression and give it to segmented.
library(segmented)
library(ggplot2)
model_segmented = segmented(lm(fn~n, data=df), seg.Z = ~ n)
predict_segmented = data.frame(n = df$n, fn = broken.line(model_segmented)$fit)
ggplot(df, aes(x = n, y = fn)) +
geom_point() + geom_line(data = predict_segmented, color = 'blue')
Gives:
Obviously, segmented expects the data to be continuous. It is not the case here, so the regression is not correct.
“Manual” method
This method is more tedious. First, we compute the break-point by trying all the possible break points and keeping the one which yields the lowest residual. Then, we add a new factor in the linear regression, which tells if the predictor variable is greater or lower than this breakpoint.
# Computation of the break-point
Break<-sort(unique(df$n))
Break<-Break[2:(length(Break)-1)]
d<-numeric(length(Break))
for (i in 1:length(Break)) {
model_manual<-lm(fn~(n<Break[i])*n + (n>=Break[i])*n, data=df)
d[i]<-summary(model_manual)[[6]]
}
breakpoint = Break[which.min(d)]
# Linear regression using this break-point
df$group = df$n >= breakpoint
model_manual<-lm(fn~n*group, data=df)
dat_pred = data.frame(n = df$n, fn = predict(model_manual, df))
ggplot(df, aes(x = n, y = fn)) +
geom_point() +
geom_line(data=dat_pred[dat_pred$n < breakpoint,], color = 'blue') +
geom_line(data=dat_pred[dat_pred$n >= breakpoint,], color = 'blue')
Gives:
Here, the regression is great.
Question
Is there a better way to achieve this goal? Can the segmented package take discontinuous data, or is there a package that can do this?
My concern is that the second method is a bit long and not very readable.

After spending a tremendous amount of time digging, I believe the chngpt package is the way to go. It can do both continuous and discontinuous segmented regressions. Link here: https://cran.r-project.org/web/packages/chngpt/vignettes/chngpt-vignette.pdf

strucchange will detect the breakpoint using statistically valid methods. Then, you can fit each piece with whatever model you want. For example, with a seasonal time series you can apply separate ARIMA models to each segment.

Related

How to fix GGplot issue or loop issue in R while superimposing F distribution on histogram of simulations?

What am I doing wrong in my following code?
My simulation study module asks me to use simple linear regression i.e., p=2. I'm supposed to generate B=10,000 independent simulations from a simple linear regression with N= 30 (number of observations) and B_0=B_1=0. For each simulation, one creates a dataset and extracts the F-statistic for the global F test. Then one should verify that the histogram resembles that of an F(1, N-2) distribution. I am confused whether my loop is the problem or my ggplot code or if it's a mix of the two.
My current output looks like:
n=30
F1 = array(NA,dim=Nsim)
for(i in 1:Nsim){
X=rnorm(n,0,sd=sigmax) # generate x
res=rnorm(n,0,sd=sigma) # generate sigma
Y=b0+b1*X+res # generate Y
mod = lm(Y~X)
res = summary(mod)
F1[i]=res$fstatistic[1] # F statistic
}
df<-tibble(F1=F1)
x=seq(1,10,1)
y=df(x,df1 = 1, df2 = n-2)
df2 = tibble(x=x,y=y)
ggplot() + geom_histogram(data=df, aes(x=F1,y=..density..), binwidth=0.1,color="black", fill="white")+
xlab("F") +
xlim(c(NA,10))+
ggtitle("n=30") +
geom_line(data = df2, aes(x = x, y = y), color = "red")

Constrained Spline Function in r

hope all is well.
I have been exploring a few options for constraining a spline function so that it not only stays positive, but, so that it stays above the lowest value of y in the dataframe. I am assuming there is a penalized spline function out there where one can readily adapt the shape, though I have not found easily or tried yet. I have also tried nls with an exponential decay function which works, however, the last estimated point is much higher than desired (would like it to pass through, or be closer to the final value of y). see code below with the options i have tried. The ultimate goal however is to fit a spline that passes through all points and never decreases below the lowest value of y at any point while also acknowledging that yes there are only 5 data points. thanks in advance for the help.
library(tidyverse)
library(broom)
library(gnm)
library(cobs)
library(zoo)
DF <- data.frame(x = seq(1,5,1),y=c(26419753,9511111,3566667,57993,52194))
t=1:5
# option 1a and 1b: preferred method which is fitting a spline function
mod1a <- splinefun(DF$x,DF$y)
curve(mod1a, 1,5)
pred_interval_mod1a <- seq(1,5,length = 40)
interp(pred_interval_mod1a) # has that dip to negative near the end which should remain larger than y= 52,194
mod1b <- cobs(x= DF$x,y = DF$y,pointwise=rbind(c(0,52194,-1),c(0,26419753,1)))
pred_interval_mod1b <- seq(1,5,length = 40)
interp(pred_interval_mod1b)
# option 2: NLS for exponential decay with starting values
mod2 <- nls(y ~ SSasymp(t, yf, y0, log_alpha), data = DF)
qplot(t, y, data = augment(mod2)) + geom_line(aes(y = .fitted))
# option 3: similar NLS premise but with lower defined
mod3 <- nls(y ~ yf + (y0 - yf) * exp(-alpha * t), data = DF,
start = list(y0 = 26419753, yf = 52194, alpha = 1),
lower= c(-Inf,52194,-Inf),algorithm="port")
# option 4: similar to 2 and 3
a=log(52194)
mod4 <- gnm(y ~ Exp(1 + t) -1, verbose = FALSE, constrain="Exp(.+x).Intercept",
constrainTo=a, start=c(a,-0.05), data=DF)
mod4_df <- data.frame(t = seq(1,5,by=1))
mod4_pred <- predict(mod4,newdata=mod4_df)
mod4_pred

Using predictNLS to create confidence intervals around fitted values in R?

I want to build confidence intervals around a large set of fitted values using predictNLS from the propogate package in R. As an example, I will use the data set they reference in the function description (https://rdrr.io/github/anspiess/propagate/man/predictNLS.html), DNase, and building a model that takes the values conc and density as features:
library(propogate)
library(dplyr)
library(modelr)
DNase <- DNase
modeldna <- DNase %>% group_by(Run) %>%
do(run_model = nls(density ~ a * exp(b * conc),
start = list(a = 1 , b = 0.5),
data = .)) %>% ungroup()
I then want to give each row the model that it is assigned to so that predictions can be added:
DNApredict <- full_join(as_tibble(DNase), modeldna, by = "Run")
Add in the predictions:
DNApredict <- DNApredict %>%
group_by(Run) %>%
do(add_predictions(., var = "predicted_density", first(.$run_model)))
And then, I want to add the confidence interval data that predictNLS seems to provide, by giving it that same data and asking it to give a confidence interval for each fitted point in the predicted_density column:
confidence_interval <- predictNLS(model = modeldna, newdata = DNApredict$predicted_density, interval = "confidence")
However, the following error arises:
Error in as.list(object$call$formula) :
argument "object" is missing, with no default
Does anyone know what might be causing this? I know that it will likely seem obvious to some of you what the object it is calling is, so I apologize if this is a ridiculous question. I am really hoping to be able to use this function to create confidence intervals around a series of fitted values. Thank you very much in advance.
Since you are running an nls on each Run in the sample data set, it is easy to get a list of nls models by splitting each run into its own data frame, and running nls on each data frame using lapply
library(propagate)
DNase <- DNase
modeldna <- DNase %>% split(DNase$Run)
models <- lapply(modeldna, function(d) nls(density ~ a * exp(b * conc),
start = list(a = 1 , b = 0.5),
data = d))
Now we can get predictions for each point in each model just as easily by running predictNLS on each model (again inside lapply)
results <- lapply(seq_along(modeldna), function(i) {
predictNLS(models[[i]], newdata = data.frame(conc = modeldna[[i]]$conc))
})
Because of the output structure of predictNLS, we need to extract the predictions for each row and coerce them into a data frame:
predictions <- lapply(results, function(x) {
as.data.frame(do.call(rbind, lapply(x$prop, function(y) y$prop)))})
Finally, we can stick our predictions (including confidence intervals) back onto the original data frame:
all_results <- do.call(rbind, lapply(seq_along(modeldna),
function(i) cbind(modeldna[[i]], predictions[[i]])))
This now gives us a complete data frame of original data points, and the relevant predictions with confidence intervals.
To show this, we can plot the results in ggplot. Here we show one plot for each run, including its original data, the predicted value as a dotted line, and the 95% confidence limit as a pale blue ribbon:
library(ggplot2)
ggplot(all_results, aes(x = conc, y = density)) +
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`),
fill = "deepskyblue4", alpha = 0.2) +
geom_point() +
geom_line(aes(y = Mean.1), linetype = 2) +
facet_wrap(.~factor(Run, levels = 1:11)) +
theme_bw()

Lagged regression in R: determining the optimal lag

I have a variable that is believed to be a good predictor for another variable, but with some lag. I don't know what the lag is and want to estimate it from the data.
Here is am example:
library(tidyverse)
data <- tibble(
id = 1:100,
y = dnorm(1:100, 30, 20) * 1000,
x.shifted = y / 10 + runif(100) / 10,
x.actual = lag(x.shifted, 30)
)
data %>%
ggplot(aes(id, x.shifted)) +
geom_point() +
geom_point(aes(id, x.actual), color = 'blue') +
geom_point(aes(id, y), color = 'red')
The model lm(y ~ x.actual, data) would not be a great fit, but the model lm(y ~ x.shifted, data) would be. Here, I know that x must be shifted by -30 days, but imagine I did not and all I knew was that it is between -30 and +30.
The immediate approach that comes to mind is to run 61 regression models, from one that shifts x by -30 to the one that shifts it by +30, and then pick the model with the best AIC or BIC. However, (a) is this the correct approach, and (b) are there R packages that already do this and find the optimal lag?
What you are describing is the cross-correlation of the two variables. You can do this very easily in R with ccf.
However, to just get the optimum lags, we can simplify to a one-liner by using sapply to feed the number of required lags into the cor function, then use which.max to find the highest correlation:
which.max(sapply(1:50, function(i) cor(data$x.actual, lag(data$y, i), use = "complete")))
#> [1] 30

Fit a GEE-model of type "exchangeable" with gamm

I would like to estimate a smooth effect of some covariate N in a marginal model of type "exchangeable" in R, where the clustering variable is S. From what I could find, this should be possible with:
geeglm(..., id = S, corstr = "exchangeable")
as well as:
gamm(..., correlation = corCompSymm(form = ~1|S))
Below you can find an example where the results look good in a sense that the two estimates are quite close. However, if I use the real data our project is about, the estimated smooth effects tend to be very different. I cannot publish that here, but maybe someone can still spot some problem in the code. For instance (see below), the gamm-object says Number of Groups: 1 which worries me as there clearly is more than one cluster...
(Yes, this is the realisation of a random-effects-model by construction, but this should lead to the desired model given the answer here.)
########
## Packages
########
library(ggplot2)
library(mgcv)
library(dplyr)
library(geepack)
library(splines)
########
## Data Simulation
########
f <- function(N) {return((-200+(N-25)^2)/100)}
N <- sort(sample(1:50, 10, replace = T))
S <- as.character(1:10)
S_Effect <- rnorm(length(S),0,1)
S_Effect <- rep(S_Effect,N)
S <- rep(S,N)
N <- rep(N,N)
E <- runif(length(N))
data <- data.frame(O = rep(0,length(N)),
E = E,
N = N,
S = as.factor(S),
S_Effect = S_Effect)
for (i in 1:length(N)) {
data$O[i] <- rbinom(1, 1, plogis(f(N[i]) + qlogis(E[i]) + S_Effect[i]))}
data <- data %>% mutate(E = qlogis(E))
########
## Fitting
########
formula_gamm <- as.formula("O ~ 1 + offset(E) + s(N, bs = 'bs')")
model_gamm <- gamm(formula_gamm, family = binomial(), correlation = corCompSymm(form=~1|S), data = data)
model_gamm
formula_geeglm <- as.formula("O ~ 1 + offset(E) + bs(N)")
model_geeglm <- geeglm(formula_geeglm, family = binomial(), corstr = "exchangeable", id = S, data = data)
########
## Plot
########
pred_gamm <- plot.gam(model_gamm$gam, select = 1)
x <- pred_gamm[[1]]$x
pred_geeglm <- predict(model_geeglm, type = "terms", newdata = data.frame(E = rep(0,length(x)), N = x))
z <- qnorm(0.9)
tmp <- data.frame(x = x,
y = pred_gamm[[1]]$fit,
group = rep("estimate gamm",length(x)))
tmp2 <- data.frame(x = x,
y = as.numeric(pred_geeglm),
group = rep("estimate geeglm",length(x)))
tmp3 <- data.frame(x = x,
y = f(x),
group = rep("actual function",length(x)))
data_pred = bind_rows(tmp,tmp2,tmp3) %>% mutate(group = as.factor(group))
p <- ggplot(data = data_pred, aes(x = x, y = y, color = group)) +
geom_line(size = 2) +
xlab("N") +
ylab("f(N)")
p
An additional question: The gamm-object contains enough information to plot a confidence-band around the estimated function, but how can I do this for the geeglm-estimate? You get something that looks reasonable if you simulate(model_geeglm, ...) and take the pointwise mean and so on, but that doesn't really satisfy me as (1) the documentation on simulate doesn't mention marginal models and (2) it is very primitive...
The GAMM is using penalised splines, such that the degrees of freedom used by the resulting spline (smoother) is likely to be somewhat less than the requested basis dimension, which is 10. The GEE is fitting an unpenalized model. All else equal, the unpenalised model will be more wiggly than the penalised one.
To compare these approaches on a common footing, you need to make sure that bs() and s(x, bs = 'bs') both produce the same number of basis functions (The s() version can produce one fewer as it will remove the lack identifiability with the intercept term, whereas you are omitting the intercept in the bs() version).
Having assured yourself that you get the same basis dimension, then you can make GAMM fit an unpenalized spline by adding fx = TRUE to the s(...) term in the formula.
Having done that, both models should be estimating similar smooth effects.
However, I would suggest that you use penalisation; For the GAMM model, use fx = FALSE, and then after estimating the model run gam.check(model$gam) (replacing model with your fitted model object) and see if the basis size check passes for the smoother.

Resources