Lagged regression in R: determining the optimal lag - r

I have a variable that is believed to be a good predictor for another variable, but with some lag. I don't know what the lag is and want to estimate it from the data.
Here is am example:
library(tidyverse)
data <- tibble(
id = 1:100,
y = dnorm(1:100, 30, 20) * 1000,
x.shifted = y / 10 + runif(100) / 10,
x.actual = lag(x.shifted, 30)
)
data %>%
ggplot(aes(id, x.shifted)) +
geom_point() +
geom_point(aes(id, x.actual), color = 'blue') +
geom_point(aes(id, y), color = 'red')
The model lm(y ~ x.actual, data) would not be a great fit, but the model lm(y ~ x.shifted, data) would be. Here, I know that x must be shifted by -30 days, but imagine I did not and all I knew was that it is between -30 and +30.
The immediate approach that comes to mind is to run 61 regression models, from one that shifts x by -30 to the one that shifts it by +30, and then pick the model with the best AIC or BIC. However, (a) is this the correct approach, and (b) are there R packages that already do this and find the optimal lag?

What you are describing is the cross-correlation of the two variables. You can do this very easily in R with ccf.
However, to just get the optimum lags, we can simplify to a one-liner by using sapply to feed the number of required lags into the cor function, then use which.max to find the highest correlation:
which.max(sapply(1:50, function(i) cor(data$x.actual, lag(data$y, i), use = "complete")))
#> [1] 30

Related

Constrained Spline Function in r

hope all is well.
I have been exploring a few options for constraining a spline function so that it not only stays positive, but, so that it stays above the lowest value of y in the dataframe. I am assuming there is a penalized spline function out there where one can readily adapt the shape, though I have not found easily or tried yet. I have also tried nls with an exponential decay function which works, however, the last estimated point is much higher than desired (would like it to pass through, or be closer to the final value of y). see code below with the options i have tried. The ultimate goal however is to fit a spline that passes through all points and never decreases below the lowest value of y at any point while also acknowledging that yes there are only 5 data points. thanks in advance for the help.
library(tidyverse)
library(broom)
library(gnm)
library(cobs)
library(zoo)
DF <- data.frame(x = seq(1,5,1),y=c(26419753,9511111,3566667,57993,52194))
t=1:5
# option 1a and 1b: preferred method which is fitting a spline function
mod1a <- splinefun(DF$x,DF$y)
curve(mod1a, 1,5)
pred_interval_mod1a <- seq(1,5,length = 40)
interp(pred_interval_mod1a) # has that dip to negative near the end which should remain larger than y= 52,194
mod1b <- cobs(x= DF$x,y = DF$y,pointwise=rbind(c(0,52194,-1),c(0,26419753,1)))
pred_interval_mod1b <- seq(1,5,length = 40)
interp(pred_interval_mod1b)
# option 2: NLS for exponential decay with starting values
mod2 <- nls(y ~ SSasymp(t, yf, y0, log_alpha), data = DF)
qplot(t, y, data = augment(mod2)) + geom_line(aes(y = .fitted))
# option 3: similar NLS premise but with lower defined
mod3 <- nls(y ~ yf + (y0 - yf) * exp(-alpha * t), data = DF,
start = list(y0 = 26419753, yf = 52194, alpha = 1),
lower= c(-Inf,52194,-Inf),algorithm="port")
# option 4: similar to 2 and 3
a=log(52194)
mod4 <- gnm(y ~ Exp(1 + t) -1, verbose = FALSE, constrain="Exp(.+x).Intercept",
constrainTo=a, start=c(a,-0.05), data=DF)
mod4_df <- data.frame(t = seq(1,5,by=1))
mod4_pred <- predict(mod4,newdata=mod4_df)
mod4_pred

Other than burn-in increase and priors, how can I help my multiple change point (mcp package in R) models converge?

I would like to identify changepoints in my data and an associated error term for their estimate. The mcp package seems to do a good job (visually) of identifying changepoints in my data, but the model parameters generally have rhat values >1.1. From my understanding, I cannot trust any Bayesian parameter estimates unless all rhat values in the model are =< 1.1. Aside from increasing my burn-in period with the adapt argument or using priors (see note below), how else can I improve these models?
Alternatively, can I force mcp to fit a 'best' two and three segment model and return those parameter estimates with error? Ideally I would be able to provide changepoint estimates with an error term associated with each estimate, but packages like segmented and struccchange generally fail to identify changepoints in my data.
The code looks like this:
set.seed(42)
x <- c(227,227,228,228,228,228,228,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,230,232,233,233,233,233,233,233,233,233,236,236,236,236,237,237,237,238,238,238,238,238,238,238,238,239,239,239,239,239,239,243,244,244,244,244,244,244,244,244,244,245,245,245,246,246,246,246,247,250,250,250,250,251,251,251,251,251,251,251,251,253,253,253,257,257,260,260,260,260,260,260,260,264,264,264,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,265,267,267,267,267,267,267,267,267,267,267,267,267,267,271,271,271,271,271,271,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,273,274,274,274,274,274,274,274)
y <- c(8.43,6.9,8.93,7.33,7.28,7.24,6.62,8.36,8.17,8.07,8.07,7.63,7.6,7.54,7.37,7.31,7.21,7.03,6.93,6.88,6.82,6.78,6.7,6.5,8.35,10.97,7.48,7.46,7.28,7.17,6.72,6.68,6.08,7.42,7.14,6.92,6.68,7.49,7.28,6.67,9.4,7.54,7.04,6.89,6.88,6.52,6.45,6.39,8.48,8.04,7.52,7.35,6.9,6.57,6.86,7.46,7.39,7.16,7.08,6.83,6.83,6.7,6.54,6.47,9.75,7.38,5.96,10.49,8.32,7.22,7.05,8.55,10.34,8.23,7.9,7.31,8.18,7.8,7.31,7.18,7.17,7.13,7.02,6.84,10.62,10.09,9.26,10.8,10.37,10.9,10.52,10.23,9.28,9.18,8.85,8.81,11.03,8.84,6.29,11.36,10.91,10.87,10.4,10.17,9.61,9.5,9.36,9.17,9.13,8.88,8.73,8.55,8.37,8.33,8.25,7.82,6.9,9.77,9.53,9.39,9.1,8.93,8.68,8.64,8.47,8.41,8.38,8.28,8.18,7.74,10.67,10.64,10.54,10.36,10.35,7.03,9.51,9.37,9.24,9.22,9.18,8.96,8.95,8.94,8.89,8.82,8.79,8.72,8.35,8.22,8.13,8.07,7.91,7.85,7.79,8.82,8.59,8.44,8.42,8.37,8.06,7.34)
df <- data.frame(x, y)
#Writing the formula for a three-segment line
three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x, #segment 2, specifying a changepoint and joined slope
y ~ 1 ~ 0 + x #segment 3, specifying a changepoint and joined slope
)
#Writing the formula for a two-segment line
two_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 0 + x #segment 2, specifying a changepoint and joined slope
)
#Disjointing the slopes of two segments
test_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x #segment 2, specifying a changepoint and disjoined slope
)
#Disjointing the slopes of the three expected segments
test_three_segment_model = list(
y ~ x, # intercept + slope
y ~ 1 ~ 1 + x, #segment 2, specifying a changepoint and disjoined slope
y ~ 1 ~ 1 + x #segment 3, specifying a changepoint and disjoined slope
)
These are the models I've tried, but none achieve consistently tolerable rhat values. The code I use to test them is as follows:
#An example run, although each of these models fails to provide suitable rhat values
fit1 <- mcp(three_segment_model, df, chains = 4, iter = 10000, cores = 3)
plot(fit1)
fit1_summary <- data.frame(summary(fit1))
Given the literature, I have also tried priors (though admittedly I cannot make heads or tails of the truncate and distribution code with mcp), which do not seem to reflect the patterns poorly identified with this modelling approach.
If your issue is primarily a practical one, upping the number of iterations and chains reveals that the posterior is quite reproducible across chains for the "problematic" parameters:
fit1 <- mcp(three_segment_model, df, chains = 6, iter = 50000, cores = 6)
plot_pars(fit1, c("cp_1", "cp_2", "x_1"))
And you have good rhat values for the non-three-segment models; all pointing to the three-segment model being unidentifiable with this data. Without knowing the process, I did try some fairly informative priors:
prior = list(
x_1 = 0, # fixed horizontal!
x_2 = "dnorm(0, 1) T(0, )" # Positive slope
)
but it did not improve rhat values for the change point parameters.

Identifying lead/lags using multivariate regression analysis

I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.

Non linear model with five parameters (w/ nls R)

This is my first question, please let me know if I'm doing anything wrong. We have a df with two variables, and want to model EPR (egg production rate) as a function of temperature.
The relevant packages as per the nls page:
install.packages("tidyverse")
install.packages("nls.multstart")
install.packages("nlstools")
library(tidyverse)
library(nls.multstart)
library(nlstools)
The relevant variables from a larger df:
temp=c(9.2,9.9,12.7,12.8,14.3,14.5,16.3,16.5,18,18,19.6,19.6,19.9,19.9,22,22.4,23.2,23.4,25.3,25.6,27,27.3,28.5,30.3,20.9)
EPR=c(1.5,0,0,0,1.27,0.56,3.08,0.575,2.7,3.09,2,6.3,2,3.76,3.7,1.65,7.1,18.9,7.07,3.77,13.79,0,0,0.47,0)
df<-data.frame(temp,EPR)
Here I write the formula with the five parameters to be estimated (k1,a,b,k2,c), temp will be the x values. So far so good.
formula<-function(k1,a,b,k2,c,temp) {
modelEPR<-k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
return(modelEPR)
}
This is where I'm stuck; I'm already using quite narrow start_lower and upper, since I now know the parameters by using the excel solver somewhat successfully. The values I get with this method will get me a model, albeit quite an inaccurate one. Yes, I gave the start lower and upper a much greater range in the beginning, but that didn't yield any better results.
fit <- nls_multstart(EPR ~ formula(k1,a,b,k2,c,temp),
data = df,
iter = 100,
start_lower = c(k1 = 14, a = 0.3, b = 20, k2 = 0.02, c = 0.15),
start_upper = c(k1 = 15, a = 0.5, b = 21, k2 = 0.08, c = 0.24),
supp_errors = 'Y',
na.action = na.omit)
fit
As aforementioned, I used the excel solver to successfully make the model and I got the parameter estimates, then tried to just manually insert them here in R, which makes for a much better model.
model<-df %>%
mutate(pred=(14.69/(1+exp(-0.41*(temp-20.52)))-0.05*exp(0.19 *temp))) %>%
ggplot()+
xlab("Temperature (°C)")+
ylab("EPR (Eggs per female per day")+
geom_point(aes(temp,EPR))+
geom_line(aes(temp,pred),col="red")
model
Ultimately, I have two questions;
a) What am I doing wrong? Or is it simply the data being weird? Seems to work better with excel?!
b) How do I code the bridge between fit and model? fit will yield the 5 parameters, but how do I insert them directly into the model function? Can I utilize mutate somehow here?
Would appreciate any help!
A. Starting values and fitting model
To get starting values:
If k1 = 0 then we can rearrange the formula as follows and then use the result of fitting that linear model as a starting value for c.
log(EPR) ~ log(k2) + c * temp
b is a shift in temp and a is a scaling so choose b = mean(temp) and a = 1/sd(temp)
We can use algorithm = "plinear" to avoid having to specify starting values for the linear parameters, i.e. for k1 and k2. When using plinear the right hand side of the formula should be a matrix such that k1 times the first column plus k2 times the second column gives the predicted EPR.
This gives the following. Note that k1 and k2 will be represented by .lin1 and .lin2 in the nls output.
fm1 <- lm(log(EPR) ~ temp, df, subset = EPR > 0)
st2 <- list(c = coef(fm1)[[2]], a = 1/sd(df$temp), b = mean(df$temp))
fo2 <- EPR ~ cbind(1/(1+exp(-a*(temp-b))), -exp(c*temp))
fm2 <- nls(fo2, df, start = st2, algorithm = "plinear",
control = list(maxiter = 200))
deviance(fm2) # residual sum of squares
## [1] 333.6
Note that this represents a lower (better) residual sum of squares than the fit shown in the question:
sum((df$EPR - pred)^2) # residual sum of squares for fit shown in question
## [1] 339.7
No packages were used.
We can plot the two fits where the fit from the question is in blue and the fit done here is in red. From the plot there is some question whether the two large EFR values are outliers and whether they should be excluded.
plot(EPR ~ temp, df)
lines(fitted(fm2) ~ temp, df, subset = order(temp), col = "red")
lines(pred ~ temp, df, subset = order(temp), col = "blue")
[continued after screenshot]
B. Evaluating model at given parameters
For a given model expressed in formula notation we can evaluate it at given parameters using the nls2 package. nls2 takes similar arguments as nls but if the starting value is a data frame with one row and the algorithm is "brute" then it simply returns the value of the right hand side evaluated at the starting values. See ?nls for more information.
library(nls2)
fo <- EPR ~ k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
st <- list(k1 = 14.69, a = 0.41, b = 20.52, k2 = 0.05, c = 0.19)
fm <- nls2(fo, df, start = data.frame(st), algorithm = "brute")
deviance(fm)
## [1] 339.7
fitted(fm) # predictions at parameter values given in st
or in terms of a function:
rhs <- function(a, b, c, k1, k2, temp) k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
p <- do.call("rhs", c(st, list(temp = df$temp)))
all.equal(p, pred)
## [1] TRUE

Segmented linear regression with discontinuous data

I have a dataset that looks to be piecewise linear. I would like to perform a segmented linear regression in R. The issue is that there is a discontinuity at the breakpoint. By using some pieces of code from this question I managed to get something, but I am not satisfied.
Dataset
Here is a dummy dataset.
NB = 100
A1 = 2 # coeff for first part
A2 = 1 # coeff for second part
B1 = 0 # intercept for first part
B2 = 300 # intercept for second part
df = data.frame(n=1:NB)
df$n = sample(500, size=NB, replace=TRUE)
df$noise = sample(20, size=NB, replace=TRUE)-10
my_func <- function(n, noise) {
if(n < 100) {
return(A1*n+B1 + noise)
}
else {
return(A2*n+B2 + noise)
}
}
df$fn = mapply(my_func, df$n, df$noise)
Using segmented package
This is quite straightforward, we simply perform a classical linear regression and give it to segmented.
library(segmented)
library(ggplot2)
model_segmented = segmented(lm(fn~n, data=df), seg.Z = ~ n)
predict_segmented = data.frame(n = df$n, fn = broken.line(model_segmented)$fit)
ggplot(df, aes(x = n, y = fn)) +
geom_point() + geom_line(data = predict_segmented, color = 'blue')
Gives:
Obviously, segmented expects the data to be continuous. It is not the case here, so the regression is not correct.
“Manual” method
This method is more tedious. First, we compute the break-point by trying all the possible break points and keeping the one which yields the lowest residual. Then, we add a new factor in the linear regression, which tells if the predictor variable is greater or lower than this breakpoint.
# Computation of the break-point
Break<-sort(unique(df$n))
Break<-Break[2:(length(Break)-1)]
d<-numeric(length(Break))
for (i in 1:length(Break)) {
model_manual<-lm(fn~(n<Break[i])*n + (n>=Break[i])*n, data=df)
d[i]<-summary(model_manual)[[6]]
}
breakpoint = Break[which.min(d)]
# Linear regression using this break-point
df$group = df$n >= breakpoint
model_manual<-lm(fn~n*group, data=df)
dat_pred = data.frame(n = df$n, fn = predict(model_manual, df))
ggplot(df, aes(x = n, y = fn)) +
geom_point() +
geom_line(data=dat_pred[dat_pred$n < breakpoint,], color = 'blue') +
geom_line(data=dat_pred[dat_pred$n >= breakpoint,], color = 'blue')
Gives:
Here, the regression is great.
Question
Is there a better way to achieve this goal? Can the segmented package take discontinuous data, or is there a package that can do this?
My concern is that the second method is a bit long and not very readable.
After spending a tremendous amount of time digging, I believe the chngpt package is the way to go. It can do both continuous and discontinuous segmented regressions. Link here: https://cran.r-project.org/web/packages/chngpt/vignettes/chngpt-vignette.pdf
strucchange will detect the breakpoint using statistically valid methods. Then, you can fit each piece with whatever model you want. For example, with a seasonal time series you can apply separate ARIMA models to each segment.

Resources