Related
My goal is to fit a three-piece (i.e., two break-point) regression model to make predictions using propagate's predictNLS function, making sure to define knots as parameters, but my model formula seems off.
I've used the segmented package to estimate the breakpoint locations (used as starting values in NLS), but would like to keep my models in the NLS format, specifically, nlsLM {minipack.lm} because I am fitting other types of curves to my data using NLS, want to allow NLS to optimize the knot values, am sometimes using variable weights, and need to be able to easily calculate the Monte Carlo confidence intervals from propagate. Though I'm very close to having the right syntax for the formula, I'm not getting the expected/required behaviour near the breakpoint(s). The segments SHOULD meet directly at the breakpoints (without any jumps), but at least on this data, I'm getting a weird local minimum at the breakpoint (see plots below).
Below is an example of my data and general process. I believe my issue to be in the NLS formula.
library(minpack.lm)
library(segmented)
y <- c(-3.99448113, -3.82447011, -3.65447803, -3.48447030, -3.31447855, -3.14448753, -2.97447972, -2.80448401, -2.63448380, -2.46448069, -2.29448796, -2.12448912, -1.95448783, -1.78448797, -1.61448563, -1.44448719, -1.27448469, -1.10448651, -0.93448525, -0.76448637, -0.59448626, -0.42448586, -0.25448588, -0.08448548, 0.08551417, 0.25551393, 0.42551411, 0.59551395, 0.76551389, 0.93551398)
x <- c(61586.1711, 60330.5550, 54219.9925, 50927.5381, 48402.8700, 45661.9175, 37375.6023, 33249.1248, 30808.6131, 28378.6508, 22533.3782, 13901.0882, 11716.5669, 11004.7305, 10340.3429, 9587.7994, 8736.3200, 8372.1482, 8074.3709, 7788.1847, 7499.6721, 7204.3168, 6870.8192, 6413.0828, 5523.8097, 3961.6114, 3460.0913, 2907.8614, 2016.1158, 452.8841)
df<- data.frame(x,y)
#Use Segmented to get estimates for parameters with 2 breakpoints
my.seg2 <- segmented(lm(y ~ x, data = df), seg.Z = ~ x, npsi = 2)
#extract knot, intercept, and coefficient values to use as NLS start points
my.knot1 <- my.seg2$psi[1,2]
my.knot2 <- my.seg2$psi[2,2]
my.m_2 <- slope(my.seg2)$x[1,1]
my.b1 <- my.seg2$coefficients[[1]]
my.b2 <- my.seg2$coefficients[[2]]
my.b3 <- my.seg2$coefficients[[3]]
#Fit a NLS model to ~replicate segmented model. Presumably my model formula is where the problem lies
my.model <- nlsLM(y~m*x+b+(b2*(ifelse(x>=knot1&x<=knot2,1,0)*(x-knot1))+(b3*ifelse(x>knot2,1,0)*(x-knot2-knot1))),data=df, start = c(m = my.m_2, b = my.b1, b2 = my.b2, b3 = my.b3, knot1 = my.knot1, knot2 = my.knot2))
How it should look
plot(my.seg2)
How it does look
plot(x, y)
lines(x=x, y=predict(my.model), col='black', lty = 1, lwd = 1)
I was pretty sure I had it "right", but when the 95% confidence intervals are plotted with the line and prediction resolution (e.g., the density of x points) is increased, things seem dramatically incorrect.
Thank you all for your help.
Define g to be a grouping vector having the same length as x which takes on values 1, 2, 3 for the 3 sections of the X axis and create an nls model from these. The resulting plot looks ok.
my.knots <- c(my.knot1, my.knot2)
g <- cut(x, c(-Inf, my.knots, Inf), label = FALSE)
fm <- nls(y ~ a[g] + b[g] * x, df, start = list(a = c(1, 1, 1), b = c(1, 1, 1)))
plot(y ~ x, df)
lines(fitted(fm) ~ x, df, col = "red")
(continued after graph)
Constraints
Although the above looks ok and may be sufficient it does not guarantee that the segments intersect at the knots. To do that we must impose the constraints that both sides are equal at the knots:
a[2] + b[2] * my.knots[1] = a[1] + b[1] * my.knots[1]
a[3] + b[3] * my.knots[2] = a[2] + b[2] * my.knots[2]
so
a[2] = a[1] + (b[1] - b[2]) * my.knots[1]
a[3] = a[2] + (b[2] - b[3]) * my.knots[2]
= a[1] + (b[1] - b[2]) * my.knots[1] + (b[2] - b[3]) * my.knots[2]
giving:
# returns a vector of the three a values
avals <- function(a1, b) unname(cumsum(c(a1, -diff(b) * my.knots)))
fm2 <- nls(y ~ avals(a1, b)[g] + b[g] * x, df, start = list(a1 = 1, b = c(1, 1, 1)))
To get the three a values we can use:
co <- coef(fm2)
avals(co[1], co[-1])
To get the residual sum of squares:
deviance(fm2)
## [1] 0.193077
Polynomial
Although it involves a large number of parameters, a polynomial fit could be used in place of the segmented linear regression. A 12th degree polynomial involves 13 parameters but has a lower residual sum of squares than the segmented linear regression. A lower degree could be used with corresponding increase in residual sum of squares. A 7th degree polynomial involves 8 parameters and visually looks not too bad although it has a higher residual sum of squares.
fm12 <- nls(y ~ cbind(1, poly(x, 12)) %*% b, df, start = list(b = rep(1, 13)))
deviance(fm12)
## [1] 0.1899218
It may, in part, reflect a limitation in segmented. segmented returns a single change point value without quantifying the associated uncertainty. Redoing the analysis using mcp which returns Bayesian posteriors, we see that the second change point is bimodally distributed:
library(mcp)
model = list(
y ~ 1 + x, # Intercept + slope in first segment
~ 0 + x, # Only slope changes in the next segments
~ 0 + x
)
# Fit it with a large number of samples and plot the change point posteriors
fit = mcp(model, data = data.frame(x, y), iter = 50000, adapt = 10000)
plot_pars(fit, regex_pars = "^cp*", type = "dens_overlay")
FYI, mcp can plot credible intervals as well (the red dashed lines):
plot(fit, q_fit = TRUE)
I am trying to build a dynamic regression model and so far I did it with the dynlm package. Basically the model looks like this
y_t = a*x1_t + b*x2_t + ... + c*y_(t-1).
y_t shall be predicted, x1_t and x2_t will be given and so is y_(t-1).
Building the model with the dynlm package worked fine, but when it came to predict y_t I got confused...
I found this, which seems to be a very similar problem, but it did not help me to handle my own problem.
Here is the problem I am facing (basically what predict() does, seems to be weird. See comments!):
library(dynlm)
# Create Data
set.seed(1)
y <- arima.sim(model = list(ar = c(.9)), n = 11) #Create AR(1) dependant variable
A <- rnorm(11) #Create independent variables
B <- rnorm(11)
y <- y + .5 * A + .2 * B #Add relationship to independent variables
data = cbind(y, A, B)
# subset used for the fitting of the model
reg <- data[1:10, ]
# Fit dynamic linear model
model <- dynlm(y ~ A + B + L(y, k = 1), data = reg) # dynlm
model
# Time series regression with "zooreg" data:
# Start = 2, End = 11
#
# Call:
# dynlm(formula = y ~ A + B + L(y, k = 1), data = reg)
# Coefficients:
# (Intercept) A B L(y, k = 1)
# 0.8930 -0.2175 0.2892 0.5176
# subset last two rows.
# the last row (r11) for which y_t shall be predicted, where from the same time A and B are input for the prediction
# and the second last row (r10), so y_(t-1) can be input for the model as well
pred <- as.data.frame(data[10:11, ])
# prediction using predict()
predict(model, newdata = pred)
# 1 2
# 1.833134 1.483809
# manual calculation of prediction of y in r11 (how I thought it should be...), taking y_(t-1) as input
predicted_value <- model$coefficients[1] + model$coefficients[2] * pred[2, 2] + model$coefficients[3] * pred[2, 3] + model$coefficients[4] * pred[1, 1]
predicted_value
# (Intercept)
# 1.743334
# and then what gives the value from predict() above taking y_t into the model (which is the value that should be predicted and not y_(t-1))
predicted_value <- model$coefficients[1] + model$coefficients[2] * pred[2, 2] + model$coefficients[3] * pred[2, 3] + model$coefficients[4] * pred[2, 1]
predicted_value
# (Intercept)
# 1.483809
Of course I could just use my own prediction function, but the problem is that my real model will have way more variables (which can even vary as I use the the step function to optimize the model according to AIC) and that I is why I want to use the predict() function.
Any ideas, how to solve this?
Unfortunately, the dynlm package does not provide a predict() method. At the moment the package completely separates the data pre-processing (which knows about functions like d(), L(), trend(), season() etc.) and the model fitting (which itself is not aware of the functions). A predict() method has been on my wishlist but so far I did not get round to write one because the flexibility of the interface allows so many models where it is not quite straightforward what to do. In the meantime, I should probably add a method that throws a warning before the lm method is found by inheritance.
This question already has an answer here:
R : constraining coefficients and error variance over multiple subsample regressions [closed]
(1 answer)
Closed 6 years ago.
I'm estimating several ordinary least squares linear regressions in R. I want to constrain the estimated coefficients across the regressions such that they're the same. For example, I have the following:
z1 ~ x + y
z2 ~ x + y
And I would like the estimated coefficient on y in the first regression to be equal to the estimated coefficient on x in the second.
Is there a straight-forward way to do this? Thanks in advance.
More detailed edit
I'm trying to estimate a system of linear demand functions, where the corresponding welfare function is quadratic. The welfare function has the form:
W = 0.5*ax*(Qx^2) + 0.5*ay*(Qy^2) + 0.5*bxy*Qx*Qy + 0.5*byx*Qy*Qx + cx*Qx + cy*Qy
Therefore, it follows that the demand functions are:
dW/dQx = Px = 2*0.5*ax*Qx + 0 + 0.5*bxy*Qy + 0.5*byx*Qy + 0 + cx
dW/dQx = Px = ax*Qx + 0.5*(bxy + byx)*Qy + cx
and
dW/dQy = Py = ay*Qy + 0.5*(byx + bxy)*Qx + cy
I would like to constrain the system so that byx = bxy (the cross-product coefficients in the welfare function). If this condition holds, the two demand functions become:
Px = ax*Qx + bxy*Qy + cy
Py = ay*Qy + bxy*Qy + cy
I have price (Px and Py) and quantity (Qx and Qy) data, but what I'm really interested in is the welfare (W) which I have no data for.
I know how to calculate and code all the matrix formulae for constrained least squares (which would take a fair few lines of code to get the coefficients, standard errors, measures of fit etc that come standard with lm()). But I was hoping there might be an existing R function (i.e. something that can be done to the lm() function) so that I wouldn't have to code all of this.
For your specified regression:
Px = ax*Qx + bxy*Qy + cy
Py = ay*Qy + bxy*Qy + cy
We can introduce a grouping factor:
id <- factor(rep.int(c("Px", "Py"), c(length(Px), length(Py))),
levels = c("Px", "Py"))
We also need to combine data:
z <- c(Px, Py) ## response
x <- c(Qx, Qy) ## covariate 1
y <- c(Qy, Qy) ## covariate 2
Then we can fit a linear model using lm with a formula:
z ~ x + y + x:id
If the x and y values are the same, then you could use this model:
lm( I(z1+z2)~ x +y ) # Need to divide coefficients by 2
If they are separate data then you could rbind the two datasets after renaming z2 to z1.
We have the diameter of trees as the predictor and tree height as the dependent variable. A number of different equations exist for this kind of data and we try to model some of them and compare the results.
However, we we can't figure out how to correctly put one equation into the corresponding R formula format.
The trees data set in R can be used as an example.
data(trees)
df <- trees
df$h <- df$Height * 0.3048 #transform to metric system
df$dbh <- (trees$Girth * 0.3048) / pi #transform tree girth to diameter
First, the example of an equation that seems to work well:
form1 <- h ~ I(dbh ^ -1) + I( dbh ^ 2)
m1 <- lm(form1, data = df)
m1
Call:
lm(formula = form1, data = df)
Coefficients:
(Intercept) I(dbh^-1) I(dbh^2)
27.1147 -5.0553 0.1124
Coefficients a, b and c are estimated, which is what we are interested in.
Now the problematic equation:
Trying to fit it like this:
form2 <- h ~ I(dbh ^ 2) / dbh + I(dbh ^ 2) + 1.3
gives an error:
m1 <- lm(form2, data = df)
Error in terms.formula(formula, data = data)
invalid model formula in ExtractVars
I guess this is because / is interpreted as a nested model and not an arithmetic operator?
This doesn't give an error:
form2 <- h ~ I(I(dbh ^ 2) / dbh + I(dbh ^ 2) + 1.3)
m1 <- lm(form2, data = df)
But the result is not the one we want:
m1
Call:
lm(formula = form2, data = df)
Coefficients:
(Intercept) I(I(dbh^2)/dbh + I(dbh^2) + 1.3)
19.3883 0.8727
Only one coefficient is given for the whole term within the outer I(), which seems to be logic.
How can we fit the second equation to our data?
Assuming you are using nls the R formula can use an ordinary R function, H(a, b, c, D), so the formula can be just h ~ H(a, b, c, dbh) and this works:
# use lm to get startingf values
lm1 <- lm(1/(h - 1.3) ~ I(1/dbh) + I(1/dbh^2), df)
start <- rev(setNames(coef(lm1), c("c", "b", "a")))
# run nls
H <- function(a, b, c, D) 1.3 + D^2 / (a + b * D + c * D^2)
nls1 <- nls(h ~ H(a, b, c, dbh), df, start = start)
nls1 # display result
Graphing the output:
plot(h ~ dbh, df)
lines(fitted(nls1) ~ dbh, df)
You've got a couple problems. (1) You're missing parentheses for the denominator of form2 (and R has no way to know that you want to add a constant a in the denominator, or where to put any of the parameters, really), and much more problematic: (2) your 2nd model isn't linear, so lm won't work.
Fixing (1) is easy:
form2 <- h ~ 1.3 + I(dbh^2) / (a + b * dbh + c * I(dbh^2))
Fixing (2), though there are many ways to estimate parameters for a nonlinear model, the nls (nonlinear least squares) is a good place to start:
m2 <- nls(form2, data = df, start = list(a = 1, b = 1, c = 1))
You need to provide starting guesses for the parameters in nls. I just picked 1's, but you should use better guesses that ballpark what the parameters might be.
edit: fixed, no longer incorrectly using offset ...
An answer that complements #shujaa's:
You can transform your problem from
H = 1.3 + D^2/(a+b*D+c*D^2)
to
1/(H-1.3) = a/D^2+b/D+c
This would normally mess up the assumptions of the model (i.e., if H were normally distributed with constant variance, then 1/(H-1.3) wouldn't be. However, let's try it anyway:
data(trees)
df <- transform(trees,
h=Height * 0.3048, #transform to metric system
dbh=Girth * 0.3048 / pi #transform tree girth to diameter
)
lm(1/(h-1.3) ~ poly(I(1/dbh),2,raw=TRUE),data=df)
## Coefficients:
## (Intercept) poly(I(1/dbh), 2, raw = TRUE)1
## 0.043502 -0.006136
## poly(I(1/dbh), 2, raw = TRUE)2
## 0.010792
These results would normally be good enough to get good starting values for the nls fit. However, you can do better than that via glm, which uses a link function to allow for some forms of non-linearity. Specifically,
(fit2 <- glm(h-1.3 ~ poly(I(1/dbh),2,raw=TRUE),
family=gaussian(link="inverse"),data=df))
## Coefficients:
## (Intercept) poly(I(1/dbh), 2, raw = TRUE)1
## 0.041795 -0.002119
## poly(I(1/dbh), 2, raw = TRUE)2
## 0.008175
##
## Degrees of Freedom: 30 Total (i.e. Null); 28 Residual
## Null Deviance: 113.2
## Residual Deviance: 80.05 AIC: 125.4
##
You can see that the results are approximately the same as the linear fit, but not quite.
pframe <- data.frame(dbh=seq(0.8,2,length=51))
We use predict, but need to correct the prediction to account for the fact that we subtracted a constant from the LHS:
pframe$h <- predict(fit2,newdata=pframe,type="response")+1.3
p2 <- predict(fit2,newdata=pframe,se.fit=TRUE) ## predict on link scale
pframe$h_lwr <- with(p2,1/(fit+1.96*se.fit))+1.3
pframe$h_upr <- with(p2,1/(fit-1.96*se.fit))+1.3
png("dbh_tmp1.png",height=4,width=6,units="in",res=150)
par(las=1,bty="l")
plot(h~dbh,data=df)
with(pframe,lines(dbh,h,col=2))
with(pframe,polygon(c(dbh,rev(dbh)),c(h_lwr,rev(h_upr)),
border=NA,col=adjustcolor("black",alpha=0.3)))
dev.off()
Because we have used the constant on the LHS (this almost, but doesn't quite, fit into the framework of using an offset -- we could only use an offset if our formula were 1/H - 1.3 = a/D^2 + ..., i.e. if the constant adjustment were on the link (inverse) scale rather than the original scale), this doesn't fit perfectly into ggplot's geom_smooth framework
library("ggplot2")
ggplot(df,aes(dbh,h))+geom_point()+theme_bw()+
geom_line(data=pframe,colour="red")+
geom_ribbon(data=pframe,colour=NA,alpha=0.3,
aes(ymin=h_lwr,ymax=h_upr))
ggsave("dbh_tmp2.png",height=4,width=6)
I was reading the documentation on R Formula, and trying to figure out how to work with depmix (from the depmixS4 package).
Now, in the documentation of depmixS4, sample formula tends to be something like y ~ 1.
For simple case like y ~ x, it is defining a relationship between input x and output y, so I get that it is similar to y = a * x + b, where a is the slope, and b is the intercept.
If we go back to y ~ 1, the formula is throwing me off. Is it equivalent to y = 1 (a horizontal line at y = 1)?
To add a bit context, if you look at the depmixs4 documentation, there is one example below
depmix(list(rt~1,corr~1),data=speed,nstates=2,family=list(gaussian(),multinomial()))
I think in general, formula that end with ~ 1 is confusing to me. Can any explain what ~ 1 or y ~ 1 mean?
Many of the operators used in model formulae (asterix, plus, caret) in R, have a model-specific meaning and this is one of them: the 'one' symbol indicates an intercept.
In other words, it is the value the dependent variable is expected to have when the independent variables are zero or have no influence. (To use the more common mathematical meaning of model terms, you wrap them in I()). Intercepts are usually assumed so it is most common to see it in the context of explicitly stating a model without an intercept.
Here are two ways of specifying the same model for a linear regression model of y on x. The first has an implicit intercept term, and the second an explicit one:
y ~ x
y ~ 1 + x
Here are ways to give a linear regression of y on x through the origin (that is, without an intercept term):
y ~ 0 + x
y ~ -1 + x
y ~ x - 1
In the specific case you mention ( y ~ 1 ), y is being predicted by no other variable so the natural prediction is the mean of y, as Paul Hiemstra stated:
> data(city)
> r <- lm(x~1, data=city)
> r
Call:
lm(formula = x ~ 1, data = city)
Coefficients:
(Intercept)
97.3
> mean(city$x)
[1] 97.3
And removing the intercept with a -1 leaves you with nothing:
> r <- lm(x ~ -1, data=city)
> r
Call:
lm(formula = x ~ -1, data = city)
No coefficients
formula() is a function for extracting formula out of objects and its help file isn't the best place to read about specifying model formulae in R. I suggest you look at this explanation or Chapter 11 of An Introduction to R.
if your model were of the form y ~ x1 + x2 This (roughly speaking) represents:
y = β0 + β1(x1) + β2(x2)
Which is of course the same as
y = β0(1) + β1(x1) + β2(x2)
There is an implicit +1 in the above formula. So really, the formula above is y ~ 1 + x1 + x2
We could have a very simple formula, whereby y is not dependent on any other variable. This is the formula that you are referencing,
y ~ 1 which roughly would equate to
y = β0(1) = β0
As #Paul points out, when you solve the simple model, you get β0 = mean (y)
Here is an example
# Let's make a small sample data frame
dat <- data.frame(y= (-2):3, x=3:8)
# Create the linear model as above
simpleModel <- lm(y ~ 1, data=dat)
## COMPARE THE COEFFICIENTS OF THE MODEL TO THE MEAN(y)
simpleModel$coef
# (Intercept)
# 0.5
mean(dat$y)
# [1] 0.5
In general such a formula describes the relation between dependent and independent variables in the form of a linear model. The lefthand side are the dependent variables, the right hand side the independent. The independent variables are used to calculate the trend component of the linear model, the residuals are then assumed to have some kind of distribution. When the independent are equal to one ~ 1, the trend component is a single value, e.g. the mean value of the data, i.e. the linear model only has an intercept.