Related
I have been tasked with estimating the degrees of freedom k of the probability distribution function of several chi-squared distributions using maximum likelihood estimation.
So I started by drawing 20 random values from the chi-distribution and proceeded to use maximum likelihood to estimate the degrees of freedom k in R.
The likelihood function of the chi-distribution is assumed to be:
library('maxLik')
library('lmtest')
> n <- 20
> df = 3
> df <- 3
> chi20 <- rchisq(n, df)
> X <- chi20
loglikfun <- function(param) {
if (param[1] <= 0) return(NA)
#return(sum(dchisq(X = chi20, df = param[1], log = TRUE])))
return((param[1]/2-1)*(sum(log(X)))-(1/2*sum(X)-n*log(gamma(param[1]/2)))-(n*param[1]/2*log(2)))
}
mle <- maxLik(Loglikfun, start = c(df = 1))
summary(mle)
However, rather than getting an estimate of approximately 3, I get 7496171 with a bunch of NaNs errors.
Any suggestions?
I think you have just transcribed the formula incorrectly. The section that you have written as
-(1/2*sum(X)-n*log(gamma(param[1]/2)))
Should not have the outer parentheses, since these have the effect of negating the sign of the -n*log(gamma(param[1]/2)) term. It therefore gets added to the result, when it should be subtracted according to the formula.
A more general observation here is that it's easier to keep track when implementing a mathematical formula if you match the names consistently and use spaces around the operators to make your code more readable. This might seem like a minor point, but in reality it is a frequent source of easily-avoided bugs.
Another point is that the function can be made more efficient by allowing R's native vectorization to try several parameter estimates at once. Effectively you can get this "for free" by removing any zero and negative values in the input vector first.
One way to correct the implementation would be:
loglikfun <- function(k)
{
k[k <= 0] <- NA
(k / 2 - 1) * sum(log(X)) - sum(X) / 2 - n * log(gamma(k / 2)) - n * k * log(2) / 2
}
We can check that this gives sensible results by plotting its output:
set.seed(100)
df <- 3
n <- 20
X <- rchisq(n, df)
parameter_estimate <- seq(0, 10, 0.1)
log_likelihood <- loglikfun(parameter_estimate)
plot(parameter_estimate, log_likelihood, type = "l")
abline(v = parameter_estimate[which.max(log_likelihood)], lty = 2, col = 2)
abline(h = max(log_likelihood, na.rm = TRUE), lty = 2, col = 2)
Where we can see the maximum log likelihood occurs at around 3, which is expected from the input.
If we try your maxLik code now, we can see we get the correct answer:
mle <- maxLik(loglikfun, start = c(k = 1))
summary(mle)
#> --------------------------------------------
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 5 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -35.58475
#> 1 free parameters
#> Estimates:
#> Estimate Std. error t value Pr(> t)
#> k 2.8960 0.4523 6.403 1.52e-10 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> --------------------------------------------
and
df <- 5
n <- 100
X <- rchisq(n, df)
mle <- maxLik(loglikfun, start = c(k = 1))
summary(mle)
#> --------------------------------------------
#> Maximum Likelihood estimation
#> Newton-Raphson maximisation, 6 iterations
#> Return code 8: successive function values within relative tolerance limit (reltol)
#> Log-Likelihood: -243.5511
#> 1 free parameters
#> Estimates:
#> Estimate Std. error t value Pr(> t)
#> k 4.9735 0.2851 17.45 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> --------------------------------------------
Where we see that the parameter estimates are now close to the parameters used in the creation of the random sample.
Created on 2021-11-01 by the reprex package (v2.0.0)
F(k)= -nlog(2gamma(k/2)+(k/2-1)*SumIFom1toN(log(Xi/2))-
-SumIFom1toN(Xi)/2
You can optimize
QS=SumIFom1toN(log(Xi/2);
HS=SumIFom1toN(Xi)/2;
F(k,QS,HS)= -nlog(2gamma(k/2)+(k/2-1)*QS-HS;
Calculation precission for machine means was out of consideration;
I have a question about how to compare coefficients in a multivariate regression in R.
I conducted a survey in which I measured three different attitudes (scale variables). My goal is to estimate whether some characteristics of the respondents (age, gender, education and ideological position) can explain their (positve/negative) attitudes.
I was advised to conduct a multivariate multiple regression instead of three univariate multiple regression. The code of my multivariate model is:
MMR <- lm(cbind(Attitude_1, Attitude_2, Attitude_3) ~
Age + Gender + Education + Ideological_position,
data = survey)
summary(MMR)
What I am trying to do next is to estimate whether the coefficients of let's say 'Gender' are statistically significant across the three individual models.
I found a very clear instruction how to do this in Stata (https://stats.idre.ucla.edu/stata/dae/multivariate-regression-analysis/), but I don't have a license, so I have to find an alternative in R. I know a similar question has been asked here before (R - Testing equivalence of coefficients in multivariate multiple regression), but the answer was that there does not exist a package (or function) in R which can be used for this purpose. Because this answer was provided a few years back, I was wondering whether in the meantime some new packages or functions are implemented.
More precisely, I was wondering whether I can use the linearHypothesis() function (https://www.rdocumentation.org/packages/car/versions/3.0-11/topics/linearHypothesis)? I already know that this function allows me to test, for instance, whether the coefficient of Gender equals to coefficient of Education:
linearhypothesis(MMR, c("GenderFemale", "EducationHigh-educated")
Can I also use this function to test whether the coefficient of Gender in the equation modelling Attitude_1 equals the coefficient of Gender in the equation modelling Attitude_2 or Attitude_3?
Any help would be greatly appreciated!
Since the model presented in the question is not reproducible (the input is missing) let us use this model instead.
fm0 <- lm(cbind(cyl, mpg) ~ wt + hp, mtcars)
We will discuss two approaches using as our linear hypothesis that the intercepts of the cyl and mpg groups are the same, that the wt slopes are the same and the hp slopes are the same.
1) Mean/Variance
In this approach we base the entire comparison only on the coefficients and their variance covariance matrix.
library(car)
v <- vcov(fm0)
co <- setNames(c(coef(fm0)), rownames(v))
h1 <- c("cyl:(Intercept) = mpg:(Intercept)", "cyl:wt = mpg:wt", "cyl:hp = mpg:hp")
linearHypothesis(NULL, h1, coef. = co, vcov. = v)
giving:
Linear hypothesis test
Hypothesis:
cyl:((Intercept) - mpg:(Intercept) = 0
cyl:wt - mpg:wt = 0
cyl:hp - mpg:hp = 0
Model 1: restricted model
Model 2: structure(list(), class = "formula", .Environment = <environment>)
Note: Coefficient covariance matrix supplied.
Df Chisq Pr(>Chisq)
1
2 3 878.53 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To explain what linearHypothesis is doing note that In this case the hypothesis matrix is L <- t(c(1, -1)) %x% diag(3) and given v then as a large sample approximation we have that L %*% co is distributed as N(0, L %*% v %*% t(L)) under the null hypothesis hence t(L %*% co) %*% solve(L %*% v %*% t(L)) %*% L %*% co is distributed as chi squared with nrow(L) degrees of freedom.
L <- t(c(1, -1)) %>% diag(3)
nrow(L) # degrees of freedom
SSH <- t(L %*% co) %*% solve(L %*% v %*% t(L)) %*% L %*% co # chisq
p <- pchisq(SSH, nrow(L), lower.tail = FALSE) # p value
2) Long form model
With this approach (which is not equivalent to the first one shown above) convert mtcars from wide to long form, mt2. We show how to do that using reshape or pivot_longer at the end but for now we will just form it explicitly. Define lhs as the 32x2 matrix on the left hand side of the fm0 formula, i.e. cbind(cyl, mpg). Note that its column names are c("cyl", "mpg"). Stringing out lhs column by column into a 64 long vector of the cyl column followed by the mpg column gives us our new dependent variable y. We also form a grouping variable g. the same length as y which indicates which column in lhs the corresponding element of y is from.
With mt2 defined we can form fm1. In forming fm1 We will use a weight vector w based on the fm0 sigma values to reflect the fact that the two groups, cyl and mpg, have different values of sigma given by the vector sigma(fm0).
We show below that the fm0 and fm1 models have the same coefficients and then run linearHypothesis.
library(car)
lhs <- fm0$model[[1]]
g. <- colnames(lhs)[col(lhs)]
y <- c(lhs)
mt2 <- with(mtcars, data.frame(wt, hp, g., y))
w <- 1 / sigma(fm0)[g.]^2
fm1 <- lm(y ~ g./(wt + hp) + 0, mt2, weights = w)
# note coefficient names
variable.names(fm1)
## [1] "g.cyl" "g.mpg" "g.cyl:wt" "g.mpg:wt" "g.cyl:hp" "g.mpg:hp"
# check that fm0 and fm1 have same coefs
all.equal(c(t(coef(fm0))), coef(fm1), check.attributes = FALSE)
## [1] TRUE
h2 <- c("g.mpg = g.cyl", "g.mpg:wt = g.cyl:wt", "g.mpg:hp = g.cyl:hp")
linearHypothesis(fm1, h2)
giving:
Linear hypothesis test
Hypothesis:
- g.cyl + g.mpg = 0
- g.cyl:wt + g.mpg:wt = 0
- g.cyl:hp + g.mpg:hp = 0
Model 1: restricted model
Model 2: y ~ g./(wt + hp) + 0
Res.Df RSS Df Sum of Sq F Pr(>F)
1 61 1095.8
2 58 58.0 3 1037.8 345.95 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If L is the hypothesis matrix which is the same as L in (1) except the columns are reordered, q is its number of rows, n is the number of rows of mt2 then SSH/q is distributed F(q, n-q-1) so we have:
n <- nrow(mt2)
L <- diag(3) %x% t(c(1, -1)) # note difference from (1)
q <- nrow(L)
SSH <- t(L %*% coef(fm1)) %*% solve(L %*% vcov(fm1) %*% t(L)) %*% L %*% coef(fm1)
SSH/q # F value
pf(SSH/q, q, n-q-1, lower.tail = FALSE) # p value
anova
An alternative to linearHypothesis is to define the reduced model and then compare the two models using anova. mt2 and w are from above. No packages are used.
fm2 <- lm(y ~ hp + wt, mt2, weights = w)
anova(fm2, fm1)
giving:
Analysis of Variance Table
Model 1: y ~ hp + wt
Model 2: y ~ g./(wt + hp) + 0
Res.Df RSS Df Sum of Sq F Pr(>F)
1 61 1095.8
2 58 58.0 3 1037.8 345.95 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Alternate wide to long calculation
An alternate way to form mt2 is by reshaping mtcars from wide form to long form using reshape.
mt2a <- mtcars |>
reshape(dir = "long", varying = list(colnames(lhs)), v.names = "y",
timevar = "g.", times = colnames(lhs)) |>
subset(select = c("wt", "hp", "g.", "y"))
or using tidyverse (which has rows in a different order but that should not matter as long as mat2b is used consistently in forming fm1 and w.
library(dplyr)
library(tidyr)
mt2b <- mtcars %>%
select(mpg, cyl, wt, hp) %>%
pivot_longer(all_of(colnames(lhs)), names_to = "g.", values_to = "y")
I'm trying to find solution in R that performs similarly to MATLAB's trust region reflective algorithm. This question has been asked before but the author was asked to provide reproducible example. I couldn't comment there so the only solution was to post new question. Here's my example:
x <- c(5000,5000,5000,5000,2500,2500,2500,2500,1250,1250,1250,1250,625,625, 625,625,312,312,312,312,156,156,156,156)
y <- c(0.209065186,0.208338898,0.211886104,0.209638321,0.112064803,0.110535275,0.111748670,0.111208841,0.060416469,0.059098975,0.059274827,0.060859512,0.032178730,0.033190833,0.031621743,0.032345817,0.017983939,0.016632180,0.018468540,0.019513489,0.011490089,0.011076365,0.009282322,0.012309134)
Since initial parameter values are the central issue I tried using 'nls2' package which uses 'brute-force' algorithm to find good starting parameters. Even with that, nls and nls.lm cannot reach convergence. Here's some basic code for this:
library('nls2'); library('minpack.lm')
fo <- y ~ I(A * (x ^ B) + C)
sA <- seq(-2,1,len=10) # range of parameter values
sB <- seq(-1,1,len=10)
sC <- seq(-1,1,len=10)
st1 <- expand.grid(A=sA,B=sB,C=sC)
mod1 <- nls2(fo,start=st1,algorithm="brute-force")
fit_ <- nls(fo,start=coef(mod1)) # basic nls
# or nls.lm
fit_ <- nlsLM(fo, start=coef(mod1),algorithm = "LM")
MATLAB produced:
a = 7.593e-05 (6.451e-05, 8.736e-05)
b = 0.9289 (0.9116, 0.9462)
c = 0.002553 (0.001333, 0.003772)
Goodness of fit:
SSE: 2.173e-05
R-square: 0.9998
Adjusted R-square: 0.9998
RMSE: 0.001017
and yes, using these parameter values, R also produced the solution.
Question: how to obtain this in R without using matlab ?
After looking at a the plotted data, I have no problem guessing suitable starting values:
plot(y ~ x)
The data is almost on a straight line through 0. So good starting value vor B and C should be 1 and 0, respectively. Then you only need to guestimate the slope of the straight line. Of course you could also use lm(y ~ x) to find starting values for A and C.
fo <- y ~ A * (x ^ B) + C
DF <- data.frame(x, y)
fit <- nls(fo, start = list(A = 0.001, B = 1, C = 0), data = DF)
summary(fit)
#Formula: y ~ A * (x^B) + C
#
#Parameters:
# Estimate Std. Error t value Pr(>|t|)
#A 7.593e-05 5.495e-06 13.820 5.17e-12 ***
#B 9.289e-01 8.317e-03 111.692 < 2e-16 ***
#C 2.552e-03 5.866e-04 4.351 0.000281 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.001017 on 21 degrees of freedom
#
#Number of iterations to convergence: 5
#Achieved convergence tolerance: 9.084e-07
lines(seq(min(x), max(x), length.out = 100),
predict(fit, newdata = data.frame(x = seq(min(x), max(x), length.out = 100))),
col = "blue")
I am trying to fit a two-part line to data.
Here's some sample data:
x<-c(0.00101959664756622, 0.001929220749155, 0.00165657261751726,
0.00182514724375389, 0.00161532360585458, 0.00126991061099209,
0.00149545009309177, 0.000816386510029308, 0.00164402569283353,
0.00128029006251656, 0.00206892841921455, 0.00132378793976235,
0.000953143467154676, 0.00272964503695939, 0.00169743839571702,
0.00286411493120396, 0.0016464862337286, 0.00155672067449593,
0.000878271561566836, 0.00195872573138819, 0.00255412836538339,
0.00126212428137799, 0.00106206607962734, 0.00169140916371657,
0.000858015581562961, 0.00191955159274793, 0.00243104345247067,
0.000871042201994687, 0.00229814264111745, 0.00226756341241083)
y<-c(1.31893118849162, 0.105150790530179, 0.412732029152914, 0.25589805483046,
0.467147868109498, 0.983984462069833, 0.640007862668818, 1.51429617241365,
0.439777145282391, 0.925550163462951, -0.0555942758921906, 0.870117027565708,
1.38032147826294, -0.96757052387814, 0.346370836378525, -1.08032147826294,
0.426215616848312, 0.55151485221263, 1.41306889485598, 0.0803478641720901,
-0.86654892295057, 1.00422341998656, 1.26214517662281, 0.359512373951839,
1.4835398594013, 0.154967053938309, -0.680501679226447, 1.44740598234453,
-0.512732029152914, -0.359512373951839)
I am hoping to be able to define the best fitting two part line (hand drawn example shown)
I then define a piecewise function that should find a two part linear function. The definition is based on the gradients of the two lines and their intercept with each other, which should completely define the lines.
# A=gradient of first line segment
# B=gradient of second line segment
# Cx=inflection point x coord
# Cy=inflexion point y coord
out_model <- nls(y ~ I(x <= Cx)*Cy-A*(Cx-x)+I(x > Cx)*Cy+B*(x),
data = data.frame(x,y),
start = c(A=-500,B=-500,Cx=0.0001,Cy=-1.5) )
However I get the error:
Error in nls(y ~ I(x <= Cx) * Cy - A * (Cx - x) + I(x > Cx) * Cy + B * :
singular gradient
I got the basic method from Finding a curve to match data
Any ideas where I am going wrong?
I don't have an elegant answer, but I do have an answer.
(SEE THE EDIT BELOW FOR A MORE ELEGANT ANSWER)
If Cx is small enough that there are no data points to fit A and Cy to, or if Cx is big enough that there are no data points to fit B and Cy to, the QR decomposition matrix will be singular because there will be many different values of Cx, A and Cy or Cx, B and Cy respectively that will fit the data equally well.
I tested this by preventing Cx from being fitted. If I fix Cx at (say) Cx = mean(x), nls() solves the problem without difficulty:
nls(y ~ ifelse(x < mean(x),ya+A*x,yb+B*x),
data = data.frame(x,y),
start = c(A=-1000,B=-1000,ya=3,yb=0))
... gives:
Nonlinear regression model
model: y ~ ifelse(x < mean(x), ya + A * x, yb + B * x)
data: data.frame(x, y)
A B ya yb
-1325.537 -1335.918 2.628 2.652
residual sum-of-squares: 0.06614
Number of iterations to convergence: 1
Achieved convergence tolerance: 2.294e-08
That led me to think that if I transformed Cx so that it could never go outside the range [min(x),max(x)], that might solve the problem. In fact, I'd want there to be at least three data points available to fit each of the "A" line and the "B" line, so Cx has to be between the third lowest and the third highest values of x. Using the atan() function with the appropriate arithmetic let me map a range [-inf,+inf] onto [0,1], so I got the code:
trans <- function(x) 0.5+atan(x)/pi
xs <- sort(x)
xlo <- xs[3]
xhi <- xs[length(xs)-2]
nls(y ~ ifelse(x < xlo+(xhi-xlo)*trans(f),ya+A*x,yb+B*x),
data = data.frame(x,y),
start = c(A=-1000,B=-1000,ya=3,yb=0,f=0))
Unfortunately, however, I still get the singular gradient matrix at initial parameters error from this code, so the problem is still over-parameterised. As #Henrik has suggested, the difference between the bilinear and single linear fit is not great for these data.
I can nevertheless get an answer for the bilinear fit, however. Since nls() solves the problem when Cx is fixed, I can now find the value of Cx that minimises the residual standard error by simply doing a one-dimensional minimisation using optimize(). Not a particularly elegant solution, but better than nothing:
xs <- sort(x)
xlo <- xs[3]
xhi <- xs[length(xs)-2]
nn <- function(f) nls(y ~ ifelse(x < xlo+(xhi-xlo)*f,ya+A*x,yb+B*x),
data = data.frame(x,y),
start = c(A=-1000,B=-1000,ya=3,yb=0))
ssr <- function(f) sum(residuals(nn(f))^2)
f = optimize(ssr,interval=c(0,1))
print (f$minimum)
print (nn(f$minimum))
summary(nn(f$minimum))
... gives output of:
[1] 0.8541683
Nonlinear regression model
model: y ~ ifelse(x < xlo + (xhi - xlo) * f, ya + A * x, yb + B * x)
data: data.frame(x, y)
A B ya yb
-1317.215 -872.002 2.620 1.407
residual sum-of-squares: 0.0414
Number of iterations to convergence: 1
Achieved convergence tolerance: 2.913e-08
Formula: y ~ ifelse(x < xlo + (xhi - xlo) * f, ya + A * x, yb + B * x)
Parameters:
Estimate Std. Error t value Pr(>|t|)
A -1.317e+03 1.792e+01 -73.493 < 2e-16 ***
B -8.720e+02 1.207e+02 -7.222 1.14e-07 ***
ya 2.620e+00 2.791e-02 93.854 < 2e-16 ***
yb 1.407e+00 3.200e-01 4.399 0.000164 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0399 on 26 degrees of freedom
Number of iterations to convergence: 1
There isn't a huge difference between the values of A and B and ya and yb for the optimum value of f, but there is some difference.
(EDIT -- ELEGANT ANSWER)
Having separated the problem into two steps, it isn't necessary to use nls() any more. lm() works fine, as follows:
function (x,y)
{
f <- function (Cx)
{
lhs <- function(x) ifelse(x < Cx,Cx-x,0)
rhs <- function(x) ifelse(x < Cx,0,x-Cx)
fit <- lm(y ~ lhs(x) + rhs(x))
c(summary(fit)$r.squared,
summary(fit)$coef[1], summary(fit)$coef[2],
summary(fit)$coef[3])
}
r2 <- function(x) -(f(x)[1])
res <- optimize(r2,interval=c(min(x),max(x)))
res <- c(res$minimum,f(res$minimum))
best_Cx <- res[1]
coef1 <- res[3]
coef2 <- res[4]
coef3 <- res[5]
plot(x,y)
abline(coef1+best_Cx*coef2,-coef2) #lhs
abline(coef1-best_Cx*coef3,coef3) #rs
}
... which gives:
If the breakpoint is known it is possible to use linear regression
Broken stick regression from "Practical Regression and Anova using R"
Julian J. Faraway
December 2000
k <- 0.0025
lhs <- function(x) ifelse(x < k,k-x,0)
rhs <- function(x) ifelse(x < k,0,x-k)
fit <- lm(y ~ lhs(x) + rhs(x))
The package segmented was designed for this type of problem.
First, create a regular linear regression with lm:
linmod <- lm(y ~ x)
summary(linmod)
Which gives us:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.108783 -0.025432 -0.006484 0.040092 0.088638
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.630e+00 2.732e-02 96.28 <2e-16 ***
x -1.326e+03 1.567e+01 -84.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04869 on 28 degrees of freedom
Multiple R-squared: 0.9961, Adjusted R-squared: 0.996
F-statistic: 7163 on 1 and 28 DF, p-value: < 2.2e-16
Next, we use the linear model to produce a segmented model with 1 break point:
segmod <- segmented(linmod, seg.Z = ~x)
summary(segmod)
And the segmented model provides a slightly better r-squared:
***Regression Model with Segmented Relationship(s)***
Call:
segmented.lm(obj = linmod, seg.Z = ~x)
Estimated Break-Point(s):
Est. St.Err
0.003 0.000
Meaningful coefficients of the linear terms:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.659e+00 2.882e-02 92.239 <2e-16 ***
x -1.347e+03 1.756e+01 -76.742 <2e-16 ***
U1.x 5.167e+02 4.822e+02 1.072 NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04582 on 26 degrees of freedom
Multiple R-Squared: 0.9968, Adjusted R-squared: 0.9964
Convergence attained in 3 iterations with relative change 0
You can check the plot, intercept and slope:
plot(segmod)
intercept(segmod)
slope(segmod)
Thank to Henrik for putting me on the right path!
Here's a more complete and relatively elegant solution with a simple plot:
range_x<-max(x)-min(x)
intervals=1000
coef1=c()
coef2=c()
coef3=c()
r2=c()
for (i in 1:intervals)
{
Cx<-min(x)+(i-1)*(range_x/intervals)
lhs <- function(x) ifelse(x < Cx,Cx-x,0)
rhs <- function(x) ifelse(x < Cx,0,x-Cx)
fit <- lm(y ~ lhs(x) + rhs(x))
coef1[i]<-summary(fit)$coef[1]
coef2[i]<-summary(fit)$coef[2]
coef3[i]<-summary(fit)$coef[3]
r2[i]<-summary(fit)$r.squared
}
best_r2<-max(r2) # get best r squared
pos<-which.max(r2)
best_Cx<-min(x)+(pos-1)*(range_x/intervals) # get Cx for best r2
plot(x,y)
abline(coef1[pos]+best_Cx*coef2[pos],-coef2[pos]) #lhs
abline(coef1[pos]-best_Cx*coef3[pos],coef3[pos]) #rs
I am trying to write a basic function to add some lines of best fit to plots using nls.
This works fine unless the data just happens to be defined exactly by the formula passed to nls. I'm aware of the issues and that this is documented behaviour as reported here.
My question though is how can I get around this and force a line of best fit to be plotted regardless of the data exactly being described by the model? Is there a way to detect the data matches exactly and plot the perfectly fitting curve? My current dodgy solution is:
#test data
x <- 1:10
y <- x^2
plot(x, y, pch=20)
# polynomial line of best fit
f <- function(x,a,b,d) {(a*x^2) + (b*x) + d}
fit <- nls(y ~ f(x,a,b,d), start = c(a=1, b=1, d=1))
co <- coef(fit)
curve(f(x, a=co[1], b=co[2], d=co[3]), add = TRUE, col="red", lwd=2)
Which fails with the error:
Error in nls(y ~ f(x, a, b, d), start = c(a = 1, b = 1, d = 1)) :
singular gradient
The easy fix I apply is to jitter the data slightly, but this seems a bit destructive and hackish.
# the above code works after doing...
y <- jitter(x^2)
Is there a better way?
Use Levenberg-Marquardt.
x <- 1:10
y <- x^2
f <- function(x,a,b,d) {(a*x^2) + (b*x) + d}
fit <- nls(y ~ f(x,a,b,d), start = c(a=1, b=0, d=0))
Error in nls(y ~ f(x, a, b, d), start = c(a = 1, b = 0, d = 0)) :
number of iterations exceeded maximum of 50
library(minpack.lm)
fit <- nlsLM(y ~ f(x,a,b,d), start = c(a=1, b=0, d=0))
summary(fit)
Formula: y ~ f(x, a, b, d)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1 0 Inf <2e-16 ***
b 0 0 NA NA
d 0 0 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 7 degrees of freedom
Number of iterations to convergence: 1
Achieved convergence tolerance: 1.49e-08
Note that I had to adjust the starting values and the result is sensitive to starting values.
fit <- nlsLM(y ~ f(x,a,b,d), start = c(a=1, b=0.1, d=0.1))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.000e+00 2.083e-09 4.800e+08 < 2e-16 ***
b -7.693e-08 1.491e-08 -5.160e+00 0.00131 **
d 1.450e-07 1.412e-08 1.027e+01 1.8e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.191e-08 on 7 degrees of freedom
Number of iterations to convergence: 3
Achieved convergence tolerance: 1.49e-08