Why do I get different values of R-squared for these two models, which should be equivalent (in the second model, the intercept term is replaced by a level of z)? Is this a bug or am I missing something?
set.seed(42)
N=100
# intercepts
iA = 3
iB = 3.5
# slopes
sA = 1.5
sB = 0.5
# xs
xA = runif(0,1, n=N)
xB = runif(0,1, n=N)
# ys
yA = sA*xA + iA + rnorm(n=N)/10
yB = sB*xB + iB + rnorm(n=N)/10
data = data.frame(x=c(xA, xB), y=c(yA, yB), z=c(rep("A", times=N), rep("B", times=N)))
lm1 = lm(data=data, formula = y ~ x + z)
lm2 = lm(data=data, formula = y ~ x + z -1)
coef(lm1)
coef(lm2)
summary(lm1)$r.squared
summary(lm2)$r.squared
Output:
> coef(lm1)
(Intercept) x zB
3.23590275 1.03353472 -0.01435266
> coef(lm2)
x zA zB
1.033535 3.235903 3.221550
>
> summary(lm1)$r.squared
[1] 0.7552991
> summary(lm2)$r.squared
[1] 0.9979477
For models with an intercept summary.lm calculates an R^2 based on comparing the model to the intercept only model. For a model without an intercept that does not make sense so it compares it to the zero model. Of course in your example the intercept is actually a linear combination of columns in the model matrix, i.e. all(model.matrix(lm2) %*% c(0, 1, 1) == 1) is TRUE so it would be possible to write software to check for whether the intercept is a submodel of the full model but as it is it only looks to see if the model formula specifies an intercept or not.
In terms of calculations for models with an intercept summary.lm uses the equivalent of
1 - sum(resid(lm1)^2) / sum((data$y - mean(data$y))^2)
## [1] 0.7552991
1 - sum(resid(lm2)^2) / sum((data$y - mean(data$y))^2)
## [1] 0.7552991
but for models without an intercept summary.lm drops the mean term
1 - sum(resid(lm2)^2) / sum(data$y^2)
## [1] 0.9979477
You can compare these to
summary(lm1)$r.squared
## [1] 0.7552991
summary(lm2)$r.squared
## [1] 0.9979477
See ?summary.lm where this is mentioned.
From help("summary.lm") (emphasis added):
R², the ‘fraction of variance explained by the model’,
R^2 = 1 - \frac{\sum_i{R_i^2}}{\sum_i(y_i- y^*)^2},
where y^* is the mean of y_i
if there is an intercept and zero otherwise.
If you remove the intercept, R² is defined differently (which is sensible from the perspective of a statistician).
Related
I'm trying to figure out how the coefficients of logistic regression with a polynomial term relate to predictions. Specifically, I'm interested in the location on the x-axis where the prediction is highest. Example below:
set.seed(42)
# Setup some dummy data
x <- 1:200
y <- rep(0, length(x))
y[51:150] <- rbinom(100, 1, 0.5)
# Fit a model
family <- binomial()
model <- glm(y ~ poly(x, 2), family = family)
# Illustrate model
plot(x, y)
lines(x, family$linkinv(predict(model)), col = 2)
The model above gives me these coefficients:
coef(model)
#> (Intercept) poly(x, 2)1 poly(x, 2)2
#> -1.990317 -3.867855 -33.299893
Created on 2021-08-03 by the reprex package (v1.0.0)
The manual page for poly() states the following:
The orthogonal polynomial is summarized by the coefficients, which can be used to evaluate it via the three-term recursion given in Kennedy & Gentle (1980, pp. 343–4), and used in the predict part of the code.
However, I don't have access to the book, nor am I able to discern from the predict.glm S3 method how these coefficients are handled. Is there a way to reconstruct the location of the summit (around 100 in the example) from the coefficients alone (i.e. without using predict() to find the maximum)?
Derivation of the location of the predicted maximum from the theoretical expressions of the orthogonal polynomials
I got a copy of the "Statistical Computing" book by Kennedy and Gentle (1982) referenced in the documentation of poly and now share my findings about the calculation of the orthogonal polynomials, and how we can use them to find the location x of the maximum predicted value.
The orthogonal polynomials presented in the book (pp. 343-4) are monic (i.e. the highest order coefficient is always 1) and are obtained by the following recurrence procedure:
where q is the number of orthogonal polynomials considered.
Note the following relationship of the above terminology with the documentation of poly:
The "three-term recursion" appearing in the excerpt included in your question is the RHS of the third expression which has precisely three terms.
The rho(j+1) coefficients in the third expression are called "centering constants".
The gamma(j) coefficients in the third expression do not have a name in the documentation but they are directly related to the "normalization constants", as seen below.
For reference, here I paste the relevant excerpt of the "Value" section of the poly documentation:
A matrix with rows corresponding to points in x and columns corresponding to the degree, with attributes "degree" specifying the degrees of the columns and (unless raw = TRUE) "coefs" which contains the centering and normalization constants used in constructing the orthogonal polynomials
Going back to the recurrence, we can derive the values of parameters rho(j+1) and gamma(j) from the third expression by imposing the orthogonality condition on p(j+1) w.r.t. p(j) and p(j-1).
(It's important to note that the orthogonality condition is not an integral, but a summation on the n observed x points, so the polynomial coefficients depend on the data! --which is not the case for instance for the Tchebyshev orthogonal polynomials).
The expressions for the parameters become:
For the polynomials of orders 1 and 2 used in your regression, we get the following expressions, already written in R code:
# First we define the number of observations in the data
n = length(x)
# For p1(x):
# p1(x) = (x - rho1) p0(x) (since p_{-1}(x) = 0)
rho1 = mean(x)
# For p2(x)
# p2(x) = (x - rho2) p1(x) - gamma1
gamma1 = var(x) * (n-1)/n
rho2 = sum( x * (x - mean(x))^2 ) / (n*gamma1)
for which we get:
> c(rho1, rho2, gamma1)
[1] 100.50 100.50 3333.25
Note that coefs attribute of poly(x,2) is:
> attr(poly(x,2), "coefs")
$alpha
[1] 100.5 100.5
$norm2
[1] 1 200 666650 1777555560
where $alpha contains the centering constants, i.e. the rho values (which coincide with ours --incidentally all centering constants are equal to the average of x when the distribution of x is symmetric for any q! (observed and proved)), and $norm2 contains the normalization constants (in this case for p(-1,x), p(0,x), p(1,x), and p(2,x)), that is the constants c(j) that normalize the polynomials in the recurrence formula --by dividing them by sqrt(c(j))--, making the resulting polynomials r(j,x) satisfy sum_over_i{ r(j,x_i)^2 } = 1; note that r(j,x) are the polynomials stored in the object returned by poly().
From the expression already given above, we observe that gamma(j) is precisely the ratio of two consecutive normalization constants, namely: gamma(j) = c(j) / c(j-1).
We can check that our gamma1 value coincides with this ratio by computing:
gamma1 == attr(poly(x,2), "coefs")$norm2[3] / attr(poly(x,2), "coefs")$norm2[2]
which returns TRUE.
Going back to your problem of finding the maximum of the values predicted by your model, we can:
Express the predicted value as a function of r(1,x) and r(2,x) and the coefficients from the logistic regression, namely:
pred(x) = beta0 + beta1 * r(1,x) + beta2 * r(2,x)
Derive the expression w.r.t. x, set it to 0 and solve for x.
In R code:
# Get the normalization constants alpha(j) to obtain r(j,x) from p(j,x) as
# r(j,x) = p(j,x) / sqrt( norm(j) ) = p(j,x) / alpha(j)
alpha1 = sqrt( attr(poly(x,2), "coefs")$norm2[3] )
alpha2 = sqrt( attr(poly(x,2), "coefs")$norm2[4] )
# Get the logistic regression coefficients (beta1 and beta2)
coef1 = as.numeric( model$coeff["poly(x, 2)1"] )
coef2 = as.numeric( model$coeff["poly(x, 2)2"] )
# Compute the x at which the maximum occurs from the expression that is obtained
# by deriving the predicted expression pred(x) = beta0 + beta1*r(1,x) + beta2*r(2,x)
# w.r.t. x and setting the derivative to 0.
xmax = ( alpha2^-1 * coef2 * (rho1 + rho2) - alpha1^-1 * coef1 ) / (2 * alpha2^-1 * coef2)
which gives:
> xmax
[1] 97.501114
i.e. the same value obtained with the other "empirical" method described in my previous answer.
The full code to obtain the location x of the maximum of the predicted values, starting off from the code you provided, is:
# First we define the number of observations in the data
n = length(x)
# Parameters for p1(x):
# p1(x) = (x - rho1) p0(x) (since p_{-1}(x) = 0)
rho1 = mean(x)
# Parameters for p2(x)
# p2(x) = (x - rho2) p1(x) - gamma1
gamma1 = var(x) * (n-1)/n
rho2 = mean( x * (x - mean(x))^2 ) / gamma1
# Get the normalization constants alpha(j) to obtain r(j,x) from p(j,x) as
# r(j,x) = p(j,x) / sqrt( norm(j) ) = p(j,x) / alpha(j)
alpha1 = sqrt( attr(poly(x,2), "coefs")$norm2[3] )
alpha2 = sqrt( attr(poly(x,2), "coefs")$norm2[4] )
# Get the logistic regression coefficients (beta1 and beta2)
coef1 = as.numeric( model$coeff["poly(x, 2)1"] )
coef2 = as.numeric( model$coeff["poly(x, 2)2"] )
# Compute the x at which the maximum occurs from the expression that is obtained
# by deriving the predicted expression pred(x) = beta0 + beta1*r(1,x) + beta2*r(2,x)
# w.r.t. x and setting the derivative to 0.
( xmax = ( alpha2^-1 * coef2 * (rho1 + rho2) - alpha1^-1 * coef1 ) / (2 * alpha2^-1 * coef2) )
Assuming you want to find the maximum of the prediction analytically for this particular case where the orthogonal polynomials are of order 1 and 2, I propose the following approach:
SUMMARY
1) Infer the polynomial coefficients
This can easily be done by fitting a linear model to the respective polynomial values contained in the model matrix.
2) Derive the prediction expression w.r.t. x and set the derivative to 0
Solve for x in the prediction expression inferred from the polynomial fit in (1) and obtain the value of x at which the prediction's maximum occurs.
DETAILS
1) Polynomial coefficients
Following from the line where you fit the GLM model, we estimate the coefficients for the polynomial of order 1, p1(x) = a0 + a1*x, and the coefficients for the polynomial of order 2, p2(x) = b0 + b1*x + b2*x^2:
X = model.matrix(model)
p1 = X[, "poly(x, 2)1"]
p2 = X[, "poly(x, 2)2"]
p1.lm = lm(p1 ~ x)
a0 = p1.lm$coeff["(Intercept)"]
a1 = p1.lm$coeff["x"]
p2.lm = lm(p2 ~ x + I(x^2))
b0 = p2.lm$coeff["(Intercept)"]
b1 = p2.lm$coeff["x"]
b2 = p2.lm$coeff["I(x^2)"]
This gives:
> c(a0, a1, b0, b1, b2)
(Intercept) x (Intercept) x I(x^2)
-1.2308840e-01 1.2247602e-03 1.6050353e-01 -4.7674315e-03 2.3718565e-05
2) Derivative of the prediction to find the maximum
The expression for the prediction, z, (before the inverse link function) is:
z = Intercept + coef1 * p1(x) + coef2 * p2(x)
We derive this expression and set it to 0 to obtain:
coef1 * a1 + coef2 * (b1 + 2 * b2 * xmax) = 0
Solving for xmax we get:
xmax = - (coef1 * a1 + coef2 * b1) / (2 * coef2 * b2)
In R code, this is computed as:
coef1 = as.numeric( model$coeff["poly(x, 2)1"] )
coef2 = as.numeric( model$coeff["poly(x, 2)2"] )
(xmax = - ( coef1 * a1 + coef2 * b1 ) / (2 * coef2 * b2))
which gives:
x
97.501114
CHECK
We can verify the maximum by adding it to the prediction's curve as a green cross:
# Prediction curve computed analytically
Intercept = model$coeff["(Intercept)"]
pred.analytical = family$linkinv( Intercept + coef1 * p1 + coef2 * p2 )
# Find the prediction's maximum analytically
pred.max = family$linkinv( Intercept + coef1 * (a0 + a1 * xmax) +
coef2 * (b0 + b1 * xmax + b2 * xmax^2) )
# Plot
plot(x, y)
# The following two lines should coincide!
lines(x, pred.analytical, col = 3)
lines(x, family$linkinv(predict(model)), col = 2)
# Location of the maximum!
points(xmax, pred.max, pch="x", col="green")
which gives:
I've read many similar questions but still couldn't find the answer.
Here is some data that I'm using to calibrate the equation below:
set.seed(100)
i <- sort(rexp(n = 100,rate = 0.01))
Tr <- sort(runif(n = 100,min = 5,max = 100))
k_start <- 3259
u_start <- 0.464
t0_start <- 38
n_start <- -1
i_test <- k_start*Tr^u_start * (5 + t0_start)^n_start
m <- nls(i~(k * Tr^u / (5+t0)^n), start = list(k = k_start, u = u_start,
t0 = t0_start, n = n_start))
When I used nlsLM and the same error came up:
Error in nlsModel(formula, mf, start, wts) : singular gradient matrix at initial parameter estimates
For the start values, I tried to use the values from calibration in Python and still the same error occurs.
There's also another way to use that equation that is like this:
However, the result is the same error.
d_start <- 43
m <- nls(i ~ (k * Tr^u / d),
start = list(k = k_start, u = u_start,d=d_start))
When I use only the numerator it works, but that's not what I need.
Any help will be very much appreciated.
In the first nls, the right hand side depends on k, t0 and n only through
k / (5+t0)^n so it is over parameterized as one parameter could represent
their combined effect. In the second nls the right hand side depends only
on k and d through k / d so again the problem has been over parameterized and
one parameter could represent their combined effect.
Getting rid of the excess parameters and getting the starting values using a linear model it converges.
fit.lm <- lm(log(i) ~ log(Tr))
co <- coef(fit.lm)
fit <- nls(i ~ k * Tr ^ u, start = list(k = exp(co[[1]]), u = co[[2]]))
fit
## Nonlinear regression model
## model: i ~ k * Tr^u
## data: parent.frame()
## k u
## 0.0002139 3.0941602
## residual sum-of-squares: 79402
##
## Number of iterations to convergence: 43
## Achieved convergence tolerance: 5.354e-06
Reciprocal Model
Below we fit a "reciprocal model" which has the same number of parameters but a better fit as measured by the deviance which is the residual sum of squares. A lower value means better fit.
# reciprocal model
fit.recip <- nls(i ~ 1/(a + b * log(Tr)), start = list(a = 1, b = 1))
deviance(fit)
## [1] 79402.17
deviance(fit.recip)
## [1] 25488.1
Graphics
Below we plot both fit (red) and fit.recip (blue) models.
plot(i ~ Tr)
lines(fitted(fit) ~ Tr, col = "red")
lines(fitted(fit.recip) ~ Tr, col = "blue")
legend("topleft", legend = c("fit", "fit.recip"), lty = 1, col = c("red", "blue"))
(continued after plot)
plinear
Note that the plinear algorithm could be used as an alternative algorithm to fit the fit model above to avoid having to supply a starting value for k. It also has the additional benefit that it requires substantially fewer iterations in this case (14 vs. 45). With plinear the formula should omit the linear argument, k, as it is implied by the algorithm and will be reported as .lin .
nls(i ~ Tr ^ u, start = list(u = co[[2]]), algorithm = "plinear")
## Nonlinear regression model
## model: i ~ Tr^u
## data: parent.frame()
## u .lin
## 3.0941725 0.0002139
## residual sum-of-squares: 79402
##
## Number of iterations to convergence: 14
## Achieved convergence tolerance: 3.848e-06
I have existing code for fitting a sigmoid curve to data in R. How can I used selfstart (or another method) to automatically find start values for the regression?
sigmoid = function(params, x) {
params[1] / (1 + exp(-params[2] * (x - params[3])))
}
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
# fitting code
fitmodel <- nls(y~a/(1 + exp(-b * (x-c))), start=list(a=1,b=.5,c=25))
# visualization code
# get the coefficients using the coef function
params=coef(fitmodel)
y2 <- sigmoid(params,x)
plot(y2,type="l")
points(y)
This is a common (and interesting) problem in non-linear curve fitting.
Background
We can find sensible starting values if we take a closer look at the function sigmoid
We first note that
So for large values of x, the function approaches a. In other words, as a starting value for a we may choose the value of y for the largest value of x.
In R language, this translates to y[which.max(x)].
Now that we have a starting value for a, we need to decide on starting values for b and c. To do that, we can make use of the geometric series
and expand f(x) = y by keeping only the first two terms
We now set a = 1 (our starting value for a), re-arrange the equation and take the logarithm on both sides
We can now fit a linear model of the form log(1 - y) ~ x to obtain estimates for the slope and offset, which in turn provide the starting values for b and c.
R implementation
Let's define a function that takes as an argument the values x and y and returns a list of parameter starting values
start_val_sigmoid <- function(x, y) {
fit <- lm(log(y[which.max(x)] - y + 1e-6) ~ x)
list(
a = y[which.max(x)],
b = unname(-coef(fit)[2]),
c = unname(-coef(fit)[1] / coef(fit)[2]))
}
Based on the data for x and y you give, we obtain the following starting values
start_val_sigmoid(x, y)
#$a
#[1] 1
#
#$b
#[1] 0.2027444
#
#$c
#[1] 15.01613
Since start_val_sigmoid returns a list we can use its output directly as the start argument in nls
nls(y ~ a / ( 1 + exp(-b * (x - c))), start = start_val_sigmoid(x, y))
#Nonlinear regression model
# model: y ~ a/(1 + exp(-b * (x - c)))
# data: parent.frame()
# a b c
# 1.0395 0.1254 29.1725
# residual sum-of-squares: 0.2119
#
#Number of iterations to convergence: 9
#Achieved convergence tolerance: 9.373e-06
Sample data
dataset = data.frame("x" = 1:53, "y" =c(0,0,0,0,0,0,0,0,0,0,0,0,0,0.1,0.18,0.18,0.18,0.33,0.33,0.33,0.33,0.41,0.41,0.41,0.41,0.41,0.41,0.5,0.5,0.5,0.5,0.68,0.58,0.58,0.68,0.83,0.83,0.83,0.74,0.74,0.74,0.83,0.83,0.9,0.9,0.9,1,1,1,1,1,1,1) )
x = dataset$x
y = dataset$y
I want to estimate a binomial model with the R package MCMCglmm. The model shall incorporate an intercept and a slope - both as fixed and random parts. How do I have to specify an accepted prior? (Note, here is a similar question, but in a much more complicated setting.)
Assume the data have the following form:
y x cluster
1 0 -0.56047565 1
2 1 -0.23017749 1
3 0 1.55870831 1
4 1 0.07050839 1
5 0 0.12928774 1
6 1 1.71506499 1
In fact, the data have been generated by
set.seed(123)
nj <- 15 # number of individuals per cluster
J <- 30 # number of clusters
n <- nj * J
x <- rnorm(n)
y <- rbinom(n, 1, prob = 0.6)
cluster <- factor(rep(1:nj, each = J))
dat <- data.frame(y = y, x = x, cluster = cluster)
The information in the question about the model, suggest to specify fixed = y ~ 1 + x and random = ~ us(1 + x):cluster. With us() you allow the random effects to be correlated (cf. section 3.4 and table 2 in Hadfield's 2010 jstatsoft-article)
First of all, as you only have one dependent variable (y), the G part in the prior (cf. equation 4 and section 3.6 in Hadfield's 2010 jstatsoft-article) for the random effects variance(s) only needs to have one list element called G1. This list element isn't the actual prior distribution - this was specified by Hadfield to be an inverse-Wishart distribution. But with G1 you specify the parameters of this inverse-Whishart distribution which are the scale matrix ( in Wikipedia notation and V in MCMCglmm notation) and the degrees of freedom ( in Wikipedia notation and nu in MCMCglmm notation). As you have two random effects (the intercept and the slope) V has to be a 2 x 2 matrix. A frequent choice is the two dimensional identity matrix diag(2). Hadfield often uses nu = 0.002 for the degrees of freedom (cf. his course notes)
Now, you also have to specify the R part in the prior for the residual variance. Here again an inverse-Whishart distribution was specified by Hadfield, leaving the user to specify its parameters. As we only have one residual variance, V has to be a scalar (lets say V = 0.5). An optional element for R is fix. With this element you specify, whether the residual variance shall be fixed to a certain value (than you have to write fix = TRUE or fix = 1) or not (then fix = FALSE or fix = 0). Notice, that you don't fix the residual variance to be 0.5 by fix = 0.5! So when you find in Hadfield's course notes fix = 1, read it as fix = TRUE and look to which value of V it is was fixed.
All togehter we set up the prior as follows:
prior0 <- list(G = list(G1 = list(V = diag(2), nu = 0.002)),
R = list(V = 0.5, nu = 0.002, fix = FALSE))
With this prior we can run MCMCglmm:
library("MCMCglmm") # for MCMCglmm()
set.seed(123)
mod0 <- MCMCglmm(fixed = y ~ 1 + x,
random = ~ us(1 + x):cluster,
data = dat,
family = "categorical",
prior = prior0)
The draws from the Gibbs-sampler for the fixed effects are found in mod0$Sol, the draws for the variance parameters in mod0$VCV.
Normally a binomial model requires the residual variance to be fixed, so we set the residual variance to be fixed at 0.5
set.seed(123)
prior1 <- list(G = list(G1 = list(V = diag(2), nu = 0.002)),
R = list(V = 0.5, nu = 0.002, fix = TRUE))
mod1 <- MCMCglmm(fixed = y ~ 1 + x,
random = ~ us(1 + x):cluster,
data = dat,
family = "categorical",
prior = prior1)
The difference can be seen by comparing mod0$VCV[, 5] to mod1$VCV[, 5]. In the later case, all entries are 0.5 as specified.
I am trying to estimate the constants for Heaps law.
I have the following dataset novels_colection:
Number of novels DistinctWords WordOccurrences
1 1 13575 117795
2 1 34224 947652
3 1 40353 1146953
4 1 55392 1661664
5 1 60656 1968274
Then I build the next function:
# Function for Heaps law
heaps <- function(K, n, B){
K*n^B
}
heaps(2,117795,.7) #Just to test it works
So n = Word Occurrences, and K and B are values that should be constants in order to find my prediction of Distinct Words.
I tried this but it gives me an error:
fitHeaps <- nls(DistinctWords ~ heaps(K,WordOccurrences,B),
data = novels_collection[,2:3],
start = list(K = .1, B = .1), trace = T)
Error = Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model
Any idea in how could I fix this or a method to fit the function and get the values for K and B?
If you take log transform on both sides of y = K * n ^ B, you get log(y) = log(K) + B * log(n). This is a linear relationship between log(y) and log(n), hence you can fit a linear regression model to find log(K) and B.
logy <- log(DistinctWords)
logn <- log(WordOccurrences)
fit <- lm(logy ~ logn)
para <- coef(fit) ## log(K) and B
para[1] <- exp(para[1]) ## K and B
With minpack.lm we can fit a non-linear model but I guess it will be prone to overfitting more than a linear model on the log-transformed variables will do (as done by Zheyuan), but we may compare the residuals of linear / non-linear model on some held-out dataset to get the empirical results, which will be interesting to see.
library(minpack.lm)
fitHeaps = nlsLM(DistinctWords ~ heaps(K, WordOccurrences, B),
data = novels_collection[,2:3],
start = list(K = .01, B = .01))
coef(fitHeaps)
# K B
# 5.0452566 0.6472176
plot(novels_collection$WordOccurrences, novels_collection$DistinctWords, pch=19)
lines(novels_collection$WordOccurrences, predict(fitHeaps, newdata = novels_collection[,2:3]), col='red')