How can you get R's glm() to match polynomial data? I've tried several iterations of 'family=AAA(link="BBB")' but I can't seem to get trivial predictions to match.
For example, please help with R's glm to match polynomial data
x=seq(-6,6,2)
y=x*x
parabola=data.frame(x,y)
plot(parabola)
model=glm(y~x,dat=parabola)
test=data.frame(x=seq(-5,5,2))
test$y=predict(model,test)
plot(test)
The plot(parabola) looks as expected, but I can find the incantation of glm() that will make plot(test) look parabolic.
I think you need to step back and start to think about a model and how you represent this in R. In your example, y is a quadratic function of x, so you need to include x and x^2 in the model formula, i.e. as predictors you need to estimate the effect of x and x^2 on the response given the data to hand.
If y is Gaussian, conditional upon the model, then you can do this with lm() and either
y ~ x + I(x^2)
or
y ~ poly(x, 2)
In the first, we wrap the quadratic term in I() as the ^ operator has a special meaning (not its mathematical one) in an R model formula. The latter version gives orthogonal polynomials and hence the x and x^2 terms won't be correlated which can help with fitting, however in some cases interpreting the coefficients is trickier with poly().
Putting it all together we have (note that I add some random error to y so as to not predict it perfectly as the example I use is more common in reality):
x <- seq(-6 ,6 ,2)
y <- x^2 + rnorm(length(x), sd = 2)
parabola <- data.frame(x = x, y = y)
mod <- lm(y ~ poly(x, 2), data = parabola)
plot(parabola)
lines(fitted(mod) ~ x, data = parabola, col = "red")
The plot produced is:
An additional issue is whether y is Gaussian? If y can't be negative (i.e. a count), and/or is discrete, modelling using lm() is going to be wrong. That's where glm() might come in, by which you might fit a curve without needing x^2 (although if the data really are a parabola, then x on its own isn't going to fit the response), as there is an explicit transformation of the data from the linear predictor on to the scale of the response.
It is better to think about the properties of the data and the sort of model you want to fit and then build up the degree of polynomial within that modelling framework, rather than jumping in a trying various incantations to simply curve fit the data.
The match is now perfect. A slightly more interesting parabola:
x=seq(-16,16,2)
y= 4*x*x + 10*x + 6
parabola=data.frame(x,y)
plot(parabola)
model=lm(y~poly(x,2),dat=parabola)
summary(model)
test=data.frame(x=seq(-15,15,2))
test$y=predict(model,test)
points(test,pch=3)
An amateur (like me) might expect the coefficients of the model to be (4,10,6) to match 4*x*x + 10*x + 6
Call:
lm(formula = y ~ poly(x, 2), data = parabola)
Residuals:
Min 1Q Median 3Q Max
-3.646e-13 -8.748e-14 -3.691e-14 4.929e-14 6.387e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.900e+02 5.192e-14 7.511e+15 <2e-16 ***
poly(x, 2)1 4.040e+02 2.141e-13 1.887e+15 <2e-16 ***
poly(x, 2)2 1.409e+03 2.141e-13 6.581e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.141e-13 on 14 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.343e+31 on 2 and 14 DF, p-value: < 2.2e-16
Why would the coefficients be (390,404,1409)?
Related
I used a linear model to obtain the best fit to my data, lm() function.
From literature I know that the optimal fit would be a linear regression with the slope = 1 and the intercept = 0. I would like to see how good this equation (y=x) fits my data? How do I proceed in order to find an R^2 as well as a p-value?
This is my data
(y = modelled, x = measured)
measured<-c(67.39369,28.73695,60.18499,49.32405,166.39318,222.29022,271.83573,241.72247, 368.46304,220.27018,169.92343,56.49579,38.18381,49.33753,130.91752,161.63536,294.14740,363.91029,358.32905,239.84112,129.65078,32.76462,30.13952,52.83656,67.35427,132.23034,366.87857,247.40125,273.19316,278.27902,123.24256,45.98363,83.50199,240.99459,266.95707,308.69814,228.34256,220.51319,83.97942,58.32171,57.93815,94.64370,264.78007,274.25863,245.72940,155.41777,77.45236,70.44223,104.22838,294.01645,312.42321,122.80831,41.65770,242.22661,300.07147,291.59902,230.54478,89.42498,55.81760,55.60525,111.64263,305.76432,264.27192,233.28214,192.75603,75.60803,63.75376)
modelled<-c(42.58318,71.64667,111.08853,67.06974,156.47303,240.41188,238.25893,196.42247,404.28974,138.73164,116.73998,55.21672,82.71556,64.27752,145.84891,133.67465,295.01014,335.25432,253.01847,166.69241,68.84971,26.03600,45.04720,75.56405,109.55975,202.57084,288.52887,140.58476,152.20510,153.99427,75.70720,92.56287,144.93923,335.90871,NA,264.25732,141.93407,122.80440,83.23812,42.18676,107.97732,123.96824,270.52620,388.93979,308.35117,100.79047,127.70644,91.23133,162.53323,NA ,276.46554,100.79440,81.10756,272.17680,387.28700,208.29715,152.91548,62.54459,31.98732,74.26625,115.50051,324.91248,210.14204,168.29598,157.30373,45.76027,76.07370)
Now I would like to see how good the equation y=x fits the data presented above (R^2 and p-value)?
I am very grateful if somebody can help me with this (basic) problem, as I found no answers to my question on stackoverflow?
Best regards Cyril
Let's be clear what you are asking here. You have an existing model, which is "the modelled values are the expected value of the measured values", or in other words, measured = modelled + e, where e are the normally distributed residuals.
You say that the "optimal fit" should be a straight line with intercept 0 and slope 1, which is another way of saying the same thing.
The thing is, this "optimal fit" is not the optimal fit for your actual data, as we can easily see by doing:
summary(lm(measured ~ modelled))
#>
#> Call:
#> lm(formula = measured ~ modelled)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -103.328 -39.130 -4.881 40.428 114.829
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 23.09461 13.11026 1.762 0.083 .
#> modelled 0.91143 0.07052 12.924 <2e-16 ***
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 55.13 on 63 degrees of freedom
#> Multiple R-squared: 0.7261, Adjusted R-squared: 0.7218
#> F-statistic: 167 on 1 and 63 DF, p-value: < 2.2e-16
This shows us the line that would produce the optimal fit to your data in terms of reducing the sum of the squared residuals.
But I guess what you are asking is "How well do my data fit the model measured = modelled + e ?"
Trying to coerce lm into giving you a fixed intercept and slope probably isn't the best way to answer this question. Remember, the p value for the slope only tells you whether the actual slope is significantly different from 0. The above model already confirms that. If you want to know the r-squared of measured = modelled + e, you just need to know the proportion of the variance of measured that is explained by modelled. In other words:
1 - var(measured - modelled) / var(measured)
#> [1] 0.7192672
This is pretty close to the r squared from the lm call.
I think you have sufficient evidence to say that your data is consistent with the model measured = modelled, in that the slope in the lm model includes the value 1 within its 95% confidence interval, and the intercept contains the value 0 within its 95% confidence interval.
As mentioned in the comments, you can use the lm() function, but this actually estimates the slope and intercept for you, whereas what you want is something different.
If slope = 1 and the intercept = 0, essentially you have a fit and your modelled is already the predicted value. You need the r-square from this fit. R squared is defined as:
R2 = MSS/TSS = (TSS − RSS)/TSS
See this link for definition of RSS and TSS.
We can only work with observations that are complete (non NA). So we calculate each of them:
TSS = nonNA = !is.na(modelled) & !is.na(measured)
# residuals from your prediction
RSS = sum((modelled[nonNA] - measured[nonNA])^2,na.rm=T)
# total residuals from data
TSS = sum((measured[nonNA] - mean(measured[nonNA]))^2,na.rm=T)
1 - RSS/TSS
[1] 0.7116585
If measured and modelled are supposed to represent the actual and fitted values of an undisclosed model, as discussed in the comments below another answer, then if fm is the lm object for that undisclosed model then
summary(fm)
will show the R^2 and p value of that model.
The R squared value can actually be calculated using only measured and modelled but the formula is different if there is or is not an intercept in the undisclosed model. The signs are that there is no intercept since if there were an intercept sum(modelled - measured, an.rm = TRUE) should be 0 but in fact it is far from it.
In any case R^2 and the p value are shown in the output of the summary(fm) where fm is the undisclosed linear model so there is no point in restricting the discussion to measured and modelled if you have the lm object of the undisclosed model.
For example, if the undisclosed model is the following then using the builtin CO2 data frame:
fm <- lm(uptake ~ Type + conc, CO2)
summary(fm)
we have the this output where the last two lines show R squared and p value.
Call:
lm(formula = uptake ~ Type + conc, data = CO2)
Residuals:
Min 1Q Median 3Q Max
-18.2145 -4.2549 0.5479 5.3048 12.9968
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.830052 1.579918 16.349 < 2e-16 ***
TypeMississippi -12.659524 1.544261 -8.198 3.06e-12 ***
conc 0.017731 0.002625 6.755 2.00e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.077 on 81 degrees of freedom
Multiple R-squared: 0.5821, Adjusted R-squared: 0.5718
F-statistic: 56.42 on 2 and 81 DF, p-value: 4.498e-16
Somehow as a follow up on the question Creating confidence intervals for regression curve in GLMM using Bootstrapping, I am interested in getting the correct values of a regression curve and the associated confidence interval curves.
Consider a case where in a GLMM, there is one response variable, two continuous fixed effects and one random effect. Here is some fake data:
library (dplyr)
set.seed (1129)
x1 <- runif(100,0,1)
x2 <- rnorm(100,0.5,0.4)
f1 <- gl(n = 5,k = 20)
rnd1<-rnorm(5,0.5,0.1)
my_data <- data.frame(x1=x1, x2=x2, f1=f1)
modmat <- model.matrix(~x1+x2, my_data)
fixed <- c(-0.12,0.35,0.09)
y <- (modmat%*%fixed+rnd1)
my_data$y <- ((y - min (y))/max(y- min (y))) %>% round (digits = 1)
rm (y)
The GLMM that I fit looks like this:
m1<-glmer (y ~x1+x2+(1|f1), my_data, family="binomial")
summary (m1)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ x1 + x2 + (1 | f1)
Data: my_data
AIC BIC logLik deviance df.resid
65.7 76.1 -28.8 57.7 96
Scaled residuals:
Min 1Q Median 3Q Max
-8.4750 -0.7042 -0.0102 1.5904 14.5919
Random effects:
Groups Name Variance Std.Dev.
f1 (Intercept) 1.996e-10 1.413e-05
Number of obs: 100, groups: f1, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.668 2.051 -4.713 2.44e-06 ***
x1 12.855 2.659 4.835 1.33e-06 ***
x2 4.875 1.278 3.816 0.000136 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) x1
x1 -0.970
x2 -0.836 0.734
convergence code: 0
boundary (singular) fit: see ?isSingular
Plotting y vs x1:
plot (y~x1, my_data)
It should be possible to get a regression curve from the summary of m1. I have learned that I need to reverse the link-function (in this case, "logit"):
y = 1/(1+exp(-(Intercept+b*x1+c*x2)))
In order to plot a regression curve of x1 in a two-dimensional space, I set x2 = mean(x2) in the formula (which also seems important - the red line in the following plots ignores x2, apparently leading to considerable bias). The regression line:
xx <- seq (from = 0, to = 1, length.out = 100)
yy <- 1/(1+exp(-(-9.668+12.855*xx+4.875*mean(x2))))
yyy <- 1/(1+exp(-(-9.668+12.855*xx)))
lines (yy ~ xx, col = "blue")
lines (yyy~ xx, col = "red")
I think, the blue line looks not so good (and the red line worse, of course). So as a side-question: is y = 1/(1+exp(-(Intercept+b*x1+c*x2))) always the right choice as a back-transformation of the logit-link? I am asking because I found this https://sebastiansauer.github.io/convert_logit2prob/, which made me suspicious. Or is there another reason for the model not to fit so well? Maybe my data creation process is somewhat 'bad'.
What I need now is to add the 95%-confidence interval to the curve. I think that Bootstrapping using the bootMer function should be a good approach. However, all examples that I found were on models with one single fixed effect. #Jamie Murphy asked a similar question, but he was interested in models containing a continuous and a categorical variable as fixed effects here: Creating confidence intervals for regression curve in GLMM using Bootstrapping
But when it comes to models with more than one continuous variables as fixed effects, I get lost. Perhaps someone can help solve this issue - possibly with a modification of the second part of this tutorial:
https://www.r-bloggers.com/2015/06/confidence-intervals-for-prediction-in-glmms/
My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009
Let's say I've got a data inputted into a data frame like so:
df = data.frame(x = c(1,2,3,4,5,10,15,25,50),
y = c(.57,.75,.82,0.87,.89,.95,.97,.98,.99))
df
and I wish to fit the expression:
y = ((x/a)^b)/(1+(x/a)^b)
where a and b are unknown parameters.
I have plotted the points and drawn a fitted line by guessing the values of a and b:
library(ggplot2)
graph <- ggplot(df, aes(x=x, y=y))
graph <- graph + geom_point()
a = 0.50
b = 1.00
guesstimate <- function(x){((x/a)^b)/(1+(x/a)^b)}
graph <- graph + stat_function(fun = guesstimate)
graph
However, I'd like to find the values of a and b which creates an expression that produces the highest R^2 square value; i.e. the best possible mathematical fit for the data possible.
Question:
Short of guessing through the values of a and b manually and checking with the naked eye which fit is best, is there a way to get R to find the 'best' a and b values along with providing the R-squared value which confirms to me that the chosen a and b values are indeed the best possible fit?
You can use the nls (non-linear least squares) function:
m1 = nls(y ~ (x/a)^b/(1+(x/a)^b), list(a=1, b=1), data=df)
summary(m1)
Formula: y ~ (x/a)^b/(1 + (x/a)^b)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.779291 0.009444 82.51 1.01e-11 ***
b 1.145174 0.012733 89.94 5.53e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003086 on 7 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 5.949e-08
ggplot(df, aes(x,y)) +
geom_point() +
geom_line(data=data.frame(x=seq(1,50,0.1), y=predict(m1, newdata=list(x=seq(1,50,0.1)))),
aes(x,y), colour="red")
nls does not provide an r-squared value, because, as discussed in this thread on R-help, r-squared is not necessarily meaningful for a non-linear model. nls does, however, find the parameter values that minimize the residual sum-of-squares, so in that sense these parameters provide the best fit for the given data and model. That doesn't mean that there isn't another model specification that gives a better fit, though in this case the model fit is virtually perfect.
Even if not obvious, a linear model can be applied here, just using basic algebra. Indeed, starting from 1/y = (1+(x/a)^b)/(x/a)^b and a little manipulation, you can arrive to:
log(1/y - 1) = -b*log(x) + b*log(a)
which is basically a linear model in the variables Y = log(1/y - 1) and X = log(x). From here, you can use lm:
df2<-data.frame(Y = log(1/df$y - 1), X = log(df$x))
coeffs<-lm(Y ~ X, data=df2)$coefficients
a <- exp(-model[1]/model[2])
# 0.7491387
b <- -model[2]
#1.116111
which are similar to those obtained with nls.
I have two variables ENERGY and TEMP
I have created two other variables temp2 and temp 3
> temp2 <- data$temp^2
> temp3 <- data$temp^3
>data=cbind(data, energy, temp,temp2,temp3)
Now to create a cubic model would it look just like a linear model?
>model<-lm(energy~temp+temp2+temp3)
Edit:
Ok so I did what you suggested and this is the output:
> ?poly
> model<- lm( energy ~ poly(temp, 3) , data=data )
> summary(model)
Call:
lm(formula = energy ~ poly(temp, 3), data = data)
Residuals:
Min 1Q Median 3Q Max
-19.159 -11.257 -2.377 9.784 26.841
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 95.50 3.21 29.752 < 2e-16 ***
poly(temp, 3)1 207.90 15.72 13.221 2.41e-11 ***
poly(temp, 3)2 -50.07 15.72 -3.184 0.00466 **
poly(temp, 3)3 81.59 15.72 5.188 4.47e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.73 on 20 degrees of freedom
Multiple R-squared: 0.9137, Adjusted R-squared: 0.9008
F-statistic: 70.62 on 3 and 20 DF, p-value: 8.105e-11
I would assume that I would test for the goodness of fit test the same way and look at the Pr(>|t|). This would lead me to believe that all of the variables are significant.
would I be able to use this fitted regression model to predict the average energy consumption for an average difference in temperature?
Instead of coding up dummy variable you should consider using the poly function:
?poly # Polynomial contrasts
model<- lm( energy ~ poly(temp, 3) , data=data )
If you want to use the same columns as you would have gotten with the dummies approach (which is not good for statistical inference purposes), you can use the 'raw' parameter:
model.r<- lm( energy ~ poly(temp, 3, raw=TRUE) , data=data )
Predictions will be the same, but the standard errors will not. This should give you the same estimates as would be returned by #RomanLuštrik's suggestion. The terms will not be orthogonal, so their necessary correlations will be high and you will be unable to make correct inferences about independent effects.
Added question: "would I be able to use this fitted regression model to predict the average energy consumption for an average difference in temperature?"
No. You would need to specify a particular two temperatures and then predict could give you a difference, but that difference will vary depending on what the reference point is, even if the magnitude of the difference is the same.. That was a consequence of using a non-linear term. Maybe you should describe your goals and use a forum that is more geared to methods questions. SO is for coding when you know what you want to do. http://stats.stackexchange.com may be more appropriate when you have formulated your question with more clarity.
There are two ways to do polynomial regression with lm:
lm( y ~ x + I(x^2) + I(x^3) )
and
lm( y ~ poly(x, 3, raw=TRUE) )
(That's cubic. I'm sure you can generalise to quartic, quintic, etc.)