Here are the codes from Dave Tang's Blog on curve fitting
x <- c(32,64,96,118,126,144,152.5,158)
y <- c(99.5,104.8,108.5,100,86,64,35.3,15)
plot(x,y,pch=19)
fit <- lm(y~poly(x,4,raw=TRUE))
summary(fit)
Call:lm(formula = y ~ poly(x, 4, raw = TRUE))
Residuals:
1 2 3 4 5 6 7 8
0.1242 -0.6912 1.6355 1.4491 -5.1240 4.0360 -0.4692 -0.9604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.474e+01 5.473e+01 1.366 0.265
poly(x, 4, raw = TRUE)1 1.426e+00 3.095e+00 0.461 0.676
poly(x, 4, raw = TRUE)2 -2.854e-02 5.729e-02 -0.498 0.653
poly(x, 4, raw = TRUE)3 2.878e-04 4.278e-04 0.673 0.549
poly(x, 4, raw = TRUE)4 -1.134e-06 1.113e-06 -1.018 0.384
Residual standard error: 4.04 on 3 degrees of freedom
Multiple R-squared: 0.9943, Adjusted R-squared: 0.9868
F-statistic: 131.5 on 4 and 3 DF, p-value: 0.001064
Given that we consider this to be quite a good fit, I would want to know what was the exact polynomial equation that was fitted. Is there any way to achieve this?
[edit]
One additional question, I see that the p-values all tend to show that the independent variables are not significant enough, yet we see a good fit, can someone please explain
You can use function polynomial from package polynom to write the equation:
library(polynom)
x <- c(32,64,96,118,126,144,152.5,158)
y <- c(99.5,104.8,108.5,100,86,64,35.3,15)
plot(x,y,pch=19)
fit <- lm(y~poly(x,4,raw=TRUE))
p0 <- polynomial(coef(fit))
p0
# 74.73766 + 1.425813*x - 0.0285437*x^2 + 0.0002877714*x^3 - 1.133744e-06*x^4
Using signif to round the coefficients:
p0 <- polynomial(signif(coef(fit), 3))
p0
# 74.7 + 1.43*x - 0.0285*x^2 + 0.000288*x^3 - 1.13e-06*x^4
You can play a little with p0:
f0 <- as.function(p0)
f0(x)
# [1] 99.37580 105.49117 106.86449 98.55089 91.12402 59.96402 35.76922
# [8] 15.96039
predict(fit)
# 1 2 3 4 5 6 7 8
# 99.37580 105.49117 106.86449 98.55089 91.12402 59.96402 35.76922 15.96039
plot(x, y)
lines(x, f0(x), col = "grey", lwd = 2) # bold grey line
lines(x, predict(fit), col = "red", lty = 2) # dashed red line
You have the coefficients in the summary:
f <- function(x) {
return(7.473766e+01 + 1.425813e+00*x -2.854370e-02*x^2 + 2.877714e-04*x^3 - 1.133744e-06*x^4 )
}
plot(x, y)
lines(x, f(x), col="red")
Related
The data has a variable charges- cost of insurance and age - that is the age of the individual.
I am trying to predict charges (y) from age and am using R.
This is the scatterplot I am getting:
my code for the model is :
plot_age<- ggplot( train, aes(x = age, y = charges)) + geom_point()
and the summary is:
I am clearly able to see 3 different groups. Is there a way to make this model better?
I tried to separate closer age groups such as group together ages between 19 and 23, 35-40, etc. but I realized that this wouldn't work as every separate "group" has all the age groups.
You appear to have data like this (which I quickly simulated).
> str(dat[-3])
'data.frame': 1000 obs. of 3 variables:
$ age : num 61 62 31 57 48 42 53 24 49 51 ...
$ insurance: num 14786 1225 32650 2500 2043 ...
$ charges : num 28788 15386 37445 14885 11375 ...
I'm not sure why you don't add insurance to your model; as you noticed, charges are obviously divided in three groups, and usually insurances group us into different say risk groups. Also age appears to have the usual slight polynomial influence.
The AIC guides you which model provides the best prediction, the less the better.
with(dat, plot(age, charges, pch=20, panel.first=grid()))
> summary(f1 <- lm(charges ~ age, dat))$coe
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5449.9211 1196.10223 4.556401 5.847361e-06
age 280.2991 27.71355 10.114154 5.845383e-23
> AIC(f1)
[1] 21627.12
> summary(f2 <- lm(charges ~ age + insurance, dat))$coe
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3787.917643 43.898173262 -86.28873 0
age 281.664438 0.987983633 285.09019 0
insurance 0.997996 0.001126929 885.58929 0
> AIC(f2)
[1] 14960.09
> summary(f3 <- lm(charges ~ poly(age, 2, raw=TRUE) + insurance, dat))$coe
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.9137452 2.997296e+01 0.4975734 6.188946e-01
poly(age, 2, raw = TRUE)1 74.0177212 1.558792e+00 47.4840371 4.450037e-258
poly(age, 2, raw = TRUE)2 2.5117835 1.865728e-02 134.6275210 0.000000e+00
insurance 0.9998817 2.577128e-04 3879.8292711 0.000000e+00
> AIC(f3)
[1] 12007.32
If you already tried insurance, and it's rather random noise,
> summary(f4 <- lm(charges ~ poly(age, 2, raw=TRUE) + insurance2, dat))$coe
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3335.384722 3682.377581 0.9057693 0.36527701
poly(age, 2, raw = TRUE)1 395.941773 191.300209 2.0697404 0.03873402
poly(age, 2, raw = TRUE)2 -1.400207 2.289572 -0.6115584 0.54096952
insurance2 -338.126051 377.836133 -0.8949013 0.37105605
> AIC(f4)
[1] 21629.93
you could try to perform kmeans-clustering.
set.seed(42)
cl <- as.factor(kmeans(as.matrix(dat$charges), 3)$cl)
with(dat, plot(age, charges, pch=20, panel.first=grid(), col=cl))
legend('topleft', leg=sort(unique(cl)), col=sort(unique(cl)),
title='k-means cluster', pch=20, horiz=TRUE, cex=.9)
> summary(f5 <- lm(charges ~ poly(age, 2, raw=TRUE) + relevel(cl, 2), dat))$coe
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8104.288947 1071.6690371 -7.562306 8.984321e-14
poly(age, 2, raw = TRUE)1 726.056149 55.6745999 13.041066 5.311684e-36
poly(age, 2, raw = TRUE)2 -7.388479 0.6818584 -10.835797 6.122137e-26
relevel(cl, 2)1 31740.604031 292.4633604 108.528480 0.000000e+00
relevel(cl, 2)3 10561.462117 287.9908463 36.672909 5.807241e-187
> AIC(f5)
[1] 19079.8
Cluster 2 was releveled as ref=erence cluster (see plot).
Simulated data:
n <- 1e3
set.seed(42)
dat <- data.frame(age=round(runif(n, 18, 65)),
group=as.factor(sample(3, n, replace=TRUE, prob=c(8/12, 2/12, 2/12)))) ## risk group
dat <- transform(dat,
insurance=ifelse(group == 1, runif(n, 1e2, 2.5e3) + rnorm(n, 100, 200),
ifelse(group == 2, runif(n, 1.25e4, 1.75e4) + rnorm(n, 0, 2e3),
runif(n, 3.25e4, 3.35e4) + rnorm(n, 1, 3e3))),
insurance2=rnorm(n)) ## random noise insurance
dat <- transform(dat, charges=0 + insurance + 75*age + 2.5*age^2 + rnorm(n, 0, 1e2))
dat$group <- NULL ## we don't know the group
I would like to use nls to fit a global parameter and group-specific parameters. The closest I have found to a minimum reproducible example is below (found here: https://stat.ethz.ch/pipermail/r-help/2015-September/432020.html)
#Generate some data
d <- transform(data.frame(x=seq(0,1,len=17),
group=rep(c("A","B","B","C"),len=17)), y =
round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2))
#Fit to model using nls
nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=rep(3,length(levels(d$group)))))
This gives me an error:
Error in numericDeriv(form[[3L]], names(ind), env, central = nDcentral) :
Missing value or an infinity produced when evaluating the model
I have not been able to figure out if the error is coming from bad guesses for the starting values, or the way this code is dealing with group-specific parameters. It seems the line with p=rep(3,length(levels(d$group))) is for generating c(3,3,3), but switching this part of the code does not remove the problem (same error obtained as above):
#Fit to model using nls
nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3, 3, 3)))
Switching to nlsLM gives a different error which leads be to believe I am having an issue with the group-specific parameters:
#Generate some data
library(minpack.lm)
d <- transform(data.frame(x=seq(0,1,len=17),
group=rep(c("A","B","B","C"),len=17)), y =
round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2))
#Fit to model using nlsLM
nlsLM(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3,3,3)))
Error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
Any ideas?
I think you can do this much more easily with nlme::gnls:
fit2 <- nlme::gnls(y~1/(b+x^p),
params = list(p~group-1, b~1),
data=d,
start = list(b=1, p = rep(3,3)))
Results:
Generalized nonlinear least squares fit
Model: y ~ 1/(b + x^p)
Data: d
Log-likelihood: 62.05887
Coefficients:
p.groupA p.groupB p.groupC b
2.262383 2.895903 3.475324 1.407561
Degrees of freedom: 17 total; 13 residual
Residual standard error: 0.007188101
The params argument allows you to specify fixed-effect submodels for each nonlinear parameter. Using p ~ b-1 parameterizes the model with a separate estimate for each group, rather than fitting a baseline (intercept) value for the first group and the differences between successive groups. (In R's formula language, -1 or +0 signify "fit a model without intercept/set the intercept to 0", which in this case corresponds to fitting all three groups separately.)
I'm quite surprised that gnls and nls don't give identical results (although both give reasonable results); would like to dig in further ...
Parameter estimates (code below):
term nls gnls
1 b 1.41 1.40
2 pA 2.28 2.28
3 pB 3.19 3.14
4 pC 3.60 3.51
par(las = 1, bty = "l")
plot(y~x, data = d, col = d$group, pch = 16)
xvec <- seq(0, 1, length = 21)
f <- function(x) factor(x, levels = c("A","B","C"))
## fit1 is nls() fit
ll <- function(g, c = 1) {
lines(xvec, predict(fit1, newdata = data.frame(group=f(g), x = xvec)), col = c)
}
Map(ll, LETTERS[1:3], 1:3)
d2 <- expand.grid(x = xvec, group = f(c("A","B","C")))
pp <- predict(fit2, newdata = d2)
ll2 <- function(g, c = 1) {
lines(xvec, pp[d2$group == g], lty = 2, col = c)
}
Map(ll2, LETTERS[1:3], 1:3)
legend("bottomleft", lty = 1:2, col = 1, legend = c("nls", "gnls"))
library(tidyverse)
library(broom)
library(broom.mixed)
(purrr::map_dfr(list(nls=fit1, gnls=fit2), tidy, .id = "pkg")
%>% select(pkg, term, estimate)
%>% group_by(pkg)
## force common parameter names
%>% mutate(across(term, ~ c("b", paste0("p", LETTERS[1:3]))))
%>% pivot_wider(names_from = pkg, values_from = estimate)
)
I was able to get this by switching the class of the group from chr to factor. Note the addition of factor() when generating the dataset.
> d <- transform(data.frame(
+ x=seq(0,1,len=17),
+ group=rep(factor(c("A","B","B","C")),len=17)),
+ y=round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2)
+ )
> str(d)
'data.frame': 17 obs. of 3 variables:
$ x : num 0 0.0625 0.125 0.1875 0.25 ...
$ group: Factor w/ 3 levels "A","B","C": 1 2 2 3 1 2 2 3 1 2 ...
$ y : num 0.71 0.71 0.71 0.71 0.69 0.7 0.69 0.69 0.62 0.64 ...
> nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3,3,3)))
Nonlinear regression model
model: y ~ 1/(b + x^p[group])
data: d
b p1 p2 p3
1.406 2.276 3.186 3.601
residual sum-of-squares: 9.537e-05
Number of iterations to convergence: 5
Achieved convergence tolerance: 4.536e-06
If you try to run a polynomial regression where x^2 is defined in the lm() function, the polynomial term is dropped due to singularities. However, if we define the polynomial term outside the lm(), the model is fit correctly.
It seems like it should work the same both ways. Why do we need to define the polynomial term outside the lm() function?
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))
# Trying to define x^2 in the model, x^2 is dropped
model_wrong <- lm(y ~ x + x^2)
# Define x^2 as its own object
x2 <- x^2
model_right <- lm(y ~ x + x2)
lm doesn't know where the term starts and stops within the formula unless you tell it, usually by wrapping it in a function. For arbitrary calculations, you can wrap them in I(...), which tells the function to use it as-is:
set.seed(47)
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))
lm(y ~ x + I(x^2))
#>
#> Call:
#> lm(formula = y ~ x + I(x^2))
#>
#> Coefficients:
#> (Intercept) x I(x^2)
#> 2.563e-01 2.488e+00 -3.660e-06
Really, you can wrap x^2 in most any function call that will return an evaluated vector that can be used in the model matrix. In some cases cbind can be very handy, though c, identity, or even {...} will work. I is purpose-built, though.
Alternatively, you can use the poly function to make both terms for you, which is very useful for higher-degree polynomials. By default, it generates orthogonal polynomials, which will make the coefficients look different:
lm(y ~ poly(x, 2))
#>
#> Call:
#> lm(formula = y ~ poly(x, 2))
#>
#> Coefficients:
#> (Intercept) poly(x, 2)1 poly(x, 2)2
#> 1.500000 243.485357 -0.004319
even though they will evaluate the same:
new <- data.frame(x = seq(-1, 1, .5))
predict(lm(y ~ x + I(x^2)), new)
#> 1 2 3 4 5
#> -2.2317175 -0.9876930 0.2563297 1.5003505 2.7443695
predict(lm(y ~ poly(x, 2)), new)
#> 1 2 3 4 5
#> -2.2317175 -0.9876930 0.2563297 1.5003505 2.7443695
If you really want your coefficients to be the same, add raw = TRUE:
lm(y ~ poly(x, 2, raw = TRUE))
#>
#> Call:
#> lm(formula = y ~ poly(x, 2, raw = TRUE))
#>
#> Coefficients:
#> (Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
#> 2.563e-01 2.488e+00 -3.660e-06
I did a linear regression for a two tailed t-test with 178 degrees of freedom. The summary function gives me two p-values for my two t-values.
t value Pr(>|t|)
5.06 1.04e-06 ***
10.09 < 2e-16 ***
...
...
F-statistic: 101.8 on 1 and 178 DF, p-value: < 2.2e-16
I want to calculate manually the p-value of the t-values with this formula:
p = 1 - 2*F(|t|)
p_value_1 <- 1 - 2 * pt(abs(t_1), 178)
p_value_2 <- 1 - 2 * pt(abs(t_2), 178)
I don't get the same p-values as in the model summary. Therefore, I want to know how the summary function Pr(>|t|) is different from my formula, as I can't find the definition of Pr(>|t|).
Can you help me? Thanks a lot!
It is
2 * pt(-abs(t_value), df)
For example:
2 * pt(-5.06, 178)
#[1] 1.038543e-06
2 * pt(-10.09, 178)
#[1] 3.223683e-19
Alternatively, use
2 * pt(abs(t_value), df, lower.tail = FALSE)
We can compute the p value Pr(>|t|) in the following different ways:
tval <- 5.06
df <- 178
# compute area under the t-pdf
integrate(function(x) dt(x, df), -Inf, -tval)$value + integrate(function(x) dt(x, df), tval, Inf)$value # sum of two areas
# [1] 1.038543e-06
1-integrate(function(x) dt(x, df), -tval, tval)$value
# [1] 1.038543e-06
# 2-sided t-test: Pr_T(|t|>|tval|) = 2*(1 - F_T(|tval|)) = 2*F_T(-|tval|), where T~t(df=178)
2*(1 - pt(tval, df))
# [1] 1.038543e-06
2*pt(tval, df, lower.tail = FALSE)
# [1] 1.038543e-06
1 - (pt(tval, df) - pt(-tval, df))
# [1] 1.038543e-06
2*pt(-tval, df)
# [1] 1.038543e-06
The following illustrates the same geometrically with a different (less extreme) value of the t-statistic, as we can see, there are two (symmetric) blue regions that together represent the corresponding probability, under the 2-sided t-test.
df <- 178
x <- seq(-6, 6,0.01)
y <- dt(x, df)
tval <- 1.25
plot(x, y, type='l', main='t-distribution and p-value (5% significance level, 2-sided t-test)')
abline(h=0)
abline(v = c(tval, -tval), col='red')
index1 <- which(x >= -tval)[1]
index2 <- which(x >= tval)[1]
polygon(x = c(x[1:index1], x[index1], x[1]),
y = c(y[1:index1], 0, 0),
col = "blue")
polygon(x = c(x[index2], x[index2], x[index2:length(x)]),
y = c(0, y[index2], y[index2:length(y)]),
col = "blue")
This question already has an answer here:
How to add all variables its second degree in lm()? [duplicate]
(1 answer)
Closed 6 years ago.
I have a linear model with almost 0 Rsquare. I am making a function with 1 parameter n which describes the power transformation that is to be taken.
If n = 3 the model becomes:
y = x1 + x2 + x1^2 + x2^2 + x1^3 + x2^3
How can I enter these in the model without having to write it again and again?
you can use the function poly in the formula like this
set.seed(123)
dat <- data.frame(y=rnorm(10), x1=rnorm(10), x2=rnorm(10))
n <-3
fm <-lm(y ~ poly(x1, degree=n, raw=TRUE)+poly(x2, degree=n, raw=TRUE), data=dat)
summary(fm)
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.50796 0.81187 0.626 0.576
## poly(x1, degree = n, raw = TRUE)1 -0.54354 0.86195 -0.631 0.573
## poly(x1, degree = n, raw = TRUE)2 -0.66328 0.55169 -1.202 0.315
## poly(x1, degree = n, raw = TRUE)3 0.05989 0.35421 0.169 0.876
## poly(x2, degree = n, raw = TRUE)1 1.06890 1.00518 1.063 0.366
## poly(x2, degree = n, raw = TRUE)2 0.01655 0.76730 0.022 0.984
## poly(x2, degree = n, raw = TRUE)3 -1.18610 0.84214 -1.408 0.254
degree is of course the maximum degree of x1 and x2, raw=TRUE means that it is equivalent to x1 + I(x1^2) + ... , if raw=FALSE the polynomials will be orthogonals.
Note that the number at the end of the names of the coefficients represent the degree of the associated polynomials.
PS : you can use poly(x1, x2, degree=n, raw=TRUE), to write a similar formula which include interactions.