Related
When I fit a polynomial regression model in R, if I use the function poly() and then try to get the variance inflation factors using vif() I get the following error:
y = c(0.22200,0.39500,0.42200,0.43700,0.42800,0.46700,0.44400,0.37800,0.49400,
0.45600,0.45200,0.11200,0.43200,0.10100,0.23200,0.30600,0.09230,0.11600,
0.07640,0.43900,0.09440,0.11700,0.07260,0.04120,0.25100,0.00002)
x1 = c(7.3,8.7,8.8,8.1,9.0,8.7,9.3,7.6,10.0,8.4,9.3,7.7,9.8,7.3,8.5,9.5,7.4,7.8,
7.7,10.3,7.8,7.1,7.7,7.4,7.3,7.6)
x2 = c(0.0,0.0,0.7,4.0,0.5,1.5,2.1,5.1,0.0,3.7,3.6,2.8,4.2,2.5,2.0,2.5,2.8,2.8,
3.0,1.7,3.3,3.9,4.3,6.0,2.0,7.8)
x3 = c(0.0,0.3,1.0,0.2,1.0,2.8,1.0,3.4,0.3,4.1,2.0,7.1,2.0,6.8,6.6,5.0,7.8,7.7,
8.0,4.2,8.5,6.6,9.5,10.9,5.2,20.7)
m = lm(y~poly(x1, x2, x3, degree=2, raw=TRUE))
summary(m)
Now call vif()
> vif(m)
Error in vif.default(m) : model contains fewer than 2 terms
The model has 9 terms and an intercept.
> m$rank
[1] 10
It seems to me that the vif() function does not function with poly(). Is this correct? Is there a way around this or do I need to use first principles?
I can calculate the variance inflation factors like so:
X = poly(x1, x2, x3, degree=2, raw=TRUE)
C = solve(t(X)%*%X)
vifs = 1/diag(C)
I don't see much point in using poly with raw=TRUE, especially in this case where you want multiple terms and poly delivers a very opaque labeling of its results. It's also unclear which version of vif is being used. I chose to experiment with using I() to create the "pure" second degree terms ( ... is "homogenous" the correct term here?) and the R formula interface (using (...)^2) for the rest and had no trouble understanding the results in contrast to what I got with poly;
> m = lm(y~(x1+x2+x3)^2 + I(x1^2) + I(x2^2) + I(x3^2))
> summary(m)
Call:
lm(formula = y ~ (x1 + x2 + x3)^2 + I(x1^2) + I(x2^2) + I(x3^2))
Residuals:
Min 1Q Median 3Q Max
-0.063213 -0.037282 -0.001113 0.016738 0.122539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.769364 1.286976 -1.375 0.1881
x1 0.420799 0.294173 1.430 0.1718
x2 0.222453 0.130742 1.701 0.1082
x3 -0.127995 0.070245 -1.822 0.0872 .
I(x1^2) -0.019325 0.016797 -1.150 0.2668
I(x2^2) -0.007449 0.012048 -0.618 0.5451
I(x3^2) 0.000824 0.001441 0.572 0.5754
x1:x2 -0.019876 0.012037 -1.651 0.1182
x1:x3 0.009151 0.007621 1.201 0.2473
x2:x3 0.002576 0.007039 0.366 0.7192
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06092 on 16 degrees of freedom
Multiple R-squared: 0.9169, Adjusted R-squared: 0.8702
F-statistic: 19.63 on 9 and 16 DF, p-value: 5.051e-07
> rms::vif(m)
x1 x2 x3 I(x1^2) I(x2^2) I(x3^2) x1:x2 x1:x3 x2:x3
521.01297 401.58833 688.02220 501.50614 173.60055 99.67708 204.43081 456.00750 349.97018
In an effort to figure out what version of vif was "not working", I tried both rms::vif and HH::vif and neither one threw the error you encountered so I'm not sure why you got the error:
> m = lm(y~poly(x1, x2, x3, degree=2, raw=TRUE))
> HH::vif(m)
poly(x1, x2, x3, degree = 2, raw = TRUE)1.0.0 poly(x1, x2, x3, degree = 2, raw = TRUE)2.0.0
521.01297 501.50614
poly(x1, x2, x3, degree = 2, raw = TRUE)0.1.0 poly(x1, x2, x3, degree = 2, raw = TRUE)1.1.0
401.58833 204.43081
poly(x1, x2, x3, degree = 2, raw = TRUE)0.2.0 poly(x1, x2, x3, degree = 2, raw = TRUE)0.0.1
173.60055 688.02220
poly(x1, x2, x3, degree = 2, raw = TRUE)1.0.1 poly(x1, x2, x3, degree = 2, raw = TRUE)0.1.1
456.00750 349.97018
poly(x1, x2, x3, degree = 2, raw = TRUE)0.0.2
99.67708
> rms::vif(m)
poly(x1, x2, x3, degree = 2, raw = TRUE)1.0.0 poly(x1, x2, x3, degree = 2, raw = TRUE)2.0.0
521.01297 501.50614
poly(x1, x2, x3, degree = 2, raw = TRUE)0.1.0 poly(x1, x2, x3, degree = 2, raw = TRUE)1.1.0
401.58833 204.43081
poly(x1, x2, x3, degree = 2, raw = TRUE)0.2.0 poly(x1, x2, x3, degree = 2, raw = TRUE)0.0.1
173.60055 688.02220
poly(x1, x2, x3, degree = 2, raw = TRUE)1.0.1 poly(x1, x2, x3, degree = 2, raw = TRUE)0.1.1
456.00750 349.97018
poly(x1, x2, x3, degree = 2, raw = TRUE)0.0.2
99.67708
Maybe it was from car?
> car::vif(m)
Error in vif.default(m) : model contains fewer than 2 terms
If so the proper way to handle this would seem to be sending an issues submission to John Fox. Does he have a GitHub page for issues? You should be able to find the answer to that question by looking at the output packageDescription("car"). Nope. You will neeed to send him an email. Here's how to get that destination:
maintainer("car")
Addendum:
m = lm(y~(x1+x2+x3)^2 + I(x1^2) + I(x2^2) + I(x3^2))
> car::vif(m)
there are higher-order terms (interactions) in this model
consider setting type = 'predictor'; see ?vif
x1 x2 x3 I(x1^2) I(x2^2) I(x3^2) x1:x2 x1:x3 x2:x3
521.01297 401.58833 688.02220 501.50614 173.60055 99.67708 204.43081 456.00750 349.97018
So maybe there's some useful information in the help page? Might be worth seeing what that message is all about.
n = 50
set.seed(100)
x = matrix(runif(n, -2, 2), nrow=n)
y = 2 + 0.75*sin(x) - 0.75*cos(x) + rnorm(n, 0, 0.2)
.
In R,
I want to estimate the above polynomial function by Least Square method.
Which means I want to know the estimate of γ0, γ1, γ2 and γ3.
And I also want to know the MSE under this estimator function.
I used this
summary(lm(y ~ x+ x^2+ x^3))
But just get this output:
Call:
lm(formula = y ~ x + (x^2) + (x^3))
Residuals:
Min 1Q Median 3Q Max
-0.66448 -0.22251 -0.07694 0.20647 0.79429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.54972 0.04761 32.55 <2e-16 ***
x 0.65279 0.04633 14.09 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3319 on 48 degrees of freedom
Multiple R-squared: 0.8053, Adjusted R-squared: 0.8013
F-statistic: 198.6 on 1 and 48 DF, p-value: < 2.2e-16
Please tell me in R, what function or package can I use to do it.
Thank you.
You need to wrap I(.) around your polynomials to indicate R that the ^ operator should be used rather as an arithmetical operator than a formula operator.
Let's create some example data to illustrate.
set.seed(42)
x <- runif(100)
y <- x + x^2 + x^3 + rnorm(100)
What you want is this:
lm(y ~ x + I(x^2) + I(x^3))$coe
# (Intercept) x I(x^2) I(x^3)
# -0.4069207 3.4580770 -4.0516060 4.3227674
or:
lm(y ~ poly(x, 3, raw=TRUE))$coe
# (Intercept) poly(x, 3, raw = TRUE)1 poly(x, 3, raw = TRUE)2 poly(x, 3, raw = TRUE)3
# -0.4069207 3.4580770 -4.0516060 4.3227674
Without the I(.) the result is different,
lm(y ~ x + x^2 + x^3)$coe
# (Intercept) x
# -0.6149136 3.3590395
because for some reason R interprets the forumla just as
lm(y ~ x)$coe
# (Intercept) x
# -0.6149136 3.3590395
all(lm(y ~ x)$coe - lm(y ~ x + x^2 + x^3)$coe == 0)
# [1] TRUE
However, consider:
lm(y ~ I(x + x^2 + x^3))$coe
# (Intercept) I(x + x^2 + x^3)
# -0.191611 1.141912
Creating a variable z
z <- x + x^2 + x^3
gives the same result:
lm(y ~ z)$coe
# (Intercept) z
# -0.191611 1.141912
Or, more explicit, say we want to calculate the interaction x with a, we want to use * as formula operator:
a <- runif(100)
lm(y ~ x*a)$coe
# (Intercept) x a x:a
# -0.71920356 3.42008631 0.19180499 -0.07049342
When we use * as an arithmetical operator,
lm(y ~ I(x*a))$coe
# (Intercept) I(x * a)
# 0.5317547 2.7361742
the result is the same as:
xa <- x*a
lm(y ~ xa)$coe
# (Intercept) xa
# 0.5317547 2.7361742
If you try to run a polynomial regression where x^2 is defined in the lm() function, the polynomial term is dropped due to singularities. However, if we define the polynomial term outside the lm(), the model is fit correctly.
It seems like it should work the same both ways. Why do we need to define the polynomial term outside the lm() function?
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))
# Trying to define x^2 in the model, x^2 is dropped
model_wrong <- lm(y ~ x + x^2)
# Define x^2 as its own object
x2 <- x^2
model_right <- lm(y ~ x + x2)
lm doesn't know where the term starts and stops within the formula unless you tell it, usually by wrapping it in a function. For arbitrary calculations, you can wrap them in I(...), which tells the function to use it as-is:
set.seed(47)
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))
lm(y ~ x + I(x^2))
#>
#> Call:
#> lm(formula = y ~ x + I(x^2))
#>
#> Coefficients:
#> (Intercept) x I(x^2)
#> 2.563e-01 2.488e+00 -3.660e-06
Really, you can wrap x^2 in most any function call that will return an evaluated vector that can be used in the model matrix. In some cases cbind can be very handy, though c, identity, or even {...} will work. I is purpose-built, though.
Alternatively, you can use the poly function to make both terms for you, which is very useful for higher-degree polynomials. By default, it generates orthogonal polynomials, which will make the coefficients look different:
lm(y ~ poly(x, 2))
#>
#> Call:
#> lm(formula = y ~ poly(x, 2))
#>
#> Coefficients:
#> (Intercept) poly(x, 2)1 poly(x, 2)2
#> 1.500000 243.485357 -0.004319
even though they will evaluate the same:
new <- data.frame(x = seq(-1, 1, .5))
predict(lm(y ~ x + I(x^2)), new)
#> 1 2 3 4 5
#> -2.2317175 -0.9876930 0.2563297 1.5003505 2.7443695
predict(lm(y ~ poly(x, 2)), new)
#> 1 2 3 4 5
#> -2.2317175 -0.9876930 0.2563297 1.5003505 2.7443695
If you really want your coefficients to be the same, add raw = TRUE:
lm(y ~ poly(x, 2, raw = TRUE))
#>
#> Call:
#> lm(formula = y ~ poly(x, 2, raw = TRUE))
#>
#> Coefficients:
#> (Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
#> 2.563e-01 2.488e+00 -3.660e-06
Here are the codes from Dave Tang's Blog on curve fitting
x <- c(32,64,96,118,126,144,152.5,158)
y <- c(99.5,104.8,108.5,100,86,64,35.3,15)
plot(x,y,pch=19)
fit <- lm(y~poly(x,4,raw=TRUE))
summary(fit)
Call:lm(formula = y ~ poly(x, 4, raw = TRUE))
Residuals:
1 2 3 4 5 6 7 8
0.1242 -0.6912 1.6355 1.4491 -5.1240 4.0360 -0.4692 -0.9604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.474e+01 5.473e+01 1.366 0.265
poly(x, 4, raw = TRUE)1 1.426e+00 3.095e+00 0.461 0.676
poly(x, 4, raw = TRUE)2 -2.854e-02 5.729e-02 -0.498 0.653
poly(x, 4, raw = TRUE)3 2.878e-04 4.278e-04 0.673 0.549
poly(x, 4, raw = TRUE)4 -1.134e-06 1.113e-06 -1.018 0.384
Residual standard error: 4.04 on 3 degrees of freedom
Multiple R-squared: 0.9943, Adjusted R-squared: 0.9868
F-statistic: 131.5 on 4 and 3 DF, p-value: 0.001064
Given that we consider this to be quite a good fit, I would want to know what was the exact polynomial equation that was fitted. Is there any way to achieve this?
[edit]
One additional question, I see that the p-values all tend to show that the independent variables are not significant enough, yet we see a good fit, can someone please explain
You can use function polynomial from package polynom to write the equation:
library(polynom)
x <- c(32,64,96,118,126,144,152.5,158)
y <- c(99.5,104.8,108.5,100,86,64,35.3,15)
plot(x,y,pch=19)
fit <- lm(y~poly(x,4,raw=TRUE))
p0 <- polynomial(coef(fit))
p0
# 74.73766 + 1.425813*x - 0.0285437*x^2 + 0.0002877714*x^3 - 1.133744e-06*x^4
Using signif to round the coefficients:
p0 <- polynomial(signif(coef(fit), 3))
p0
# 74.7 + 1.43*x - 0.0285*x^2 + 0.000288*x^3 - 1.13e-06*x^4
You can play a little with p0:
f0 <- as.function(p0)
f0(x)
# [1] 99.37580 105.49117 106.86449 98.55089 91.12402 59.96402 35.76922
# [8] 15.96039
predict(fit)
# 1 2 3 4 5 6 7 8
# 99.37580 105.49117 106.86449 98.55089 91.12402 59.96402 35.76922 15.96039
plot(x, y)
lines(x, f0(x), col = "grey", lwd = 2) # bold grey line
lines(x, predict(fit), col = "red", lty = 2) # dashed red line
You have the coefficients in the summary:
f <- function(x) {
return(7.473766e+01 + 1.425813e+00*x -2.854370e-02*x^2 + 2.877714e-04*x^3 - 1.133744e-06*x^4 )
}
plot(x, y)
lines(x, f(x), col="red")
I would like to know how to constrain certain parameters in lm() to have positive coefficients. There are a few packages or functions (e.g. display) that can make all coefficients, and the intercept, positive.
For instance, in this example, I would like to force only x1 and x2 to have positive coefficients.
x1=c(NA,rnorm(99)*10)
x2=c(NA,NA,rnorm(98)*10)
x3=rnorm(100)*10
y=sin(x1)+cos(x2)-x3+rnorm(100)
lm(y~x1+x2+x3)
Call:
lm(formula = y ~ x1 + x2 + x3)
Coefficients:
(Intercept) x1 x2 x3
-0.06278 0.02261 -0.02233 -0.99626
I have tried function nnnpls() in package nnls, it can control the coefficient sign easily. Unfortunately I can't use it due to issues with NAs in the data as this function doesn't allow NA.
I saw function glmc() can be used to apply constraints but I couldn't get it working.
Could someone let me know what should I do?
You could use the package colf for this. It currently offers two least squares non linear optimizers, namely nls or nlxb:
library(colf)
colf_nlxb(y ~ x1 + x2 + x3, data = DF, lower = c(-Inf, 0, 0, -Inf))
#nlmrt class object: x
#residual sumsquares = 169.53 on 98 observations
# after 3 Jacobian and 3 function evaluations
# name coeff SEs tstat pval gradient JSingval
#1 param_X.Intercept. -0.0066952 NA NA NA 3.8118 103.3941
#2 param_x1 0.0000000 NA NA NA 103.7644 88.7017
#3 param_x2 0.0000000 NA NA NA 0.0000 9.8032
#4 param_x3 -0.9487088 NA NA NA 330.7776 0.0000
colf_nls(y ~ x1 + x2 + x3, data = DF, lower = c(-Inf, 0, 0, -Inf))
#Nonlinear regression model
# model: y ~ param_X.Intercept. * X.Intercept. + param_x1 * x1 + param_x2 *
# x2 + param_x3 * x3
# data: model_ingredients$model_data
#param_X.Intercept. param_x1 param_x2 param_x3
# -0.0392 0.0000 0.0000 -0.9801
# residual sum-of-squares: 159
#
#Algorithm "port", convergence message: both X-convergence and relative convergence (5)
You can set the lower and/or upper bounds to specify the limits as you like for each one of the coefficients.
You can use package penalized:
set.seed(1)
x1=c(NA,rnorm(99)*10)
x2=c(NA,NA,rnorm(98)*10)
x3=rnorm(100)*10
y=sin(x1)+cos(x2)-x3+rnorm(100)
DF <- data.frame(x1,x2,x3,y)
lm(y~x1+x2+x3, data=DF)
#Call:
#lm(formula = y ~ x1 + x2 + x3, data = DF)
#
#Coefficients:
#(Intercept) x1 x2 x3
# -0.02438 -0.01735 -0.02030 -0.98203
This gives the same:
library(penalized)
mod1 <- penalized(y, ~ x1 + x2 + x3, ~1,
lambda1=0, lambda2=0, positive = FALSE, data=na.omit(DF))
coef(mod1)
#(Intercept) x1 x2 x3
#-0.02438357 -0.01734856 -0.02030120 -0.98202831
If you constraint the coefficients of x1 and x2 to be positive, they become zero (as expected):
mod2 <- penalized(y, ~ x1 + x2 + x3, ~1,
lambda1=0, lambda2=0, positive = c(T, T, F), data=na.omit(DF))
coef(mod2)
#(Intercept) x3
#-0.03922266 -0.98011223
With ConsReg https://cran.r-project.org/web/packages/ConsReg/index.html package you can deal with this kind of problems
You can set bound limits (lower and upper) and also restrictions within coefficients, like beta1 > beta2 which in some cases can be very useful.