n = 50
set.seed(100)
x = matrix(runif(n, -2, 2), nrow=n)
y = 2 + 0.75*sin(x) - 0.75*cos(x) + rnorm(n, 0, 0.2)
.
In R,
I want to estimate the above polynomial function by Least Square method.
Which means I want to know the estimate of γ0, γ1, γ2 and γ3.
And I also want to know the MSE under this estimator function.
I used this
summary(lm(y ~ x+ x^2+ x^3))
But just get this output:
Call:
lm(formula = y ~ x + (x^2) + (x^3))
Residuals:
Min 1Q Median 3Q Max
-0.66448 -0.22251 -0.07694 0.20647 0.79429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.54972 0.04761 32.55 <2e-16 ***
x 0.65279 0.04633 14.09 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3319 on 48 degrees of freedom
Multiple R-squared: 0.8053, Adjusted R-squared: 0.8013
F-statistic: 198.6 on 1 and 48 DF, p-value: < 2.2e-16
Please tell me in R, what function or package can I use to do it.
Thank you.
You need to wrap I(.) around your polynomials to indicate R that the ^ operator should be used rather as an arithmetical operator than a formula operator.
Let's create some example data to illustrate.
set.seed(42)
x <- runif(100)
y <- x + x^2 + x^3 + rnorm(100)
What you want is this:
lm(y ~ x + I(x^2) + I(x^3))$coe
# (Intercept) x I(x^2) I(x^3)
# -0.4069207 3.4580770 -4.0516060 4.3227674
or:
lm(y ~ poly(x, 3, raw=TRUE))$coe
# (Intercept) poly(x, 3, raw = TRUE)1 poly(x, 3, raw = TRUE)2 poly(x, 3, raw = TRUE)3
# -0.4069207 3.4580770 -4.0516060 4.3227674
Without the I(.) the result is different,
lm(y ~ x + x^2 + x^3)$coe
# (Intercept) x
# -0.6149136 3.3590395
because for some reason R interprets the forumla just as
lm(y ~ x)$coe
# (Intercept) x
# -0.6149136 3.3590395
all(lm(y ~ x)$coe - lm(y ~ x + x^2 + x^3)$coe == 0)
# [1] TRUE
However, consider:
lm(y ~ I(x + x^2 + x^3))$coe
# (Intercept) I(x + x^2 + x^3)
# -0.191611 1.141912
Creating a variable z
z <- x + x^2 + x^3
gives the same result:
lm(y ~ z)$coe
# (Intercept) z
# -0.191611 1.141912
Or, more explicit, say we want to calculate the interaction x with a, we want to use * as formula operator:
a <- runif(100)
lm(y ~ x*a)$coe
# (Intercept) x a x:a
# -0.71920356 3.42008631 0.19180499 -0.07049342
When we use * as an arithmetical operator,
lm(y ~ I(x*a))$coe
# (Intercept) I(x * a)
# 0.5317547 2.7361742
the result is the same as:
xa <- x*a
lm(y ~ xa)$coe
# (Intercept) xa
# 0.5317547 2.7361742
Related
I need to export a final multivariate polynomial regression equation from R to another application. I do not understand one portion of the regression output. The regression uses the polym() function. The summary table is posted below.
ploy_lm <- lm(df$SV ~ polym(df$Indy, df$HI, degree = 3, raw = TRUE)
summary(ploy_lm)
The table below says polym input for "df$Indy, df$HI, degree = 3, raw = TRUE".
Estimate
Intercept
-8.903
(polym input)1.o
1.189E0
(polym input)2.o
-1.651E-2
(polym input)1.1
8.247E-4
How do I translate the results into a final regression equation? Does the value at the end of the first column (e.g. from the last row: "polym(df$Indy, df$WM_HI, degree = 3, raw = TRUE)1.1") signify the exponent value?
Here is a simple example with a predefined function:
x1<- runif(20, 1, 20)
x2 <- runif(20, 15, 30)
#define a function for y
y <- (1 - 3*x1 + 1/5*x2 - x1*x2 + 0.013*x1^2 + 0.2 *x2^2)
#add some noise to prevent a warning on the fit
y <- y +rnorm(20, 0, 0.01)
ploy_lm <- lm(y ~ polym(x1, x2, degree = 2, raw = TRUE))
summary(ploy_lm)
Call:
lm(formula = y ~ polym(x1, x2, degree = 2, raw = TRUE))
Residuals:
Min 1Q Median 3Q Max
-0.017981 -0.007537 0.001757 0.005833 0.018697
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.588e-01 7.158e-02 13.39 2.25e-09 ***
polym(x1, x2, degree = 2, raw = TRUE)1.0 -3.003e+00 2.820e-03 -1064.88 < 2e-16 ***
polym(x1, x2, degree = 2, raw = TRUE)2.0 1.315e-02 9.659e-05 136.15 < 2e-16 ***
polym(x1, x2, degree = 2, raw = TRUE)0.1 2.059e-01 6.536e-03 31.51 2.12e-14 ***
polym(x1, x2, degree = 2, raw = TRUE)1.1 -1.000e+00 1.059e-04 -9446.87 < 2e-16 ***
polym(x1, x2, degree = 2, raw = TRUE)0.2 1.998e-01 1.511e-04 1322.68 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01167 on 14 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.298e+08 on 5 and 14 DF, p-value: < 2.2e-16
#In summary
# Term model Fitted
# Intercept 1 .959
# x1 -3 -3
# x1^2 0.013 .0132
# x2 0.2 .206
# x2^2 0.2 .1998
# x1 * x2 -1 -1
The first digit after the ")" is the power of the first term and the number after the "." is the power of the second term.
I'd like to estimate the effect of a treatment on two separate groups, so something of the form
Equation 1
T being the treatment and M the dummy separating the two groups.
The problem is that the treatment is correlated to other variables that affect Y. Luckily, there exists a variable Z that serves as an instrument for T. What I've been able to implement in R was to "manually" run 2SLS, following the stages
Equation 2
and
Equation 3
To provide a reproducible example, first a simulation
n <- 100
set.seed(271)
Z <- runif(n)
e <- rnorm(n, sd = 0.5)
M <- as.integer(runif(n)) # dummy
u <- rnorm(n)
# Treat = 1 + 2*Z + e
alpha_0 <- 1
alpha_1 <- 2
Treat <- alpha_0 + alpha_1*Z + e
# Y = 3 + M + 2*Treat + 3*Treat * M + e + u (ommited vars that determine Treat affect Y)
beta_0 <- 3
beta_1 <- 1
beta_2 <- 2
beta_3 <- 3
Y <- beta_0 + beta_1*M + beta_2*Treat + beta_3 * M*Treat + e + u
The first stage regression
fs <- lm(Treat ~ Z)
stargazer::stargazer(fs, type = "text")
===============================================
Dependent variable:
---------------------------
Treat
-----------------------------------------------
Z 2.383***
(0.168)
Constant 0.835***
(0.096)
-----------------------------------------------
Observations 100
R2 0.671
Adjusted R2 0.668
Residual Std. Error 0.445 (df = 98)
F Statistic 200.053*** (df = 1; 98)
===============================================
And second stage
Treat_hat <- fitted(fs)
ss <- lm(Y ~ M + Treat_hat + M:Treat_hat)
stargazer::stargazer(ss, type = "text")
===============================================
Dependent variable:
---------------------------
Y
-----------------------------------------------
M 1.230
(1.717)
Treat_hat 2.243***
(0.570)
M:Treat_hat 2.636***
(0.808)
Constant 2.711**
(1.213)
-----------------------------------------------
Observations 100
R2 0.727
Adjusted R2 0.718
Residual Std. Error 2.539 (df = 96)
F Statistic 85.112*** (df = 3; 96)
===============================================
The problem now is that those Standard Errors aren't adjusted for the first stage, which looks like quite some work to do manually. As I'd do for any other IV regression, I'd prefer to just use AER::ivreg.
But I can't seem to get the same regression going there. Here are many possible iterations, that never quite do the same thing
AER::ivreg(Y ~ M + Treat + M:Treat | Z)
AER::ivreg(Y ~ M + Treat + M:Treat | M + Z)
Warning message:
In ivreg.fit(X, Y, Z, weights, offset, ...) :
more regressors than instruments
These make sense, I guess
AER::ivreg(Y ~ M + Treat + M:Treat | M + Z + M:Z)
Call:
AER::ivreg(formula = Y ~ M + Treat + M:Treat | M + Z + M:Z)
Coefficients:
(Intercept) M Treat M:Treat
2.641 1.450 2.229 2.687
Surprisingly close, but not quite.
I couldn't find a way to tell ivreg that Treat and M:Treat aren't really two separate endogenous variables, but really just the same endogenous variable moved around and interacted with an exogenous one.
In conclusion,
i) Is there some way to mess with ivreg and make this work?
ii) Is there some other function for 2SLS that can just manually accept 1st and 2nd stage formulas without this sort of restriction, and that adjusts standard errors?
iii) What's the simplest way to get the correct SEs if there are no other alternatives? I didn't come across any direct R code, just a bunch of matrix multiplication formulas (although I didn't dig too deep for this one).
Thank you
Essentially, if Z is a valid a valid instrument for Treat, M:Z should be a valid instrument for M:Treat, so, to me this makes sense:
AER::ivreg(Y ~ M + Treat + M:Treat | M + Z + M:Z)
I actually managed to back out the correct param values for a modified simulation:
n <- 100
set.seed(271)
Z <- runif(n)
e <- rnorm(n, sd = 0.5)
M <- round(runif(n)) # note: I changed from as.integer() to round() in order to get some 1's in the regression
u <- rnorm(n)
# Treat = 1 + 2*Z + e
alpha_0 <- 1
alpha_1 <- 2
Treat <- alpha_0 + alpha_1*Z + e
beta_0 <- 3
beta_1 <- 1
beta_2 <- 2
beta_3 <- 3
Y <- beta_0 + beta_1*M + beta_2*Treat + beta_3 * M*Treat
Now:
my_ivreg <- AER::ivreg(Y ~ M + Treat + M:Treat | M + Z + M:Z)
>summary(my_ivreg)
Call:
AER::ivreg(formula = Y ~ M + Treat + M:Treat | M + Z + M:Z)
Residuals:
Min 1Q Median 3Q Max
-1.332e-14 -7.105e-15 -3.553e-15 -8.882e-16 3.553e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.000e+00 2.728e-15 1.100e+15 <2e-16 ***
M 1.000e+00 3.810e-15 2.625e+14 <2e-16 ***
Treat 2.000e+00 1.255e-15 1.593e+15 <2e-16 ***
M:Treat 3.000e+00 1.792e-15 1.674e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.633e-15 on 96 degrees of freedom
Multiple R-Squared: 1, Adjusted R-squared: 1
Wald test: 1.794e+31 on 3 and 96 DF, p-value: < 2.2e-16
Which is what we were looking for...
If you try to run a polynomial regression where x^2 is defined in the lm() function, the polynomial term is dropped due to singularities. However, if we define the polynomial term outside the lm(), the model is fit correctly.
It seems like it should work the same both ways. Why do we need to define the polynomial term outside the lm() function?
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))
# Trying to define x^2 in the model, x^2 is dropped
model_wrong <- lm(y ~ x + x^2)
# Define x^2 as its own object
x2 <- x^2
model_right <- lm(y ~ x + x2)
lm doesn't know where the term starts and stops within the formula unless you tell it, usually by wrapping it in a function. For arbitrary calculations, you can wrap them in I(...), which tells the function to use it as-is:
set.seed(47)
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))
lm(y ~ x + I(x^2))
#>
#> Call:
#> lm(formula = y ~ x + I(x^2))
#>
#> Coefficients:
#> (Intercept) x I(x^2)
#> 2.563e-01 2.488e+00 -3.660e-06
Really, you can wrap x^2 in most any function call that will return an evaluated vector that can be used in the model matrix. In some cases cbind can be very handy, though c, identity, or even {...} will work. I is purpose-built, though.
Alternatively, you can use the poly function to make both terms for you, which is very useful for higher-degree polynomials. By default, it generates orthogonal polynomials, which will make the coefficients look different:
lm(y ~ poly(x, 2))
#>
#> Call:
#> lm(formula = y ~ poly(x, 2))
#>
#> Coefficients:
#> (Intercept) poly(x, 2)1 poly(x, 2)2
#> 1.500000 243.485357 -0.004319
even though they will evaluate the same:
new <- data.frame(x = seq(-1, 1, .5))
predict(lm(y ~ x + I(x^2)), new)
#> 1 2 3 4 5
#> -2.2317175 -0.9876930 0.2563297 1.5003505 2.7443695
predict(lm(y ~ poly(x, 2)), new)
#> 1 2 3 4 5
#> -2.2317175 -0.9876930 0.2563297 1.5003505 2.7443695
If you really want your coefficients to be the same, add raw = TRUE:
lm(y ~ poly(x, 2, raw = TRUE))
#>
#> Call:
#> lm(formula = y ~ poly(x, 2, raw = TRUE))
#>
#> Coefficients:
#> (Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
#> 2.563e-01 2.488e+00 -3.660e-06
In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.
I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future.
Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521
Data:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813,
0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following:
x <- 1:11
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647,
0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076,
0.514708368)
fm <- lm(y ~ log(x))
giving:
> coef(fm)
(Intercept) log(x)
0.7521 -0.0990
and
plot(y ~ x, log = "x")
lines(fitted(fm) ~ x, col = "red")
You can get the same results by:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
t <- seq(along=y)
> summary(lm(y~log(t)))
Call:
lm(formula = y ~ log(t))
Residuals:
Min 1Q Median 3Q Max
-3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 ***
log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.972e-10 on 9 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16
For large projects I recommend to encapsulate the data into a data frame, like
df <- data.frame(y, t)
lm(formula = y ~ log(t), data=df)