I am confused. I have the following model: lm(GAV ~ EMPLOYED). This model has heteroscedasticity, and I believe the error standard deviation of this model can be approximated by a variable called SDL.
I have fitted the corresponding weighted model, resulting after dividing each term by variable SDL, using two forms:
lm(I(GAV/SDL) ~ I(1/SDL) + I(EMPLOYED/SDL)-1)
And
lm(GAV ~EMPLOYED,weights = 1/SDL)
I thought they would yield the same results. However, I get different parameters estimates...
Can anyone show me the error I am making?
Thanks in advance!
Fede
help("lm") clearly explains:
weighted least squares is used with weights weights (that is,
minimizing sum(w*e^2));
So:
x <- 1:10
set.seed(42)
w <- sample(10)
y <- 1 + 2 * x + rnorm(10, sd = sqrt(w))
lm(y ~ x, weights = 1/w)
#Call:
# lm(formula = y ~ x, weights = 1/w)
#
#Coefficients:
#(Intercept) x
# 3.715 1.643
lm(I(y/w^0.5) ~ I(1/w^0.5) + I(x/w^0.5) - 1)
#Call:
# lm(formula = I(y/w^0.5) ~ I(1/w^0.5) + I(x/w^0.5) - 1)
#
#Coefficients:
#I(1/w^0.5) I(x/w^0.5)
# 3.715 1.643
Btw., you might be interested in library(nlme); help("gls"). It offers more sophisticated possibilities for modelling heteroscedasticity.
Related
I am able to change the coefficients of my linear model. Then i want to compare the results of my "new" model with the new coefficients, but R is not calculating the results with the new coefficients.
As you can see in my following example the summary of my models fit and fit1 are excactly the same, though results like multiple R-squared should or fitted values should change.
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
# generate second function for comparing results
fit1 <- fit
# replace coefficients with new values, use whole name which is coefficients:
fit1$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit1)
# Comparing
summary(fit)
summary(fit1)
Thanks in advance
It might be easier to compute the multiple R^2 yourself with the substituted parameters.
mult_r2 <- function(beta, y, X) {
tot_ss <- var(y) * (length(y) - 1)
rss <- sum((y - X %*% beta)^2)
1 - rss/tot_ss
}
(or, more compactly, following the comments, you could compute p <- X %*% beta; (cor(y,beta))^2)
mult_r2(coef(fit), y = model.response(model.frame(fit)), X = model.matrix(fit))
## 0.9931179, matches summary()
Now with new coefficients:
new_coef <- coef(fit)
new_coef[2:3] <- c(5,1)
mult_r2(new_coef, y = model.response(model.frame(fit)), X = model.matrix(fit))
## [1] -343917
That last result seems pretty wild, but the substituted coefficients are very different from the true least-squares coeffs, and negative R^2 is possible when the model is bad enough ...
I'm new to the R tool and am having some trouble with the glm() function.
I have some data that I have showed below. When the linear predictor is just x, the glm() function works fine but as soon as I change the linear predictor to x + x^2, it starts giving me the same results that I got for the first model.
The code is as follows:
model1 <- glm(y ~ x, data=data1, family=poisson (link="log"))
coef(model1)
(Intercept) x
0.3396339 0.2565236
model2 <- glm(y ~ x + x^2, data=data1, family=poisson (link="log"))
coef(model2)
(Intercept) x
0.3396339 0.2565236
As you can see there's no coefficient for x^2 as if it's not even in the model.
The lm and glm functions have a special interpretation of the formula (see ?formula) which can be confusing if you are not expecting it. The intended usage of the interface is (w + x)^2 means a*w + b*x + c*w*x + d! If you wish to suppress this you need to use the literal function, I.
model2 <- glm(gear ~ disp + I(disp^2),
data = mtcars, family = poisson (link = "log"))
coef(model2)
# (Intercept) disp I(disp^2)
# 1.542059e+00 -1.248689e-03 6.578518e-07
Put another way, I allows you to perform transformations in the call to glm. The following is equivalent.
mtcars1 <- mtcars
mtcars1$disp_sq <- mtcars1$disp^2
model2a <- glm(gear ~ disp + disp_sq,
data = mtcars1, family = poisson (link = "log"))
coef(model2a)
# (Intercept) disp disp_sq
# 1.542059e+00 -1.248689e-03 6.578518e-07
I am very new to R and I appreciate the help
I have some data that looks like this.
Y is negatively correlated with X, in a nonlinear way. It seems to be approximated by a formula of the following form y=1+ax where a<1.
If I wanted to fit that data in R to find a what function would I use? NLS?
Next time please provide test data. We have done it for you this time. Then we use nls as shown.
set.seed(123)
# generate test data
n <- 35
x <- 1:n
a <- 0.5
y <- 1 + a^x + rnorm(n, 0, .01)
fm <- nls(y ~ 1+a^x, start = list(a = mean((y-1)^(1/x), na.rm = TRUE)))
fm
giving:
Nonlinear regression model
model: y ~ 1 + a^x
data: parent.frame()
a
0.5025
residual sum-of-squares: 0.003031
Number of iterations to convergence: 5
Achieved convergence tolerance: 1.346e-06
Plot
plot(y ~ x)
lines(fitted(fm) ~ x, col = "red")
How can I simulate data so that the coefficients recovered by lm are determined to be particular pre-determined values and have normally distributed residuals? For example, could I generate data so that lm(y ~ 1 + x) will yield (Intercept) = 1.500 and x = 4.000? I would like the solution to be versatile enough to work for multiple regression with continuous x (e.g., lm(y ~ 1 + x1 + x2)) but there are bonus points if it works for interactions as well (lm(y ~ 1 + x1 + x2 + x1*x2)). Also, it should work for small N (e.g., N < 200).
I know how to simulate random data which is generated by these parameters (see e.g. here), but that randomness carries over to variation in the estimated coefficients, e.g., Intercept = 1.488 and x = 4.067.
Related: It is possible to generate data that yields pre-determined correlation coefficients (see here and here). So I'm asking if this can be done for multiple regression?
One approach is to use a perfectly symmetrical noise. The noise cancels itself so the estimated parameters are exactly the input parameters, yet the residuals appear normally distributed.
x <- 1:100
y <- cbind(1,x) %*% c(1.5, 4)
eps <- rnorm(100)
x <- c(x, x)
y <- c(y + eps, y - eps)
fit <- lm(y ~ x)
# (Intercept) x
# 1.5 4.0
plot(fit)
Residuals are normally distributed...
... but exhibit an anormally perfect symmetry!
EDIT by OP: I wrote up a general-purpose code exploiting the symmetrical-residuals trick. It scales well with more complex models. This example also shows that it works for categorical predictors and interaction effects.
library(dplyr)
# Data and residuals
df = tibble(
# Predictors
x1 = 1:100, # Continuous
x2 = rep(c(0, 1), each=50), # Dummy-coded categorical
# Generate y from model, including interaction term
y_model = 1.5 + 4 * x1 - 2.1 * x2 + 8.76543 * x1 * x2,
noise = rnorm(100) # Residuals
)
# Do the symmetrical-residuals trick
# This is copy-and-paste ready, no matter model complexity.
df = bind_rows(
df %>% mutate(y = y_model + noise),
df %>% mutate(y = y_model - noise) # Mirrored
)
# Check that it works
fit <- lm(y ~ x1 + x2 + x1*x2, df)
coef(fit)
# (Intercept) x1 x2 x1:x2
# 1.50000 4.00000 -2.10000 8.76543
You could do rejection sampling:
set.seed(42)
tol <- 1e-8
x <- 1:100
continue <- TRUE
while(continue) {
y <- cbind(1,x) %*% c(1.5, 4) + rnorm(length(x))
if (sum((coef(lm(y ~ x)) - c(1.5, 4))^2) < tol) continue <- FALSE
}
coef(lm(y ~ x))
#(Intercept) x
# 1.500013 4.000023
Obviously, this is a brute-force approach and the smaller the tolerance and the more complex the model, the longer this will take. A more efficient approach should be possible by providing residuals as input and then employing some matrix algebra to calculate y values. But that's more of a maths question ...
I am currently trying to fit a polynomial model to measurement data using lm().
fit_poly4 <- lm(y ~ poly(x, degree = 4, raw = T), weights = w)
with x as independent, y as dependent variable and w = 1/variance of the measurements.
I want to try a polynomial with given coefficients instead of the ones determined by R. Specifically I want my polynomial to be
y = -3,3583*x^4 + 43*x^3 - 191,14*x^2 + 328,2*x - 137,7
I tried to enter it as
fit_poly4 <- lm(y ~ 328.2*x-191.14*I(x^2)+43*I(x^3)-3.3583*I(x^4)-137.3,
weights = w)
but this just returns an error:
Error in terms.formula(formula, data = data) : invalid model formula in ExtractVars
Is there a way to determine the coefficients in lm() and how would one do this?
I'm not sure why you want to do this, but you can use an offset term:
set.seed(101)
dd <- data.frame(x=rnorm(1000),y=rnorm(1000), w = rlnorm(1000))
fit_poly4 <- lm(y ~
-1 + offset(328.2*x-191.14*I(x^2)+43*I(x^3)-3.3583*I(x^4)-137.3),
data=dd,
weights = w)
the -1 suppresses the usual intercept term.