Here's my problem: Why do the following procedures (classical version and custom function) for planned contrasts lead to different results for the estimate and the 95%CI? Please note that I copied the custom function from this website.
#classical version
data(mtcars)
#set Helmert contrasts
cyl2<-c(-1,1,0)
cyl1<-c(-1,-1,2)
mtcars$cyl<-factor(mtcars$cyl)
contrasts(mtcars$cyl) <-cbind(c1,c2)
classical<-summary.lm(aov(disp~cyl, mtcars))
#custom function (I want to use it because it includes results for equal AND unequal variances --> if the custom function is correct, results for equal variances should be the same as in the classical example):
oneway <- function(dv, group, contrast, alpha = .05) {
# -- arguments --
# dv: vector of measurements (i.e., dependent variable)
# group: vector that identifies which group the dv measurement came from
# contrast: list of named contrasts
# alpha: alpha level for 1 - alpha confidence level
# -- output --
# computes confidence interval and test statistic for a linear contrast of population means in a between-subjects design
# returns a data.frame object
# estimate (est), standard error (se), t-statistic (z), degrees of freedom (df), two-tailed p-value (p), and lower (lwr) and upper (upr) confidence limits at requested 1 - alpha confidence level
# first line reports test statistics that assume variances are equal
# second line reports test statistics that do not assume variances are equal
# means, standard deviations, and sample sizes
ms <- by(dv, group, mean, na.rm = TRUE)
vars <- by(dv, group, var, na.rm = TRUE)
ns <- by(dv, group, function(x) sum(!is.na(x)))
# convert list of contrasts to a matrix of named contrasts by row
contrast <- matrix(unlist(contrast), nrow = length(contrast), byrow = TRUE, dimnames = list(names(contrast), NULL))
# contrast estimate
est <- contrast %*% ms
# welch test statistic
se_welch <- sqrt(contrast^2 %*% (vars / ns))
t_welch <- est / se_welch
# classic test statistic
mse <- anova(lm(dv ~ factor(group)))$"Mean Sq"[2]
se_classic <- sqrt(mse * (contrast^2 %*% (1 / ns)))
t_classic <- est / se_classic
# if dimensions of contrast are NULL, nummer of contrasts = 1, if not, nummer of contrasts = dimensions of contrast
num_contrast <- ifelse(is.null(dim(contrast)), 1, dim(contrast)[1])
df_welch <- rep(0, num_contrast)
df_classic <- rep(0, num_contrast)
# makes rows of contrasts if contrast dimensions aren't NULL
if(is.null(dim(contrast))) contrast <- t(as.matrix(contrast))
# calculating degrees of freedom for welch and classic
for(i in 1:num_contrast) {
df_classic[i] <- sum(ns) - length(ns)
df_welch[i] <- sum(contrast[i, ]^2 * vars / ns)^2 / sum((contrast[i, ]^2 * vars / ns)^2 / (ns - 1))
}
# p-values
p_welch <- 2 * (1 - pt(abs(t_welch), df_welch))
p_classic <- 2 * (1 - pt(abs(t_classic), df_classic))
# 95% confidence intervals
lwr_welch <- est - se_welch * qt(p = 1 - (alpha / 2), df = df_welch)
upr_welch <- est + se_welch * qt(p = 1 - (alpha / 2), df = df_welch)
lwr_classic <- est - se_classic * qt(p = 1 - (alpha / 2), df = df_classic)
upr_classic <- est + se_classic * qt(p = 1 - (alpha / 2), df = df_classic)
# output
data.frame(contrast = rep(rownames(contrast), times = 2),
equal_var = rep(c("Assumed", "Not Assumed"), each = num_contrast),
est = rep(est, times = 2),
se = c(se_classic, se_welch),
t = c(t_classic, t_welch),
df = c(df_classic, df_welch),
p = c(p_classic, p_welch),
lwr = c(lwr_classic, lwr_welch),
upr = c(upr_classic, upr_welch))
}
#results for mtcars with and without Welch correction:
custom<-(with(mtcars,
oneway(dv = disp, group= cyl, contrast = list (cyl1=c(-1,-1,2), cyl2 =c(-1,1,0)))))
Now results are the same for p and t for the classical and the custom version, as expected (at least when equal_var = Assumed). But why are the estimate and the 95%CIs different?
> custom
contrast equal_var est se t df p lwr upr
1 cyl1 Assumed 417.74935 37.20986 11.226845 29.000000 4.487966e-12 341.64664 493.8521
2 cyl2 Assumed 78.17792 24.96113 3.131986 29.000000 3.945539e-03 27.12667 129.2292
3 cyl1 Not Assumed 417.74935 40.30748 10.364066 18.452900 3.985000e-09 333.21522 502.2835
4 cyl2 Not Assumed 78.17792 17.67543 4.422972 9.224964 1.566927e-03 38.34147 118.0144
> classical
Call:
aov(formula = disp ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-77.300 -30.586 -6.568 20.814 118.900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 213.850 9.507 22.494 < 2e-16 ***
cyl1 69.625 6.202 11.227 4.49e-12 ***
cyl2 39.089 12.481 3.132 0.00395 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 51.63 on 29 degrees of freedom
Multiple R-squared: 0.8377, Adjusted R-squared: 0.8265
F-statistic: 74.83 on 2 and 29 DF, p-value: 3.551e-12
PS: This was my best attempt to solve this problem. Alternatively, I would be happy for any ideas on how to get estimates and 95%CIs for Welch-corrected contrasts in R that would not involve relying on custom functions from blogs.
Related
I am working with an interaction model similar to this one below:
set.seed(1993)
moderating <- sample(c("Yes", "No"),100, replace = T)
x <- sample(c("Yes", "No"), 100, replace = T)
y <- sample(1:100, 100, replace = T)
df <- data.frame(y, x, moderating)
Results <- lm(y ~ x*moderating)
summary(Results)
Call:
lm(formula = y ~ x * moderating)
Residuals:
Min 1Q Median 3Q Max
-57.857 -29.067 3.043 22.960 59.043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.4000 6.1639 8.501 2.44e-13 ***
xYes 8.4571 9.1227 0.927 0.356
moderatingYes -11.4435 8.9045 -1.285 0.202
xYes:moderatingYes -0.1233 12.4563 -0.010 0.992
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.82 on 96 degrees of freedom
Multiple R-squared: 0.04685, Adjusted R-squared: 0.01707
F-statistic: 1.573 on 3 and 96 DF, p-value: 0.2009
I'm learning how to calculate the fitted value of a interaction from a regression table. In the example, the base category (or omitted category) is x= No and moderating = No.
Thus far, I know the following fitted values:
#Calulate Fitted Value From a Regression Interaction by hand
#Omitted Variable = X_no.M_no
X_no.M_no <- 52.4000
X_yes.M_no <- 52.4000 + 8.4571
X_no.M_yes <- 52.4000 + -11.4435
X_yes.M_yes #<- ?
I do not understand how the final category, X_yes.M_yes, is calculated. My initial thoughts were X_yes.M_yes <- 52.4000 + -0.1233, (the intercept plus the interaction term) but that is incorrect. I know its incorrect because, using the predict function, the fitted value of X_yes.M_yes = 49.29032, not 52.4000 + -0.1233 = 52.2767.
How do I calculate, by hand, the predicted value of the X_yes.M_yes category?
Here are the predicted values as generated from the predict function in R
#Validated Here Using the Predict Function:
newdat <- NULL
for(m in na.omit(unique(df$moderating))){
for(i in na.omit(unique(df$x))){
moderating <- m
x <- i
newdat<- rbind(newdat, data.frame(x, moderating))
}
}
Prediction.1 <- cbind(newdat, predict(Results, newdat, se.fit = TRUE))
Prediction.1
Your regression looks like this in math:
hat_y = a + b x + c m + d m x
Where x = 1 when "yes" and 0 when "no" and m is similarly defined by moderating.
Then X_yes.M_yes implies x = 1 and m = 1, so your prediction is a + b + c + d.
or in your notation X_yes.M_yes = 52.4000 + 8.4571 - 11.4435 - 0.1233
Does anyone know if it is possible to use lmFit or lm in R to calculate a linear model with categorical variables while including all possible comparisons between the categories? For example in the test data created here:
set.seed(25)
f <- gl(n = 3, k = 20, labels = c("control", "low", "high"))
mat <- model.matrix(~f, data = data.frame(f = f))
beta <- c(12, 3, 6) #these are the simulated regression coefficient
y <- rnorm(n = 60, mean = mat %*% beta, sd = 2)
m <- lm(y ~ f)
I get the summary:
summary(m)
Call:
lm(formula = y ~ f)
Residuals:
Min 1Q Median 3Q Max
-4.3505 -1.6114 0.1608 1.1615 5.2010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.4976 0.4629 24.840 < 2e-16 ***
flow 3.0370 0.6546 4.639 2.09e-05 ***
fhigh 6.1630 0.6546 9.415 3.27e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.07 on 57 degrees of freedom
Multiple R-squared: 0.6086, Adjusted R-squared: 0.5949
F-statistic: 44.32 on 2 and 57 DF, p-value: 2.446e-12
which is because the contrasts term ("contr.treatment") compares "high" to "control" and "low" to "control".
Is it possible to get also the comparison between "high" and "low"?
If you use aov instead of lm, you can use the TukeyHSD function from the stats package:
fit <- aov(y ~ f)
TukeyHSD(fit)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
# Fit: aov(formula = y ~ f)
# $f
# diff lwr upr p adj
# low-control 3.036957 1.461707 4.612207 6.15e-05
# high-control 6.163009 4.587759 7.738259 0.00e+00
# high-low 3.126052 1.550802 4.701302 3.81e-05
If you want to use an lm object, you can use the TukeyHSD function from the mosaic package:
library(mosaic)
TukeyHSD(m)
Or, as #ben-bolker suggests,
library(emmeans)
e1 <- emmeans(m, specs = "f")
pairs(e1)
# contrast estimate SE df t.ratio p.value
# control - low -3.036957 0.6546036 57 -4.639 0.0001
# control - high -6.163009 0.6546036 57 -9.415 <.0001
# low - high -3.126052 0.6546036 57 -4.775 <.0001
# P value adjustment: tukey method for comparing a family of 3 estimates
With lmFit:
library(limma)
design <- model.matrix(~0 + f)
colnames(design) <- levels(f)
fit <- lmFit(y, design)
contrast.matrix <- makeContrasts(control-low, control-high, low-high,
levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
round(t(rbind(fit2$coefficients, fit2$t, fit2$p.value)), 5)
# [,1] [,2] [,3]
# control - low -3.03696 -4.63938 2e-05
# control - high -6.16301 -9.41487 0e+00
# low - high -3.12605 -4.77549 1e-05
Also see Multiple t-test comparisons for more information.
I am having trouble estimating the constant (intercept) of a multivariate linear regression model by using stochastic gradient descent optimization method (which means batch size equals to 1 of mini-batch gradient descent). The function in R that I use is:
StochasticGradientDescent <- function(data, alpha, iteration, epsilon){data <- matrix(unlist(data), ncol=ncol(data), byrow=FALSE)
independent.variable<- data[,1:ncol(data)-1]
dependent.variable<- data[,ncol(data)]
#add column of 1s for constant
independent.variable <- cbind(theta0 = 1, independent.variable)
theta_new <- matrix( 0, ncol = ncol(independent.variable))
theta_old <- matrix( 1, ncol = ncol(independent.variable))
#Cost function
CostFunction <- function (independent.variable, dependent.variable, theta){
1/(2*(NROW(dependent.variable))) * sum(((independent.variable %*% t(theta)) - dependent.variable)^2);
}
thetas <- vector( mode = "list", length = iteration )
thetas[[1]] <- theta_new
J <- numeric( length = iteration )
J[1] <- CostFunction(independent.variable, dependent.variable, theta_old )
derivative <- function(independent.variable, dependent.variable, theta){
idx <- sample.int(NROW(independent.variable), 1)
descent <- (t(independent.variable[idx, , drop = FALSE]) %*% ((independent.variable[idx, , drop = FALSE] %*% t(theta)) - dependent.variable[idx, drop = FALSE]))
return( t(descent) )
}
#stopping criterion
step <- 1
while(any(abs(theta_new - theta_old) > epsilon) & step <= iteration )
{
step <- step + 1
# gradient descent
theta_old <- theta_new
theta_new <- theta_old - alpha * derivative(independent.variable, dependent.variable, theta_old)
# record keeping
thetas[[step]] <- theta_new
J[step] <- CostFunction(independent.variable, dependent.variable, theta_new)
}
costs <- data.frame( costs = J )
theta <- data.frame( do.call( rbind, thetas ), row.names = NULL )
return( list( costs = costs, theta = theta))
}
I simulate an artificial data.
x1 <- runif(1000000,1,100);
x2 <- runif(1000000,1,200);
y <- 5+4*x1+3*x2;
QR decomposition of lm package gives this result:
fit <- lm(y ~ x1+x2);
summary(fit)
#
#Call:
# lm(formula = y ~ x1 + x2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-7.386e-09 0.000e+00 0.000e+00 0.000e+00 9.484e-10
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 5.000e+00 2.162e-14 2.313e+14 <2e-16 ***
# x1 4.000e+00 2.821e-16 1.418e+16 <2e-16 ***
# x2 3.000e+00 1.403e-16 2.138e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 8.062e-12 on 999997 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: 1
#F-statistic: 3.292e+32 on 2 and 999997 DF, p-value: < 2.2e-16
My initial values for thetas are 0s. Learning rate is chosen to be 0.00005. Number of iterations is 5000. Stopping criteria which is epsilon here (a user-defined small value) is 0.000001. If the trained parameter’s difference between the two iteration is smaller than this value then the algorithm will stop. The result I get is given below:
data<- data.frame(cbind(x1, x2, y))
results <- StochasticGradientDescent( data = data, alpha = 0.00005, iteration = 5000, epsilon = .000001)
results$theta[ nrow(results$theta), ]
# theta0 V2 V3
#5001 0.2219142 4.04408 2.999861
As you can see estimates of coefficients are very close to actual ones. However, the coefficient estimation for theta0 (intercept/constant) is not even close. Besides, I get these values at the end of the cycle of iterations, which is not good. I cannot converge efficiently. I tried but I really could not figure out why this is the case. Could someone help me, please?
I have run the Hosmer Lemeshow statistic in R, but I have obtained an p-value of 1. This seems strange to me. I know that a high p-valvalue means that we do not reject the null hypothesis that observed and expected are the same, but is it possible i have an error somewhere?
How do i interpret such p-value?
Below is the code i have used to run the test. I also attach how my model looks like. Response variable is a count variable, while all regressors are continous. I have run a negative binomial model, due to detected overdispersion in my initial poisson model.
> hosmerlem <- function(y, yhat, g=10)
+ {cutyhat <- cut(yhat, breaks = quantile(yhat, probs=seq(0,1, 1/g)), include.lowest=TRUE)
+ obs <- xtabs(cbind(1 - y, y) ~ cutyhat)
+ expect <- xtabs(cbind(1 - yhat, yhat) ~ cutyhat)
+ chisq <- sum((obs - expect)^2/expect)
+ P <- 1 - pchisq(chisq, g - 2)
+ return(list(chisq=chisq,p.value=P))}
> hosmerlem(y=TOT.N, yhat=fitted(final.model))
$chisq
[1] -2.529054
$p.value
[1] 1
> final.model <-glm.nb(TOT.N ~ D.PARK + OPEN.L + L.WAT.C + sqrt(L.P.ROAD))
> summary(final.model)
Call:
glm.nb(formula = TOT.N ~ D.PARK + OPEN.L + L.WAT.C + sqrt(L.P.ROAD),
init.theta = 4.979895131, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.08218 -0.70494 -0.09268 0.55575 1.67860
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.032e+00 3.363e-01 11.989 < 2e-16 ***
D.PARK -1.154e-04 1.061e-05 -10.878 < 2e-16 ***
OPEN.L -1.085e-02 3.122e-03 -3.475 0.00051 ***
L.WAT.C 1.597e-01 7.852e-02 2.034 0.04195 *
sqrt(L.P.ROAD) 4.924e-01 3.101e-01 1.588 0.11231
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(4.9799) family taken to be 1)
Null deviance: 197.574 on 51 degrees of freedom
Residual deviance: 51.329 on 47 degrees of freedom
AIC: 383.54
Number of Fisher Scoring iterations: 1
Theta: 4.98
Std. Err.: 1.22
2 x log-likelihood: -371.542
As correctly pointed out by #BenBolker, Hosmer-Lemeshow is a test for logistic regression, not for a negative binomial generalized linear model.
If we consider to apply the test to a logistic regression,
the inputs of the function hosmerlem (a copy of the hoslem.test function in the package ResourceSelection) should be:
- y = a numeric vector of observations, binary (0/1)
- yhat = expected values (probabilities)
Here is an illustrative example that shows how to get the correct inputs:
set.seed(123)
n <- 500
x <- rnorm(n)
y <- rbinom(n, 1, plogis(0.1 + 0.5*x))
logmod <- glm(y ~ x, family=binomial)
# Important: use the type="response" option
yhat <- predict(logmod, type="response")
hosmerlem(y, yhat)
########
$chisq
[1] 4.522719
$p.value
[1] 0.8071559
The same result is given by the function hoslem.test:
library(ResourceSelection)
hoslem.test(y, yhat)
########
Hosmer and Lemeshow goodness of fit (GOF) test
data: y, yhat
X-squared = 4.5227, df = 8, p-value = 0.8072
As already mentioned, HL-test is not appropriate for the specified model. It is also important to know that a large p-value doesn't necessarily mean a good fit. It could also be that there isn't enough evidence to prove it's a poor fit.
Meanwhile, the gofcat package implementation of the HL-test provides for passing model objects directly to the function without necessarily supplying the observed and predicted values. For the simulated data one has:
library(gofcat)
set.seed(123)
n <- 500
x <- rnorm(n)
y <- rbinom(n, 1, plogis(0.1 + 0.5*x))
logmod <- glm(y ~ x, family=binomial)
hosmerlem(logmod, group = 10)
Hosmer-Lemeshow Test:
Chi-sq df pr(>chi)
binary(Hosmerlem) 4.5227 8 0.8072
H0: No lack of fit dictated
rho: 100%
In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.