I am working with an interaction model similar to this one below:
set.seed(1993)
moderating <- sample(c("Yes", "No"),100, replace = T)
x <- sample(c("Yes", "No"), 100, replace = T)
y <- sample(1:100, 100, replace = T)
df <- data.frame(y, x, moderating)
Results <- lm(y ~ x*moderating)
summary(Results)
Call:
lm(formula = y ~ x * moderating)
Residuals:
Min 1Q Median 3Q Max
-57.857 -29.067 3.043 22.960 59.043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.4000 6.1639 8.501 2.44e-13 ***
xYes 8.4571 9.1227 0.927 0.356
moderatingYes -11.4435 8.9045 -1.285 0.202
xYes:moderatingYes -0.1233 12.4563 -0.010 0.992
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.82 on 96 degrees of freedom
Multiple R-squared: 0.04685, Adjusted R-squared: 0.01707
F-statistic: 1.573 on 3 and 96 DF, p-value: 0.2009
I'm learning how to calculate the fitted value of a interaction from a regression table. In the example, the base category (or omitted category) is x= No and moderating = No.
Thus far, I know the following fitted values:
#Calulate Fitted Value From a Regression Interaction by hand
#Omitted Variable = X_no.M_no
X_no.M_no <- 52.4000
X_yes.M_no <- 52.4000 + 8.4571
X_no.M_yes <- 52.4000 + -11.4435
X_yes.M_yes #<- ?
I do not understand how the final category, X_yes.M_yes, is calculated. My initial thoughts were X_yes.M_yes <- 52.4000 + -0.1233, (the intercept plus the interaction term) but that is incorrect. I know its incorrect because, using the predict function, the fitted value of X_yes.M_yes = 49.29032, not 52.4000 + -0.1233 = 52.2767.
How do I calculate, by hand, the predicted value of the X_yes.M_yes category?
Here are the predicted values as generated from the predict function in R
#Validated Here Using the Predict Function:
newdat <- NULL
for(m in na.omit(unique(df$moderating))){
for(i in na.omit(unique(df$x))){
moderating <- m
x <- i
newdat<- rbind(newdat, data.frame(x, moderating))
}
}
Prediction.1 <- cbind(newdat, predict(Results, newdat, se.fit = TRUE))
Prediction.1
Your regression looks like this in math:
hat_y = a + b x + c m + d m x
Where x = 1 when "yes" and 0 when "no" and m is similarly defined by moderating.
Then X_yes.M_yes implies x = 1 and m = 1, so your prediction is a + b + c + d.
or in your notation X_yes.M_yes = 52.4000 + 8.4571 - 11.4435 - 0.1233
Related
I would like to write a function that smooths the coefficient of growth rate to 0 in 60 days. So far I managed to write the following code:
corona <- data.frame(Cases = c(3, 16, 79, 157, 229, 322, 400, 650, 888, 1128, 1694, 2036, 2502, 3089, 3858), Date = seq(as.Date("2020/02/20"), as.Date("2020/03/05"), by = "days"))
library(dplyr)
corona_entire <- corona %>% mutate(Growth = (Cases - lag(Cases))/lag(Cases)*100)
mean(corona_entire$Growth[12:15])
ff = function(x) x*(1.2285823)^60
ff(3858)
However, in my function the growth rate (0.2285823) is constant over 60 periods. I would like to tell R to make that growth rate tend to 0 as we get closer and closer to 60. I need to write a convergence function for the growth rate basically.
Any idea how can I code it?
Thanks!
Further to my comment above, it's not clear to me what you're trying to do. If you want to model the Growth rate you need to fit some form of model.
For a start, how about an exponential model of the form y = y0 * exp(k * time)?
In that case we can linearise the model (and data) by taking the log, and then use lm to estimate the model coefficients log(y0) and k.
df <- corona_entire %>% mutate(Time = as.integer(Date - min(Date)))
fit <- lm(log(Growth) ~ Time, weights = df$Growth, data = df)
Here I have used a weighted least squares approach by weighting every point by its Growth rate.
We can then plot the points and best fit curve:
f <- function(x, fit) exp(coef(fit)[1])*exp(coef(fit)[2] * x)
ggplot(df, aes(Time, Growth)) +
geom_point() +
stat_function(fun = f, args = list(fit = fit)) +
labs(x = sprintf("Days since %s", min(df$Date)))
Not a good fit but this should give you some ideas. You probably want to fit a more suitable non-linear growth-rate model, and estimate parameters using nls.
Update
Since you really want to predict Cases, let's re-formulate our model.
We start again with an exponential model of the form Cases ~ y0 * exp(k * Time)
ggplot(df, aes(Time, Cases)) +
geom_point()
fit1 <- lm(log(Cases) ~ Time, data = df)
f1 <- function(x, fit) exp(coef(fit)[1])*exp(coef(fit)[2] * x)
ggplot(df, aes(Time, Cases)) +
geom_point() +
stat_function(fun = f1, args = list(fit = fit1)) +
labs(x = sprintf("Days since %s", min(df$Date)))
Not a good fit! Results seem to suggest sub-exponential growth. A simple model for sub-exponential growth in epidemiology is a model of the form Cases ~ (r / m * Time + A)^m, see e.g. Chowell et al., Phys. Life Rev. 18, 66 (2016).
So let's fit the model, this time using the non-linear least-squares routine nls.
fit2 <- nls(
Cases ~ (r / m * Time + A)^m,
data = df,
start = list(r = 4, m = 3, A = 1))
f2 <- function(x, r, m, A) (r / m * x + A)^m
ggplot(df, aes(Time, Cases)) +
geom_point() +
stat_function(
fun = f2,
args = list(
r = coef(fit2)[1],
m = coef(fit2)[2],
A = coef(fit2)[3])) +
labs(x = sprintf("Days since %s", min(df$Date)))
Looks like a decent fit. You can inspect the quality of the fit and the non-linear least-squares estimates for the coefficients if you type summary(fit2)
summary(fit2)
#
#Formula: Cases ~ (r/m * Time + A)^m
#
#Parameters:
# Estimate Std. Error t value Pr(>|t|)
#r 2.3308 0.6543 3.562 0.00391 **
#m 3.3316 0.4202 7.929 4.12e-06 ***
#A 2.1101 0.3126 6.750 2.04e-05 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 51.41 on 12 degrees of freedom
#
#Number of iterations to convergence: 6
#Achieved convergence tolerance: 6.514e-07
#
If you just want a linear fall in growth rate towards 1 over 60 days you can do this:
ff = function(initial_n, initial_rate = 1.2285823, days = 60, time_to_stasis = 60)
{
daily_rate <- seq(initial_rate, 1, length.out = time_to_stasis)
result <- numeric(days)
result[1] <- initial_n
for(i in seq(days - 1)) result[i + 1] <- floor(daily_rate[i] * result[i])
return(result)
}
So you get a daily number like this:
ff(3858)
#> [1] 3858 4739 5803 7084 8620 10456 12643 15239 18309 21926
#> [11] 26173 31141 36932 43656 51436 60403 70699 82477 95897 111129
#> [21] 128350 147743 169494 193790 220818 250760 283791 320073 359754 402961
#> [31] 449796 500332 554607 612621 674331 739644 808418 880454 955498 1033237
#> [41] 1113297 1195248 1278600 1362812 1447290 1531398 1614460 1695773 1774611 1850239
#> [51] 1921922 1988936 2050581 2106192 2155151 2196899 2230944 2256873 2274360 2283171
and you can adjust the parameters to whatever you like.
You could use it to plot projections like this:
plot(1:60, ff(3858))
I'm not sure how biologically plausible this is though.
Looking at the data, it looks like a quadratic curve is the better option to model Cases as a function of days
corona$days = as.numeric(corona$Date - corona$Date[1], "days") + 1
mod = lm(Cases ~ poly(days, 2, raw = TRUE), corona)
summary(mod)
#Call:
#lm(formula = Cases ~ poly(days, 2, raw = TRUE), data = corona)
#Residuals:
# Min 1Q Median 3Q Max
#-140.48 -50.63 -24.30 65.89 148.04
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 264.912 84.071 3.151 0.00836 **
#poly(days, 2, raw = TRUE)1 -158.269 24.179 -6.546 2.75e-05 ***
#poly(days, 2, raw = TRUE)2 25.863 1.469 17.600 6.17e-10 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 94.38 on 12 degrees of freedom
#Multiple R-squared: 0.9949, Adjusted R-squared: 0.9941
#F-statistic: 1181 on 2 and 12 DF, p-value: 1.668e-14
plot(corona$days, corona$Cases)
lines(predict(mod, data.frame(days = corona$days)))
# Growth Rate
d = predict(mod, data.frame(days = 59:60))
diff(d)/d[1]
# 2
#0.03606188
Let's say I have data:*
data = data.frame(xdata = 1:10, ydata = 6:15)
I look at the data
data
xdata ydata
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
6 6 11
7 7 12
8 8 13
9 9 14
10 10 15
Now I want to include a third column to the data which should be an increment/estimate and a fourth column we should be standard errors. To do this, I estimate the increment for each row of the data by fitting a linear model and taking the slope/estimate and also the associated standard error. So I fit model_1:
model_1 = lm(ydata~xdata,data = data)
out = summary(model_1)
out
It gives me:
Call:
lm(formula = ydata ~ xdata, data = data)
Residuals:
Min 1Q Median 3Q Max
-5.661e-16 -1.157e-16 4.273e-17 2.153e-16 4.167e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.000e+00 2.458e-16 2.035e+16 <2e-16 ***
xdata 1.000e+00 3.961e-17 2.525e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.598e-16 on 8 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.374e+32 on 1 and 8 DF, p-value: < 2.2e-16
To extract the estimate, I can simply do:
out$coefficients[2,1]
To extract the standard error, I can simply do:
out$coefficients[2,2]
but my interest is to have an output that shows estimates and standard errors for each row so that I end up with 10 estimates and 10 standard errors. Is there a way to do this?
Many thanks!
Basically, your lm model is of the formula y = Intercept + x*coefficient.
So, you can calculate the estimate based on the output of the summary(lm(...
So, if you take the following example:
set.seed(123)
vector1 = rnorm(100, mean = 4)
vector2 = rnorm(100, mean = 1)
dat = data.frame(vector1,vector2)
model_dat = lm(vector2 ~ vector1, data = dat)
Model = summary(model_dat)
And now, you can calculate the estimate:
dat$estimate = dat$vector1 * Model$coefficients[2,1] + Model$coefficients[1,1]
And then for the standard error, you can use predict.lm with the function se.fit = TRUE:
dat$SE = predict.lm(model_dat, se.fit = TRUE, level = 0.95)$se.fit
So, you get the following dataset:
> head(dat)
vector1 vector2 estimate SE
1 3.439524 0.28959344 0.9266060 0.11942447
2 3.769823 1.25688371 0.9092747 0.10294104
3 5.558708 0.75330812 0.8154090 0.18452625
4 4.070508 0.65245740 0.8934973 0.09709476
5 4.129288 0.04838143 0.8904130 0.09716038
6 5.715065 0.95497228 0.8072047 0.19893259
You can compare the result of this by first, checking the plotting obtained using stat_smooth:
library(ggplot2)
ggplot(dat, aes(x = vector1, y = vector2)) + geom_point() + stat_smooth(method = "lm", se = TRUE)
And you get this plot:
And if now, you use estimate and SE columns from your dat:
ggplot(dat, aes(x = vector1, y = vector2)) + geom_point() +
geom_line(aes(x = vector1, y = estimate), color = "red")+
geom_line(aes(x = vector1, y = estimate+SE)) +
geom_line(aes(x = vector1, y = estimate-SE))
You get almost the same plot:
Hope that it answers your question
Does anyone know if it is possible to use lmFit or lm in R to calculate a linear model with categorical variables while including all possible comparisons between the categories? For example in the test data created here:
set.seed(25)
f <- gl(n = 3, k = 20, labels = c("control", "low", "high"))
mat <- model.matrix(~f, data = data.frame(f = f))
beta <- c(12, 3, 6) #these are the simulated regression coefficient
y <- rnorm(n = 60, mean = mat %*% beta, sd = 2)
m <- lm(y ~ f)
I get the summary:
summary(m)
Call:
lm(formula = y ~ f)
Residuals:
Min 1Q Median 3Q Max
-4.3505 -1.6114 0.1608 1.1615 5.2010
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.4976 0.4629 24.840 < 2e-16 ***
flow 3.0370 0.6546 4.639 2.09e-05 ***
fhigh 6.1630 0.6546 9.415 3.27e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.07 on 57 degrees of freedom
Multiple R-squared: 0.6086, Adjusted R-squared: 0.5949
F-statistic: 44.32 on 2 and 57 DF, p-value: 2.446e-12
which is because the contrasts term ("contr.treatment") compares "high" to "control" and "low" to "control".
Is it possible to get also the comparison between "high" and "low"?
If you use aov instead of lm, you can use the TukeyHSD function from the stats package:
fit <- aov(y ~ f)
TukeyHSD(fit)
# Tukey multiple comparisons of means
# 95% family-wise confidence level
# Fit: aov(formula = y ~ f)
# $f
# diff lwr upr p adj
# low-control 3.036957 1.461707 4.612207 6.15e-05
# high-control 6.163009 4.587759 7.738259 0.00e+00
# high-low 3.126052 1.550802 4.701302 3.81e-05
If you want to use an lm object, you can use the TukeyHSD function from the mosaic package:
library(mosaic)
TukeyHSD(m)
Or, as #ben-bolker suggests,
library(emmeans)
e1 <- emmeans(m, specs = "f")
pairs(e1)
# contrast estimate SE df t.ratio p.value
# control - low -3.036957 0.6546036 57 -4.639 0.0001
# control - high -6.163009 0.6546036 57 -9.415 <.0001
# low - high -3.126052 0.6546036 57 -4.775 <.0001
# P value adjustment: tukey method for comparing a family of 3 estimates
With lmFit:
library(limma)
design <- model.matrix(~0 + f)
colnames(design) <- levels(f)
fit <- lmFit(y, design)
contrast.matrix <- makeContrasts(control-low, control-high, low-high,
levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)
round(t(rbind(fit2$coefficients, fit2$t, fit2$p.value)), 5)
# [,1] [,2] [,3]
# control - low -3.03696 -4.63938 2e-05
# control - high -6.16301 -9.41487 0e+00
# low - high -3.12605 -4.77549 1e-05
Also see Multiple t-test comparisons for more information.
I'm evaluating the performance of a numerical deterministic model, and I'm evaluating its predictive performance against observed data. I made a scatter plot of the observed (Vsurface) vs modeled (Vmod) data, fit a lm (the red line), and added a 1:1 line. I want to find the point where these two lines intersect so I can document where the model shifts from over-predicting to under-predicting. Is there an easy way to do this? Here is the code for the lm:
lm <- lm(Vmod~Vsurface, data = v)
summary(lm)
Call:
lm(formula = Vmod ~ Vsurface, data = v)
Residuals:
Min 1Q Median 3Q Max
-0.63267 -0.11995 -0.03618 0.13816 0.60314
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.20666 0.06087 3.395 0.00185 **
Vsurface 0.43721 0.06415 6.816 1.05e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2232 on 32 degrees of freedom
Multiple R-squared: 0.5921, Adjusted R-squared: 0.5794
F-statistic: 46.45 on 1 and 32 DF, p-value: 1.047e-07
Here is the plot code:
ggplot(data = v, aes(x = Vsurface, y = Vmod)) +
geom_point(col = "slateblue2") +
geom_smooth(method = "lm", col = "red") +
geom_abline(intercept = 0, slope = 1)
I'm working in R markdown.
Just to explicitly state G5W's comment - the model is a list and the co-efficients can be extracted like this:
lmodel <- lm(Vmod~Vsurface, data = v)
x1 <- lmodel$coefficients[1]/(1-lmodel$coefficients[2])
### x1 is the intersection point
Edited step by step:
x <- rnorm(100,10,2)
y <- rnorm(100,15,3)
lmodel <- lm(y ~x)
lmodel$coefficients
Intercept = 13.6578378
x = 0.1283835
x1 <- lmodel$coefficients[1]/(1 - lmodel$coefficients[2])
x1
15.66955
In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.