I have a model formula in the form of
model.all <- lme(Response ~ A + B + C)
I would like to update this model by successively removing a predictor variable from the model, so I would end up with 3 models, specifically:
mod.1 <- lme(Response ~ B + C) ; mod.2 <- lme(Response ~ A + C) ; mod.3 <- lme(Response ~ A + B)
I am thinking of a loop function, so I am aware of the update function, but I have too many predictor variables to manually change the code.
Any suggestions would be appreciated.
I would use combn in this occasion, see the example below:
Example Data
Response <- runif(100)
A <- runif(100)
B <- runif(100)
C <- runif(100)
Solution
a <- c('A','B','C') #the names of your variables
b <- as.data.frame(combn(a,2)) #two-way combinations of those using combn
#create the formula for each model
my_forms <- sapply(b, function(x) paste('Response ~ ', paste(x,collapse=' + ')))
> my_forms #the formulas that will be used in the model
V1 V2 V3
"Response ~ A + B" "Response ~ A + C" "Response ~ B + C"
#run each model
my_models <- lapply(my_forms, function(x) lm(as.formula(x)))
Output
> summary(my_models[[1]])
Call:
lm(formula = as.formula(x))
Residuals:
Min 1Q Median 3Q Max
-0.48146 -0.20745 -0.00247 0.24263 0.58341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.32415 0.08232 3.938 0.000155 ***
A 0.25404 0.09890 2.569 0.011733 *
B 0.07955 0.10129 0.785 0.434141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2828 on 97 degrees of freedom
Multiple R-squared: 0.06507, Adjusted R-squared: 0.04579
F-statistic: 3.375 on 2 and 97 DF, p-value: 0.03827
As you can see each model is saved in as a list element in my_models. I find this quite easy to make and run.
Related
I am a beginner in R so I'm sorry if my question is basic and has been answered somewhere else but unfortunately I could not find the answer.
One of my predictor variables, nationality, has 8 levels.
I want to create a user defined function that loops through each level in my variable nationality, taking one level per regression. I created a list of the levels of the variable nationalityas such:
mylist <- list("bangladeshian", "british", "filipino", "indian",
"indonesian", "nigerian", "pakistani", "spanish")
then created a user defined function:
f1 <- function(x) {
l <- summary(glm(smoke ~ I(nationality == mylist[x]),
data=df.subpop, family=binomial(link="probit")))
print(l)
}
f1(2)
f1(2) gives this output:
Call:
glm(formula = smoke ~ I(nationality == mylist[x]),
family = binomial(link = "probit"), data = df.subpop)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.629 -0.629 -0.629 -0.629 1.853
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9173 0.1659 -5.530 3.21e-08 ***
I(nationality == mylist[x])TRUE -4.2935 376.7536 -0.011 0.991
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 73.809 on 78 degrees of freedom
Residual deviance: 73.416 on 77 degrees of freedom
AIC: 77.416
Number of Fisher Scoring iterations: 14
As you can see, the coefficient for nationality is "I(nationality == mylist[x])TRUE"
which is not very informative and requires the user to refer back to the line of code
f1(2) and also to mylist to understand the level that that coefficient represents. I believe there should be a cleaner and more straightforward way to do this and accurately run a regression for each level without having to call f1() 8 times.
Consider dynamically building formula with as.formula or reformulate:
nationality_levels <- levels(df.subpop$nationality)
f1 <- function(x) {
# BUILD FORMULA (EQUIVALENT CALLS)
f <- as.formula(paste0("smoke ~ I(nationality == '", x, "')"))
f <- reformulate(paste0("I(nationality == '", x, "')"), "smoke")
l <- summary(
glm(f, data=df.subpop, family=binomial(link="probit"))
)
}
reg_list <- lapply(nationality_levels, f1)
reg_list
Being aware of the danger of using dynamic variable names, I am trying to loop over varios regression models where different variables specifications are choosen. Usually !!rlang::sym() solves this kind of problem for me just fine, but it somehow fails in regressions. A minimal example would be the following:
y= runif(1000)
x1 = runif(1000)
x2 = runif(1000)
df2= data.frame(y,x1,x2)
summary(lm(y ~ x1+x2, data=df2)) ## works
var = "x1"
summary(lm(y ~ !!rlang::sym(var)) +x2, data=df2) # gives an error
My understanding was that !!rlang::sym(var)) takes the values of var (namely x1) and puts that in the code in a way that R thinks this is a variable (not a char). BUt I seem to be wrong. Can anyone enlighten me?
Personally, I like to do this with some computing on the language. For me, a combination of bquote with eval is easiest (to remember).
var <- as.symbol(var)
eval(bquote(summary(lm(y ~ .(var) + x2, data = df2))))
#Call:
#lm(formula = y ~ x1 + x2, data = df2)
#
#Residuals:
# Min 1Q Median 3Q Max
#-0.49298 -0.26248 -0.00046 0.24111 0.51988
#
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.50244 0.02480 20.258 <2e-16 ***
#x1 -0.01468 0.03161 -0.464 0.643
#x2 -0.01635 0.03227 -0.507 0.612
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Residual standard error: 0.2878 on 997 degrees of freedom
#Multiple R-squared: 0.0004708, Adjusted R-squared: -0.001534
#F-statistic: 0.2348 on 2 and 997 DF, p-value: 0.7908
I find this superior to any approach that doesn't show the same call as summary(lm(y ~ x1+x2, data=df2)).
The bang-bang operator !! only works with "tidy" functions. It's not a part of the core R language. A base R function like lm() has no idea how to expand such operators. Instead, you need to wrap those in functions that can do the expansion. rlang::expr is one such example
rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2)))
# summary(lm(y ~ x1 + x2, data = df2))
Then you need to use rlang::eval_tidy to actually evaluate it
rlang::eval_tidy(rlang::expr(summary(lm(y ~ !!rlang::sym(var) + x2, data=df2))))
# Call:
# lm(formula = y ~ x1 + x2, data = df2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.49178 -0.25482 0.00027 0.24566 0.50730
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.4953683 0.0242949 20.390 <2e-16 ***
# x1 -0.0006298 0.0314389 -0.020 0.984
# x2 -0.0052848 0.0318073 -0.166 0.868
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2882 on 997 degrees of freedom
# Multiple R-squared: 2.796e-05, Adjusted R-squared: -0.001978
# F-statistic: 0.01394 on 2 and 997 DF, p-value: 0.9862
You can see this version preserves the expanded formula in the model object.
1) Just use lm(df2) or if lm has additional columns beyond what is shown in the question but we just want to regress on x1 and x2 then
df3 <- df2[c("y", var, "x2")]
lm(df3)
The following are optional and only apply if it is important that the formula appear in the output as if it had been explicitly given.
Compute the formula fo using the first line below and then run lm as in the second line:
fo <- formula(model.frame(df3))
fm <- do.call("lm", list(fo, quote(df3)))
or just run lm as in the first line below and then write the formula into it as in the second line:
fm <- lm(df3)
fm$call <- formula(model.frame(df3))
Either one gives this:
> fm
Call:
lm(formula = y ~ x1 + x2, data = df3)
Coefficients:
(Intercept) x1 x2
0.44752 0.04278 0.05011
2) character string lm accepts a character string for the formula so this also works. The fn$ causes substitution to occur in the character arguments.
library(gsubfn)
fn$lm("y ~ $var + x2", quote(df2))
or at the expense of more involved code, without gsubfn:
do.call("lm", list(sprintf("y ~ %s + x2", var), quote(df2)))
or if you don't care that the formula displays without var substituted then just:
lm(sprintf("y ~ %s + x2", var), df2)
I have to run a simulation with R.
Basically I have to create a set of variables (X) in a matrix nXp where the first variable has value 1, and the other 23 variables has random values randomly extracted from a N(0,1).
A vector beta of length 24 which has as first 2 values 1 and the rest is 0.
A vector epsilon of length 24 which is extracted from a N(0,1).
After that I create a variable y which is: y=X %*% beta + epsilon.
Then I select the variable of X which has the max(cor(abs(Xj,y))) where j goes from (3,24) and I have to run a model y ~ X1 + X2 + Xj and see the results.
> set.seed(123)
>
> n=25 p=24 b=seq(1,1000)
>
> X <- cbind(matrix(1,nrow=25,ncol=1),matrix(rnorm(25*23),nrow=25,
> ncol=23)) beta <- t(t(c(1,1,rep(0,22)))) eps <- t(t(rnorm(25)))
>
> y <- X %*% beta + eps
>
> j<-seq(3,24) m <- which.max(abs(cor(X[,j],y)))
>
> newX <- as.data.frame(cbind(y,X[,1], X[,2], X[,m+2])) anyNA(newX[,2])
> mod <- lm(V1 ~ . , data=newX)
> summary(mod)
Call: lm(formula = V1 ~ ., data = newX)
Residuals:
Min 1Q Median 3Q Max
-1.42575 -0.90957 0.06547 0.38879 2.39707
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1235 0.2421 4.641 0.000126 ***
V2 NA NA NA NA
V3 0.6803 0.2775 2.452 0.022612 *
V4 -0.5943 0.3036 -1.957 0.063101 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.085 on 22 degrees of freedom Multiple R-squared: 0.3958,
Adjusted R-squared: 0.3408 F-statistic: 7.205 on 2 and 22 DF, p-value: 0.003919
Everything works fine, but as you can see the estimated coefficient for V2, that would be the coefficient for the variable X1, which is composed just by 1, is NA.
I don't understand why I have NA as a result, the values in the variable are listed as numeric and there's no missing value.
If anyone can help me to understand, thanks!
In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.
this piece of code will return coefficients :intercept , slop1 , slop2
set.seed(1)
n=10
y=rnorm(n)
x1=rnorm(n)
x2=rnorm(n)
lm.ft=function(y,x1,x2)
return(lm(y~x1+x2)$coef)
res=list();
for(i in 1:n){
x1.bar=x1-x1[i]
x2.bar=x2-x2[i]
res[[i]]=lm.ft(y,x1.bar,x2.bar)
}
If I type:
> res[[1]]
I get:
(Intercept) x1 x2
-0.44803887 0.06398476 -0.62798646
How can we return predicted values,residuals,R square, ..etc?
I need something general to extract whatever I need from the summary?
There are a couple of things going on here.
First, you are better off combining your variables into a data.frame:
df <- data.frame(y=rnorm(10), x1=rnorm(10), x2 = rnorm(10))
fit <- lm(y~x1+x2, data=df)
If you do this, using you model for prediction with a new dataset will be much easier.
Second, some of the statistics of the fit are accessible from the model itself, and some are accessible from summary(fit).
coef <- coefficients(fit) # coefficients
resid <- residuals(fit) # residuals
pred <- predict(fit) # fitted values
rsq <- summary(fit)$r.squared # R-sq for the fit
se <- summary(fit)$sigma # se of the fit
To get the statistics of the coefficients, you need to use summary:
stat.coef <- summary(fit)$coefficients
coef <- stat.coef[,1] # 1st column: coefficients (same as above)
se.coef <- stat.coef[,2] # 2nd column: se for each coef
t.coef <- stat.coef[,3] # 3rd column: t-value for each coef
p.coef <- stat.coef[,4] # 4th column: p-value for each coefficient
In your function, you return just the coefficients. Try returning the whole model:
lm.ft=function(y,x1,x2) lm(y~x1+x2) # You don't need the return statement.
Now try your code, and then run:
summary(res[[1]])
# Call:
# lm(formula = y ~ x1 + x2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.88518 -0.25311 0.03868 0.43110 0.61753
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.44804 0.32615 -1.374 0.2119
# x1 0.06398 0.24048 0.266 0.7979
# x2 -0.62799 0.26915 -2.333 0.0524 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.6149 on 7 degrees of freedom
# Multiple R-squared: 0.5173, Adjusted R-squared: 0.3794
# F-statistic: 3.751 on 2 and 7 DF, p-value: 0.07814
You need predict -
set.seed(1)
n=10
y=rnorm(n)
x1=rnorm(n)
x2=rnorm(n)
lm.ft=function(y,x1,x2)
# return(lm(y~x1+x2)$coef)
return(lm(y~x1+x2))
res=lm.ft(y,x1,x2)
ypredicted <- predict(res)
residuals <- y - ypredicted