I'm trying to figure out how can I set up purrr to run several multiple regressions like the image below. As you will notice, this dataset describes an intervention program and we are analyzing this data using ANCOVA procedures (TIME 2 ~ TIME 1 + CONDITION).
om4g**TIME2**01 ~ om4g**TIME1**01 + CONDITION
example:
om4g201 ~ om4g01 + CONDITION
Just in case someone want a reproducible code:
dataset <- data.frame(rest201=c(10,20,30,40),
rest101=c(5,10,20,24),
omgt201=c(40,10,20,10),
omgt101=c(10,20,10,05),
CONDITION=c(0,1))
lm(rest201~rest101+CONDITION, data=dataset)
lm(omgt201~omgt101+CONDITION, data=dataset)
I found just one similar question than mine here (Making linear models in a for loop using R programming) but the answer was not working.
Thanks!
Similar to #Roman's answer, here is how to do it using map2 from purrr:
library(purrr)
y_var = c("rest201", "omgt201")
x_var = list(c("rest101", "CONDITION"), c("omgt101", "CONDITION"))
map2(x_var, y_var, ~ lm(as.formula(paste(.y, "~", paste(.x, collapse = " + "))), data = dataset))
To get the summary table for each model, you can wrap each lm with summary and extract the coefficients table:
map2(x_var, y_var, ~ {
lm(as.formula(paste(.y, "~", paste(.x, collapse = " + "))), data = dataset) %>%
summary() %>%
`$`("coefficients")
})
Result:
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.779097 0.76821670 3.617596 0.17169133
rest101 1.377672 0.04750594 29.000000 0.02194371
CONDITION 3.800475 0.72163694 5.266464 0.11945968
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.000000e+01 16.666667 1.800000e+00 0.3228289
omgt101 -2.445145e-16 1.333333 -1.833859e-16 1.0000000
CONDITION -2.000000e+01 14.529663 -1.376494e+00 0.3999753
You could construct a list of formulas for each model and use that to construct a model.
x <- c(101, 102, 103)
mdls <- sprintf("omg4g%s ~ om4g%s + CONDITION",
as.character(x + 100),
as.character(x)
)
out <- sapply(mdls, FUN = function(x) {
formula(x, data = latino_dataset)
})
$`omg4g201 ~ om4g101 + CONDITION`
omg4g201 ~ om4g101 + CONDITION
<environment: 0x0000000009aff7b8>
$`omg4g202 ~ om4g102 + CONDITION`
omg4g202 ~ om4g102 + CONDITION
<environment: 0x0000000009afda98>
$`omg4g203 ~ om4g103 + CONDITION`
omg4g203 ~ om4g103 + CONDITION
<environment: 0x00000000099b0828>
e.g.
sapply(out, FUN = lm)
Related
The code below creates a linear model with R's lm, then a weighted model with a weights column. Finally, I try to pass in the weight column name with a variable weight_col and that fails. I'm pretty sure it's looking for "weight_col" in df, then the caller's environment, finds a variable of length 1, and the lengths don't match.
How do I get it to use weight_col as a name for the weights column in df?
I've tried several combinations of things without success.
> df <- data.frame(
x=c(1,2,3),
y=c(4,5,7),
w=c(1,3,5)
)
> lm(y ~ x, data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
2.333 1.500
> lm(y ~ x, data=df, weights=w)
Call:
lm(formula = y ~ x, data = df, weights = w)
Coefficients:
(Intercept) x
1.947 1.658
> weight_col <- 'w'
> lm(y ~ x, data=df, weights=weight_col)
Error in model.frame.default(formula = y ~ x, data = df, weights = weight_col, :
variable lengths differ (found for '(weights)')
> R.version.string
[1] "R version 3.6.3 (2020-02-29)"
You can use the data frame name with extractor operator:
lm(y ~ x, data = df, weights = df[[weight_col]])
Or you can use function get:
lm(y ~ x, data = df, weights = get(weight_col))
We can use [[ to extract the value of the column
lm(y ~ x, data=df, weights=df[[weight_col]])
Or with tidyverse
library(dplyr)
df %>%
summarise(model = list(y ~ x, weights = .data[[weight_col]]))
Your first example if weights = w, which is using non-standard evaluation to find w in the context of df. So far, this is normal for interactive use.
Your second set is weights = weight_col which resolves to weights = "w", which is very different. There is nothing in R's non-standard (or standard) evaluation in which that makes sense.
As I said in my comment, use the standard-evaluation form with [[.
lm(y ~ x, data=df, weights=df[[weight_col]])
# Call:
# lm(formula = y ~ x, data = df, weights = df[[weight_col]])
# Coefficients:
# (Intercept) x
# 1.947 1.658
Let's say I have use mtcars dataset to set arbitrary formula:
data(mtcars)
myFormula <- as.formula("mpg ~ cyl + I(disp / hp) + I(wt^2) + I((qsec + vs) / gear)")
I would like to use that formula inside lm function, but before that, I would like to remove potential rows that contain Inf, NaN and NA. From example if disp / hp result in any Inf values I would like to remove rows that contain it. I know I can do that by generate new variable first , remove Inf and then run lm with formula, but I would like to do that using formula terms, since it is part of shiny application and formula is input.
My try:
formulaTerms <- terms(myFormula)
formulaTerms <- gsub("I", "", labels(formulaTerms))
formulaTermsRatio <- formulaTerms[grep("/", formulaTerms)]
mtcarsDT <- setDT(mtcars)
mtcarsDT <- mtcarsDT[, formulaTermsRatio[1] := sym(formulaTermsRatio[1])]
Use drop.terms. Assuming that each term is represented by a single column in the model matrix (i.e. no factors with > 2 levels) we compute the model matrix mm and find the column numbers, wx, of the bad columns. Then use drop.terms to drop those columns from the terms object and extract the formula from the revised terms object.
mtcars[1, 3] <- Inf
# is.na is TRUE for NA or NaN; is.infinite is TRUE for Inf or -Inf
is.bad <- function(x) any(is.na(x) | is.infinite(x))
fo_terms <- terms(myFormula) # myFormula is taken from question
mm <- model.matrix(myFormula, mtcars)
wx <- which(apply(mm[, -1], 2, is.bad))
fo_terms2 <- drop.terms(fo_terms, wx, keep.response = TRUE)
fo2 <- formula(fo_terms2)
myFormula
## mpg ~ cyl + I(disp/hp) + I(wt^2) + I((qsec + vs)/gear)
fo2
## mpg ~ cyl + I(wt^2) + I((qsec + vs)/gear)
Update
If you want to remove bad rows rather than terms from the formula then:
lm(myFormula, mtcars, subset = !apply(mm, 1, is.bad))
Note that lm will automatically remove rows with NAs and NaNs (dependintg on the na.action argument) so in this case you could simplify is.bad to only check for Inf and -Inf.
Another approach would be to replace Inf and -Inf with NA.
mtcars[is.infinite(mtcars)] <- NA
and then perform lm normally.
You can remove these values from the data you're regressing on. Inf will occur where hp==0 or gear==0.
data(mtcars)
df <- mtcars
myFormula <- as.formula("mpg ~ cyl + I(disp / hp) + I(wt^2) + I((qsec + vs) / gear)")
df <- df[!(df$hp==0 | df$gear==0),]
lm(myFormula,df)
> lm(myFormula,df)
Call:
lm(formula = myFormula, data = df)
Coefficients:
(Intercept) cyl I(disp/hp) I(wt^2) I((qsec + vs)/gear)
35.5847 -1.9639 1.0707 -0.3671 -0.1699
I use a best subset selection package to determine the best independent variables from which to build my model (I do have a specific reason for doing this instead of using the best subset object directly). I want to programmatically extract the feature names and use the resulting string to build my model formula. The result would be something like this:
x <- "x1 + x2 + x3"
y <- "Surv(time, event)"
Because I'm building a coxph model, the formula is as follows:
coxph(Surv(time, event) ~ x1 + x2 + x3)
Using these string fields, I tried to construct the formula like so:
form <- y ~ x
This creates an object of class formula but when I call coxph it doesn't evaluate based on the references created form the formula object. I get the following error:
Error in model.frame.default(formula = y ~ x) : object is not a matrix
If I call eval on the objects y and x within the coxph call, I get the following:
Error in model.frame.default(formula = eval(y) ~ eval(x), data = df) :
variable lengths differ (found for 'eval(x)')
I'm not really sure how to proceed. Thanks for your input.
Couldn't find a good dupe, so posting comment as an answer.
If you build the full formula as a string, including the ~, you can use as.formula on it, e.g.,
x = "x1 + x2 + x3"
y = "Surv(time, event)"
form = as.formula(paste(y, "~", x))
coxph(form, data = your_data)
For a reproducible example, consider the first example at the bottom of the ?coxph help page:
library(survival)
test1 <- list(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
# Fit a stratified model
coxph(Surv(time, status) ~ x + strata(sex), test1)
# Call:
# coxph(formula = Surv(time, status) ~ x + strata(sex), data = test1)
#
# coef exp(coef) se(coef) z p
# x 0.802 2.231 0.822 0.98 0.33
#
# Likelihood ratio test=1.09 on 1 df, p=0.3
# n= 7, number of events= 5
lhs = "Surv(time, status)"
rhs = "x + strata(sex)"
form = as.formula(paste(lhs, "~", rhs))
form
# Surv(time, status) ~ x + strata(sex)
## formula looks good
coxph(form, test1)
# Call:
# coxph(formula = form, data = test1)
#
# coef exp(coef) se(coef) z p
# x 0.802 2.231 0.822 0.98 0.33
Same results either way.
I am trying to master building functions in R. Say I have a data frame or data.table,
dummy <- df(y, x, a, b, who)
Where the vector "who" is like so,
who <- c("Joseph", "Kim", "Billy")
I would like to use the character vector to perform various regression models and name the outputs and their summary statistics. So for the entry, "Billy" in the vector above, I would like something like this:
function() {
ols.reg.Billy <- lm(y ~ x + a + b, data = dummy[dummy$who == "Billy"])
dw.Billy <- dwtest(ols.reg.Billy)
output.Billy <- list(ols.reg.Billy, dw.Billy)
return(output.Billy)
}
But for 500 different entries of the who vector above.
Is there some way to do this? What's the most efficient way? I keep getting errors and I feel I am seriously missing something. Is there some way to use paste?
If this doesn't solve it, please provide a reproducible example. It makes it easier to help you.
library(lmtest)
outputs <- lapply(who, function(name) {
ols.reg <- lm(y ~ x + a + b, data = dummy[dummy$who == name])
dw <- dwtest(ols.reg)
output <- paste(c("ols.reg","dw"), name, sep = "_")
return(output)
})
1) Map Using the built in CO2 data set suppose we wish to regress uptake on conc separately for each Type. Note that this names the components by the Type.
Map(function(x) lm(uptake ~ conc, CO2, subset = Type == x), levels(CO2$Type))
giving this two component list (one component for each level of Type -- Quebec and Mississauga) -- continued after output.
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
15.49754 0.01238
2) Map/do.call We may wish to not only name the components using the Type but also have x substituted with the actual Type in the Call: line of the output. In that case use do.call to invoke lm and use quote to ensure that the name of the data frame rather than its value is displayed and use bquote to perform the substitution for x.
reg <- function(x) {
do.call("lm", list(uptake ~ conc, quote(CO2), subset = bquote(Type == .(x))))
}
Map(reg, levels(CO2$Type))
giving:
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Quebec")
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Mississippi")
Coefficients:
(Intercept) conc
15.49754 0.01238
3) lmList The nlme package has lmList for doing this:
library(nlme)
lmList(uptake ~ conc | Type, CO2, pool = FALSE)
giving:
Call:
Model: uptake ~ conc | Type
Data: CO2
Coefficients:
(Intercept) conc
Quebec 23.50304 0.02308005
Mississippi 15.49754 0.01238113
I want to estimate an equation such as:
(where the bar denotes the mean of a variable.... Meaning, I want to automatically have interactions between Z and a demeaned version of X. So far I just demean the variables manually beforehand and estimate:
lm(Y ~ .*Z, data= sdata)
This seems to be working, but I would rather use a solution that does not require manual demeaning beforehand because I would also like to include the means of more complex terms, such as:
Edit:
As requested, a working code-sample, note that in the actual thing I have large (and varying) numbers of X- variables, so that I dont want to use a hard-coded variant:
x1 <- runif(100)
x2 <- runif(100)
Z <- runif(100)
Y <- exp(x1) + exp(x2) + exp(z)
##current way of estimating the first equation:
sdata <- data.frame(Y=Y,Z=Z,x1=x1-mean(x1),x2=x2-mean(x2))
lm(Y ~ .*Z, data= sdata)
##basically what I want is that the following terms, and their interactions with Z are also used:
# X1^2 - mean(X1^2)
# X2^2 - mean(X2^2)
# X1*X2 - mean(X1*X2)
Edit 2:
Now, what I want to achieve is basically what
lm(Y ~ .^2*Z, data= sdata)
would do. However, given prior demeaing expressions in there, such as: Z:X1:X2 would correspond to: (x1-mean(x1))*(x2-mean(x2)), while what I want to have is x1*x2-mean(x1*x2)
To show that scale works inside a formula:
lm(mpg ~ cyl + scale(disp*hp, scale=F), data=mtcars)
Call:
lm(formula = mpg ~ cyl + scale(disp * hp, scale = F), data = mtcars)
Coefficients:
(Intercept) cyl scale(disp * hp, scale = F)
3.312e+01 -2.105e+00 -4.642e-05
Now for comparison let's scale the interaction outside the formula:
mtcars$scaled_interaction <- with(mtcars, scale(disp*hp, scale=F))
lm(mpg ~ cyl + scaled_interaction, data=mtcars)
Call:
lm(formula = mpg ~ cyl + scaled_interaction, data = mtcars)
Coefficients:
(Intercept) cyl scaled_interaction
3.312e+01 -2.105e+00 -4.642e-05
At least in these examples, it seems as if scale inside formulae is working.
To provide a solution to your specific issue:
Alternative 1: Use formulae
# fit without Z
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
vars <- attr(mod$terms, "term.labels")
vars <- gsub(":", "*", vars) # needed so that scale works later
vars <- paste0("scale(", vars, ", scale=F)")
newf <- as.formula(paste0("Y ~ ", paste0(vars, collapse = "+")))
# now interact with Z
f2 <- update.formula(newf, . ~ .*Z)
# This fives the following formula:
f2
Y ~ scale(x1, scale = F) + scale(x2, scale = F) + scale(x1*x2, scale = F) +
Z + scale(x1, scale = F):Z + scale(x2, scale = F):Z + scale(x1*x2, scale = F):Z
Alternative 2: Use Model Matrices
# again fit without Z and get model matrix
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
modmat <- apply(model.matrix(mod), 2, function(x) scale(x, scale=F))
Here, all x's and the interactions are demeaned:
> head(modmat)
(Intercept) x1 x2 x1:x2
[1,] 0 0.1042908 -0.08989091 -0.01095459
[2,] 0 0.1611867 -0.32677059 -0.05425087
[3,] 0 0.2206845 0.29820499 0.06422944
[4,] 0 0.3462069 -0.15636463 -0.05571430
[5,] 0 0.3194451 -0.38668844 -0.12510551
[6,] 0 -0.4708222 -0.32502269 0.15144812
> round(colMeans(modmat), 2)
(Intercept) x1 x2 x1:x2
0 0 0 0
You can use the model matrix as follows:
modmat <- modmat[, -1] # remove intercept
lm(sdata$Y ~ modmat*sdata$Z)
It is not beautiful, but should do the work with any number of explanatory variables. You can also add Y and Z to the matrix so that the output looks prettier if this is a concern. Note that you can also create the model matrix directly without fitting the model. I took it from the fitted model directly since it have already fitted it for the first approach.
As a sidenote, it may be that this is not implemented in a more straight forward fashion because it is difficult to imagine situations in which demeaning the interaction is more desirable compared to the interaction of demeaned variables.
Comparing both approaches:
Here the output of both approaches for comparison. As you can see, apart from the coefficient names everything is identical.
> lm(sdata$Y ~ modmat*sdata$Z)
Call:
lm(formula = sdata$Y ~ modmat * sdata$Z)
Coefficients:
(Intercept) modmatx1 modmatx2 modmatx1:x2 sdata$Z
4.33105 1.56455 1.43979 -0.09206 1.72901
modmatx1:sdata$Z modmatx2:sdata$Z modmatx1:x2:sdata$Z
0.25332 0.38155 -0.66292
> lm(f2, data=sdata)
Call:
lm(formula = f2, data = sdata)
Coefficients:
(Intercept) scale(x1, scale = F) scale(x2, scale = F)
4.33105 1.56455 1.43979
scale(x1 * x2, scale = F) Z scale(x1, scale = F):Z
-0.09206 1.72901 0.25332
scale(x2, scale = F):Z scale(x1 * x2, scale = F):Z
0.38155 -0.66292