I am trying to master building functions in R. Say I have a data frame or data.table,
dummy <- df(y, x, a, b, who)
Where the vector "who" is like so,
who <- c("Joseph", "Kim", "Billy")
I would like to use the character vector to perform various regression models and name the outputs and their summary statistics. So for the entry, "Billy" in the vector above, I would like something like this:
function() {
ols.reg.Billy <- lm(y ~ x + a + b, data = dummy[dummy$who == "Billy"])
dw.Billy <- dwtest(ols.reg.Billy)
output.Billy <- list(ols.reg.Billy, dw.Billy)
return(output.Billy)
}
But for 500 different entries of the who vector above.
Is there some way to do this? What's the most efficient way? I keep getting errors and I feel I am seriously missing something. Is there some way to use paste?
If this doesn't solve it, please provide a reproducible example. It makes it easier to help you.
library(lmtest)
outputs <- lapply(who, function(name) {
ols.reg <- lm(y ~ x + a + b, data = dummy[dummy$who == name])
dw <- dwtest(ols.reg)
output <- paste(c("ols.reg","dw"), name, sep = "_")
return(output)
})
1) Map Using the built in CO2 data set suppose we wish to regress uptake on conc separately for each Type. Note that this names the components by the Type.
Map(function(x) lm(uptake ~ conc, CO2, subset = Type == x), levels(CO2$Type))
giving this two component list (one component for each level of Type -- Quebec and Mississauga) -- continued after output.
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
15.49754 0.01238
2) Map/do.call We may wish to not only name the components using the Type but also have x substituted with the actual Type in the Call: line of the output. In that case use do.call to invoke lm and use quote to ensure that the name of the data frame rather than its value is displayed and use bquote to perform the substitution for x.
reg <- function(x) {
do.call("lm", list(uptake ~ conc, quote(CO2), subset = bquote(Type == .(x))))
}
Map(reg, levels(CO2$Type))
giving:
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Quebec")
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Mississippi")
Coefficients:
(Intercept) conc
15.49754 0.01238
3) lmList The nlme package has lmList for doing this:
library(nlme)
lmList(uptake ~ conc | Type, CO2, pool = FALSE)
giving:
Call:
Model: uptake ~ conc | Type
Data: CO2
Coefficients:
(Intercept) conc
Quebec 23.50304 0.02308005
Mississippi 15.49754 0.01238113
Related
The code below creates a linear model with R's lm, then a weighted model with a weights column. Finally, I try to pass in the weight column name with a variable weight_col and that fails. I'm pretty sure it's looking for "weight_col" in df, then the caller's environment, finds a variable of length 1, and the lengths don't match.
How do I get it to use weight_col as a name for the weights column in df?
I've tried several combinations of things without success.
> df <- data.frame(
x=c(1,2,3),
y=c(4,5,7),
w=c(1,3,5)
)
> lm(y ~ x, data=df)
Call:
lm(formula = y ~ x, data = df)
Coefficients:
(Intercept) x
2.333 1.500
> lm(y ~ x, data=df, weights=w)
Call:
lm(formula = y ~ x, data = df, weights = w)
Coefficients:
(Intercept) x
1.947 1.658
> weight_col <- 'w'
> lm(y ~ x, data=df, weights=weight_col)
Error in model.frame.default(formula = y ~ x, data = df, weights = weight_col, :
variable lengths differ (found for '(weights)')
> R.version.string
[1] "R version 3.6.3 (2020-02-29)"
You can use the data frame name with extractor operator:
lm(y ~ x, data = df, weights = df[[weight_col]])
Or you can use function get:
lm(y ~ x, data = df, weights = get(weight_col))
We can use [[ to extract the value of the column
lm(y ~ x, data=df, weights=df[[weight_col]])
Or with tidyverse
library(dplyr)
df %>%
summarise(model = list(y ~ x, weights = .data[[weight_col]]))
Your first example if weights = w, which is using non-standard evaluation to find w in the context of df. So far, this is normal for interactive use.
Your second set is weights = weight_col which resolves to weights = "w", which is very different. There is nothing in R's non-standard (or standard) evaluation in which that makes sense.
As I said in my comment, use the standard-evaluation form with [[.
lm(y ~ x, data=df, weights=df[[weight_col]])
# Call:
# lm(formula = y ~ x, data = df, weights = df[[weight_col]])
# Coefficients:
# (Intercept) x
# 1.947 1.658
I want to run linear models (in this case, multivariate models with two response variables) within a for loop in which a new data frame called bc_applied is created at each iteration, as well as the vector targets. In my code, the column names "target1" and "target2" change at every iteration, which means I can't explicitly write variable names, instead I want to extract them from the vector targets.
Here is an example:
targets <- c("target1","target2")
bc_applied <- data.frame("dsRNA" = c(rep("gene1",5),rep("gene2",5),rep("gene3",5)),
"target1" = runif(15), "target2" = runif(15))
But when running
lm(bc_applied[,targets] ~ dsRNA, data = bc_applied)
The following error is returned:
Error in model.frame.default(formula = bc_applied[, targets] ~ dsRNA, :
invalid type (list) for variable 'bc_applied[, targets]'
The desired output is given by
lm(cbind(target1, target2) ~ dsRNA, data = bc_applied)
According to ?lm
If response is a matrix a linear model is fitted separately by least-squares to each column of the matrix.
With cbind, it is creating a matrix. So, we need an option that takes a matrix. After subsetting the dataset with the columns, convert it to a matrix with as.matrix and it should work
lm(as.matrix(bc_applied[,targets]) ~ dsRNA, data = bc_applied)
-output
#Call:
#lm(formula = as.matrix(bc_applied[, targets]) ~ dsRNA, data = bc_applied)
#Coefficients:
# target1 target2
#(Intercept) 0.45161 0.47457
#dsRNAgene2 0.36341 0.29226
#dsRNAgene3 -0.07115 -0.03003
Or another option is to create a formula with paste
lm(paste0('cbind(', toString(targets),') ~ dsRNA'), data = bc_applied)
-output
#Call:
#lm(formula = paste0("cbind(", toString(targets), ") ~ dsRNA"),
# data = bc_applied)
#Coefficients:
# target1 target2
#(Intercept) 0.45161 0.47457
#dsRNAgene2 0.36341 0.29226
#dsRNAgene3 -0.07115 -0.03003
or create the formula with glue
lm(glue::glue('cbind({toString(targets)}) ~ dsRNA'), bc_applied)
or another option is
lm(do.call(cbind, asplit(bc_applied[, targets], 2)) ~ dsRNA, bc_applied)
Crosschecking with cbind
lm(cbind(target1, target2)~ dsRNA, data = bc_applied)
-output
#Call:
#lm(formula = cbind(target1, target2) ~ dsRNA, data = bc_applied)
#Coefficients:
# target1 target2
#(Intercept) 0.45161 0.47457
#dsRNAgene2 0.36341 0.29226
#dsRNAgene3 -0.07115 -0.03003
I use a best subset selection package to determine the best independent variables from which to build my model (I do have a specific reason for doing this instead of using the best subset object directly). I want to programmatically extract the feature names and use the resulting string to build my model formula. The result would be something like this:
x <- "x1 + x2 + x3"
y <- "Surv(time, event)"
Because I'm building a coxph model, the formula is as follows:
coxph(Surv(time, event) ~ x1 + x2 + x3)
Using these string fields, I tried to construct the formula like so:
form <- y ~ x
This creates an object of class formula but when I call coxph it doesn't evaluate based on the references created form the formula object. I get the following error:
Error in model.frame.default(formula = y ~ x) : object is not a matrix
If I call eval on the objects y and x within the coxph call, I get the following:
Error in model.frame.default(formula = eval(y) ~ eval(x), data = df) :
variable lengths differ (found for 'eval(x)')
I'm not really sure how to proceed. Thanks for your input.
Couldn't find a good dupe, so posting comment as an answer.
If you build the full formula as a string, including the ~, you can use as.formula on it, e.g.,
x = "x1 + x2 + x3"
y = "Surv(time, event)"
form = as.formula(paste(y, "~", x))
coxph(form, data = your_data)
For a reproducible example, consider the first example at the bottom of the ?coxph help page:
library(survival)
test1 <- list(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
# Fit a stratified model
coxph(Surv(time, status) ~ x + strata(sex), test1)
# Call:
# coxph(formula = Surv(time, status) ~ x + strata(sex), data = test1)
#
# coef exp(coef) se(coef) z p
# x 0.802 2.231 0.822 0.98 0.33
#
# Likelihood ratio test=1.09 on 1 df, p=0.3
# n= 7, number of events= 5
lhs = "Surv(time, status)"
rhs = "x + strata(sex)"
form = as.formula(paste(lhs, "~", rhs))
form
# Surv(time, status) ~ x + strata(sex)
## formula looks good
coxph(form, test1)
# Call:
# coxph(formula = form, data = test1)
#
# coef exp(coef) se(coef) z p
# x 0.802 2.231 0.822 0.98 0.33
Same results either way.
I'm trying to figure out how can I set up purrr to run several multiple regressions like the image below. As you will notice, this dataset describes an intervention program and we are analyzing this data using ANCOVA procedures (TIME 2 ~ TIME 1 + CONDITION).
om4g**TIME2**01 ~ om4g**TIME1**01 + CONDITION
example:
om4g201 ~ om4g01 + CONDITION
Just in case someone want a reproducible code:
dataset <- data.frame(rest201=c(10,20,30,40),
rest101=c(5,10,20,24),
omgt201=c(40,10,20,10),
omgt101=c(10,20,10,05),
CONDITION=c(0,1))
lm(rest201~rest101+CONDITION, data=dataset)
lm(omgt201~omgt101+CONDITION, data=dataset)
I found just one similar question than mine here (Making linear models in a for loop using R programming) but the answer was not working.
Thanks!
Similar to #Roman's answer, here is how to do it using map2 from purrr:
library(purrr)
y_var = c("rest201", "omgt201")
x_var = list(c("rest101", "CONDITION"), c("omgt101", "CONDITION"))
map2(x_var, y_var, ~ lm(as.formula(paste(.y, "~", paste(.x, collapse = " + "))), data = dataset))
To get the summary table for each model, you can wrap each lm with summary and extract the coefficients table:
map2(x_var, y_var, ~ {
lm(as.formula(paste(.y, "~", paste(.x, collapse = " + "))), data = dataset) %>%
summary() %>%
`$`("coefficients")
})
Result:
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.779097 0.76821670 3.617596 0.17169133
rest101 1.377672 0.04750594 29.000000 0.02194371
CONDITION 3.800475 0.72163694 5.266464 0.11945968
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.000000e+01 16.666667 1.800000e+00 0.3228289
omgt101 -2.445145e-16 1.333333 -1.833859e-16 1.0000000
CONDITION -2.000000e+01 14.529663 -1.376494e+00 0.3999753
You could construct a list of formulas for each model and use that to construct a model.
x <- c(101, 102, 103)
mdls <- sprintf("omg4g%s ~ om4g%s + CONDITION",
as.character(x + 100),
as.character(x)
)
out <- sapply(mdls, FUN = function(x) {
formula(x, data = latino_dataset)
})
$`omg4g201 ~ om4g101 + CONDITION`
omg4g201 ~ om4g101 + CONDITION
<environment: 0x0000000009aff7b8>
$`omg4g202 ~ om4g102 + CONDITION`
omg4g202 ~ om4g102 + CONDITION
<environment: 0x0000000009afda98>
$`omg4g203 ~ om4g103 + CONDITION`
omg4g203 ~ om4g103 + CONDITION
<environment: 0x00000000099b0828>
e.g.
sapply(out, FUN = lm)
I'm trying to create a series of models based on subsets of different categories in my data. Instead of creating a bunch of individual model objects, I'm using lapply() to create a list of models based on subsets of every level of my category factor, like so:
test.data <- data.frame(y=rnorm(100), x1=rnorm(100), x2=rnorm(100), category=rep(c("A", "B"), 2))
run.individual.models <- function(x) {
lm(y ~ x1 + x2, data=test.data, subset=(category==x))
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
# [[1]]
# Call:
# lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
# x))
# Coefficients:
# (Intercept) x1 x2
# 0.10852 -0.09329 0.11365
# ....
This works fantastically, except the model call shows subset = (category == x) instead of category == "A", etc. This makes it more difficult to use both for diagnostic purposes (it's hard to remember which model in the list corresponds to which category) and for functions like predict().
Is there a way to substitute the actual character value of x into the lm() call so that the model doesn't use the raw x in the call?
Along the lines of Explicit formula used in linear regression
Use bquote to construct the call
run.individual.models <- function(x) {
lmc <- bquote(lm(y ~ x1 + x2, data=test.data, subset=(category==.(x))))
eval(lmc)
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
[[1]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"A"))
Coefficients:
(Intercept) x1 x2
-0.08434 0.05881 0.07695
[[2]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"B"))
Coefficients:
(Intercept) x1 x2
0.1251 -0.1854 -0.1609