Suppose I have a data frame in the environment, mydata, with three columns, A, B, C.
mydata = data.frame(A=c(1,2,3),
B=c(4,5,6),
C=c(7,8,9))
I can create a linear model with
lm(C ~ A, data=mydata)
I want a function to generalize this, to regress B or C on A, given just the name of the column, i.e.,
f = function(x){
lm(x ~ A, data=mydata)
}
f(B)
f(C)
or
g = function(x){
lm(mydata$x ~ mydata$A)
}
g(B)
g(C)
These solutions don't work. I know there is something wrong with the evaluation, and I have tried permutations of quo() and enquo() and !!, but no success.
This is a simplified example, but the idea is, when I have dozens of similar models to build, each fairly complicated, with only one variable changing, I want to do so without repeating the entire formula each time.
If we want to pass unquoted column name, and option is {{}} from tidyverse. With select, it can take both string and unquoted
library(dplyr)
printcol2 <- function(data, x) {
data %>%
select({{x}})
}
printcol2(mydata, A)
# A
#1 1
#2 2
#3 3
printcol2(mydata, 'A')
# A
#1 1
#2 2
#3 3
If the OP wanted to pass unquoted column name to be passed in lm
f1 <- function(x){
rsp <- deparse(substitute(x))
fmla <- reformulate("A", response = rsp)
out <- lm(fmla, data=mydata)
out$call <- as.symbol(paste0("lm(", deparse(fmla), ", data = mydata)"))
out
}
f1(B)
#Call:
#lm(B ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 3 1
f1(C)
#Call:
#lm(C ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 6 1
Maybe you are looking for deparse(substitute(.)). It accepts arguments quoted or not quoted.
f = function(x, data = mydata){
y <- deparse(substitute(x))
fmla <- paste(y, 'Species', sep = '~')
lm(as.formula(fmla), data = data)
}
mydata <- iris
f(Sepal.Length)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 5.006 0.930 1.582
f(Petal.Width)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 0.246 1.080 1.780
I think generally, you might be looking for:
printcol <- function(x){
print(x)
}
printcol(mydata$A)
This doesn't involve any fancy evaluation, you just need to specify the variable you'd like to subset in your function call.
This gives us:
[1] 1 2 3
Note that you're only printing the vector A, and not actually subsetting column A from mydata.
Related
Hi I’m starting to use r and am stuck on analyzing my data. I have a dataframe that has 80 columns. Column 1 is the dependent variable and from column 2 to 80 they are the independent variables. I want to perform 78 multiple linear regressions leaving the first independent variable of the model fixed (column 2) and create a list where I can to save all regressions to later be able to compare the models using AIC scores. how can i do it?
Here is my loop
data.frame
for(i in 2:80)
{
Regressions <- lm(data.frame$column1 ~ data.frame$column2 + data.frame [,i])
}
Using the iris dataset as an example you can do:
lapply(seq_along(iris)[-c(1:2)], function(x) lm(data = iris[,c(1:2, x)]))
[[1]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Length
2.2491 0.5955 0.4719
[[2]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Petal.Width
3.4573 0.3991 0.9721
[[3]]
Call:
lm(data = iris[, c(1:2, x)])
Coefficients:
(Intercept) Sepal.Width Speciesversicolor Speciesvirginica
2.2514 0.8036 1.4587 1.9468
This works because when you pass a dataframe to lm() without a formula it applies the function DF2formula() under the hood which treats the first column as the response and all other columns as predictors.
With the for loop we can initialize a list to store the output
nm1 <- names(df1)[2:80]
Regressions <- vector('list', length(nm1))
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(reformulate(c("column2", nm1[i]), "column1"), data = df1)
}
Or use paste instead of reformulate
for(i in seq_along(Regressions)) {
Regressions[[i]] <- lm(as.formula(paste0("column1 ~ column2 + ",
nm1[i])), data = df1)
}
Using a reproducible example
nm2 <- names(iris)[3:5]
Regressions2 <- vector('list', length(nm2))
for(i in seq_along(Regressions2)) {
Regressions2[[i]] <- lm(reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"), data = iris)
}
Regressions2[[1]]
#Call:
#lm(formula = reformulate(c("Sepal.Width", nm2[i]), "Sepal.Length"),
# data = iris)
#Coefficients:
# (Intercept) Sepal.Width Petal.Length
# 2.2491 0.5955 0.4719
I am trying to test a whole bunch of different models easily and compare AIC / R-sq values to select the right one. I am having some trouble saving things how I want to between lists and data frames.
data frame I am going to model:
set.seed(1)
df <- data.frame(response=runif(50,min=50,max=100),
var1 = sample(1:20,50,replace=T),
var2 = sample(40:60,50,replace = T))
list of formulas to test:
formulas <- list( response ~ NULL,
response ~ var1,
response ~ var2,
response ~ var1 + var2,
response ~ var1 * var2)
So, what I want to do is create a loop that will model all of these formulas, extract Formula, AIC, and R-sq values into a table, and let me sort it to find the best one. The problem I'm having is I can't extract the formula name as "Response ~ var1", instead, it keeps coming out as "Response" "~" "var1" if I try to extract as a character object. Or, if I extract as a list (like below), then it comes out like this:
[[1]]
response ~ NULL
[[2]]
[1] 415.89
[[3]]
[1] 0
And I can't easily plug those list elements into a data frame. Here is what I tried:
selection <- matrix(ncol=3)
colnames(selection) <- c("formula","AIC","R2") # create a df to store results in
for ( i in 1:length(formulas)){
mod <- lm( formula = formulas[[i]], data= df)
mod_vals <- c(extract(formulas[[i]]),
round(AIC(mod),2),
round(summary(mod)$adj.r.squared,2)
)
selection[i,] <- mod_vals[]
}
Any ideas? I don't have to keep it as a for loop either, I just want a way to test a long list of models together.
Thanks.
You could use lapply to loop over each formula and extract relevant statistic from the model and bind the datasets together.
do.call(rbind, lapply(formulas, function(x) {
mod <- lm(x, data= df)
data.frame(formula = format(x),
AIC = round(AIC(mod),2),
r_square = round(summary(mod)$adj.r.squared,2))
}))
# formula AIC r_square
#1 response ~ NULL 405.98 0.00
#2 response ~ var1 407.54 -0.01
#3 response ~ var2 407.90 -0.02
#4 response ~ var1 + var2 409.50 -0.03
#5 response ~ var1 * var2 410.36 -0.03
Or with purrr
purrr::map_df(formulas, ~{
mod <- lm(.x, data= df)
data.frame(formula = format(.x),
AIC = round(AIC(mod),2),
r_square = round(summary(mod)$adj.r.squared,2))
})
I am trying to create a function that passes a parameter in as the dependent variable with the independent variables staying the same.
I have tried to use {{}} but see the problem as something like the below if select contains was possible.
test_func <- function(dataframe, dependent){
model <- tidy(lm({{ dependent }} ~ . - select(contains("x")), data = dataframe))
return(model)
}
test_func(datasets::anscombe, x1)
The function should pass as function(dataframe, dependent) with a single model.
Use reformulate().
f <- function(d, y) lm(reformulate(names(d)[grep("x", names(d))], response=y), data=d)
f(datasets::anscombe, "y1")
# Call:
# lm(formula = reformulate(names(d)[grep("x", names(d))], response = y),
# data = d)
#
# Coefficients:
# (Intercept) x1 x2 x3 x4
# 4.33291 0.45073 NA NA -0.09873
I'm trying to use a string input for a function that then needs to be converted into useable form in R. For example:
I have the following function:
MyFunction <- function(MyDataFrame){
Fit <- aov(VariableA ~ A * B * C, MyDataFrame)
model = lme(VariableA ~ A * B * C, random=~1| Sample, method="REML", MyDataFrame)
return(anova(model))
}
This works fine. However, I sometimes want to use different formulas with a single function so my "Expression" can be "A * B * C" or "A * C". I tried using:
MyFunction <- function(MyDataFrame, Expression = "A * B * C"){
Fit <- aov(VariableA ~ Expression, MyDataFrame)
model = lme(VariableA ~ Expression, random=~1| Sample, method="REML", MyDataFrame)
return(anova(model))
}
This does not work. Any suggestions?
R needs to know that the formula is actually a formula, and you run into issues with evaluating expressions, environments, and so forth when you have a string that you want to use as an expression in a formula. Based on what it looks like you are trying to do, I would probably set up my function like so:
library(nlme)
fun <- function(df, response, predictors){
model_formula <- as.formula(paste0(response, " ~ ", predictors))
fit <- aov(model_formula, df)
model = nlme::lme(model_formula, df)
return(anova(model))
}
fun(Orthodont, "distance", "age")
#> numDF denDF F-value p-value
#> (Intercept) 1 80 3096.4889 <.0001
#> age 1 80 85.8464 <.0001
fun(Orthodont, "distance", "age + Sex")
#> numDF denDF F-value p-value
#> (Intercept) 1 80 4226.931 <.0001
#> age 1 80 111.949 <.0001
#> Sex 1 25 4.429 0.0456
For your purposes, you don't even need to use strings, you could pass the expression directly and use match.call() with an eval(). A toy example is:
fun <- function(data, expression){
m <- match.call()
lm(hp ~ eval(m$expression), data)
}
fun(mtcars, cyl)
#Call:
#lm(formula = hp ~ eval(m$expression), data = data)
#Coefficients:
# (Intercept) eval(m$expression)
# -51.05 31.96
This question already has answers here:
Linear Regression and storing results in data frame [duplicate]
(5 answers)
Closed 7 years ago.
I need to store lm fit object in a data frame for further processing (This is needed as I will have around 200+ regressions to be stored in the data frame). I am not able to store the fit object in the data frame. Following code produces the error message:
x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)
df = data.frame()
df = rbind(df, c(id="xx1", fitObj=fit))
Error in rbind(deparse.level, ...) :
invalid list argument: all variables should have the same length
I would like to get the data frame as returned by "do" call of dplyr, example below:
> tacrSECOutput
Source: local data frame [24 x 5]
Groups: <by row>
sector control id1 fit count
1 Chemicals and Chemical Products S tSector <S3:lm> 2515
2 Construation and Real Estate S tSector <S3:lm> 985
Please note that this is a sample output only. I would like to create the data frame (fit column for the lm object) in the above format so that my rest of the code can work on the added models.
What am I doing wrong? Appreciate the help very much.
The list approach:
Clearly based on #Pascal 's idea. Not a fan of lists, but in some cases they are extremely helpful.
set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)
set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)
# manually select model names
model_names = c("fit1","fit2")
# create a list based on models names provided
list_models = lapply(model_names, get)
# set names
names(list_models) = model_names
# check the output
list_models
# $fit1
#
# Call:
# lm(formula = y ~ x)
#
# Coefficients:
# (Intercept) x
# 0.5368 1.9678
#
#
# $fit2
#
# Call:
# lm(formula = y ~ x)
#
# Coefficients:
# (Intercept) x
# 0.5545 1.9192
Given that you have lots of models in your work space, the only "manual" thing you have to do is provide a vector of your models names (how are they stored) and then using the get function you can obtain the actual model objects with those names and save them in a list.
Store model objects in a dataset when you create them:
The data frame can be created using dplyr and do if you are planning to store the model objects when they are created.
library(dplyr)
set.seed(42)
x1 = runif(100)
y1 = 2*x+runif(100)
set.seed(123)
x2 <- runif(100)
y2 <- 2*x+runif(100)
model_formulas = c("y1~x1", "y2~x2")
data.frame(model_formulas, stringsAsFactors = F) %>%
group_by(model_formulas) %>%
do(model = lm(.$model_formulas))
# model_formulas model
# (chr) (chr)
# 1 y1~x1 <S3:lm>
# 2 y2~x2 <S3:lm>
It REALLY depends on how "organised" is the process that allows you to built those 200+ models you mentioned. You can build your models this way if they depend on columns of a specific dataset. It will not work if you want to build models based on various columns of different datasets, maybe of different work spaces or different model types (linear/logistic regression).
Store existing model objects in a dataset:
Actually I think you can still use dplyr using the same philosophy as in the list approach. If the models are already built you can use their names like this
library(dplyr)
set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)
set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)
# manually select model names
model_names = c("fit1","fit2")
data.frame(model_names, stringsAsFactors = F) %>%
group_by(model_names) %>%
do(model = get(.$model_names))
# model_names model
# (chr) (chr)
# 1 fit1 <S3:lm>
# 2 fit2 <S3:lm>
This seems to work:
x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)
df <- data.frame()
fitvec <- serialize(fit,NULL)
df <- rbind(df, data.frame(id="xx1", fitObj=fitvec))
fit1 <- unserialize( df$fitObj )
print(fit1)
yields:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.529 1.936
Update Okay, now more complex, so as to get one row per fit.
vdf <- data.frame()
fitlist <- list()
niter <- 5
for (i in 1:niter){
# Create a new model each time
a <- runif(1)
b <- runif(1)
n <- 50*runif(1) + 50
x <- runif(n)
y <- a*x + b + rnorm(n,0.1)
fit <- lm(x~y)
fitlist[[length(fitlist)+1]] <- serialize(fit,NULL)
}
vdf <- data.frame(id=1:niter)
vdf$fitlist <- fitlist
for (i in 1:niter){
print(unserialize(vdf$fitlist[[i]]))
}
yields:
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.45689 0.07766
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.44922 0.00658
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.41036 0.04522
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.40823 0.07189
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.40818 0.08141