Problem using function with "as.formula" in a loop in R - r

I've created a function that returns an ANOVA table, and it uses formula to create the formula of the oneway.testfunction.
A simplified version of the function is:
anova_table <- function(df, dv, group){
dv_t <- deparse(substitute(dv))
group_t <- deparse(substitute(group))
anova <- oneway.test(formula = formula(paste(dv_t, "~", group_t)),
data = df,
var.equal = F)
return(anova)
}
It works fine when I use it outside a loop:
data("mpg")
mpg <- mpg %>% mutate_if(is.character, as.factor)
anova_table(mpg, displ, drv)
However, I'd like it to work also inside a loop.
When I try the following code, I get this error message:
"Error in model.frame.default(formula = formula(paste(dv_t, "~", group_t)), :
object is not a matrix"
I'm not sure what I'm doing wrong.
vars_sel <- mpg %>% select(where(is.numeric)) %>% names()
vars_sel <- dput(vars_sel)
vars_sel <- syms(vars_sel)
for(i in vars_sel){
var <- sym(i)
print(anova_table(mpg, var, drv))
}
Any help would be much appreciated!

Because of how your function works, the var in your loop is being taken literally, so the function is looking for a column called var in mpg which doesn't exist. You can get round this by building and evaluating a call to your function in the loop:
for(i in vars_sel){
a <- eval(as.call(list(anova_table, df = mpg, dv = i, group = quote(drv))))
print(a)
}
#>
#> One-way analysis of means (not assuming equal variances)
#>
#> data: displ and drv
#> F = 143.9, num df = 2.000, denom df = 67.605, p-value < 2.2e-16
#>
#>
#> One-way analysis of means (not assuming equal variances)
#>
#> data: year and drv
#> F = 0.59072, num df = 2.000, denom df = 67.876, p-value = 0.5567
#>
#>
#> One-way analysis of means (not assuming equal variances)
#>
#> data: cyl and drv
#> F = 129.2, num df = 2.000, denom df = 82.862, p-value < 2.2e-16
#>
#>
#> One-way analysis of means (not assuming equal variances)
#>
#> data: cty and drv
#> F = 89.54, num df = 2.000, denom df = 78.879, p-value < 2.2e-16
#>
#>
#> One-way analysis of means (not assuming equal variances)
#>
#> data: hwy and drv
#> F = 127.14, num df = 2.000, denom df = 71.032, p-value < 2.2e-16
Created on 2022-09-25 with reprex v2.0.2

Related

Use lapply with formula to estimate lm with different weights

Related but slightly different to How to use lapply with a formula? and Calling update within a lapply within a function, why isn't it working?:
I am trying to estimate models with replicate weights. For correct standard errors, I need to estimate the same regression model with each version of the replicate weights. Since I need to estimate many different models and do not want to always write a seperate loop, I tried writing a function where I specify both data for the regression, regression formula and the data with the replicate weights. While the function works fine when specifying the formula explicitly inside the lapply() command in the function and not as a function input (function tryout below), as soon as I specify the regression formula as a function input (function tryout2 below), it breaks.
Here is a reproducible example:
library(tidyverse)
set.seed(123)
lm.dat <- data.frame(id=1:500,
x1=sample(1:100, replace=T, size=500),
x2=runif(n=500, min=0, max=20)) %>%
mutate(y=0.2*x1+1.5*x2+rnorm(n=500, mean=0, sd=5))
repweights <- data.frame(id=1:500)
set.seed(123)
for (i in 1:200) {
repweights[,i+1] <- runif(n=500, min=0, max=10)
names(repweights)[i+1] <- paste0("hrwgt", i)
}
The two functions are defined as follows:
trythis <- function(data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=y~x1+x2, weights=weightsdata[,x]))
return(res)
}
results1 <- trythis(data=lm.dat, weightsdata=repweights[-1], weightsN=200)
trythis2 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) lm(data=data, formula=LMformula, weights=weightsdata[,x]))
return(res)
}
While the first function works, applying the second one results in an error:
trythis2(LMformula = y~x1+x2, data=lm.dat, weightsN=200, weightsdata = repweights[-1])
Error in eval(extras, data, env) : object 'weightsdata' not found
Formulas have an associated environment in which the referenced variables can be found. In your case, the formula you are passing has the environment of the calling frame. To access the variables within the function, you need to assign the formula to the local frame so it can find the correct variables:
trythis3 <- function(LMformula, data, weightsdata, weightsN){
rep <- as.list(1:weightsN)
res <- lapply(rep, function(x) {
environment(LMformula) <- sys.frames()[[length(sys.frames())]]
lm(data = data, formula = LMformula, weights = weightsdata[,x])
})
return(res)
}
trythis3(LMformula = y~x1+x2, data = lm.dat, weightsN = 200,
weightsdata = repweights[-1])
Which results in
#> [[1]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = LMformula, data = data, weights = weightsdata[,
#> x])
#>
#> Coefficients:
#> (Intercept) x1 x2
#> 1.2932 0.1874 1.4308
...etc

How can I extract, label and data.frame values from Console in a loop?

I made a nls loop and get values calculated in console. Now I want to extract those values, specify which values are from which group and put everything in a dataframe to continue working.
my loop so far:
for (i in seq_along(trtlist2)) { loopmm.nls <-
nls(rate ~ (Vmax * conc /(Km + conc)),
data=subset(M3, M3$trtlist==trtlist2[i]),
start=list(Km=200, Vmax=2), trace=TRUE )
summary(loopmm.nls)
print(summary(loopmm.nls))
}
the output in console: (this is what I want to extract and put in a dataframe, I have this same "parameters" thing like 20 times)
Parameters:
Estimate Std. Error t value Pr(>|t|)
Km 23.29820 9.72304 2.396 0.0228 *
Vmax 0.10785 0.01165 9.258 1.95e-10 ***
---
different ways of extracting data from the console that work but not in the loop (so far!)
#####extract data in diff ways from nls#####
## extract coefficients as matrix
Kinall <- summary(mm.nls)$parameters
## extract coefficients save as dataframe
Kin <- as.data.frame(Kinall)
colnames(Kin) <- c("values", "SE", "T", "P")
###create Km Vmax df
Kms <- Kin[1, ]
Vmaxs <- Kin[2, ]
#####extract coefficients each manually
Km <- unname(coef(summary(mm.nls))["Km", "Estimate"])
Vmax <- unname(coef(summary(mm.nls))["Vmax", "Estimate"])
KmSE <- unname(coef(summary(mm.nls))["Km", "Std. Error"])
VmaxSE <- unname(coef(summary(mm.nls))["Vmax", "Std. Error"])
KmP <- unname(coef(summary(mm.nls))["Km", "Pr(>|t|)"])
VmaxP <- unname(coef(summary(mm.nls))["Vmax", "Pr(>|t|)"])
KmT <- unname(coef(summary(mm.nls))["Km", "t value"])
VmaxT <- unname(coef(summary(mm.nls))["Vmax", "t value"])
one thing that works if you extract data through append, but somehow that only works for "estimates" not the rest
Kms <- append(Kms, unname(coef(loopmm.nls)["Km"] ))
Vmaxs <- append(Vmaxs, unname(coef(loopmm.nls)["Vmax"] ))
}
Kindf <- data.frame(trt = trtlist2, Vmax = Vmaxs, Km = Kms)
I would just keep everything in the dataframe for ease. You can nest by the group and then run the regression then pull the coefficients out. Just make sure you have tidyverse and broom installed on your computer.
library(tidyverse)
#example
mtcars |>
nest(data = -cyl) |>
mutate(model = map(data, ~nls(mpg~hp^b,
data = .x,
start = list(b = 1))),
clean_mod = map(model, broom::tidy)) |>
unnest(clean_mod) |>
select(-c(data, model))
#> # A tibble: 3 x 6
#> cyl term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 b 0.618 0.0115 53.6 2.83e- 9
#> 2 4 b 0.731 0.0217 33.7 1.27e-11
#> 3 8 b 0.504 0.0119 42.5 2.46e-15
#what I expect will work for your data
All_M3_models <- M3 |>
nest(data = -trtlist) |>
mutate(model = map(data, ~nls(rate ~ (Vmax * conc /(Km + conc)),
data=.x,
start=list(Km=200, Vmax=2))),
clean_mod = map(model, broom::tidy))|>
unnest(clean_mod) |>
select(-c(data, model))

Run all posible combination of linear regression with 2 independent variables

I want to run every combination possible for every 2 independent variables (OLS regression). I have a csv where I have my data (just one dependent variable and 23 independent variables), and I've tried renaming the variables inside my database from a to z, and I called 'y' to my dependent variable (a column with name "y" which is my dependent variable) to be recognized by the following code:
#all the combinations
all_comb <- combn(letters, 2)
#create the formulas from the combinations above and paste
text_form <- apply(all_comb, 2, function(x) paste('Y ~', paste0(x, collapse = '+')))
lapply(text_form, function(i) lm(i, data= KOFS05.12))
but this error is shown:
Error in eval(predvars, data, env) : object 'y' not found
I need to know the R squared
Any idea to make it work and run every possible regression?
As mentioned in the comments under the question check whether you need y or Y. Having addressed that we can use any of these. There is no need to rename the columns. We use the built in mtcars data set as an example since no test data was provided in the question. (Please always provide that in the future.)
1) ExhaustiveSearch This runs quite fast so you might be able to try combinations higher than 2 as well.
library(ExhaustiveSearch)
ExhaustiveSearch(mpg ~., mtcars, combsUpTo = 2)
2) combn Use the lmfun function defined below with combn.
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lmfun <- function(x, dep) do.call("lm", list(reformulate(x, dep), quote(mtcars)))
lms <- combn(nms, 2, lmfun, dep = dep, simplify = FALSE)
names(lms) <- lapply(lms, formula)
3) listcompr Using lmfun from above and listcompr we can use the following. Note that we need version 0.1.1 or later of listcompr which is not yet on CRAN so we get it from github.
# install.github("patrickroocks/listcompr")
library(listcompr)
packageVersion("listcompr") # need version 0.1.1 or later
dep <- "mpg" # name of dependent variable
nms <- setdiff(names(mtcars), dep) # names of indep variables
lms2 <- gen.named.list("{nm1}.{nm2}", lmfun(c(nm1, nm2), dep),
nm1 = nms, nm2 = nms, nm1 < nm2)
You should specify your text_form as formulas:
KOFS05.12 <- data.frame(y = rnorm(10),
a = rnorm(10),
b = rnorm(10),
c = rnorm(10))
all_comb <- combn(letters[1:3], 2)
fmla_form <- apply(all_comb, 2, function(x) as.formula(sprintf("y ~ %s", paste(x, collapse = "+"))))
lapply(fmla_form, function(i) lm(i, KOFS05.12))
#> [[1]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a b
#> 0.19763 -0.15873 0.02854
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) a c
#> 0.21395 -0.15967 0.05737
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = i, data = KOFS05.12)
#>
#> Coefficients:
#> (Intercept) b c
#> 0.157140 0.002523 0.028088
Created on 2021-02-17 by the reprex package (v1.0.0)

R loop for linear regression lm(y~x) and save model output as a dataset

I would like to make a regression loop lm(y~x) with a dataset with one y and several x, and run the regression for each x, and then also store the results (estimate, p-values) in a data.frame() so I don't have to copy them manually (especially as my real data set it much bigger).
I think this should not be too difficult, but I struggle a lot to make it work and appreciate your help:
Here is my sample data set:
sample_data <- data.frame(
fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
Xbeta = c(2.8907744, -0.7680777, -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990, -0.5698354, -0.5698354),
Xgamma = c( 0.1180758, -0.6275700, 0.3731964, -0.2353454,-0.5761923, -0.5186803, 0.43041835, 3.9111749, -0.5030638),
Xalpha = c(0.2643091, 1.6663923, 0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278, 0.3183102, -0.2385278),
Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834, 3.8966404, 1.2028175)
)
#yname <- ("fit")
#xnames <- c("Xbeta ","Xgamma", "Xalpha", "Xdelta")
The simple regression with the first independant variable Xbeta would look like this lm(fit~Xbeta, data= sample_data)and I would like to run the regression for each variable starting with an "X" and then store the result (estimate, p-value).
I have found a code that allows me to select variables that start with "X" and then use it for the model, but the code gives me an error from mutate() onwards (indicated by #).
library(tidyverse)
library(tsibble)
sample_data %>%
gather(stock, return, starts_with("X")) %>%
group_nest(stock)
# %>%
# mutate(model = map(data,
# ~lm(formula = "fit~ return",
# data = .x))
# ),
# resid = map(model, residuals)
# ) %>%
# unnest(c(data,resid)) %>%
# summarise(sd_residual = sd(resid))
For then storing the regression results I have also found the following appraoch using the R package "broom": r for loop for regression lm(y~x)
sample_data%>%
group_by(y,x)%>% # get combinations of y and x to regress
do(tidy(lm(fRS_relative~xvalue, data=.)))
But I always get an error for group_by() and do()
I really appreciate your help!
One option would be to use lapply to perform a regression with each of the independent variables. Use tidy from broom library to store the results into a tidy format.
lapply(1:length(xnames),
function(i) broom::tidy(lm(as.formula(paste0('fit ~ ', xnames[i])), data = sample_data))) -> test
and then combine all the results into a single dataframe:
do.call('rbind', test)
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 1.05 0.133 7.89 0.0000995
# 2 Xbeta -0.156 0.0958 -1.62 0.148
# 3 (Intercept) 0.968 0.147 6.57 0.000313
# 4 Xgamma 0.0712 0.107 0.662 0.529
# 5 (Intercept) 1.09 0.131 8.34 0.0000697
# 6 Xalpha -0.0999 0.0508 -1.96 0.0902
# 7 (Intercept) 0.998 0.175 5.72 0.000723
# 8 Xdelta -0.0114 0.0909 -0.125 0.904
Step one
Your data is messy, let us tidy it up.
sample_data <- data.frame(
fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
Xbeta = c(2.8907744, -0.7680777, -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990, -0.5698354, -0.5698354),
Xgamma = c( 0.1180758, -0.6275700, 0.3731964, -0.2353454,-0.5761923, -0.5186803, 0.43041835, 3.9111749, -0.5030638),
Xalpha = c(0.2643091, 1.6663923, 0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278, 0.3183102, -0.2385278),
Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834, 3.8966404, 1.2028175)
)
tidyframe = data.frame(fit = sample_data$fit,
X = c(sample_data$Xbeta,sample_data$Xgamma,sample_data$Xalpha,sample_data$Xdelta),
type = c(rep("beta",9),rep("gamma",9),rep("alpha",9),rep("delta",9)))
Created on 2020-07-13 by the reprex package (v0.3.0)
Step two
Iterate over each type, and get the P-value, using this nifty function
# From https://stackoverflow.com/a/5587781/3212698
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
Then do some clever piping
tidyframe %>% group_by(type) %>%
summarise(type = type, p = lmp(lm(formula = fit ~ X))) %>%
unique()
#> `summarise()` regrouping output by 'type' (override with `.groups` argument)
#> # A tibble: 4 x 2
#> # Groups: type [4]
#> type p
#> <fct> <dbl>
#> 1 alpha 0.0902
#> 2 beta 0.148
#> 3 delta 0.904
#> 4 gamma 0.529
Created on 2020-07-13 by the reprex package (v0.3.0)

object '...' not found in R Functions with lm -->> (Error in eval(predvars, data, env) : object '...' not found)

I'm using the moderndrive package to calculate a linear regression but using a function. I am trying to create a function where i can just pass in two selected columns(e.g deaths & cases, titles of the columns) from my data frame (Rona_2020). Below is the function...
score_model_Fxn <- function(y, x){
score_mod <- lm(y ~ x, data = Rona_2020)
Reg_Table <- get_regression_table(score_mod)
print(paste('The regression table is', Reg_Table))
}
when I run the function ...
score_model_Fxn(deaths, cases)
I get ...
Error in eval(predvars, data, env) : object 'deaths' not found
What should i do? I have looked several similar issues but to no avail.
What you want to do by passing deaths and cases is called non-standard evaluation. You need to combine this with computing on the language if you want to run a model with the correct formula and scoping. Computing on the language can be done with substitute and bquote.
library(moderndive)
score_model_Fxn <- function(y, x, data){
#get the symbols passed as arguments:
data <- substitute(data)
y <- substitute(y)
x <- substitute(x)
#substitute them into the lm call and evaluate the call:
score_mod <- eval(bquote(lm(.(y) ~ .(x), data = .(data))))
Reg_Table <- get_regression_table(score_mod)
message('The regression table is') #better than your paste solution
print(Reg_Table)
invisible(score_mod) #a function should always return something useful
}
mod <- score_model_Fxn(Sepal.Length, Sepal.Width, iris)
#The regression table is
## A tibble: 2 x 7
# term estimate std_error statistic p_value lower_ci upper_ci
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 intercept 6.53 0.479 13.6 0 5.58 7.47
#2 Sepal.Width -0.223 0.155 -1.44 0.152 -0.53 0.083
print(mod)
#
#Call:
#lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
#
#Coefficients:
#(Intercept) Sepal.Width
# 6.5262 -0.2234
You could have the function return Reg_Table instead if you prefer.
One of the coolest ways of doing this is using the new recipes package to generate the formula for us and then manipulating a tibble to produce or result
library(tidyverse)
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#>
#> fixed
#> The following object is masked from 'package:stats':
#>
#> step
library(moderndive)
score_model_Fxn <- function(df,x, y){
formula_1 <- df %>%
recipe() %>%
update_role({{x}},new_role = "outcome") %>%
update_role({{y}},new_role = "predictor") %>%
formula()
Reg_Table <- mtcars %>%
summarise(score_mod = list(lm(formula_1,data = .))) %>%
rowwise() %>%
mutate(Reg_Table = list(get_regression_table(score_mod))) %>%
pull(Reg_Table)
print(paste('The regression table is', Reg_Table))
Reg_Table
}
k <- mtcars %>%
score_model_Fxn(x = cyl,y = gear)
#> [1] "The regression table is list(term = c(\"intercept\", \"gear\"), estimate = c(10.585, -1.193), std_error = c(1.445, 0.385), statistic = c(7.324, -3.101), p_value = c(0, 0.004), lower_ci = c(7.633, -1.978), upper_ci = c(13.537, -0.407))"
k
#> [[1]]
#> # A tibble: 2 x 7
#> term estimate std_error statistic p_value lower_ci upper_ci
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 intercept 10.6 1.44 7.32 0 7.63 13.5
#> 2 gear -1.19 0.385 -3.10 0.004 -1.98 -0.407
Created on 2020-06-09 by the reprex package (v0.3.0)
For those that might be interested...I modified Bruno's answer.
library(tidyverse); library(recipes); library(moderndive)
score_model_Fxn2 <- function(df,x, y){
formula_1 <- df %>%
recipe() %>%
update_role({{y}},new_role = "outcome") %>%
update_role({{x}},new_role = "predictor") %>%
formula()
Reg_Table <- df %>%
summarise(score_mod = list(lm(formula_1,data = .))) %>%
rowwise() %>%
mutate(Reg_Table = list(get_regression_table(score_mod))) %>%
pull(Reg_Table)
print(Reg_Table)
}
score_model_Fxn2()

Resources