I am trying to write a function that has up to 6 potential conditions. The function creates a linear model that could have up to three parameters and is either scaled or not scaled. I tried using nested if else statements, but have run into issues because I have too many conditions (scaled or not X 3ivs possible = 6 potential conditions). How can I simplify my code so it is easier to read?
This is what I tried to write, but it does not currently work.
test<-data.frame(a=sample.int(20,10,replace=T),b=sample.int(20,10,replace=T),c=sample.int(20,10,replace=T),d=sample.int(20,10,replace=T))
lm3iv<-function(dv,iv1,iv2=NA,iv3=NA,df,scale=F){
dn<-dv;in1<-iv1;in2<-iv2;in3<-iv3 #stores the names of variables/elements specified in function
dv<-eval(parse(text=paste0(df,"$",dv))) #Store output of (df,"$",dv) as variable dv; parse=trun string into text; eval=return the values given by an expression that is passed to eval (by parse in this case).
#return(dv)
iv1<-eval(parse(text=paste0(df,"$",iv1)))
if(!is.na(iv2)){iv2<-eval(parse(text=paste0(df,"$",iv2)))}
if(!is.na(iv3)){iv3<-eval(parse(text=paste0(df,"$",iv3)))}
ifelse(scale,
ifelse(!is.na(iv3),
{
x<-lm(scale(dv)~scale(iv1)+scale(iv2)+scale(iv3))
names(x$coefficients)<-c(dn,in1,in2,in3) #set names of coefficients (element of x) from x (object defined in above line); names=pulling specific elements of coefficients
return(summary(x))
},
ifelse(!is.na(iv2),{
x<-lm(scale(dv)~scale(iv1)+scale(iv2))
names(x$coefficients)<-c(dn,in1,in2)
return(summary(x))
},
{
x<-lm(scale(dv)~scale(iv1))
names(x$coefficients)<-c(dn,in1)
return(summary(x))
},
ifelse(!is.na(iv3),
return(summary(lm((dv)~(iv1)+(iv2)+(iv3)))),
ifelse(!is.na(iv2),
return(summary(lm((dv)~(iv1)+(iv2)))),
return(summary(lm((dv)~(iv1)))))))
) #format=ifelse(cond,if_true,if_false)
)
}
#tried adding list() before the first summary to return model outpot and the model its self; have to add the model again after the comma after summary() - e.g., lm(scale(dv)~scale(iv1)+scale(iv2)+scale(iv3))) (same model being summarized). Have to create a variable when passing the data into the model. When reading output -> model.output(model.std.el.ns.tw[[1]],model.std.el.ns.tw[[2]]).
lm3iv("a","b","c","d",df="test",scale=F)
This is the error I'm getting, but I am trying to simplify my code, not just solve the error:
Error in ifelse(scale, ifelse(!is.na(iv3), { :
argument "no" is missing, with no default
Here is you function organised. I hope it works because I didn't care for what the function do, just for the arguments of the ifelse(test, yes, no)
lm3iv<-function(dv,iv1,iv2=NA,iv3=NA,df,scale=F){
dn<-dv;in1<-iv1;in2<-iv2;in3<-iv3 #stores the names of variables/elements specified in function
dv<-eval(parse(text=paste0(df,"$",dv))) #Store output of (df,"$",dv) as variable dv; parse=trun string into text; eval=return the values given by an expression that is passed to eval (by parse in this case).
#return(dv)
iv1<-eval(parse(text=paste0(df,"$",iv1)))
if(!is.na(iv2)){iv2<-eval(parse(text=paste0(df,"$",iv2)))}
if(!is.na(iv3)){iv3<-eval(parse(text=paste0(df,"$",iv3)))}
ifelse(test = scale,
yes = ifelse(test = !is.na(iv3),
yes = {
x<-lm(scale(dv)~scale(iv1)+scale(iv2)+scale(iv3))
names(x$coefficients)<-c(dn,in1,in2,in3) #set names of coefficients (element of x) from x (object defined in above line); names=pulling specific elements of coefficients
return(summary(x))
},
no = ifelse(test = !is.na(iv2),
yes = {
x<-lm(scale(dv)~scale(iv1)+scale(iv2))
names(x$coefficients)<-c(dn,in1,in2)
return(summary(x))
},
no = {
x<-lm(scale(dv)~scale(iv1))
names(x$coefficients)<-c(dn,in1)
return(summary(x))
})),
no = ifelse(test = !is.na(iv3),
yes = return(summary(lm((dv)~(iv1)+(iv2)+(iv3)))),
no = ifelse(test = !is.na(iv2),
yes = return(summary(lm((dv)~(iv1)+(iv2)))),
no = return(summary(lm((dv)~(iv1))))))
) #format=ifelse(cond,if_true,if_false)
}
Instead of using a bunch of ifelse statements, perhaps you should build a formula from the function arguments and then pass it to lm?
library(tidyverse)
lm2 <- function(dep_var, ind_var, ..., data, scaled = FALSE) {
if (scaled) {
data <- mutate_if(data, is.numeric, scale)
}
form <- paste(dep_var, '~',
paste(ind_var, ..., sep = '+')) %>%
as.formula()
lm(form, data = data) %>% summary()
}
# tests
# 1 independent var
lm(Petal.Width ~ Sepal.Width, data = iris) %>% summary()
lm2('Petal.Width', 'Sepal.Width', data = iris)
# 2 independent var
lm(Petal.Width ~ Sepal.Width + Sepal.Length, data = iris) %>% summary()
lm2('Petal.Width', 'Sepal.Width', 'Sepal.Length', data = iris)
# test scale option
iris_scaled <- mutate(iris,
Petal.Width = scale(Petal.Width),
Sepal.Width = scale(Sepal.Width))
lm(Petal.Width ~ Sepal.Width, data = iris_scaled) %>% summary()
lm2('Petal.Width', 'Sepal.Width', data = iris, scaled = TRUE)
Related
I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)
The logic is similar to the content-based recommender,
content
undesirable
desirable
user_1
...
user_10
1
3.00
2.77
0.11
NA
...
5000
2.50
2.11
NA
0.12
I need to run the model for undesirable and desirable as independent values and each user as the dependent value, thus I need run 10 times to fit the model and predict each user's NA value.
This is the code that I hard coding, but I wonder how to use for loop, I just searched for several methods but they do not work for me...
the data as 'test'
hard code
#fit model
fit_1 = lm(user_1 ~ undesirable + desirable, data = test)
...
fit_10 = lm(user_10 ~ undesirable + desirable, data = test)
#prediction
u_1_na = test[is.na(test$user_1), c('user_1', 'undesirable', 'desirable')]
result1 = predict(fit_1, newdata = u_1_na)
which(result1 == max(result1))
max(result1)
...
u_10_na = test[is.na(test$user_10), c('user_10', 'undesirable', 'desirable')]
result10 = predict(fit_10, newdata = u_10_na)
which(result10 == max(result10))
max(result10)
#make to csv file
apply each max predict value to csv.
this is what I try for now(for loop)
mod_summaries <- list()
for(i in 1:10) {
predictors_i <- colnames(data)[1:10]
mod_summaries[[i - 1]] <- summary(
lm(predictors_i ~ ., test[ , c("undesirable", 'desirable')]))
}
An apply method:
mod_summaries_lapply <-
lapply(
colnames(mtcars),
FUN = function(x)
summary(lm(reformulate(".", response = x), data = mtcars))
)
A for loop method to make linear models for each column. The key is the reformulate() function, which creates the formula from strings. In the question, the function is made of a string and results in error invalid term in model formula. The string needs to be evaluated with eval() . This example uses the mtcars dataset.
mod_summaries <- list()
for(i in 1:11) {
predictors_i <- colnames(mtcars)[i]
mod_summaries[[i]] <- summary(lm(reformulate(".", response = predictors_i), data=mtcars))
#summary(lm(reformulate(". -1", response = predictors_i), data=mtcars)) # -1 to exclude intercept
#summary(lm(as.formula(paste(predictors_i, "~ .")), data=mtcars)) # a "paste as formula" method
}
You could use the function as.formula together with the paste function to create your formula. Following is an example
formula_lm <- as.formula(
paste(response_var,
paste(expl_var, collapse = " + "),
sep = " ~ "))
This implies that you have more than one explanatory variable (separated in the paste with +). If you only have one, omit the second paste.
With the created formula, you can use the lm funciton like this:
lm(formula_lm, data)
Edit: the vector expl_var would in your case include the undesirable and desirable variable.
Avoid the loop. Make your data tidy. Something like:
library(tidyverse)
test %>%
select(-content) %>%
pivot_longer(
starts_with("user"),
names_to="user",
values_to="value"
) %>%
group_by(user) %>%
group_map(
function(.x, .y) {
summary(lm(user ~ ., data=.x))
}
)
Untested code since your example is not reproducible.
I'm trying to use a custom function for testing in a tbl_summary |> add_difference() call.
Basically, whatever I pass the function via the test.arg argument in add_difference doesn't get to the called function. I made a simplified version of my code below--I'm pretty sure it's far from optimized and probably full of other bugs I can't see because I'm stuck at the call! 😕
library(tidyverse)
library(gtsummary)
# Most args are necessary for any function to be compatible with
# add_difference()
testMe <- function(data,
variable,
by,
group = NULL,
type = NULL,
conf.level = 0.95,
adj.vars = NULL,
# test_name is the one I got problems with
test_name = NULL,
ci_type = 0.95,
continuous_variable = NULL,
na.rm,
...) {
# Turn strings into actual symbol for valiables
variable <- ensym(variable)
by <- ensym(by)
# Use functions from infer to make statistical computations
estimate <- data |> specify(expr(!!variable ~ !!by)) |>
calculate(stat = test_name, na.rm = na.rm)
booted <- data |>
specify(expr(!!variable ~ !!by)) |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = test_name, na.rm = na.rm)
result_df <- get_confidence_interval(booted,
estimate,
type = "bias-corrected",
point_estimate = estimate) |>
rename(conf.low = lower_ci,
conf.high = upper_ci)
# returns a df with column names as specified in gtsummary docs
return(result_df)
}
This is the gtsummary code:
# Turn the group factor into character (per add_difference()) and reduce
# groups to 2.
pdata <- PlantGrowth |> filter((group == "trt1") | (group == "trt2")) |>
mutate(group = as.character(group))
table <- data |>
tbl_summary(
by = group,
statistic = list(all_continuous() ~ "{median} ({p25}—{p75})"),
) |>
add_difference(
test = list(weight ~ "testMe"),
# Here we pass a string containing the name of the estimate which
# the infer functions will need.
test.args = weight ~ list(test_name = "diff in medians")
# The following selection, which I've seen in documentation and on the
# Internet, doesn't work for me.
# test.args = all_tests("testMe) ~ list(test_name = "diff in medians")
)
In my attempts, the testMe function is called but its test_name argument is always NULL (i.e. default); all subsequent errors are due to its being blank, and if I set a sensible default, the function always returns the subsequent result.
When I change testMe not to have a default test_name, I get an "argument missing" error. In other words, test.args isn't working for me…
Am I getting something wrong? Thanks!
I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)
This is an example of what I'm trying to do.
Step 1 :
Create a list of combination of dependent variable and independent variables
a <- list(paste("Sepal.Length ~ Sepal.Width" ) ,
paste("Sepal.Width ~ Sepal.Length" )
)
Step 2 :
Using lappy function to run glm for each element in the list in the step #1 , and also create a for loop to test two different parameters
param <- c("gaussian" , "Gamma" )
for(i in 1:2) {
print(lapply(a , FUN = function(X) glm(X , data = iris ,family = param[i] )))}
Is there a better way to achieve this without using for loop in the second step? This is what I have tried but it's not working.
a <-
list(
paste("Sepal.Length ~ Sepal.Width , data = iris , family = "Gaussian" " ) ,
paste("Sepal.Length ~ Sepal.Width , data = iris , family = "Gamma" " ) ,
paste("Sepal.Width ~ Sepal.Length , data = iris , family = "Gaussian" " ) ,
paste("Sepal.Width ~ Sepal.Length , data = iris , family = "Gamma" " )
)
lapply(a , FUN = function(X) glm(X))
Your paste does nothing here. Leave it out. Furthermore, the use of strings is also unnecessary here. Leave them out. Same goes for your parameter families: these are functions, no need to quote them.
This already vastly simplifies the code, both in length and conceptually. Now we have this:
models = list(Sepal.Length ~ Sepal.Width, Sepal.Width ~ Sepal.Length)
families = c(gaussian, Gamma)
And we can apply it:
lapply(models,
function (model) lapply(families,
function (family) glm(model, family, iris)))
… which is a nested application. The indentation hints at what belongs together. Since this is a bit odd, we can also use the cartesian product of the different parameters:
params = as.data.frame(t(expand.grid(models, families)))
lapply(params, function (p) glm(formula = p[[1]], data = iris, family = p[[2]]))
The first line is a bit obscure here. expand.grid allows us to create a data frame of all parameter combinations. Here’s an example:
> expand.grid(1 : 3, c('a', 'b'))
Var1 Var2
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
Unfortunately, this data frame is in the wrong orientation to be used by lapply, because that applies over columns. So we transpose it (and convert it to a data.frame again, since t always returns a matrix).
This piece of code is incredibly useful because it makes writing nested loops via lapply much more readable; unfortunately, it is itself quite unreadable, so we stick it into a function:
combine_parameters = function (...)
as.data.frame(t(expand.grid(...)))
This allows us to write elegant, readable code:
models = list(Sepal.Length ~ Sepal.Width, Sepal.Width ~ Sepal.Length)
families = c(gaussian, Gamma)
params = combine_parameters(models, families)
lapply(params, function (p) glm(formula = p[[1]], family = p[[2]]), data = iris)
Using lapply:
lapply(c("gaussian", "Gamma"), function(myFamily){
lapply(c("Sepal.Length ~ Sepal.Width" ,
"Sepal.Width ~ Sepal.Length"), function(myFormula){
glm(formula = myFormula, family = myFamily, data = iris)
})
})
EDIT:
As mentioned in #KonradRudolph answer, we can pass formula as a list with a link = argument, e.g.:
lapply(list(gaussian(link = "identity"), Gamma), function(myFamily){
lapply(c("Sepal.Length ~ Sepal.Width" ,
"Sepal.Width ~ Sepal.Length"), function(myFormula){
glm(formula = myFormula, family = myFamily, data = iris)
})
})