R add multiple different inputs interacted within an lapply function - r

I am trying to run a complex repeated function in r using lapply to create multiple datasets and perform multiple similar analyses. My base function works fine, essentially being the following:
models <- paste0("outcome", 1:10, " ~ explanatoryvariable | fe_variable |0|fe_variable") |> lapply(\(x) felm(as.formula(x), data = df))
I then plot models based on that formula:
plot1 <- plot_model(models[[1]], show.values = TRUE, value.offset = .3, vline.color = "Blue", vline.size = 0.1, p.threshold = 0.05, colors = "Black", digits = 3)
I do this for all plots 1-10. This allows me to perform the same regression analysis for multiple versions of explanatoryvariable, which are all included in my dataset with their corresponding number 1-10.
This works perfectly, however I now need to repeat this with a more complex formula. I need to add an interaction term to the orginal function. However, the variable being interacted is not the same between each model. Essentially, I need the following:
models <- paste0("outcome", 1:10, " ~ explanatoryvariable*interactionvariable(x) | fe_variable |0|fe_variable") |> lapply(\(x) felm(as.formula(x), data = df))
Where interactionvariable changes for each model.
So, when I create the plots
plot1 will include explanatoryvariable interacted with interactionvariable1 but plot2 will include explanatoryvariable interacted with interactionvariable2, and so on.
Because of the way my function with lapply was written before, I cannot insert this changing interactive variable into the pasted text. Is there a way to design an lapply function with these two changing inputs? How can I change the code to preserve the overall function for multiple models/plots, but adjust it for this more complex interaction term? Is it possible to have to sets of changing variables within the lapply function, especially given that they must be part of an interaction term?
Note: plot_model is from the sjPlot package and felm is from the lfe package
Update with attempts using map function
I tried using the map function and seem closer but am receiving errors:
input_fun<-function(data, input1, input2){
felm(as.formula(paste("outcome",input1," ~ explanatoryvariable*",input2, "| fe_variable |0|fe_variable", collapse = "", sep = "", data = data)))}
model <-map(.x = 1:2, .f = ~ input_fun(data= df, input1 = 1:10, input2 = c(interactionvariable1,interactionvariable2, ...)))
which returns the following error:
Error in str2lang(x) : <text>:69:396: unexpected symbol
I also tried:
input_fun<-function(data, input1, input2){
felm(as.formula(paste("outcome",input1," ~ explanatoryvariable*",input2, "| fe_variable |0|fe_variable |data = df", collapse = "", sep = "")))}
model <-map_chr(.x = 1:2, .f = ~ input_fun(data= df, input1 = 1:10, input2 = c(interactionvariable1,interactionvariable2, ...)))
and received:
Warning: invalid formula c("outcome1 ~ explanatoryvariable*interactionvariable1 | fe_variable |0|fe_variable | data = df", "outcome2 ~ explanatoryvariable*interactionvariable2 | fe_variable |0|fe_variable | data = df")...: assignment is deprecatedError in class(ff) <- "formula" : cannot set attribute on a symbol
As #allan suggested, map seems to be on the right track, but how can I fix these errors?

Related

Issue concerning the Y parameter in tbl_uvregression function inside a function

So I am trying to input the Y parameter of the tbl_uvregression function (gt_summary package) via a custom function. The idea is to create multiple tbl inside my function and return the different tables merged.
Here an example of the code I am using:
#Loading libraries + example dataset from questionr package
library(haven)
library(tidyverse)
library(finalfit)
library(dplyr)
library(survey)
library(srvyr)
library(gtsummary)
library(glue)
library(gt)
library(knitr)
library(questionr)
data(hdv2003)
Here is the part where I have an issue :
reg_log <- function(dataframew, variables, by) {
##param1 : weighted dataframe
##param2 : vector containing variables we want in our graph
##param3 : the variable or column we want as our Y argument
Table <- tbl_uvregression(data = dataframew, include = variables, exponentiate = TRUE, method.args = list(family = quasibinomial()), y = by, method = survey::svyglm)
return(Table)
}
When I run this function outside of reg_log, I have no issue, but it seems that inside a function, the Y parameter of tbl_uvregression does not evaluate the argument, but instead read it literally. Here's the error I get when calling my function:
hdv2003w <- svydesign(ids = ~1, data = hdv2003, weights = ~hdv2003$poids) #setting the survey.design object
reg_log(hdv2003w, c("age", "sexe", "hard.rock", "sport"), "sport")
x There was an error constructing model survey::svyglm(formula = by ~ age, design = ., family = quasibinomial()) See error below. Erreur : Problem with mutate() column model.
i model = map(...).
x Error in svyglm.survey.design(formula = by ~ age, design = structure(list(: all variables must be in design= argument
I am aware that the Y parameter requires a syntax without the quotes, but even when I'm using the substitute() function it does not work. I have resolved myself to make several possibilities using the switch function, but if anyone knows how to resolve this, it will be awesome.
Thanks.
The tbl_uvregression() function expects an unquoted input for y=, rather than a string with the outcome name. I updated your function to account for the string input.
library(gtsummary)
library(questionr)
data(hdv2003)
reg_log <- function(dataframew, variables, by) {
tbl_uvregression(
data = dataframew,
include = all_of(variables),
exponentiate = TRUE,
method.args = list(family = quasibinomial()),
y = !!rlang::sym(by),
method = survey::svyglm
)
}
hdv2003w <- survey::svydesign(ids = ~1, data = hdv2003, weights = ~hdv2003$poids) #setting the survey.design object
tbl <-
reg_log(hdv2003w, c("age", "sexe", "hard.rock"), "sport")
Created on 2021-11-12 by the reprex package (v2.0.1)

Remove linear dependent variables while using the bife package

Some pre-programmed models automatically remove linear dependent variables in their regression output (e.g. lm()) in R. With the bife package, this does not seem to be possible. As stated in the package description in CRAN on page 5:
If bife does not converge this is usually a sign of linear dependence between one or more regressors
and the fixed effects. In this case, you should carefully inspect your model specification.
Now, suppose the problem at hand involves doing many regressions and one cannot inspect adequately each regression output -- one has to suppose some sort of rule-of-thumb regarding the regressors. What could be some of the alternatives to remove linear dependent regressors more or less automatically and achieve an adequate model specification?
I set a code as an example below:
#sample coding
x=10*rnorm(40)
z=100*rnorm(40)
df1=data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2=data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3=rbind(df1,df2)
df3=rbind(df1,df2)
for(i in 1:4) {
x=df3[df3$Region==i,]
model = bife::bife(a ~ x + y + z | ID, data = x)
results=data.frame(Region=unique(df3$Region))
results$Model = results
if (i==1){
df4=df
next
}
df4=rbind(df4,df)
}
Error: Linear dependent terms detected!
Since you're only looking at linear dependencies, you could simply leverage methods that detect them, like for instance lm.
Here's an example of solution with the package fixest:
library(bife)
library(fixest)
x = 10*rnorm(40)
z = 100*rnorm(40)
df1 = data.frame(a=rep(c(0,1),times=20), x=x, y=x, z=z, ID=c(1:40), date=1, Region=rep(c(1,2, 3, 4),10))
df2 = data.frame(a=c(rep(c(1,0),times=15),rep(c(0,1),times=5)), x=1.4*x+4, y=1.4*x+4, z=1.2*z+5, ID=c(1:40), date=2, Region=rep(c(1,2,3,4),10))
df3 = rbind(df1, df2)
vars = c("x", "y", "z")
res_all = list()
for(i in 1:4) {
x = df3[df3$Region == i, ]
coll_vars = feols(a ~ x + y + z | ID, x, notes = FALSE)$collin.var
new_fml = xpd(a ~ ..vars | ID, ..vars = setdiff(vars, coll_vars))
res_all[[i]] = bife::bife(new_fml, data = x)
}
# Display all results
for(i in 1:4) {
cat("\n#\n# Region: ", i, "\n#\n\n")
print(summary(res_all[[i]]))
}
The functions needed here are feols and xpd, the two are from fixest. Some explanations:
feols, like lm, removes variables on-the-fly when they are found to be collinear. It stores the names of the collinear variables in the slot $collin.var (if none is found, it's NULL).
Contrary to lm, feols also allows fixed-effects, so you can add it when you look for linear dependencies: this way you can spot complex linear dependencies that would also involve the fixed-effects.
I've set notes = FALSE otherwise feols would have prompted a note referring to collinearity.
feols is fast (actually faster than lm for large data sets) so won't be a strain on your analysis.
The function xpd expands the formula and replaces any variable name starting with two dots with the associated argument that the user provide.
When the arguments of xpd are vectors, the behavior is to coerce them with pluses, so if ..vars = c("x", "y") is provided, the formula a ~ ..vars | ID will become a ~ x + y | ID.
Here it replaces ..vars in the formula by setdiff(vars, coll_vars)), which is the vector of variables that were not found to be collinear.
So you get an algorithm with automatic variable removal before performing bife estimations.
Finally, just a side comment: in general it's better to store results in lists since it avoids copies.
Update
I forgot, but if you don't need bias correction (bife::bias_corr), then you can directly use fixest::feglm which automatically removes collinear variables:
res_bife = bife::bife(a ~ x + z | ID, data = df3)
res_feglm = fixest::feglm(a ~ x + y + z | ID, df3, family = binomial)
rbind(coef(res_bife), coef(res_feglm))
#> x z
#> [1,] -0.02221848 0.03045968
#> [2,] -0.02221871 0.03045990

Pass df column names to nested equation in Graph Printing Function

I need some clarification on the primary post on Passing a data.frame column name to a function
I need to create a function that will take a testSet, trainSet, and colName(aka predictor) as inputs to a function that prints a plot of the dataset with a GAM model trend line.
The issue I run into is:
plot.model = function(predictor, train, test) {
mod = gam(Response ~ s(train[[predictor]], spar = 1), data = train)
...
}
#Function Call
plot.model("Predictor1", 1.0, crime.train, crime.test)
I can't simply pass the predictor as a string into the gam function, but I also can't use a string to index the data frame values as shown in the link above. Somehow, I need to pass the colName key to the game function. This issue occurs in other similar scenarios regarding plotting.
plot <- ggplot(data = test, mapping = aes(x=predictor, y=ViolentCrimesPerPop))
Again, I can't pass a string value for the column name and I can't pass the column values either.
Does anyone have a generic solution for these situations. I apologize if the answer is buried in the above link, but it's not clear to me if it is.
Note: A working gam function call looks like this:
mod = gam(Response ~ s(Predictor1, spar = 1.0), data = train)
Where the train set is a data frame with column names "Response" & "Predictor".
Use aes_string instead of aes when you pass a column name as string.
plot <- ggplot(data = test, mapping = aes_string(x=predictor, y=ViolentCrimesPerPop))
For gam function:: Example which is copied from gam function's documentation. I have used vector, scalar is even easier. Its just using paste with a collapse parameter.
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
# String manipulate for formula
formula <- as.formula(paste("y~s(", paste(colnames(dat)[2:5], collapse = ")+s("), ")", sep =""))
b <- gam(formula, data=dat)
is same as
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

User-Defined Function for lme model fits: error

I am beginning to write a function that builds linear mixed models with nlme. I am encountering an error: Error in eval(expr, envir, enclos) : object 'value' not found, which I believe is due to R not knowing where to find the data frame variables (e.g., value). If this is, in fact, why the error is occurring, how do I tell the function that value and timepoint belong to the variables in Dat in the (reproducible) code below?
require(nlme)
Dat <- data.frame(
id = sample(10:19),
Time = sample(c("one", "two"), 10, replace = T),
Value = sample(1:10)
)
nlme_rct_lmm <- function (data, value, timepoint,
ID) {
#base_level intercept only model
bl_int_only <- gls(value ~ 1,
data = data,
method = "ML",
na.action="na.omit")
#vary intercept across participants
randomIntercept <- lme(value ~ 1,
data = data,
random = ~1|ID,
method = "ML",
na.action = "na.omit")
#add timepoint as a fixed effect
timeFE <- lme(value ~ timepoint,
data = data,
random = ~1|ID,
method = "ML",
na.action = "na.omit")
}
nlme_rct_lmm(Dat, Value, Time, id)
This isn't (as you and I both expected) a problem with evaluation within different frames; rather, it's an issue of consistency between the names of the variables between the formula and the data. R is case-sensitive, so it matters whether you use value or Value, id or ID, etc.. Furthermore, formula interpretation uses non-standard evaluation (NSE), so if you have a variable value equal to the symbol Value, value ~ 1 does not magically get transmuted to Value ~ 1. What I've outlined below works by passing the names of the response, time, and ID variables to the function, because it's the easiest approach. It's a little bit more elegant to the end-user if you use non-standard evaluation, but that's a bit harder to program (and therefore understand, debug, etc.).
Below the easy/boneheaded approach, I also discuss how to implement the NSE approach (scroll all the way down ...)
Note that your example doesn't return anything; with R, that means that all the results will be discarded when it finishes the function. You might want to return the results as a list (or perhaps your real function will do something other stuff with the fitted models, such as a series of model tests, and return those answers as the results ...)
require(nlme)
Dat <- data.frame(
ID = sample(10:19),
Time = sample(c("one", "two"), 10, replace = T),
Value = sample(1:10)
)
nlme_rct_lmm <- function (data, value, timepoint,
ID) {
nullmodel <- reformulate("1",response=value)
fullmodel <- reformulate(c("1",timepoint),response=value)
remodel <- reformulate(paste("1",ID,sep="|"))
#base_level intercept only model
bl_int_only <- gls(nullmodel,
data = data,
method = "ML",
na.action="na.omit")
#vary intercept across participants
randomIntercept <- lme(nullmodel,
data = data,
random = remodel,
method = "ML",
na.action = "na.omit")
#add timepoint as a fixed effect
timeFE <- lme(fullmodel,
data = data,
random = remodel,
method = "ML",
na.action = "na.omit")
}
nlme_rct_lmm(Dat, "Value", "Time", "ID")
If you want something a bit more elegant (but internally obscure) you can substitute the following lines for defining the models. The inner substitute() calls retrieves the symbols that were passed to the function as arguments; the outer substitute() calls insert those symbols into the formula.
nullmodel <- formula(substitute(v~1,list(v=substitute(value))))
fullmodel <- formula(substitute(v~t,list(v=substitute(value),
t=substitute(timepoint))))
remodel <- formula(substitute(~1|i,list(i=substitute(ID))))
Now this would work, without specifying the variables as strings, as you expected: nlme_rct_lmm(Dat, Value, Time, ID)

Pass glm predictors from a list

I have a large set of model specifications to test, which share a dv but have unique IVs. In the following example
foo <- data.frame(dv = sample(c(0,1), 100, replace=T),
x1 = runif(100),
x2 = runif(100))
I want the first model to only include x1, the second x2, the third both, and the fourth their interaction. So I thought a sensible way would be to build a list of formula statements:
bar <- list("x1",
"x2",
"x1+x2",
"x1*x2")
which I would then use in a llply call from the plyr package to obtain a list of model objects.
require(plyr)
res <- llply(bar, function(i) glm(dv ~ i, data = foo, family = binomial()))
Unfortunately I'm told
Error in model.frame.default(formula = dv ~ i, data = foo, drop.unused.levels = TRUE):variable lengths differ (found for 'i')
Obviously I'm mixing up something fundamental--do I need to manipulate the original foo list in some fashion?
Your problem is with how you are specifying the formula, since inside the function i is a variable. This would work:
glm(paste("dv ~", i), data = foo, family = binomial())
The problem is that dv ~ i isn't a formula. i is (inside the anonymous function) simply a symbol that represents a variable containing a character value.
Try this:
bar <- list("dv~x1",
"dv~x2",
"dv~x1+x2",
"dv~x1*x2")
res <- llply(bar, function(i) glm(i, data = foo, family = binomial()))
But setting statistical issues aside, it might possibly be easier to use something like ?step or ?stepAIC in the MASS package for tasks similar to this?

Resources