Is using assign() to deal with a scoping problem a problem? - r

When contained inside a custom function, RVAideMemoire::Anova.clm
does not find the 'df1' data object that was passed to ordinal::clm (It would seem
because it searches for 'df1' in the global environment):
library(ordinal)
library(RVAideMemoire)
set.seed(1)
df <- data.frame(x = factor(sample(1:2, 100, replace=TRUE)),
y = factor(sample(1:5, 100, replace=TRUE), ordered=TRUE))
clm_function <- function(dv, gv, df1){
model <- ordinal::clm(as.formula(paste0(dv, " ~ ", gv)), data = df1)
result <- RVAideMemoire::Anova.clm(model, type = "II")
return(result)
}
clm_function(dv = "y", gv = "x", df1 = df)
Error in is.data.frame(data) : object 'df1' not found
One can sidestep this error by using assign to put 'df1' in the global
environment temporarily:
clm_function_alt <- function(dv, gv, df1){
assign("df1", df1, envir=globalenv()) # ASSIGN HERE
model <- ordinal::clm(as.formula(paste0(dv, " ~ ", gv)), data = df1)
result <- RVAideMemoire::Anova.clm(model, type = "II")
rm(df1, pos = 1) # REMOVE HERE
return(result)
}
clm_function_alt(dv = "y", gv = "x", df1 = df)
LyzandeR’s answer in this post implies that assigning ‘df1’ to the global environment outside the function is the way to go.
But I am wondering if there something potentially problematic with using assign from inside a function, like I have shown?
If so, is there a way to instruct RVAideMemoire::Anova.clm to search for ‘df1’ in the execution environment?

Related

For loop to run lmer model for each column in a dataframe gives "variable lengths differ" error (R)

I am trying to run a for loop that plugs every column in a dataframe into a lme4::lmer() model syntax and appends the result to a list e.g.
list_results_univariate <- list()
for (i in names(my_dataset))
{
resultmodel <- lmer(y_variable ~
i + i*timevar1 + i*timevar2 +
(1|Date) + (1|Location),
data= my_dataset)
tidy_resultmodel <- tidy_lmer(resultmodel)
list_results_univariate[[i]] <- tidy_resultmodel
}
but the result is:
Error in model.frame.default(data = my_dataset, drop.unused.levels = TRUE, :
variable lengths differ (found for 'i')
The dataset contains no NAs and no single-level factors as I already removed these. It still returns the same error if I remove timevar1, timevar2, Date and Location from the list of names I iterate over.
How do I get this to run without manually writing the model for each variable?
Your formula includes i directly, which means that lmer expects to find a column called i in your dataset. Your i variable has length 1 (the string column name), but lmer is expecting a variable of length equal to the length of your y_variable, hence the error message.
Inside your loop, you should create a formula that evaluates i to its underlying value, and then use that formula in lmer. For example:
library(lme4)
dat <- data.frame(id = sample(c("a", "b", "c"), 100, replace=TRUE),
y = rnorm(100),
x = rnorm(100),
w = rnorm(100),
z = rnorm(100))
# this errors
for (i in c("x", "w", "z")) {
lmer(y ~ i + (1 | id), data=dat)
}
# this works
models <- list()
for (i in c("x", "w", "z")) {
f <- formula(paste("y~(1|id)+", i))
models[[i]] <- lmer(f, data=dat)
}

R sometimes fails to evaluate expressions parsed from strings

I have a massive dataframe where I need to create "lagged" variables and compare them with former time points. As this process needs to be variable, I've chosen to write my own functions which create these lagged variables (not included here).
As I use GLM's, I want to use the stepAIC function and before I start writing tenth of "lag01 + lag02..." I wanted to create another function (modelfiller) which creates these strings according to my parameters and then I use string2lang to make them expressions.
This mostly works but there is one issue which I cannot get my head around.
As you can see in the reprex full.model can be created when I only use y~x+lag01+lag02. If I use modelfiller("y", 2, "x", "lag") at location 1 and 3 it also works. But the moment I put modelfiller("y", 2, "x", "lag") at location 2 in the code (within the stepAIC glm) it creates the following error message:
Error: Problem with `mutate()` input `GLM_AIC`.
x object '.x' not found
i Input `GLM_AIC` is `purrr::map(...)`.
i The error occurred in group 1: group = "a".
I have also tried as.formula with & without eval, but it caused the same issue.
group <- c(rep("a", 10), rep("b", 10), rep("c", 10))
order <- c(seq(1:10), seq(1:10), seq(1:10))
x <- c(runif(30))
y <- c(runif(30))
df <- data.frame(group, order, x, y)
df <- df %>%
dplyr::group_by(group) %>%
dplyr::arrange(group, order) %>%
dplyr::mutate(lag01 = dplyr::lag(x, n=1),
lag02 = dplyr::lag(x, n=2)) %>%
tidyr::drop_na()
modelfiller = function(depPar, maxlag, indepPar, str) {
varnames = list()
for (i in seq(1:maxlag)) {
varnames[i] = paste0(str, stringr::str_pad(i, width = 2, pad = "0"))
}
varnames = paste0(varnames, collapse="+")
varnames = paste(indepPar, varnames, sep = "+")
return(paste(depPar, varnames, sep = "~"))
}
full.model <- df %>%
tidyr::nest(- group) %>%
dplyr::mutate(
# Perform GLM calculation on each group and then a step-wise model selection based on AIC
GLM = purrr::map(
data, ~ lm(data = .x,
# Location 1 - Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)),
GLM_AIC = purrr::map(
data, ~ MASS::stepAIC(glm(data = .x,
# Location 2 - NOT Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)
,direction = "both", trace = FALSE, k = 2,
scope = list(
lower = lm(data = .x,
y ~ 1),
upper = glm(data = .x,
# Location 3 - Working
str2lang(modelfiller("y", 2, "x", "lag"))
#y~x+lag01+lag02
)
)))
)
The issue is that glm stores the name of the variable used to reference the data, and stepAIC then attempts to retrieve this name and evaluate it to access the data, but gets confused about which environment the variable was defined in. To demonstrate, I'm going to simplify your code to
mdl <- str2lang(modelfiller("y", 2, "x", "lag")) # This is your y~x+lag01+lag02
dfn <- df %>% tidyr::nest( data = c(-group) ) # First step of your %>% chain
glms <- purrr::map( dfn$data, ~glm(data = .x, mdl) ) # Construct the models
# Examine glms to observe that
# Call: glm(formula = mdl, data = .x) <--- glm() remembers that the data is in .x
# but stepAIC is not properly aware of where .x
# is defined and behaves effectively as
MASS::stepAIC( glms[[1]] ) # Error: object '.x' not found
Option 1
One workaround is to manually construct the expression that contains the data and then evaluate it:
glm2 <- function(.df, ...) {
eval(rlang::expr(glm(!!rlang::enexpr(.df),!!!list(...)))) }
glms2 <- purrr::map( dfn$data, ~glm2(data = .x, mdl) ) # Same as above, but with glm2
MASS::stepAIC( glms2[[1]] ) # Now works
Changing glm to glm2 in your problematic spot makes your code work too. The down side is that the Call: then remembers the entire data frame, which can be problematic if they are very large.
Option 2
Another alternative is to replace the purrr call with a for loop, which helps maintain the calling frames assumed by stepAIC, thus guiding it to where the data is defined
# This fails with Error: object '.x' not found
purrr::map( dfn$data, ~MASS::stepAIC(glm(data=.x, mdl), direction="both") )
# This works
for( mydata in dfn$data )
MASS::stepAIC(glm(data=mydata, mdl), direction="both")
The advantage here is not needing to store the entire data frame inside the call. The disadvantage is that you effectively lose access to what purrr does to streamline the code.

Iterate over list and append in order to do a regression in R

I know that somewhere there will exist this kind of question, but I couldn't find it. I have the variables a, b, c, d and I want to write a loop, such that I regress and append the variables and regress again with the additional variable
lm(Y ~ a, data = data), then
lm(Y ~ a + b, data = data), then
lm(Y ~ a + b + c, data = data) etc.
How would you do this?
Using paste and as.formula, example using mtcars dataset:
myFits <- lapply(2:ncol(mtcars), function(i){
x <- as.formula(paste("mpg",
paste(colnames(mtcars)[2:i], collapse = "+"),
sep = "~"))
lm(formula = x, data = mtcars)
})
Note: looks like a duplicate post, I have seen a better solution for this type of questions, cannot find at the moment.
You could do this with a lapply / reformulate approach.
formulae <- lapply(ivars, function(x) reformulate(x, response="Y"))
lapply(formulae, function(x) summary(do.call("lm", list(x, quote(dat)))))
Data
set.seed(42)
dat <- data.frame(matrix(rnorm(80), 20, 4, dimnames=list(NULL, c("Y", letters[1:3]))))
ivars <- sapply(1:3, function(x) letters[1:x]) # create an example vector ov indep. variables
vars = c('a', 'b', 'c', 'd')
# might want to use a subset of names(data) instead of
# manually typing the names
reg_list = list()
for (i in seq_along(vars)) {
my_formula = as.formula(sprintf('Y ~ %s', paste(vars[1:i], collapse = " + ")))
reg_list[[i]] = lm(my_formula, data = data)
}
You can then inspect an individual result with, e.g., summary(reg_list[[2]]) (for the 2nd one).

R: object y not found in function (x,y) [function to pass through data frames in r]

I am writing a function to build new data frames based on existing data frames. So I essentially have
f1 <- function(x,y) {
x_adj <- data.frame("DID*"= df.y$`DM`[x], "LDI"= df.y$`DirectorID*`[-(x)], "LDM"= df.y$`DM`[-(x)], "IID*"=y)
}
I have 4,000 data frames df., so I really need to use this and R is returning an error saying that df.y is not found. y is meant to be used through a list of all the 4000 names of the different df. I am very new at R so any help would be really appreciated.
In case more specifics are needed I essentially have something like
df.1 <- data.frame(x = 1:3, b = 5)
And I need the following as a result using a function
df.11 <- data.frame(x = 1, c = 2:3, b = 5)
df.12 <- data.frame(x = 2, c = c(1,3), b = 5)
df.13 <- data.frame(x = 3, c = 1:2, b = 5)
Thanks in advance!
OP seems to access data.frame with dynamic name.
One option is to use get:
get(paste("df",y,sep = "."))
The above get will return df.1.
Hence, the function can be modified as:
f1 <- function(x,y) {
temp_df <- get(paste("df",y,sep = "."))
x_adj <- data.frame("DID*"= temp_df$`DM`[x], "LDI"= temp_df$`DirectorID*`[-(x)],
"LDM"= temp_df$`DM`[-(x)], "IID*"=y)
}

user defined variables in a function in r

I am trying to do a generic function to construct a formula for lineal regression. I want that the function create the formula either
using user defined variables or,
using all the variables present in the dataframe.
I can create the formula using all the variables present in the dataframe but my problem is when I try to get the user defined variables, I do not know exactly how to get the variables to later use them to create the formula.
The function that I have until now is this:
lmformula <- function (data, IndepVariable = character, VariableList = TRUE){
if (VariableList) {
newlist <- list()
newlist <- # Here is where I do not exactly what to do to extract the variables defined by user
DependVariables <- newlist
f <- as.formula(paste(IndepVariable, "~", paste((DependVariables), collapse = '+')))
}else {
names(data) <- make.names(colnames(data))
DependVariables <- names(data)[!colnames(data)%in% IndepVariable]
f <- as.formula(paste(IndepVariable,"~", paste((DependVariables), collapse = '+')))
return (f)
}
}
Please any hint will be deeply appreciated
The only thing that changes is how you get the independent variables
If the user specifies them, then use that character vector directly
Else, you have to to take all the variables other than the dependent variable(which you are already doing)
Note : As Roland mentioned, the formula is like dependentVariable ~ independentVariable1 + independentVariable2 + independentVariable3
# creating mock data
data <- data.frame(col1 = numeric(0), col2 = numeric(0), col3 = numeric(0), col4 = numeric(0))
# the function
lmformula <- function (data, DepVariable, IndepVariable, VariableList = TRUE) {
if (!VariableList) {
IndepVariable <- names(data)[!names(data) %in% DepVariable]
}
f <- as.formula(paste(DepVariable,"~", paste(IndepVariable, collapse = '+')))
return (f)
}
# working examples
lmformula(data = data, DepVariable = "col1", VariableList = FALSE)
lmformula(data = data, DepVariable = "col1", IndepVariable = c("col2", "col3"), VariableList = TRUE)
Hope it helps!

Resources