An R function cannot work in local environment of other functions - r

I use Matchit package for propensity score matching. It can generate a matched data after matching using get_matches() function.
However, if I do not run the get_matches() function in the global environment but include it in any other function, the matched data cannot be found in the local environment. (These prove to be misleading information. There is nothing wrong with MatchIt's output. Answer by Noah explains my question better.)
For producing my data
dataGen <- function(b0,b1,n = 2000,cor = 0){
# covariate
sigma <- matrix(rep(cor,9),3,3)
diag(sigma) <- rep(1,3)
cov <- MASS::mvrnorm(n, rep(0,3), sigma)
# error
error <- rnorm(n,0,sqrt(18))
# treatment variable
logit <- b0+b1*cov[,1]+0.3*cov[,2]+cov[,3]
p <- 1/(1+exp(-logit))
treat <- rbinom(n,1,p)
# outcome variable
y <- error+treat+cov[,1]+cov[,2]
data <- as.data.frame(cbind(cov,treat,y))
return(data)
}
set.seed(1)
data <- dataGen(b0=-0.92, b1=0.8, 900)
It is like the following works. The est.m.WLS() can use the m.data.
fm1 <- treat ~ V1+V2+V3
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
est.m.WLS <- function(m.data, fm2){
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
fm2 <- y ~ treat+V1+V2+V3
est.m.WLS(m.data,fm2)
But the next syntax does not work. It will report
"object 'm.data' not found"
rm(m.data)
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
est.m.WLS <- function(m.out, fm2){
m.data <- MatchIt::get_matches(m.out,data=data)
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
est.m.WLS(m.out,fm2)
Since I want to run parallel loops using the groundhog library for simulation purpose, the get_matches function also cannot work in foreach()%dopar%{...} environment.
res=foreach(s = 1:7,.combine="rbind")%dopar%{
m.out <- MatchIt::matchit(data = data, formula = fm.p, distance = data$logit, m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
...
}
How should I fix the problem?
Any help would be appreciated. Thank you!
Using for() loop directly will not run into any problem since it just works in the global environment, but it is too slow... I really hope to do the thousand time simulations at once. Help!

This has nothing to do with MatchIt or get_matches(). Run debugonce(est.m.WLS) with your second implementation of est.m.WLS(). You will see that get_matches() works perfectly fine and returns m.data. The problem is when lmtest() runs with a formula argument for cluster.
This is due to a bug in R, outside any package, that I have already requested to be fixed. The problem is that expand.model.matrix(), a function that searches for the dataset that the variables supplied to cluster could be in, only searches the global environment for data, but m.data does not exist in the global environment. To get around this issue, don't supply a formula to cluster; use cluster = m.data["subclass"]. This should hopefully be resolved in an upcoming R release.

Related

How to increase the efficiency of a for loop used to run Stepwise Regressions iteratively

All of the code in this question can be found in my GitHub Repository for this research project on Estimated Exhaustive Regression. Specifically, in the "Both BE & FS script" and "LASSO code" Rscripts, and you may use the significantly truncated file folder of datasets "sample_obs(20)" rather than "spencer" because the former only contains 20 csvs while the latter contains 58.5k!
I am running both a Backward Elimination and a Forward Selection Stepwise Regression on each of N different csv file formatted datasets within a file folder using the following code (once the N datasets have already been loaded):
set.seed(11) # for reproducibility
full_models <- vector("list", length = length(datasets))
BE_fits <- vector("list", length = length(datasets))
head(BE_fits, n = 3) # returns a list with 18 elements, all of which are NULL
set.seed(11) # for reproducibility
for(i in seq_along(datasets)) {
full_models[[i]] <- lm(formula = Y ~ ., data = datasets[[i]])
BE_fits[[i]] <- step(object = full_models[[i]],
scope = formula(full_models[[i]]),
direction = 'backward',
trace = 0) }
And to get the final results I want, I use the following:
BE_Coeffs <- lapply(seq_along(BE_fits), function(i) coef(BE_fits[[i]]))
Models_Selected_by_BE <- lapply(seq_along(BE_fits),
\(i) names(coef(BE_fits[[i]])))
And for FS Stepwise, I used:
set.seed(11) # for reproducibility
FS_fits <- vector("list", length = length(datasets))
head(FS_fits, n = 3) # returns a list with 15 elements, all of which are NULL
set.seed(11) # for reproducibility
for(j in seq_along(datasets)) { null_models[[j]] = lm(formula = Y ~ 1,
data = datasets[[j]])
FS_fits[[j]] = step(object = null_models[[j]],
direction = 'forward',
scope = formula(full_models[[j]]), trace = 0) }
Much of the syntax of this code I got from previous questions I asked here several months ago, but now I am rerunning all of my models on a new file folder filled with new randomly generated synthetic datasets, and I don't want to re-run this using this code because last time, it took WELL OVER 12 or 14 hours for both the BE and the FS stepwise procedures to finish running.
p.s.
I already was able to avoid using a loop when I did the same thing instead for LASSO Regression as my 1st Benchmark Variable Selection Procedure using the following code which employed a function from R's useful apply family (this only takes 2-3 hours):
set.seed(11) # to ensure replicability
LASSO_fits <- lapply(datasets, function(i)
enet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE))
However, I could not figure out how to replicate something similar for either basic version of Stepwise because of the all important initialization step beforehand.

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

Pasting object names inside functions

This is a follow-up question to this (see data and previous commands).
Starting with a list of models in mods, i am now able to find the model with the least AIC (corresponds to the best model):
mods <- lapply(methods, function(m)
update(amod.null, correlation = getFunction(m)(1, form = ~ x + y), method="ML"))
names(mods) <- methods
list.AIC <- lapply(mods, function(x) AIC(x))
best.mod <- names(which.min(list.AIC))
Now, i need to do some testing on the model, e.g. Tukey between dates. The syntax is very simple, e.g. for amod.null
library(multcomp)
res <- glht(amod.null, mcp(Date = "Tukey"))
The tricky part is, how can i tell glht to use the model which was put into best.mod (note: this is all happening within a loop). I tried
res <- glht(paste("mods$", as.factor(best.mod),sep = "") , mcp(Date = "Tukey"))
but to no avail, as glht needs to find a model-object in the first argument.
/edit:
Possibly useful:
names(mods)
[1] "corExp" "corGaus" "corLin" "corRatio" "corSpher"
Since the models are stored in the list mods, you can access the "best model" by using the index of which.min(list.AIC):
list.AIC <- sapply(mods, AIC)
best.mod <- mods[which.min(list.AIC)]
best.mod[[1]]

R Harmonic Prediction Failing - newdata structure

I am forecasting a time series using harmonic regression created as such:
(Packages used: tseries, forecast, TSA, plyr)
airp <- AirPassengers
TIME <- 1:length(airp)
SIN <- COS <- matrix(nrow = length(TIME), ncol = 6,0)
for (i in 1:6){
SIN[,i] <- sin(2*pi*i*TIME/12)
COS[,i] <- cos(2*pi*i*TIME/12)
}
SIN <- SIN[,-6]
decomp.seasonal <- decompose(airp)$seasonal
seasonalfit <- lm(airp ~ SIN + COS)
The fitting works just fine. The problem occurs when forecasting.
TIME.NEW <- seq(length(TIME)+1, length(TIME)+12, by=1)
SINNEW <- COSNEW <- matrix(nrow=length(TIME.NEW), ncol = 6, 0)
for (i in 1:6) {
SINNEW[,i] <- sin(2*pi*i*TIME.NEW/12)
COSNEW[,i] <- cos(2*pi*i*TIME.NEW/12)
}
SINNEW <- SINNEW[,-6]
prediction.harmonic.dataframe <- data.frame(TIME = TIME.NEW, SIN = SINNEW, COS = COSNEW)
seasonal.predictions <- predict(seasonalfit, newdata = prediction.harmonic.dataframe)
This causes the warning:
Warning message:
'newdata' had 12 rows but variables found have 144 rows
I went through and found that the names were SIN.1, SIN.2, et cetera, instead of SIN1 and SIN2... So I manually changed those and it still didn't work. I also manually removed the SIN.6 because it, for some reason, was still there.
Help?
Edit: I have gone through the similar posts as well, and the answers in those questions did not fix my problem.
Trying to predict with a data.frame after fitting an lm model with variables not inside a data.frame (especially matrices) is not fun. It's better if you always fit your model from data in a data.frame.
For example if you did
seasonalfit <- lm(airp ~ ., data.frame(airp=airp,SIN=SIN,COS=COS))
Then your predict would work.
Alternatively you can try to cram matrices into data.frames but this is generally a bad idea. You would do
prediction.harmonic.dataframe <- data.frame(TIME = TIME.NEW,
SIN = I(SINNEW), COS = I(COSNEW))
The I() (or AsIs function) will keep them as matrices.

Object not found error when passing model formula to another function

I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.

Resources