How to increase the efficiency of a for loop used to run Stepwise Regressions iteratively - r

All of the code in this question can be found in my GitHub Repository for this research project on Estimated Exhaustive Regression. Specifically, in the "Both BE & FS script" and "LASSO code" Rscripts, and you may use the significantly truncated file folder of datasets "sample_obs(20)" rather than "spencer" because the former only contains 20 csvs while the latter contains 58.5k!
I am running both a Backward Elimination and a Forward Selection Stepwise Regression on each of N different csv file formatted datasets within a file folder using the following code (once the N datasets have already been loaded):
set.seed(11) # for reproducibility
full_models <- vector("list", length = length(datasets))
BE_fits <- vector("list", length = length(datasets))
head(BE_fits, n = 3) # returns a list with 18 elements, all of which are NULL
set.seed(11) # for reproducibility
for(i in seq_along(datasets)) {
full_models[[i]] <- lm(formula = Y ~ ., data = datasets[[i]])
BE_fits[[i]] <- step(object = full_models[[i]],
scope = formula(full_models[[i]]),
direction = 'backward',
trace = 0) }
And to get the final results I want, I use the following:
BE_Coeffs <- lapply(seq_along(BE_fits), function(i) coef(BE_fits[[i]]))
Models_Selected_by_BE <- lapply(seq_along(BE_fits),
\(i) names(coef(BE_fits[[i]])))
And for FS Stepwise, I used:
set.seed(11) # for reproducibility
FS_fits <- vector("list", length = length(datasets))
head(FS_fits, n = 3) # returns a list with 15 elements, all of which are NULL
set.seed(11) # for reproducibility
for(j in seq_along(datasets)) { null_models[[j]] = lm(formula = Y ~ 1,
data = datasets[[j]])
FS_fits[[j]] = step(object = null_models[[j]],
direction = 'forward',
scope = formula(full_models[[j]]), trace = 0) }
Much of the syntax of this code I got from previous questions I asked here several months ago, but now I am rerunning all of my models on a new file folder filled with new randomly generated synthetic datasets, and I don't want to re-run this using this code because last time, it took WELL OVER 12 or 14 hours for both the BE and the FS stepwise procedures to finish running.
p.s.
I already was able to avoid using a loop when I did the same thing instead for LASSO Regression as my 1st Benchmark Variable Selection Procedure using the following code which employed a function from R's useful apply family (this only takes 2-3 hours):
set.seed(11) # to ensure replicability
LASSO_fits <- lapply(datasets, function(i)
enet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE))
However, I could not figure out how to replicate something similar for either basic version of Stepwise because of the all important initialization step beforehand.

Related

Inconsistent results trying to reproduce the variables selected by LASSO ran using enet(lambda = 0) with lars(type = "lasso")

The scripts which are referenced and from which the lines of code included came from can all be found at my GitHub Repository for this project, along with mini practice datasets (called ten, top 50, and last 50) so that you can replicate my results.
I am attempting to verify that the variables selected by n LASSO Regressions I have run on n datasets are valid, i.e. are reproducible. In my research project, I have run a LASSO on each of 260k synthetic datasets which were randomly generated via Monte Carlo Methods and have calculated several performance metrics for each of their selections. All of these LASSOs were run via the enet() function from the elastic net package in R.
Now, I am attempting to reproduce my results using a different function from a different package in R (but setting the same random seed beforehand of course), and for this task, I have chosen to use the lars() function from the lars package (as a 2nd choice because I have already spent much time trying and failing to use the glmnet package to do this which I asked for help with on Stack Overflow here, but no one answered).
This is the code I used to run my original 260k LASSOs, it can be found in the "LASSO Regressions" script, the "LASSO Regressions (practice version)" script, and also in the "LASSO code, but just the regression part" script.
set.seed(11)
system.time(LASSO_fits <- lapply(X = datasets, function(i)
elasticnet::enet(x = as.matrix(dplyr::select(i,
starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE)))
Note: the "datasets" object is a list of length n where each element is a data.table/data.frame object.
From there, I isolate and save just the candidate variables (there are 30 for each of the n datasets) selected by LASSO using:
## This stores and prints out all of the regression
## equation specifications selected by LASSO when called.
LASSO_Coeffs <- lapply(LASSO_fits,
function(i) predict(i,
x = as.matrix(dplyr::select(i, starts_with("X"))),
s = 0.1, mode = "fraction",
type = "coefficients")[["coefficients"]])
### Write my own custom function which will return a new list containing just the
### Independent Variables which are 'selected' or chosen by LASSO for each individual dataset.
IVs_Selected <- lapply(LASSO_Coeffs, function(i) names(i[i > 0]))
After that, the formulae I use to calculate all of the model performance metrics for this version and the replication version are the same, so I'll just skip all of that and jump to the same code sections in either the "LASSO using Lars (regression part only)" script, or the "LASSO using Lars" script; starting with the section which fits the n LASSOs:
set.seed(11) # to ensure replicability
system.time(LASSO.Lars.fits <- lapply(datasets, function(i)
lars(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, type = "lasso")))
# This stores and prints out all of the regression
# equation specifications selected by LASSO when called
LASSO.Lars.Coeffs <- lapply(LASSO.Lars.fits,
function(i) predict(i,
x = as.matrix(dplyr::select(i, starts_with("X"))),
s = 0.1, mode = "fraction",
type = "coefficients")[["coefficients"]])
IVs.Selected.by.Lars <- lapply(LASSO.Lars.Coeffs, function(i) names(i[i > 0]))
Running both versions on the "top 50" folder which is included in the GitHub Repo linked to above, I got the following performance metrics (starting with enet):
> performance_metrics
True.Positive.Rate True.Negative.Rate False.Positive.Rate
0.987 1 0
Underspecified.Models.Selected Correctly.Specified.Models.Selected
2 48
Overspecified.Models.Selected Models.with.at.least.one.Omitted.Variable
0 2
Models.with.at.least.one.Extra.Variable
0
> performance_metrics
True.Positive.Rate True.Negative.Rate False.Positive.Rate Underspecified.Models.Selected
1 1 0 0
Correctly.Specified.Models.Selected Overspecified.Models.Selected
50 0
Models.with.at.least.one.Omitted.Variable Models.with.at.least.one.Extra.Variable
0 0
Due to an unfortunate artifact of how the datasets are constructed, all the observations in each of them have to be initially loaded into R's Environment as strings. This makes it computationally infeasible to run any of these scripts on more than 5,000, 10,000, or 15,000 of the 260,000 total synthetic datasets at a time. So, I will include two screenshots below, showing the slight difference in the performance of several sets of 5 or 10k LASSOs ran using enet (1st screenshot is for enet) vs replications on the same datasets using lars (2nd screenshot is lars):
Looking through the function I believe that there remain 2 arguments that differ between the two implementations. In particular lars has a different error tolerance in the argument eps which defaults to 1e-12 rather than .Machine$double.eps. Additionally the argument normalize has been set to FALSE in the enet implementation but is not specified in the lars implementation in your code which defaults to TRUE. I believe I managed to produce the same outputs after a brief correction to the arguments so the specifications were identical.
Note: in the example below I have used the same top 50 dataset from your github repo.
setwd("C:\\top 50")
library(tidyverse)
ls_obj = list.files(pattern="*.csv")
datasets = map(ls_obj, ~read_csv(., skip = 2L))
set.seed(11)
system.time(LASSO_fits <- lapply(X = datasets, function(i)
elasticnet::enet(x = as.matrix(dplyr::select(i,
starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE)))
#> user system elapsed
#> 0.69 0.13 0.87
LASSO_Coeffs <- lapply(LASSO_fits,
function(i) predict(i,
x = as.matrix(dplyr::select(i, starts_with("X"))),
s = 0.1, mode = "fraction",
type = "coefficients")[["coefficients"]])
IVs_Selected <- lapply(LASSO_Coeffs, function(i) names(i[i > 0]))
set.seed(11) # to ensure replicability
system.time(LASSO.Lars.fits <- lapply(datasets, function(i)
lars::lars(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, type = "lasso", normalize = FALSE, eps = .Machine$double.eps)))
#> user system elapsed
#> 0.34 0.00 0.39
# This stores and prints out all of the regression
# equation specifications selected by LASSO when called
LASSO.Lars.Coeffs <- lapply(LASSO.Lars.fits,
function(i) predict(i,
x = as.matrix(dplyr::select(i, starts_with("X"))),
s = 0.1, mode = "fraction",
type = "coefficients")[["coefficients"]])
IVs.Selected.by.Lars <- lapply(LASSO.Lars.Coeffs, function(i) names(i[i > 0]))
identical(IVs.Selected.by.Lars, IVs_Selected)
#> [1] TRUE
Created on 2023-02-03 with reprex v2.0.2

How to systematically replicate the results of running n LASSOs on n data sets in R using enet() with lars()

My code used to fit k LASSO Regressions on k csv file-formatted data sets via the enet() function from the following:
set.seed(150)
system.time(LASSO <- lapply(datasets, function(J)
elasticnet::enet(x = as.matrix(dplyr::select(J,
starts_with("X"))),
y = J$Y, lambda = 0, normalize = FALSE)))
The code to extract the coefficients from those k estimates is:
## This stores and prints out the estimates for all of the regression
## equation specifications selected by LASSO when called.
LASSO_Coeffs <- lapply(LASSO,
function(i) predict(i,
x = as.matrix(dplyr::select(i, starts_with("X"))),
s = 0.1, mode = "fraction",
type = "coefficients")[["coefficients"]])
The line of code to isolate and store the names the of all the variables with positive coefficient estimates only:
IVs_Selected <- lapply(LASSO_Coeffs, function(i) names(i[i > 0]))
What I want is the syntax required to replicate this process exactly using the lars() function from the lars package (or perhaps some other function from some other package in R which has the ability to estimate a LASSO Regression which I have not heard of).
p.s. Here is all of the code I used to load/import the n data sets into R and store them in the 'datasets' list just in case this added context is of any use whatsoever:
# these 2 lines together create a simple character list of
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/Data/0.5-5-1-1 to 0.5-6-10-500"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)
# reformat the names of each of the csv file formatted dataset
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
# The code below reads the data into the RStudio Workspace from
# each of the n datasets in an iterative manner in such a way
# that it assigns each of them to the corresponding name of that
# dataset in the file folder they are stored in.
system.time( datasets <- lapply(paths_list, fread) )
I used fread because I am loading 5, 10, or 15k datasets at a time here; and they all initially load as characters/strings due to a quick of their construction.
# change column names of all the columns in the data.table 'datasets'
datasets <- lapply(datasets, function(dataset_i) {
colnames(dataset_i) <- c("Y","X1","X2","X3","X4","X5","X6","X7","X8",
"X9","X10","X11","X12","X13","X14","X15",
"X16","X17","X18","X19","X20","X21","X22",
"X23","X24","X25","X26","X27","X28","X29","X30")
dataset_i })
Structural_IVs <- lapply(datasets, function(j) {j[1, -1]})
Structural_Variables <- lapply(Structural_IVs, function(i) {names(i)[i == 1]})
datasets <- lapply(datasets, function(i) {i[-1:-3, ]})
datasets <- lapply(datasets, \(X) { lapply(X, as.numeric) })
datasets <- lapply(datasets, function(i) { as.data.table(i) })
Going off of the syntax of the code snippets you used in this post, something like this ought to do the trick:
set.seed(150) # to ensure replicability
LASSO.Lars.fits <- lapply(X = datasets, function(i)
lars(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, type = "lasso"))
However, if you have the time, I would recommend also seeing if you can replicate your set of variables selected using glmnet as well as lars. That way, you could know if let's just say, you get different variables 'selected' by lars than you did with enet, which of these sets is identical to the corresponding set of optimal variables selected by glmnet. Otherwise, you would have to just assume that the sets selected by enet were all valid or the opposite.

An R function cannot work in local environment of other functions

I use Matchit package for propensity score matching. It can generate a matched data after matching using get_matches() function.
However, if I do not run the get_matches() function in the global environment but include it in any other function, the matched data cannot be found in the local environment. (These prove to be misleading information. There is nothing wrong with MatchIt's output. Answer by Noah explains my question better.)
For producing my data
dataGen <- function(b0,b1,n = 2000,cor = 0){
# covariate
sigma <- matrix(rep(cor,9),3,3)
diag(sigma) <- rep(1,3)
cov <- MASS::mvrnorm(n, rep(0,3), sigma)
# error
error <- rnorm(n,0,sqrt(18))
# treatment variable
logit <- b0+b1*cov[,1]+0.3*cov[,2]+cov[,3]
p <- 1/(1+exp(-logit))
treat <- rbinom(n,1,p)
# outcome variable
y <- error+treat+cov[,1]+cov[,2]
data <- as.data.frame(cbind(cov,treat,y))
return(data)
}
set.seed(1)
data <- dataGen(b0=-0.92, b1=0.8, 900)
It is like the following works. The est.m.WLS() can use the m.data.
fm1 <- treat ~ V1+V2+V3
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
est.m.WLS <- function(m.data, fm2){
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
fm2 <- y ~ treat+V1+V2+V3
est.m.WLS(m.data,fm2)
But the next syntax does not work. It will report
"object 'm.data' not found"
rm(m.data)
m.out <- MatchIt::matchit(data = data, formula = fm1, link = "logit", m.order = "random", caliper = 0.2)
est.m.WLS <- function(m.out, fm2){
m.data <- MatchIt::get_matches(m.out,data=data)
model.1 <- lm(fm2, data = m.data, weights=(weights))
est <- model.1$coefficients["treat"]
## regular robust standard error ignoring pair membership
model.1.2 <- lmtest::coeftest(model.1,vcov. = sandwich::vcovHC)
CI.r <- confint(model.1.2,"treat",level=0.95)
## cluster robust standard error accounting for pair membership
model.2.2 <- lmtest::coeftest(model.1, vcov. = sandwich::vcovCL, cluster = ~subclass)
CI.cr <- confint(model.2.2,"treat",level=0.95)
return(c(est=est,CI.r,CI.cr))
}
est.m.WLS(m.out,fm2)
Since I want to run parallel loops using the groundhog library for simulation purpose, the get_matches function also cannot work in foreach()%dopar%{...} environment.
res=foreach(s = 1:7,.combine="rbind")%dopar%{
m.out <- MatchIt::matchit(data = data, formula = fm.p, distance = data$logit, m.order = "random", caliper = 0.2)
m.data <- MatchIt::get_matches(m.out,data=data)
...
}
How should I fix the problem?
Any help would be appreciated. Thank you!
Using for() loop directly will not run into any problem since it just works in the global environment, but it is too slow... I really hope to do the thousand time simulations at once. Help!
This has nothing to do with MatchIt or get_matches(). Run debugonce(est.m.WLS) with your second implementation of est.m.WLS(). You will see that get_matches() works perfectly fine and returns m.data. The problem is when lmtest() runs with a formula argument for cluster.
This is due to a bug in R, outside any package, that I have already requested to be fixed. The problem is that expand.model.matrix(), a function that searches for the dataset that the variables supplied to cluster could be in, only searches the global environment for data, but m.data does not exist in the global environment. To get around this issue, don't supply a formula to cluster; use cluster = m.data["subclass"]. This should hopefully be resolved in an upcoming R release.

Is there a way to store all factors selected from running the same Stepwise regression on each of N datasets using lapply(csvs, FUN(i) { step() })?

My file folder with the N datasets in the form of csv files is called sample_obs. The goal is to end up with two lists, one I have already figured out how to obtain, namely, a list of the names of each individual csv file that matches the format of their actual names in the folder all of them are in, not their file paths.
So, this is all the code I have written and the other list I need to create is a list of the factors/Independent Variables chosen by my Backward Elimination Stepwise Regression function only, no R-squared, Cp, AIC, BIC, or any other standard regression diagnostic tools; I don't want or need coefficient estimates either, just the regressors "chosen" for each dataset out of the 30 candidate regressors.
So far, in terms of my code that actually runs (except the last few lines):
# these 2 lines together create a simple character list of
# all the file names in the file folder of datasets you created
directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"
# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"
# This command reads all the data in each of the N csv files via their names
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)
### Step 3: Run a Backward Elimination Stepwise Regression function on each of the N csvs.
# Assign the full models (meaning one with all 30 candidate regressors included in step 1)
# as the initial model that BE starts out with.
# This is crucial because if the initial model had less than the number of candidate factors # in the datasets, e.g. 25 (so, X1:X26), then it could miss 1 or more of the factors
# X26:X30 which ought to be 'chosen' in dataset j by Stepwise j.
full_model <- lapply(csvs, function(i) {
lm(formula = Y ~ ., data = i) })
Finally, this is the part where I get really tripped up. I have tried at least 6 different sets of arguments, different syntax, using different objects, etc. when running my BE Stepwise Regression on my N datasets, but I'll just include 2 of them below which take entirely different approaches but are both wrong:
# attempt 1
set.seed(50) # for reproducibility
BE_fits1 <- map(.x = full_model[-1], .f = function(i) { step(object = all_IVs_models2, direction = 'backward', scope = formula(full_model), trace = 0) })
# attempt 3
set.seed(50) # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {
step(object = i[["coefficients"]], direction = 'backward',
scope = formula(full_model), trace = 0)
When I hit Ctrl+Enter on attempt 1, I get the following error message:
Error in x$terms %||% attr(x, "terms") %||% stop("no terms component nor attribute") :
no terms component nor attribute
And when I try to run my code for attempt #3, I get the following different error message:
Error in x$terms : $ operator is invalid for atomic vectors
I don't recognize either of these error messages.
p.s. If anyone looking over this question would like, I can re ask this question but including MUCH less minute details if you want me to.

Setting Random seeds do not affect classification methods C5.0 and ctree

I want to compare between two different classification methods, namely ctree and C5.0 in the libraries partyand c50 respectively, the comparison is to test their sensitivity to the initial start points. The test should be carried 30 times for each time the number of wrong classified items are calculated and stored in a vector then by using t-test I hope to see if they are really different or not.
library("foreign"); # for read.arff
library("party") # for ctree
library("C50") # for C5.0
trainTestSplit <- function(data, trainPercentage){
newData <- list();
all <- nrow(data);
splitPoint <- floor(all * trainPercentage);
newData$train <- data[1:splitPoint, ];
newData$test <- data[splitPoint:all, ];
return (newData);
}
ctreeErrorCount <- function(st,ss){
set.seed(ss);
model <- ctree(Class ~ ., data=st$train);
class <- st$test$Class;
st$test$Class <- NULL;
pre = predict(model, newdata=st$test, type="response");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
C50ErrorCount <- function(st,ss){
model <- C5.0(Class ~ ., data=st$train, seed=ss);
class <- st$test$Class;
pre = predict(model, newdata=st$test, type="class");
errors <- length(which(class != pre)); # counting number of miss classified items
return(errors);
}
compare <- function(n = 30){
data <- read.arff(file.choose());
set.seed(100);
errors = list(ctree = c(), c50 = c());
seeds <- floor(abs(rnorm(n) * 10000));
for(i in 1:n){
splitData <- trainTestSplit(data, 0.66);
errors$ctree[i] <- ctreeErrorCount(splitData, seeds[i]);
errors$c50[i] <- C50ErrorCount(splitData, seeds[i]);
}
cat("\n\n");
cat("============= ctree Vs C5.0 =================\n");
cat(paste(errors$ctree, " ", errors$c50, "\n"))
tt <- t.test(errors$ctree, errors$c50);
print(tt);
}
The program shown is supposedly doing the job of comparison, but because of the number of errors does not change in the vectors then the t.test function produces an error. I used iris inside R (but changing class to Class) and Winchester breast cancer data which can be downloaded here to test it but any data can be used as long as it has Class attribute
But I get in to the problem that the result of both methods remain constant and not changes while I am changing the random seed, theoretically ,as described in their documentation,both of the functions use random seeds, ctree uses set.seed(x) while C5.0 uses an argument called seed to set seed, unfortunatly I can not find the effect.
Could you please tell me how to control initials of these functions
ctrees does only depend on a random seed in the case where you configure it to use a random selection of input variables (ie that mtry > 0 within ctree_control). See http://cran.r-project.org/web/packages/party/party.pdf (p. 11)
In regards to C5.0-trees the seed is used this way:
ctrl = C5.0Control(sample=0.5, seed=ss);
model <- C5.0(Class ~ ., data=st$train, control = ctrl);
Notice that the seed is used to select a sample of the data, not within the algoritm itself. See http://cran.r-project.org/web/packages/C50/C50.pdf (p. 5)

Resources