Is there a way to store all factors selected from running the same Stepwise regression on each of N datasets using lapply(csvs, FUN(i) { step() })? - r

My file folder with the N datasets in the form of csv files is called sample_obs. The goal is to end up with two lists, one I have already figured out how to obtain, namely, a list of the names of each individual csv file that matches the format of their actual names in the folder all of them are in, not their file paths.
So, this is all the code I have written and the other list I need to create is a list of the factors/Independent Variables chosen by my Backward Elimination Stepwise Regression function only, no R-squared, Cp, AIC, BIC, or any other standard regression diagnostic tools; I don't want or need coefficient estimates either, just the regressors "chosen" for each dataset out of the 30 candidate regressors.
So far, in terms of my code that actually runs (except the last few lines):
# these 2 lines together create a simple character list of
# all the file names in the file folder of datasets you created
directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"
# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"
# This command reads all the data in each of the N csv files via their names
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)
### Step 3: Run a Backward Elimination Stepwise Regression function on each of the N csvs.
# Assign the full models (meaning one with all 30 candidate regressors included in step 1)
# as the initial model that BE starts out with.
# This is crucial because if the initial model had less than the number of candidate factors # in the datasets, e.g. 25 (so, X1:X26), then it could miss 1 or more of the factors
# X26:X30 which ought to be 'chosen' in dataset j by Stepwise j.
full_model <- lapply(csvs, function(i) {
lm(formula = Y ~ ., data = i) })
Finally, this is the part where I get really tripped up. I have tried at least 6 different sets of arguments, different syntax, using different objects, etc. when running my BE Stepwise Regression on my N datasets, but I'll just include 2 of them below which take entirely different approaches but are both wrong:
# attempt 1
set.seed(50) # for reproducibility
BE_fits1 <- map(.x = full_model[-1], .f = function(i) { step(object = all_IVs_models2, direction = 'backward', scope = formula(full_model), trace = 0) })
# attempt 3
set.seed(50) # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {
step(object = i[["coefficients"]], direction = 'backward',
scope = formula(full_model), trace = 0)
When I hit Ctrl+Enter on attempt 1, I get the following error message:
Error in x$terms %||% attr(x, "terms") %||% stop("no terms component nor attribute") :
no terms component nor attribute
And when I try to run my code for attempt #3, I get the following different error message:
Error in x$terms : $ operator is invalid for atomic vectors
I don't recognize either of these error messages.
p.s. If anyone looking over this question would like, I can re ask this question but including MUCH less minute details if you want me to.

Related

How to increase the efficiency of a for loop used to run Stepwise Regressions iteratively

All of the code in this question can be found in my GitHub Repository for this research project on Estimated Exhaustive Regression. Specifically, in the "Both BE & FS script" and "LASSO code" Rscripts, and you may use the significantly truncated file folder of datasets "sample_obs(20)" rather than "spencer" because the former only contains 20 csvs while the latter contains 58.5k!
I am running both a Backward Elimination and a Forward Selection Stepwise Regression on each of N different csv file formatted datasets within a file folder using the following code (once the N datasets have already been loaded):
set.seed(11) # for reproducibility
full_models <- vector("list", length = length(datasets))
BE_fits <- vector("list", length = length(datasets))
head(BE_fits, n = 3) # returns a list with 18 elements, all of which are NULL
set.seed(11) # for reproducibility
for(i in seq_along(datasets)) {
full_models[[i]] <- lm(formula = Y ~ ., data = datasets[[i]])
BE_fits[[i]] <- step(object = full_models[[i]],
scope = formula(full_models[[i]]),
direction = 'backward',
trace = 0) }
And to get the final results I want, I use the following:
BE_Coeffs <- lapply(seq_along(BE_fits), function(i) coef(BE_fits[[i]]))
Models_Selected_by_BE <- lapply(seq_along(BE_fits),
\(i) names(coef(BE_fits[[i]])))
And for FS Stepwise, I used:
set.seed(11) # for reproducibility
FS_fits <- vector("list", length = length(datasets))
head(FS_fits, n = 3) # returns a list with 15 elements, all of which are NULL
set.seed(11) # for reproducibility
for(j in seq_along(datasets)) { null_models[[j]] = lm(formula = Y ~ 1,
data = datasets[[j]])
FS_fits[[j]] = step(object = null_models[[j]],
direction = 'forward',
scope = formula(full_models[[j]]), trace = 0) }
Much of the syntax of this code I got from previous questions I asked here several months ago, but now I am rerunning all of my models on a new file folder filled with new randomly generated synthetic datasets, and I don't want to re-run this using this code because last time, it took WELL OVER 12 or 14 hours for both the BE and the FS stepwise procedures to finish running.
p.s.
I already was able to avoid using a loop when I did the same thing instead for LASSO Regression as my 1st Benchmark Variable Selection Procedure using the following code which employed a function from R's useful apply family (this only takes 2-3 hours):
set.seed(11) # to ensure replicability
LASSO_fits <- lapply(datasets, function(i)
enet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE))
However, I could not figure out how to replicate something similar for either basic version of Stepwise because of the all important initialization step beforehand.

Where to put a sample(, J) function in R (inside or outside?) a regsubsets() which is the FUN in an lapply() so it only runs J possible models?

The end goal here is to run a random sample (without replacement) of J different possible regression models rather than all 2^k - 1 possible models as in a traditional All Subsets Regression aka Best Subset Regression (also sometimes called Exhaustive Regression) on each of I different csv file formatted datasets all located within the same file folder.
Here is my code (it is in my GitHub Repository for this project, it is called 'EER script'):
# Load all libraries needed for this script.
# The library specifically needed to run a basic ASR is the 'leaps' library.
library(dplyr)
library(tidyverse)
library(stats)
library(ggplot2)
library(lattice)
library(caret)
library(leaps)
library(purrr)
directory_path <- "~/DAEN_698/sample obs"
filepath_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(filepath_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)
datasets <- lapply(filepath_list, read.csv)
# code to run a normal All Subsets Regression
ASR_fits <- lapply(datasets, function(i)
regsubsets(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, data = i, nvmax = 15,
intercept = TRUE, method = "exhaustive"))
ASR_fits_summary <- summary(ASR_fits)
This is the part I am completely stuck, I got the above to run and the ASR_fits_summary object is a list with I elements, all of the class 'regsubsets' which is exactly what I was hoping for, but that is still just a list of the estimates made by a traditional ASR, what I need to figure out is where I should insert a sample(, J) function within this lapply function so that each regsubsets chooses the optimal model out of just J randomly selected models from the 2k - 1 possible models to made it computational feasible.
I am guessing I will have to either nest another lapply within my current lapply function, or write a simple custom function that takes J random samples without replacement, but I just don't know at what step to put it.

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

Confused with output differences between two simple loops

I am confused as to why the two pieces of code are returning different results.
In one the only difference between the loops is the use of Wage$age.cut1 vs. age.cut1. What is the significance of the difference?
DATA: ISLR package, Wage data
cv.err <- rep(NA, 10)
for (i in 2:10){
Wage$age.cut1 = cut(Wage$age, i)
fit = glm(wage~age.cut1, data = Wage)
cv.err[i] = cv.glm(Wage, fit, K = 10)$delta[2]
}
> cv.err
[1] NA 1733.646 1681.587 1636.521 1632.931 1623.112 1611.965 1600.903 1609.973
[10] 1607.234 # these are the expected results
cv.err <- rep(NA, 10)
for (i in 2:10){
age.cut1 = cut(Wage$age, i)
fit = glm(wage~age.cut1, data = Wage)
cv.err[i] = cv.glm(Wage, fit, K = 10)$delta[2]
}
> cv.err
[1] NA 1603.255 1608.617 1602.296 1606.265 1606.139 1602.448 1606.063 1605.100
[10] 1606.986
Yes, the difference of those two make a great difference in your loop logic. In first loop, age.cut1 is a column in Wage dataframe evidenced by the $ qualifier and is used in the glm formula. In second loop, age.cut1 is a standalone, separate named vector and is unused in the glm formula. Whenever a formula is used, the columns derive from the object referenced in data argument.
Unfamiliar with listed packages and data structures, most likely age.cut1 column does exist in Wage data frame prior to looping (since no error occurred in its reference in second loop's glm call). However, it is only updated in first loop with cut(Wage$age, i). Though a similar named object is assigned in second loop, the original column data remains untouched in glm.

All NaNs in RMA normalization of GSE31312 using Brainarray custom CDFs

I'm trying to RMA normalize a particular gene expression dataset concerning diffuse large B-cell lymphoma using custom gene-level annotation CDF (chip definition file) files from Brainarray.
Unfortunately, the RMA normalized expression matrix is all NaNs, and I don't understand why.
The dataset (GSE31312) is freely available at the GEO website and uses the Affymetrix HG-U133 Plus 2.0 array platform. I'm using the affy package to perform the RMA normalization.
Since the problem is specific to the dataset, the following R code to reproduce the problem is unfortunately quite cumbersome (2 GB download, 8.8 GB unpacked).
Set the working directory.
setwd("~/Desktop/GEO")
Load the needed packages. Uncomment to install the packages.
#source("http://bioconductor.org/biocLite.R")
#biocLite(pkgs = c("GEOquery", "affy", "AnnotationDbi", "R.utils"))
library("GEOquery") # To automatically download the data
library("affy")
library("R.utils") # For file handling
Download the array data to the work dir.
files <- getGEOSuppFiles("GSE31312")
Untar the data in a dir called CEL
#Sys.setenv(TAR = '/usr/bin/tar') # For (some) OS X uncommment this line
untar(tarfile = "GSE31312/GSE31312_RAW.tar", exdir = "CEL")
Unzip the all .gz files
gz.files <- list.files("CEL", pattern = "\\.gz$",
ignore.case = TRUE, full.names = TRUE)
for (file in gz.files)
gunzip(file, skip = TRUE, remove = TRUE)
cel.files <- list.files("CEL", pattern = "\\.cel$",
ignore.case = TRUE, full.names = TRUE)
Download, install, and load the custom Brainarray Ensembl ENSG gene annotation package
download.file(paste0("http://brainarray.mbni.med.umich.edu/Brainarray/",
"Database/CustomCDF/18.0.0/ensg.download/",
"hgu133plus2hsensgcdf_18.0.0.tar.gz"),
destfile = "hgu133plus2hsensgcdf_18.0.0.tar.gz")
install.packages("hgu133plus2hsensgcdf_18.0.0.tar.gz",
repos = NULL, type = "source")
library(hgu133plus2hsensgcdf)
Perform the RMA normalization with and without the custom CDF.
affy.rma <- justRMA(filenames = cel.files, verbose = TRUE)
ensg.rma <- justRMA(filenames = cel.files, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
As can be seen, the function returns without warning an expression matrix exprs(ensg.ram) where all entries in the expression matrix are NaN.
sum(is.nan(exprs(ensg.rma))) # == prod(dim(ensg.rma)) == 9964482
Interestingly, there are quite a few NaNs in exprs(affy.rma) when using the standard CDF.
# Show some NaNs
na.rows <- apply(is.na(exprs(affy.rma)), 1, any)
exprs(affy.rma)[which(na.rows)[1:10], 1:4]
# A particular bad probeset:
exprs(affy.rma)["1553575_at", ]
# There are relatively few NaNs in total (but the really should be none)
sum(is.na(exprs(affy.rma))) # == 12305
# Probesets of with all NaNs
sum(apply(is.na(exprs(affy.rma)), 1, all))
There really should be none NaNs at all. I've tried using the expresso function to perform background correction only (with no normalization and summarization) which also yield NaNs. So the problem appears to stem from the background correction. However, one might worry that it is one or more bad arrays that is the cause. Can anybody help me track down the origin of the NaNs and get some useful expression values?
Thanks, and best regards
Anders
Edit: A single file appears to be the issue (but not quite)
I decided to check what happens if each .CEL file is normalized independently. What actually happens underneath the RMA hood when justRMA is given a single array I'm not sure of. But I imagine that the quantile normalization step becomes the identity function and the summarization (median polish) simply stops after the first iteration.
Anyway, to perform the one-by-one RMA normalization we run:
ensg <- vector("list", length(cel.files)) # Initialize a list
names(ensg) <- basename(cel.files)
for (file in cel.files) {
ensg[[basename(file)]] <- justRMA(filenames = file, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
cat("File", which(file == cel.files), "done.\n\n")
}
# Extract the expression matrices for each file and combine them
X <- as.data.frame(do.call(cbind, lapply(ensg, exprs)))
By looking at head(X) it appears that GSM776462.CEL is all NaNs. Indeed, it almost is:
sum(!is.nan(X$GSM776462.CEL)) # == 18
Only 18 of 20009 is not NaN. Next, I count the number of NaN appearing other places
sum(is.na(X[, -grep("GSM776462.CEL", colnames(X))])) # == 0
which is 0. Hence GSM776462.CEL appears to be the culprit.
But the regular CDF annotation does not give any problems:
affy <- justRMA(filenames = "CEL/GSM776462.CEL")
any(is.nan(exprs(affy))) # == FALSE
It is also weird, that using regular CDF has seemingly randomly scattered NaNs in the expression matrix. I still don't quite know what to make of this.
Edit2: NaNs vanish when excluding sample but not with standard CDF
For what it is worth, when I exclude the GSM776462.CEL file and RMA normalize with and without the custom CDF files the NaNs only disappear in the custom CDF case.
# Trying with all other CEL than GSM776462.CEL
cel.files2 <- grep("GSM776462.CEL", cel.files, invert = TRUE, value = TRUE)
affy.rma.no.776462 <- justRMA(filenames = cel.files2)
ensg.rma.no.776462 <- justRMA(filenames = cel.files2, verbose = TRUE,
cdfname = "HGU133Plus2_Hs_ENSG")
sum(is.na(exprs(affy.rma.no.776462))) # == 12275
sum(is.na(exprs(ensg.rma.no.776462))) # == 0
Puzzling.
Edit3: No NAs or NaNs in "raw data"
For what it is worth, I've tried to read in the raw probe-level expression values to check if they contain NAs or NaNs. The following goes through the .CEL-files one-by-one and checks for any missing values.
for (file in cel.files) {
affybtch <- suppressWarnings(read.affybatch(filenames = file))
tmp <- exprs(affybtch)
cat(file, "done.\n")
if (any(is.na(tmp)))
stop(paste("NAs or NaNs are present in", file))
}
No NAs or NaNs are found.

Resources