How to write a function in R that will implement the "best subsets" approach to model selection? - r

So I need to write a function that takes a data-frame as input. The columns are my explanatory variables (except for the last column/right most column which is the response variable). I'm trying to fit a linear model and track each model's adjusted r-square as the criterion used to pick the best model.
The model will use all the columns as the explanatory variables (except for the right-most column which will be the response variable).
The function is supposed to create a tibble with a single column for the model number (I have no idea what this is supposed to mean), subset of of explanatory variables along with response variable, model formula, outcome of fitting linear model, and others as needed.
The function is supposed to output: the model number, the explanatory variables in the model, the value of adjusted r-square, and a graph (I can figure the graph out on my own). I have a image of a table here to help with visualizing what the result should look like.
I figured out that this code will get me the explanatory and response variables:
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
I know that I can get a model with something like this (ignore variable names for now):
model <- final_data %>%
group_by(debt) %>%
lm(debt ~ distance, data = .) %>%
glance()
I also know that I'm going to have to somehow map that model to each of the rows in the tibble that I'm trying to create.
What I'm stuck on is figuring out how to put all this together and create the complete function. I wish I could provide more details but I am completely stuck. I've spent about 10 hours working on this today... I asked my professor for help and he just told me to post here.
For reference here is a very early (not working at all) attempt I made:
best_subsets <- function(data) {
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
# Create the formula as a string and then later in the lm function
# have it turned into a real formula.
form <- paste(y, "~", xs, sep = " ")
data %>%
lm(as.formula(form), data = .) %>%
glance()
}

I don't fully understand your description but I think I understand your goal. Maybe this can help in some way?:
library(tidyverse)
library(broom)
library(data.table)
lm_func <- function(df){
fit1 <- lm(df[, 1] ~ df[, 2], data = df)
fit2 <- lm(df[, 1] ~ df[, 3], data = df)
fit3 <- lm(df[, 1] ~ df[, 2], df[, 3], data = df)
results <- list(fit1, fit2, fit3)
names(results) <- paste0("explanitory_variables_", 1:3)
r_sq <- lapply(results, function(x){
glance(x)
})
r_sq_df <- rbindlist(r_sq, idcol = "df_name")
r_sq_df
}
lm_func(iris)
This gives you a dataframe of all the important outputs from which you can select adj.r.squared. Would also be possible to automate. As a side note, selecting a model based on R squared seems very strange, dangers of overfitting? a higher R squared does not necessarily mean a better model, consider looking into AIC as well?
Let me know if this helps at all or if I can refine the answer a little more towards your goal.
UPDATE:
lm_func <- function(df) {
lst <- c()
for (i in 2:ncol(df)) {
ind <- i
form_df <- df[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = df)
lst[[i - 1]] <- glance(fit)
}
lst
names(lst) <- paste0("explanitory_variables_", 1:length(lst))
lst <- rbindlist(lst, idcol = "df_name")
lst
}
lm_func(iris)
This assumes your first column is y and you want a model for every additional column.
OK one more UPDATE:
I think this does everything possible but is probably overkill:
library(combinat)
library(data.table)
library(tidyverse)
library(broom)
#First function takes a dataframe containing only the dependent and independent variables. Specify them by variable name or column position.
#The function then returns a list of dataframes of every possible order of independent variables (y ~ x1 + x2...) (y ~ x2 + x1...).
#So you can run your model on every possible sequence of explanatory variables
formula_func <- function(df, dependent = df["Sepal.Length"], independents = df[c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")]) {
independents_df_list <- permn(independents) #length of output should be the factorial of the number of independent variables
df_list <- lapply(independents_df_list, function(x){ #this just pastes your independent variable as the first column of each df
cbind(dependent, x)
})
df_list
}
permd_df_list <- formula_func(iris) # voila
# This function takes the output from the previous function and runs the lm building in one variable each time (y ~ x1), (y ~ x1 + x2) and so on
# So the result is many lms building in one one independent variable at a time in every possible order
# If that is as confusing to you as it is to me then check final output. You will see what model formula is used per row and in what order each explanatory variable was added
lm_func <- function(form_df_list, df) {
mega_lst <- c()
mega_lst <- lapply(form_df_list, function(x) {
lst <- vector(mode = "list", length = length(2:ncol(x)))
for (i in 2:ncol(x)) {
ind <- i
form_df <- x[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = x)
lst[[i - 1]] <- glance(fit)
names(lst)[[i-1]] <- deparse(form)
}
lst <- rbindlist(lst, idcol = "Model_formula")
return(lst)
})
return(mega_lst)
}
everything_list <- lm_func(permd_df_list, iris) # VOILA!!!
#Remove duplicates and return single df
everything_list_distinct <- everything_list %>%
rbindlist() %>%
distinct()
## You can now subset and select whichever column you want from the final output
I posted this as a coding exercise so let me know if anyone spots any errors. Just one caveat, this code does NOT represent a statistically sound approach just a coding experiment so be sure to understand the stats first!

Related

Combining for loops and ifelse in R

I am trying to write a for loop that will generate a correlation for a fixed column (LPS0) vs. all other columns in the data set. I don't want to use a correlation matrix because I only care about the correlation of LPS0 vs all other columns, not the correlations of the other columns with themselves. I then want to include an if statement to print only the significant correlations (p.value <= 0.05). I ran into some issues where some of the p.values are returned as NA, so I switched to an if_else loop. However, I am now getting an error. My code is as follows:
for(i in 3:ncol(microbiota_lps_0_morm)) {
morm_0 <- cor.test(microbiota_lps_0_morm$LPS0, microbiota_lps_0_morm[[colnames(microbiota_lps_0_morm)[i]]], method = "spearman")
if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
}
The first value is returned, and then the loop stops with the following error:
Error in if_else():
! true must be length 1 (length of condition), not 8.
Backtrace: 1. dplyr::if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
How can I make the loop print morm only when p.value <- 0.05?
Here's a long piece of code which aytomates the whole thing. it might be overkill but you can just take the matrix and use whatever you need. it makes use of the tidyverse.
df <- select_if(mtcars,is.numeric)
glimpse(df)
# keeping real names
dict <- cbind(original=names(df),new=paste0("v",1:ncol(df)))
# but changing names for better data viz
colnames(df) <- paste0("v",1:ncol(df))
# correlating between variables + p values
pvals <- list()
corss <- list()
for (coln in colnames(df)) {
pvals[[coln]] <- map(df, ~ cor.test(df[,coln], .)$p.value)
corss[[coln]] <- map(df, ~ cor(df[,coln], .))
}
# Keeping both matrices in a list
matrices <- list(
pvalues = matrix(data=unlist(pvals),
ncol=length(names(pvals)),
nrow=length(names(pvals))),
correlations = matrix(data=unlist(corss),
ncol=length(names(corss)),
nrow=length(names(corss)))
)
rownames(matrices[[1]]) <- colnames(df)
rownames(matrices[[2]]) <- colnames(df)
# Creating a combined data frame
long_cors <- expand.grid(Var1=names(df),Var2=names(df)) %>%
mutate(cor=unlist(matrices["correlations"]),
pval=unlist(matrices["pvalues"]),
same=Var1==Var2,
significant=pval<0.05,
dpcate=duplicated(cor)) %>%
# Leaving no duplicants, non-significant or self-correlation results
filter(same ==F,significant==T,dpcate==F) %>%
select(-c(same,dpcate,significant))
# Plotting correlations
long_cors %>%mutate(negative=cor<0) %>%
ggplot(aes(x=Var1,y=Var2,
color=negative,size=abs(cor),fill=Var2,
label=round(cor,2)))+
geom_label(show.legend = F,alpha=0.2)+
scale_color_manual(values = c("black","darkred"))+
# Sizing each correlation by it's magnitude
scale_size_area(seq(1,100,length=length(unique(long_cors$Var1))))+ theme_light()+
theme(axis.text = element_text(face = "bold",size=12))+
labs(title="Correlation between variables",
caption = "p < 0.05")+xlab("")+ylab("")
If you want to correlate a column of a matrix with the remaining columns, you can do so with one function call:
mtx <- matrix(rnorm(800), ncol=8)
cor(mtx[,1], mtx[,-1])
However, you will not get p-values. For getting p-values, I would recommend this approach:
library(tidyverse)
significant <- map_dbl(2:ncol(mtx),
~ cor.test(mtx[,1], mtx[,.], use="p", method="s")$p.value)
Whenever you feel like you need a for loop in R, chances are, you should be using another approach. for is a very un-R construct, and R gives many better ways of handling the same issues. map_* family of functions from tidyverse is but one of them. Another approach, in base R, would be to use apply:
significant <- apply(mtx[,-1], 2,
\(x) cor.test(x, mtx[,1], method="s", use="p")$p.value)

Get different test and training sets from the same sample

I have some data for which I want to compare a few different linear models. I can use caTools::sample.split() to get one training/test set.
I would like to see how the model would change if I had used a different training/test set from the same sample. If I do not use set.seed() I should get a different set every time I call sample.split.
I am using lapply to call the function a certain number of times right now:
library(data.table)
library(caTools)
dat <- as.data.table(iris)
dat_list <- lapply(1:20, function(z) {
sample_indices <- sample.split(dat$Sepal.Length, SplitRatio = 3/4)
inter <- dat
inter$typ <- "test"
inter$typ[sample_indices] <- "train"
inter$set_no <- z
return(as.data.table(inter))})
And for comparing the coefficients:
coefs <- sapply(1:20, function(z){
m <- lm(Sepal.Length ~ Sepal.Width, data = dat_list[[z]][typ == "train"])
return(unname(m$coefficients))
})
The last few lines could be edited to return the RMS error when predicting values in the test set (typ=="test").
I'm wondering if there's a better way of doing this?
I'm interested in splitting the data efficiently (my actual data set is quite large)
I'm a big advocate of lists of data frames, but it doesn't make sense to duplicate your data in a list - especially if it's biggish data, you don't need 20 copies of your data to have 20 train-test splits.
Instead, just store the indices of the train and test sets, and give the appropriate subset to the model.
n = 5
train_ind = replicate(n = n, sample(nrow(iris), size = 0.75 * nrow(iris)), simplify = FALSE)
test_ind = lapply(train_ind, function(x) setdiff(1:nrow(iris), x))
# then modify your loop to subset the right rows
coefs <- sapply(seq_len(n), function(z) {
m <- lm(Sepal.Length ~ Sepal.Width, data = iris[train_ind[[z]], ])
return(m$coefficients)
})
It's also good to parameterize anything that is used more than once. If you want to change to 20 replicates, set up your code so you change n = 20 at the top and don't have to go through the whole thing looking for every time you used 5 to change it to 20. It might be nice to pull out the split_ratio = 0.75 and put it on it's own line at the top too, even though it's only used once.

Save iterations of for loop in R

I'm working on a project where I need to collect the intercept, slope, and R squared of several linear regressions. Since I need to at least 200 samples of different sample sizes I set-up the code below, but it only saves the last iteration of the loop. Any suggestions on how I can record each loop so that I can have all of the coefficients and r-squares that I require.
for (i in 1:5) {
x <- as.data.frame(mydf[sample(1:1000,25,replace=FALSE),])
mylm <- lm(spd66305~spd66561, data=x)
coefs <- rbind(lman(mylm))
total.coefs <- rbind(coefs)
}
total.coefs
The function used in the loop is below if that is needed.
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
tbl <- c(intercept,slope,r2,r)
}
Thanks for the help.
Before starting your loop, you can write
total.coefs <- data.frame(), to initialise an empty data.frame. Then in your loop you want to update the total.coefs, as follows: total.coefs <- rbind(total.coefs, coefs). Finally replace the last line in lman by:
tbl <- data.frame(intercept=intercept, slope=slope, r2=r2, r=r).
Here's how I'd do it, for example on the mtcars data. Note: It's not advisable to use rbind inside the loop if you're building a data structure. You can call rbind after the looping has been done and things are much less stressful. I prefer to do this type of operation with a list.
Here I wrapped my lapply loop with rbind, and then do.call binds the list elements together recursively. Another thing to note is that I take the samples prior to entering the loop. This makes debugging easier and can be more efficient overall
reps <- replicate(3, sample(nrow(mtcars), 5), simplify = FALSE)
do.call(rbind, lapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}))
# (Intercept) hp R
# [1,] 33.29360 -0.08467169 0.5246208
# [2,] 29.97636 -0.06043852 0.4770310
# [3,] 28.33462 -0.05113847 0.8514720
The following transposed vapply loop produces the same result, and is often faster when you know the type of result you expect
t(vapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}, numeric(3)))
Another way to record each loop would be to make the work reproducible and keep your datasets around in case you have extreme values, missing values, new questions about the datasets, or other surprises that need investigated.
This is a similar case using the iris dataset.
# create sample data
data(iris)
iris <- iris[ ,c('Sepal.Length','Petal.Length')]
# your function with data.frame fix on last line
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
data.frame(intercept,slope,r2,r)
}
# set seed to make reproducible
set.seed(3)
# create all datasets
alldatasets <- lapply(1:200,function(x,df){
df[sample(1:nrow(df),size = 50,replace = F), ]
},df = iris)
# create all models based on alldatasets
allmodels <- lapply(alldatasets,lm,formula = Sepal.Length ~ Petal.Length)
# run custom function on all models
lmanresult <- lapply(allmodels,lman)
# format results
result <- do.call('rbind',lmanresult)
row.names(result) <- NULL
# inspect the 129th sample, model, and result
alldatasets[[129]]
summary(allmodels[[129]])
result[129, ]

Get predicted values for next period

please consider following data:
y<- c(2,2,6,3,2,23,5,6,4,23,3,4,3,87,5,7,4,23,3,4,3,87,5,7)
x1<- c(3,4,6,3,3,23,5,6,4,23,6,5,5,1,5,7,2,23,6,5,5,1,5,7)
x2<- c(7,3,6,3,2,2,5,2,2,2,2,2,6,5,4,3,2,3,2,2,6,5,4,3)
type <- c("a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b","c","c","c","c","c","c","c","c")
generation<- c(1,1,1,1,2,2,3,3,1,2,2,2,3,3,4,4,1,2,2,2,3,3,4,4)
year<- c(2004,2005,2006,2007,2008,2009,2010,2011,2004,2005,2006,2007,2008,2009,2010,2011,2004,2005,2006,2007,2008,2009,2010,2011)
data <- data.frame(y,x1,x2,model,generation,year)
I would now make analysis that only take into account each single year and predict on the following. So in essence, this would run several separate analysis, only taking into account the data up to one point in time and then predicting on the next (only the directly next) period.
I tried to set up an example for the three models:
data2004 <- subset(data, year==2004)
data2005 <- subset(data, year==2005)
m1 <- lm(y~x1+x2, data=data2004)
preds <- predict(m1, data2005)
How can I do this automatically? My preferred output would be a predicted value for each type that indicates what the value would have been for each of the values that exist in the following period (the original data has 200 periods).
Thanks in advance, help very much appreciated!
The following may be more like what you want.
uq.year <- sort(unique(dat$year)) ## sorting so that i+1 element is the year after ith element
year <- dat$year
dat$year <- NULL ## we want everything in dat to be either the response or a predictor
model <- rep(c("a", "b", "c"), times = length(year) / 3) ## identifies the separate people per year
predlist <- vector("list", length(uq.year) - 1) ## there is 1 prediction fewer than the number of unique years
for(i in 1:(length(uq.year) - 1))
{
mod <- lm(y ~ ., data = subset(dat, year == uq.year[i]))
predlist[[i]] <- predict(mod, subset(dat, subset = year == uq.year[i + 1], select = -y))
names(predlist[[i]]) <- model[year == uq.year[i + 1]] ## labeling each prediction
}
The reason that we want dat to only have modeling variables (rather than year, for example) is because then we can easily use the y ~ . notation and avoid having to spell out all of the predictors in the lm call.

Generating multiple datasets and applying function and output multiple dataset

Here is my problem, just hard for me...
I want to generate multiple datasets, then apply a function to these datasets and output corresponding output in single or multiple dataset (whatever possible)...
My example, although I need to generate a large number of variables and datasets
seed <- round(runif(10)*1000000)
datagen <- function(x){
set.seed(x)
var <- rep(1:3, c(rep(3, 3)))
yvar <- rnorm(length(var), 50, 10)
matrix <- matrix(sample(1:10, c(10*length(var)), replace = TRUE), ncol = 10)
mydata <- data.frame(var, yvar, matrix)
}
gdt <- lapply (seed, datagen)
# resulting list (I believe is correct term) has 10 dataframes:
# gdt[1] .......to gdt[10]
# my function, this will perform anova in every component data frames and
#output probability coefficients...
anovp <- function(x){
ind <- 3:ncol(x)
out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]])
pval <- out$coefficients[,4][2]
pval <- do.call(rbind,pval)
}
plist <- lapply (gdt, anovp)
Error in gdt[x] : invalid subscript type 'list'
This is not working, I tried different options. But could not figure out...finally decided to bother experts, sorry for that...
My questions are:
(1) Is this possible to handle such situation in this way or there are other alternatives to handle such multiple datasets created?
(2) If this is right way, how can I do it?
Thank you for attention and I will appreciate your help...
You have the basic idea right, in that you should create a list of data frames and then use lapply to apply the function to each element of the list. Unfortunately, there are several oddities in your code.
There is no point in randomly generating a seed, then setting it. You only need to use set.seed in order to make random numbers reproducible. Cut the lines
seed <- round(runif(10)*1000000)
and maybe
set.seed(x)
rep(1:3, c(rep(3, 3))) is the same as rep(1:3, each = 3).
Don't call your variables var or matrix, since they will mask the names of those functions. since it's confusing.
3:ncol(x) is dangerous. If x has less than 3 columns it doesn't do what you think it does.
... and now, the problem you actually wanted solving.
The problem is in the line out <- lm(gdt[x]$yvar ~ gdt[x][, ind[ind]]).
lapply passes data frames into anovp, not indicies, so x is a data frame in gdt[x]. Which throws an error.
One more thing. While you are rewriting that line, note that lm takes a data argument, so you don't need to do things like gdt$some_column; you can just reference some_column directly.
EDIT: Further advice.
You appear to always use the formula yvar ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10. Since its the same each time, create it before your call to lapply.
independent_vars <- paste(colnames(gdt[[1]])[-1:-2], collapse = " + ")
model_formula <- formula(paste("yvar", independent_vars, sep = " ~ "))
I probably wouldn't bother with the anovp function. Just do
models <- lapply(gdt, function(data) lm(model_formula, data))
Then include a further call to lapply to play with the coefficients if necessary. The next line replicates your anovp code, but won't work because model$coefficients is a vector (so the dimensions aren't right). Adjust to retrieve the bit you actualy want.
coeffs <- lapply(models, function(model) do.call(rbind, model$coefficients[,4][2]))

Resources