Get different test and training sets from the same sample - r

I have some data for which I want to compare a few different linear models. I can use caTools::sample.split() to get one training/test set.
I would like to see how the model would change if I had used a different training/test set from the same sample. If I do not use set.seed() I should get a different set every time I call sample.split.
I am using lapply to call the function a certain number of times right now:
library(data.table)
library(caTools)
dat <- as.data.table(iris)
dat_list <- lapply(1:20, function(z) {
sample_indices <- sample.split(dat$Sepal.Length, SplitRatio = 3/4)
inter <- dat
inter$typ <- "test"
inter$typ[sample_indices] <- "train"
inter$set_no <- z
return(as.data.table(inter))})
And for comparing the coefficients:
coefs <- sapply(1:20, function(z){
m <- lm(Sepal.Length ~ Sepal.Width, data = dat_list[[z]][typ == "train"])
return(unname(m$coefficients))
})
The last few lines could be edited to return the RMS error when predicting values in the test set (typ=="test").
I'm wondering if there's a better way of doing this?

I'm interested in splitting the data efficiently (my actual data set is quite large)
I'm a big advocate of lists of data frames, but it doesn't make sense to duplicate your data in a list - especially if it's biggish data, you don't need 20 copies of your data to have 20 train-test splits.
Instead, just store the indices of the train and test sets, and give the appropriate subset to the model.
n = 5
train_ind = replicate(n = n, sample(nrow(iris), size = 0.75 * nrow(iris)), simplify = FALSE)
test_ind = lapply(train_ind, function(x) setdiff(1:nrow(iris), x))
# then modify your loop to subset the right rows
coefs <- sapply(seq_len(n), function(z) {
m <- lm(Sepal.Length ~ Sepal.Width, data = iris[train_ind[[z]], ])
return(m$coefficients)
})
It's also good to parameterize anything that is used more than once. If you want to change to 20 replicates, set up your code so you change n = 20 at the top and don't have to go through the whole thing looking for every time you used 5 to change it to 20. It might be nice to pull out the split_ratio = 0.75 and put it on it's own line at the top too, even though it's only used once.

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

How to write a function in R that will implement the "best subsets" approach to model selection?

So I need to write a function that takes a data-frame as input. The columns are my explanatory variables (except for the last column/right most column which is the response variable). I'm trying to fit a linear model and track each model's adjusted r-square as the criterion used to pick the best model.
The model will use all the columns as the explanatory variables (except for the right-most column which will be the response variable).
The function is supposed to create a tibble with a single column for the model number (I have no idea what this is supposed to mean), subset of of explanatory variables along with response variable, model formula, outcome of fitting linear model, and others as needed.
The function is supposed to output: the model number, the explanatory variables in the model, the value of adjusted r-square, and a graph (I can figure the graph out on my own). I have a image of a table here to help with visualizing what the result should look like.
I figured out that this code will get me the explanatory and response variables:
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
I know that I can get a model with something like this (ignore variable names for now):
model <- final_data %>%
group_by(debt) %>%
lm(debt ~ distance, data = .) %>%
glance()
I also know that I'm going to have to somehow map that model to each of the rows in the tibble that I'm trying to create.
What I'm stuck on is figuring out how to put all this together and create the complete function. I wish I could provide more details but I am completely stuck. I've spent about 10 hours working on this today... I asked my professor for help and he just told me to post here.
For reference here is a very early (not working at all) attempt I made:
best_subsets <- function(data) {
cols <- colnames(data)
# Get the response variable.
y <- tail(cols, 1)
# Get a list of the explanatory variables.
xs <- head(cols, length(cols) - 1)
# Create the formula as a string and then later in the lm function
# have it turned into a real formula.
form <- paste(y, "~", xs, sep = " ")
data %>%
lm(as.formula(form), data = .) %>%
glance()
}
I don't fully understand your description but I think I understand your goal. Maybe this can help in some way?:
library(tidyverse)
library(broom)
library(data.table)
lm_func <- function(df){
fit1 <- lm(df[, 1] ~ df[, 2], data = df)
fit2 <- lm(df[, 1] ~ df[, 3], data = df)
fit3 <- lm(df[, 1] ~ df[, 2], df[, 3], data = df)
results <- list(fit1, fit2, fit3)
names(results) <- paste0("explanitory_variables_", 1:3)
r_sq <- lapply(results, function(x){
glance(x)
})
r_sq_df <- rbindlist(r_sq, idcol = "df_name")
r_sq_df
}
lm_func(iris)
This gives you a dataframe of all the important outputs from which you can select adj.r.squared. Would also be possible to automate. As a side note, selecting a model based on R squared seems very strange, dangers of overfitting? a higher R squared does not necessarily mean a better model, consider looking into AIC as well?
Let me know if this helps at all or if I can refine the answer a little more towards your goal.
UPDATE:
lm_func <- function(df) {
lst <- c()
for (i in 2:ncol(df)) {
ind <- i
form_df <- df[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = df)
lst[[i - 1]] <- glance(fit)
}
lst
names(lst) <- paste0("explanitory_variables_", 1:length(lst))
lst <- rbindlist(lst, idcol = "df_name")
lst
}
lm_func(iris)
This assumes your first column is y and you want a model for every additional column.
OK one more UPDATE:
I think this does everything possible but is probably overkill:
library(combinat)
library(data.table)
library(tidyverse)
library(broom)
#First function takes a dataframe containing only the dependent and independent variables. Specify them by variable name or column position.
#The function then returns a list of dataframes of every possible order of independent variables (y ~ x1 + x2...) (y ~ x2 + x1...).
#So you can run your model on every possible sequence of explanatory variables
formula_func <- function(df, dependent = df["Sepal.Length"], independents = df[c("Sepal.Width", "Petal.Length", "Petal.Width", "Species")]) {
independents_df_list <- permn(independents) #length of output should be the factorial of the number of independent variables
df_list <- lapply(independents_df_list, function(x){ #this just pastes your independent variable as the first column of each df
cbind(dependent, x)
})
df_list
}
permd_df_list <- formula_func(iris) # voila
# This function takes the output from the previous function and runs the lm building in one variable each time (y ~ x1), (y ~ x1 + x2) and so on
# So the result is many lms building in one one independent variable at a time in every possible order
# If that is as confusing to you as it is to me then check final output. You will see what model formula is used per row and in what order each explanatory variable was added
lm_func <- function(form_df_list, df) {
mega_lst <- c()
mega_lst <- lapply(form_df_list, function(x) {
lst <- vector(mode = "list", length = length(2:ncol(x)))
for (i in 2:ncol(x)) {
ind <- i
form_df <- x[, 1:ind]
form <- DF2formula(form_df)
fit <- lm(form, data = x)
lst[[i - 1]] <- glance(fit)
names(lst)[[i-1]] <- deparse(form)
}
lst <- rbindlist(lst, idcol = "Model_formula")
return(lst)
})
return(mega_lst)
}
everything_list <- lm_func(permd_df_list, iris) # VOILA!!!
#Remove duplicates and return single df
everything_list_distinct <- everything_list %>%
rbindlist() %>%
distinct()
## You can now subset and select whichever column you want from the final output
I posted this as a coding exercise so let me know if anyone spots any errors. Just one caveat, this code does NOT represent a statistically sound approach just a coding experiment so be sure to understand the stats first!

Scaling only some columns of a training set and a test set

I often have to deal with the following issue:
I have a test set and a training set
I want to scale all columns of a training set, except for a few ones which are identified by a character vector
then, based on the sample means and sample standard deviations of the selected columns of the training set, I want to rescale the test set too
Currently, my workflow is kludgy: I use an index vector and then partial assignment to scale only some columns of the train set. I store the means and standard deviations from the scaling operation on the training set, and I use them to scale the test set. I was wondering if there could be a simpler way, without having to install caret (for a series of reasons, I'm not a big fan of caret and I definitely won't start using it just for this problem).
Here is my current workflow:
# define dummy train and test sets
train <- data.frame(letters = LETTERS[1:10], months = month.abb[1:10], numbers = 1:10,
x = rnorm(10, 1), y = runif(10))
test <- train
test$x <- rnorm(10, 1)
test$y <- runif(10)
# names of variables I don't want to scale
varnames <- c("letters", "months", "numbers")
# index vector of columns which must not be scaled
index <- names(train) %in% varnames
# scale only the columns not in index
temp <- scale(train[, !index])
train[, !index] <- temp
# get the means and standard deviations from temp, to scale test too
means <- attr(temp, "scaled:center")
standard_deviations <- attr(temp, "scaled:center")
# scale test
test[, !index] <- scale(test[, !index], center = means, scale = standard_deviations)
Is there a simpler/more idiomatic way to do this?
It is a nice question and I have tried a lot to come up with an answer.
I think this is a bit more elegant code:
train0=train%>%select(-c(letters, months, numbers))%>%as.matrix%>%scale
means <- attr(train0, "scaled:center")
standard_deviations <- attr(train0, "scaled:center")
train0=cbind(select(train,c(letters, months, numbers)),train0)
test0=test%>%select(-c(letters, months, numbers))%>%as.matrix%>%scale(center = means, scale = standard_deviations)
test0=cbind(select(test,c(letters, months, numbers)),test0)
I have tried hard to work with mutate_at in order to avoid cbind extra code but with no lack

Save iterations of for loop in R

I'm working on a project where I need to collect the intercept, slope, and R squared of several linear regressions. Since I need to at least 200 samples of different sample sizes I set-up the code below, but it only saves the last iteration of the loop. Any suggestions on how I can record each loop so that I can have all of the coefficients and r-squares that I require.
for (i in 1:5) {
x <- as.data.frame(mydf[sample(1:1000,25,replace=FALSE),])
mylm <- lm(spd66305~spd66561, data=x)
coefs <- rbind(lman(mylm))
total.coefs <- rbind(coefs)
}
total.coefs
The function used in the loop is below if that is needed.
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
tbl <- c(intercept,slope,r2,r)
}
Thanks for the help.
Before starting your loop, you can write
total.coefs <- data.frame(), to initialise an empty data.frame. Then in your loop you want to update the total.coefs, as follows: total.coefs <- rbind(total.coefs, coefs). Finally replace the last line in lman by:
tbl <- data.frame(intercept=intercept, slope=slope, r2=r2, r=r).
Here's how I'd do it, for example on the mtcars data. Note: It's not advisable to use rbind inside the loop if you're building a data structure. You can call rbind after the looping has been done and things are much less stressful. I prefer to do this type of operation with a list.
Here I wrapped my lapply loop with rbind, and then do.call binds the list elements together recursively. Another thing to note is that I take the samples prior to entering the loop. This makes debugging easier and can be more efficient overall
reps <- replicate(3, sample(nrow(mtcars), 5), simplify = FALSE)
do.call(rbind, lapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}))
# (Intercept) hp R
# [1,] 33.29360 -0.08467169 0.5246208
# [2,] 29.97636 -0.06043852 0.4770310
# [3,] 28.33462 -0.05113847 0.8514720
The following transposed vapply loop produces the same result, and is often faster when you know the type of result you expect
t(vapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}, numeric(3)))
Another way to record each loop would be to make the work reproducible and keep your datasets around in case you have extreme values, missing values, new questions about the datasets, or other surprises that need investigated.
This is a similar case using the iris dataset.
# create sample data
data(iris)
iris <- iris[ ,c('Sepal.Length','Petal.Length')]
# your function with data.frame fix on last line
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
data.frame(intercept,slope,r2,r)
}
# set seed to make reproducible
set.seed(3)
# create all datasets
alldatasets <- lapply(1:200,function(x,df){
df[sample(1:nrow(df),size = 50,replace = F), ]
},df = iris)
# create all models based on alldatasets
allmodels <- lapply(alldatasets,lm,formula = Sepal.Length ~ Petal.Length)
# run custom function on all models
lmanresult <- lapply(allmodels,lman)
# format results
result <- do.call('rbind',lmanresult)
row.names(result) <- NULL
# inspect the 129th sample, model, and result
alldatasets[[129]]
summary(allmodels[[129]])
result[129, ]

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

Resources