How to capture the most important variables in Bootstrapped models in R? - r

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?

You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Related

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

Get different test and training sets from the same sample

I have some data for which I want to compare a few different linear models. I can use caTools::sample.split() to get one training/test set.
I would like to see how the model would change if I had used a different training/test set from the same sample. If I do not use set.seed() I should get a different set every time I call sample.split.
I am using lapply to call the function a certain number of times right now:
library(data.table)
library(caTools)
dat <- as.data.table(iris)
dat_list <- lapply(1:20, function(z) {
sample_indices <- sample.split(dat$Sepal.Length, SplitRatio = 3/4)
inter <- dat
inter$typ <- "test"
inter$typ[sample_indices] <- "train"
inter$set_no <- z
return(as.data.table(inter))})
And for comparing the coefficients:
coefs <- sapply(1:20, function(z){
m <- lm(Sepal.Length ~ Sepal.Width, data = dat_list[[z]][typ == "train"])
return(unname(m$coefficients))
})
The last few lines could be edited to return the RMS error when predicting values in the test set (typ=="test").
I'm wondering if there's a better way of doing this?
I'm interested in splitting the data efficiently (my actual data set is quite large)
I'm a big advocate of lists of data frames, but it doesn't make sense to duplicate your data in a list - especially if it's biggish data, you don't need 20 copies of your data to have 20 train-test splits.
Instead, just store the indices of the train and test sets, and give the appropriate subset to the model.
n = 5
train_ind = replicate(n = n, sample(nrow(iris), size = 0.75 * nrow(iris)), simplify = FALSE)
test_ind = lapply(train_ind, function(x) setdiff(1:nrow(iris), x))
# then modify your loop to subset the right rows
coefs <- sapply(seq_len(n), function(z) {
m <- lm(Sepal.Length ~ Sepal.Width, data = iris[train_ind[[z]], ])
return(m$coefficients)
})
It's also good to parameterize anything that is used more than once. If you want to change to 20 replicates, set up your code so you change n = 20 at the top and don't have to go through the whole thing looking for every time you used 5 to change it to 20. It might be nice to pull out the split_ratio = 0.75 and put it on it's own line at the top too, even though it's only used once.

How to run many linear regressions/correlations in one data set

I have one data set in an excel/csv form. I wish to run many simple linear regressions/correlations (each with a p-value).
I have several independent variables (x's) and one dependent variable (y).
The variables are all columns of data, not rows. Each column has the name of the data type in the first cell, and all the numerical data in the lower cells.
I want to create a loop instead of manually running each test, but I'm unfamiliar with loops in R. If anyone could help, I would greatly appreciate it.Thanks!
Without more detail it's hard to know for sure, but using dplyr and broom might get you where you need to go.
For example, this runs a linear model for each group:
library(broom)
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(tidy(lm(mpg ~ wt, data = .)))
For more detail, may I suggest: http://r4ds.had.co.nz/many-models.html
Here is my attempt to use a simulated data set to demonstrate 1) "manually" compute correlations, and 2) iteratively calculate correlation by a for loop in R:
First, generate data simulation with 2 independent variables x1 (normally distributed) and x2 (exponentially distributed), and a dependent variable y (same distribution as x1):
set.seed(1) #reproducibility
## The first column is your DEPENDENT variable
## The rest are independent variables
data <- data.frame(y=rnorm(100,0.5,1), x1=rnorm(100,0,1), x2= rexp(100,0.5))
"Manually" compute correlation:
cor_x1_y <- cor.test(data$x1, data$y)
cor_x2_y <- cor.test(data$x2, data$y)
c(cor_x1_y$estimate, cor_x2_y$estimate) #corr. coefficients
## cor cor
## -0.0009943199 -0.0404557828
c(cor_x1_y$p.value, cor_x2_y$p.value) #p values
## [1] 0.9921663 0.6894252
Iteratively compute correlation and store results in a matrix called results:
results <- NULL # placeholder
for(i in 2:ncol(data)) {
## Perform i^th test:
one_test <- cor.test(data[,i], data$y)
test_cor <- one_test$estimate
p_value <- one_test$p.value
## Add any other parameters you'd like to include
##update results vector
results <- rbind(results, c(test_cor , p_value))
}
colnames(results) <- c("correlation", "p_value")
results
## correlation p_value
## [1,] -0.0009943199 0.9921663
## [2,] -0.0404557828 0.6894252

Save iterations of for loop in R

I'm working on a project where I need to collect the intercept, slope, and R squared of several linear regressions. Since I need to at least 200 samples of different sample sizes I set-up the code below, but it only saves the last iteration of the loop. Any suggestions on how I can record each loop so that I can have all of the coefficients and r-squares that I require.
for (i in 1:5) {
x <- as.data.frame(mydf[sample(1:1000,25,replace=FALSE),])
mylm <- lm(spd66305~spd66561, data=x)
coefs <- rbind(lman(mylm))
total.coefs <- rbind(coefs)
}
total.coefs
The function used in the loop is below if that is needed.
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
tbl <- c(intercept,slope,r2,r)
}
Thanks for the help.
Before starting your loop, you can write
total.coefs <- data.frame(), to initialise an empty data.frame. Then in your loop you want to update the total.coefs, as follows: total.coefs <- rbind(total.coefs, coefs). Finally replace the last line in lman by:
tbl <- data.frame(intercept=intercept, slope=slope, r2=r2, r=r).
Here's how I'd do it, for example on the mtcars data. Note: It's not advisable to use rbind inside the loop if you're building a data structure. You can call rbind after the looping has been done and things are much less stressful. I prefer to do this type of operation with a list.
Here I wrapped my lapply loop with rbind, and then do.call binds the list elements together recursively. Another thing to note is that I take the samples prior to entering the loop. This makes debugging easier and can be more efficient overall
reps <- replicate(3, sample(nrow(mtcars), 5), simplify = FALSE)
do.call(rbind, lapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}))
# (Intercept) hp R
# [1,] 33.29360 -0.08467169 0.5246208
# [2,] 29.97636 -0.06043852 0.4770310
# [3,] 28.33462 -0.05113847 0.8514720
The following transposed vapply loop produces the same result, and is often faster when you know the type of result you expect
t(vapply(reps, function(x) {
mod <- lm(mpg ~ hp, mtcars[x,])
c(coef(mod), R = summary(mod)$r.squared)
}, numeric(3)))
Another way to record each loop would be to make the work reproducible and keep your datasets around in case you have extreme values, missing values, new questions about the datasets, or other surprises that need investigated.
This is a similar case using the iris dataset.
# create sample data
data(iris)
iris <- iris[ ,c('Sepal.Length','Petal.Length')]
# your function with data.frame fix on last line
lman <- function(mylm){
r2 <- summary(mylm)$r.squared
r <- sqrt(r2)
intercept <- coef(mylm)[1]
slope <- coef(mylm)[2]
data.frame(intercept,slope,r2,r)
}
# set seed to make reproducible
set.seed(3)
# create all datasets
alldatasets <- lapply(1:200,function(x,df){
df[sample(1:nrow(df),size = 50,replace = F), ]
},df = iris)
# create all models based on alldatasets
allmodels <- lapply(alldatasets,lm,formula = Sepal.Length ~ Petal.Length)
# run custom function on all models
lmanresult <- lapply(allmodels,lman)
# format results
result <- do.call('rbind',lmanresult)
row.names(result) <- NULL
# inspect the 129th sample, model, and result
alldatasets[[129]]
summary(allmodels[[129]])
result[129, ]

Using a for loop for performing several regressions

I am currently performing a style analysis using the following method: http://www.r-bloggers.com/style-analysis/ . It is a constrained regression of one asset on a number of benchmarks, over a rolling 36 month window.
My problem is that I need to perform this regression for a fairly large number of assets and doing it one by one would take a huge amount of time. To be more precise: Is there a way to tell R to regress columns 1-100 one by one on colums 101-116. Of course this also means printing 100 different plots, one for each asset. I am new to R and have been stuck for several days now.
I hope it doesn't matter that the following excerpt isn't reproducible, since the code works as originally intended.
# Style Regression over Window, constrained
#--------------------------------------------------------------------------
# setup
load.packages('quadprog')
style.weights[] = NA
style.r.squared[] = NA
# Setup constraints
# 0 <= x.i <= 1
constraints = new.constraints(n, lb = 0, ub = 1)
# SUM x.i = 1
constraints = add.constraints(rep(1, n), 1, type = '=', constraints)
# main loop
for( i in window.len:ndates ) {
window.index = (i - window.len + 1) : i
fit = lm.constraint( hist.returns[window.index, -1], hist.returns[window.index, 1], constraints )
style.weights[i,] = fit$coefficients
style.r.squared[i,] = fit$r.squared
}
# plot
aa.style.summary.plot('Style Constrained', style.weights, style.r.squared, window.len)
Thank you very much for any tips!
"Is there a way to tell R to regress columns 1-100 one by one on colums 101-116."
Yes! You can use a for loop, but you there's also a whole family of 'apply' functions which are appropriate. Here's a generalized solution with a random / toy dataset and using lm(), but you can sub in whatever regression function you want
# data frame of 116 cols of 20 rows
set.seed(123)
dat <- as.data.frame(matrix(rnorm(116*20), ncol=116))
# with a for loop
models <- list() # empty list to store models
for (i in 1:100) {
models[[i]] <-
lm(formula=x~., data=data.frame(x=dat[, i], dat[, 101:116]))
}
# with lapply
models2 <-
lapply(1:100,
function(i) lm(formula=x~.,
data=data.frame(x=dat[, i], dat[, 101:116])))
# compare. they give the same results!
all.equal(models, models2)
# to access a single model, use [[#]]
models2[[1]]

Resources