I am trying to build Test and Train groups for doing the Cross Validation. I have a total individuals pool of 95 invidual IDs and tried to make the task done like this:
# create 95 unique IDs as individuals
set.seed(1)
indv <- stringi::stri_rand_strings(95, 4)
# specify Kfold
n.folds <- 5
folds <- cut(1:length(indv), breaks = n.folds, labels = FALSE)
# randomise the folds
folds <- sample(folds, length(folds))
samples.train <- list()
samples.test <- list()
foldSet <- list()
kfold.df <- data.frame("IID" = indv)
for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]
# replace to x (test) if the corresponding value is TRUE, and to y (train) if it is FALSE.
foldSet[[f]] <- ifelse(kfold.df$IID %in%
samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df[[f]] <- cbind(kfold.df, foldSet[[f]])
}
The goal is preparing 5 testing and training sets of samples to do the modeling. But I have encountered with this error message:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 95, 2
Besides, the foldSet output is not as expected, although samples.train and samples.test are correct. Could you please help me to make this loop working!
UPDATE:
Here is the for-loop without using wildcards in creating foldSet :
for (f in 1:n.folds) {
samples.train[[f]] <- indv[folds != f]
samples.test[[f]] <- indv[folds == f]
foldSet <<- ifelse(kfold.df$IID %in% samples.test[[f]], "test", "train")
# combine foldSet to datafarme.
kfold.df <<- cbind(kfold.df, foldSet)
}
By executing the loop you will find kfold.df as a dataframe listing all five folds test/train random sets. I expect for each iteration, creating the testing and training sets corresponding to the f, so, after five iteration, I would have access to each fold's Training/Testing sets for the next operations inside the loop, like kfold.df[foldSet == "train", "IID"]. I need this access bcoz I want to use it for subsetting another bigger matrix based on train and test invd of each fold, preparing it for applying to the regression model. That's why I used the wildcards for foldSet to make the loop able creating all by itself but I failed to manage it.
I think you may be overcomplicating things (which is something I do all the time...)
You don't need to go to great lengths to make what you are trying to make. This answer is broken down into three parts.
Building the data frame you're looking for (I think!)
Why you really don't need this data frame to be built
Why not use what's already out there?
Part 1
If I understand correctly, this is about what you're looking for (less the strings). I also included how you might use it with your actual data.
library(tidyverse)
giveMe <- function(rowCt, nfolds){
# set.seed(235) # removed seed after establishing working function to incite
# the expected randomness
folds <- cut(1:rowCt, breaks = nfolds, labels = F)
# randomise the folds
folds <- sample(folds, length(folds))
# create the folds' sets
kfold.df <- map_dfc(1:nfolds,
~ifelse(folds != .x, T, F)) %>%
setNames(., paste0("foldSet_",1:nfolds)) %>% # name each field
add_column(IID = 1:rowCt, .before = 1) # add indices to the left
return(kfold.df) # return a data frame
}
given <- giveMe(95, 5)
giveMore <- giveMe(nrow(iris), 5) # uses the built-in iris data set
Part 2
You could just create your random fold sequence and use that with a model, you don't need to stack them in a data frame. You have to loop through the model the same number of times, why not do it at the same time?
folds <- sample(cut(1:nrow(iris), 5, # no seed-- random on purpose
labels = F))
tellMe <- map(1:5, # the folds start in col 2
~lm(Sepal.Length~.,
iris[ifelse(folds != .x,
T, F),
1:4])) # dropped 'Species' groups' issue
To check out the model performance:
map_dfr(1:5, .f = function(x){
y = tellMe[[x]]
sigma = sigma(y)
rsq = summary(y)$adj.r.squared
c(sigma = sigma, rsq = rsq)
})
# # A tibble: 5 × 2
# sigma rsq
# <dbl> <dbl>
# 1 0.334 0.844
# 2 0.309 0.869
# 3 0.302 0.846
# 4 0.330 0.847
# 5 0.295 0.872
Predict and inspect the testing performance
# create a list of the predictec values from the test data
showMe <- map(1:5,
~predict(tellMe[[.x]],
iris[ifelse(folds == .x,
T, F), 1:4]))
# Grab comparable metrics like those from the models
map_dfr(1:5,
.f = function(x){
A = iris[ifelse(folds == x, T, F), ]$Sepal.Length
P = showMe[[x]]
sigma = sqrt(sum((A - P)^2) / length(A))
rsq = cor(A, P)^2
c(sigma = sigma, rsq = rsq)
})
# # A tibble: 5 × 2
# sigma rsq
# <dbl> <dbl>
# 1 0.232 0.919
# 2 0.342 0.774
# 3 0.366 0.884
# 4 0.250 0.906
# 5 0.384 0.790
Part 3
Here I'm going to use the caret library. However, there are a lot of other options.
library(caret)
set.seed(1)
# split training and testing 70/30%
tr <- createDataPartition(iris$Species, p = .7, list = F)
# set up 5-fold val
trC <- trainControl(method = "cv", number = 5)
# train the model
fit <- train(Sepal.Length~., iris[tr, ],
method = "lm",
trControl = trC)
summary(fit)
# truncated results best model:
# Residual standard error: 0.2754 on 39 degrees of freedom
# Multiple R-squared: 0.9062, Adjusted R-squared: 0.8941
fit.p <- predict(fit, iris[-tr,])
postResample(fit.p, iris[-tr, ]$Sepal.Length)
# RMSE Rsquared MAE
# 0.2795920 0.8925574 0.2302402
If you want to see each of the folds' performance, you can do that, too.
fit$resample
# RMSE Rsquared MAE Resample
# 1 0.3629901 0.7911634 0.2822708 Fold1
# 2 0.3680954 0.8888947 0.2960464 Fold2
# 3 0.3508317 0.8394489 0.2709989 Fold3
# 4 0.2548549 0.8954633 0.1960375 Fold4
# 5 0.3396910 0.8661239 0.3187768 Fold5
Related
I understand how to boostrap using the "boot" package in R, through the PDF for the package and also from these two examples on Stack, Bootstrapped correlation with more than 2 variables in R and Bootstrapped p-value for a correlation coefficient on R.
However, this is for small datasets ( 2 variables or a matrix with 5 variables). I have a very large matrix (1000+ columns) and the code I use to compute the correlation between every metabolite pair (removing duplicate and correlations with the metabolite itself) is:
x <- colnames(dat)
GetCor = function(x,y) cor(dat[,x], dat[,y], method="spearman")
GetCor = Vectorize(GetCor)
out <- data.frame(t(combn(x,2)), stringsAsFactors = F) %>%
mutate(v = GetCor(X1,X2))
I'm not sure how I can then alter this for it to be the function I pass to statistic in boot so
boot_res<- boot(dat, ?, R=1000)
Or would I just need to obtain a matrix of the bootstrapped p value or estimate depending on function code (colMeans(boot_res$t)) and get rid of the upper or lower triangle?
Was curious to know the most efficient way of going about the problem..
Something like this? It follows more or less the same lines as my answer to the 2nd question you link to in your question.
Note that I have simplified the correlation code, cor accepts a data.frame or a matrix, so pass a two column one and keep one of the off diagonal correlation matrix elements.
library(boot)
bootPairwiseCor <- function(data, i) {
d <- data[i,]
combn(d, 2, \(x) cor(x, method="spearman")[1,2])
}
dat <- iris[-5]
nms <- combn(colnames(dat), 2, paste, collapse = "_")
R <- 100L
b <- boot(dat, bootPairwiseCor, R)
b
#>
#> ORDINARY NONPARAMETRIC BOOTSTRAP
#>
#>
#> Call:
#> boot(data = dat, statistic = bootPairwiseCor, R = R)
#>
#>
#> Bootstrap Statistics :
#> original bias std. error
#> t1* -0.1667777 0.0037142908 0.070552718
#> t2* 0.8818981 -0.0002851683 0.017783297
#> t3* 0.8342888 0.0006306610 0.021509280
#> t4* -0.3096351 0.0047809612 0.075976067
#> t5* -0.2890317 0.0045689001 0.069929108
#> t6* 0.9376668 -0.0014838117 0.009632318
data.frame(variables = nms, correlations = colMeans(b$t))
#> variables correlations
#> 1 Sepal.Length_Sepal.Width -0.1630634
#> 2 Sepal.Length_Petal.Length 0.8816130
#> 3 Sepal.Length_Petal.Width 0.8349194
#> 4 Sepal.Width_Petal.Length -0.3048541
#> 5 Sepal.Width_Petal.Width -0.2844628
#> 6 Petal.Length_Petal.Width 0.9361830
Created on 2023-01-28 with reprex v2.0.2
You may want to use cor.test to get theoretical t-values. We will use them for comparison with the B bootstrap t-values. (Recall: The p-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct.)
Here is a similar function to yours, but applying cor.test and extracting statistics.
corr_cmb <- \(X, boot=FALSE) {
stts <- c('estimate', 'statistic', 'p.value')
cmbn <- combn(colnames(X), 2, simplify=FALSE)
a <- lapply(cmbn, \(x) as.data.frame(cor.test(X[, x[1]], X[, x[2]])[stts])) |>
do.call(what=rbind) |>
`rownames<-`(sapply(cmbn, paste, collapse=':'))
if (boot) {
a <- a[, 'statistic']
}
a
}
We run it one times on the data to get a theoretical solution.
rhat <- corr_cmb(dat)
head(rhat, 3)
# estimate statistic p.value
# V1:V2 0.06780426 2.1469547 0.03203729
# V1:V3 0.03471587 1.0973752 0.27274212
# V1:V4 0.05301563 1.6771828 0.09381987
Bootstrap
We can assume from the start that the bootstrap with 1000 columns will run for a while (choose(1000, 2) returns 499500 combinations). That's why we think about a multithreaded solution right away.
To bootstrap we simple repeatedly apply corr_cmb repeatedly on a sample of the data with replications.
We will measure the time to estimate the time needed for 1000 variables.
## setup clusters
library(parallel)
CL <- makeCluster(detectCores() - 1)
clusterExport(CL, c('corr_cmb', 'dat'))
t0 <- Sys.time() ## timestamp before run
B <- 1099L
clusterSetRNGStream(CL, 42)
boot_res <- parSapply(CL, 1:B, \(i) corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
t1 <- Sys.time() ## timestamp after run
stopCluster(CL)
After the bootstrap, we calculate the ratios of how many times the absolute bootstrap test statistics exceeded the theoretical ones (Ref.),
boot_p <- rowMeans(abs(boot_res - rowMeans(boot_res)) > abs(rhat$statistic))
and cbind the bootstrap p-values to the theoretical result.
cbind(rhat, boot_p)
# estimate statistic p.value boot_p
# V1:V2 0.06780426 2.1469547 0.03203729 0.03003003
# V1:V3 0.03471587 1.0973752 0.27274212 0.28028028
# V1:V4 0.05301563 1.6771828 0.09381987 0.08208208
# V1:V5 -0.01018682 -0.3218300 0.74764890 0.73473473
# V2:V3 0.03730133 1.1792122 0.23859474 0.23323323
# V2:V4 0.07203911 2.2817257 0.02271539 0.01201201
# V2:V5 0.03098230 0.9792363 0.32770055 0.30530531
# V3:V4 0.02364486 0.7471768 0.45513283 0.47547548
# V3:V5 -0.02864165 -0.9051937 0.36558126 0.38938939
# V4:V5 0.03415689 1.0796851 0.28054328 0.29329329
Note that the data used is fairly normally distributed. If the data is not normally distributed, the bootstrap p-values will be more different.
To conclude, an estimate of the time needed for your 1000 variables.
d <- as.numeric(difftime(t1, t0, units='mins'))
n_est <- 1000
t_est <- d/(choose(m, 2))*choose(n_est, 2)
cat(sprintf('est. runtime for %s variables: %s mins\n', n_est, round(t_est, 1)))
# est. runtime for 1000 variables: 1485.8 mins
(Perhaps for sake of completeness, a single-threaded version for smaller problems:)
## singlethreaded version
# set.seed(42)
# B <- 1099L
# boot_res <- replicate(B, corr_cmb(dat[sample.int(nrow(dat), replace=TRUE), ], boot=TRUE))
Data:
library(MASS)
n <- 1e3; m <- 5
Sigma <- matrix(.5, m, m)
diag(Sigma) <- 1
set.seed(42)
M <- mvrnorm(n, runif(m), Sigma)
M <- M + rnorm(length(M), sd=6)
dat <- as.data.frame(M)
I want to write function combinations_features(y, x) which go through all combinations containing three variables and will output r squared, adjusted r squared, AIC and BIC for each combination.
My solution
combinations_features <- function(y, x) {
# Define empty vectors to store statistics
feature_vec_1 <- feature_vec_2 <-
feature_vec_3 <- feature_vec_4 <- c()
# Obtaining all combinations containing three variables
comb_names <- utils::combn(colnames(x), 3)
# For each combination obtain wanted statistics
for (i in 1:ncol(comb_names)) {
feature_vec_1 <- append(
feature_vec_1, summary(lm(y ~ ., data = x[, comb_names[, i]]))$adj.r.squared
)
feature_vec_2 <- append(
feature_vec_2, summary(lm(y ~ ., data = x[, comb_names[, i]]))$r.squared
)
feature_vec_3 <- append(
feature_vec_3, AIC(lm(y ~ ., data = x[, comb_names[, i]]))
)
feature_vec_4 <- append(
feature_vec_4, BIC(lm(y ~ ., data = x[, comb_names[, i]]))
)
}
# Assign everything into data frame
data.frame(
"Adj R2" = feature_vec_1, "R2" = feature_vec_2,
"AIC" = feature_vec_3, "BIC" = feature_vec_4
)
}
Let's see how it works - define some artificial data and give it to the function.
set.seed(42)
predictors <- data.frame(rnorm(100), runif(100), rexp(100), rpois(100, 1))
dependent <- rnorm(100)
> combinations_features(dependent, predictors)
Adj.R2 R2 AIC BIC
1 -0.0283756015 0.002787295 276.2726 289.2985
2 0.0000677269 0.030368705 273.4678 286.4937
3 -0.0011990695 0.029140296 273.5944 286.6203
4 0.0015404392 0.031796789 273.3204 286.3463
However I find this code very inefficient due to these two things:
(1) Loop - I looped it over columns of matrices comb_names, I wonder if it can be omitted somehow
(2) Length of the code - This code is huge! Due to the fact that I define feature_vec for each statistics and append to them separately. I wonder if assigning to them can be done somehow by one command.
Could you please give me hand with improving my code by telling if it's possible to apply (1) or (2) ?
How about this, which relies on bind_rows() from tidyverse? I don't think there's a way to avoid looping over the combinations, but lapply makes everything a little neater, IMHO.
combinations_features1 <- function(y, x) {
comb_names <- utils::combn(colnames(x), 3)
bind_rows(
lapply(
1:ncol(comb_names),
function(z) {
m <- lm(y ~ ., data = x[, comb_names[,z]])
s <- summary(m)
tibble(Adj.R2=s$adj.r.squared, R2=s$r.squared, AIC=AIC(m), BIC=BIC(m))
}
)
)
}
combinations_features1(dependent, predictors)
# A tibble: 4 x 4
Adj.R2 R2 AIC BIC
<dbl> <dbl> <dbl> <dbl>
1 -0.0284 0.00279 276. 289.
2 0.0000677 0.0304 273. 286.
3 -0.00120 0.0291 274. 287.
4 0.00154 0.0318 273. 286.
bind_rows(), if given a list, binds the elements of the list into a single data.frame.
Same idea as above, just directly applying lapply to the list of combinations would also work:
combinations_features <- function(y,x){
do.call(rbind, lapply(utils::combn(colnames(x), 3, simplify=FALSE),
function(i){
f1 <- lm(y ~ ., data=x[, i])
data.frame(Adj.R2=summary(f1)$adj.r.squared,
R2=summary(f1)$r.squared,
AIC=AIC(f1), BIC=BIC(f1))
}))
}
I am working with several linear regression models.
I want to run a linear regression model with different 30 set.seed
For clarification, I only share the code with two regression models and 10 set.seed (In my project I have 12 regression models and each one should be run with 30 different set.seeds)
I need a solution that I can run a 30 set.seed for one linear regression model so I can go away from my laptop during the running period (30 set.seeds). Then I did the same for the second regression model.
Is there a way to run the code over the 30 different set.seed automatically. So I got a result for each set.seed.
I hope everything is clear and I am happy to clarify more.
NOTE
Bear in mind that I have four related Blocks with each regression model. So any change with set.seed or creatFolds may affect the other blocks.
EDIT1
The dataset used
wdbc <- read.delim("airfoil_self_noise.dat",header=F)
wdbcc=as.data.frame(scale(wdbc))
#set.seed(1)
#set.seed(2)
#set.seed(3)
#set.seed(4)
...
k = 30
folds <- createFolds(wdbcc$V6, k = k, list = TRUE, returnTrain = TRUE)
## Ordinary Least Square regression ##
#Block A
lm = list()
for (i in 1:k) {
lm[[i]] = lm(V6~ ., data = wdbcc[folds[[i]],])
}
#Block B
lm_coef = list()
lm_coef_var = list()
for(j in 1:(lm[[1]]$coefficients %>% length())){
for(i in 1:k){
lm_coef[[i]] = lm[[i]]$coefficients[j]
lm_coef_var[[j]] = lm_coef %>% unlist() %>% var()
}
}
#Block C
lm_var = unlist(lm_coef_var)
lm_df = cbind(coefficients = lm[[1]]$coefficients%>% names() %>% as.data.frame()
, variance = lm_var %>% as.data.frame())
colnames(lm_df) = c("coefficients", "variance_lm")
#Block D
lm_var_sum = sum(lm_var)
PQSQ-Regression
X=list()
Y=list()
for (i in 1:k) {
n=wdbcc[folds[[i]],-6]
m=wdbcc[folds[[i]],6]
X[[i]]=n
Y[[i]]=m
}
#Block A
lmPQSQ1 = list()
for (i in 1:k) {
lmPQSQ1[[i]] = PQSQRegression(X[[i]],Y[[i]],0.01,data = wdbcc[folds[[i]],])
}
lmmPQSQ1=list()
for (i in 1:k) {
L=list(coefficients = c(lmPQSQ1[[i]][[2]],lmPQSQ1[[i]][[1]]))
lmmPQSQ1[[i]]=L
}
#Block B
lm_coefPQSQ1 = list()
lm_coef_varPQSQ1 = list()
for(j in 1:(lmmPQSQ1[[1]]$coefficients %>% length())){
for(i in 1:k){
lm_coefPQSQ1[[i]] = lmmPQSQ1[[i]]$coefficients[j]
lm_coef_varPQSQ1[[j]] = lm_coefPQSQ1 %>% unlist() %>% var()
}
}
#Block C
lm_varPQSQ1 = unlist(lm_coef_varPQSQ1)
lm_dfPQSQ1 = variance = lm_varPQSQ1 %>% as.data.frame()
#Block D
PQSQ1_var_sum = sum(lm_varPQSQ1)
If I understand you correctly you want to regress V6 on all the other variables using both OLS and a LAD model. You want to select k=30 random "folds" using createFolds and repeat the process also n=30 times. As result you want the variances for each repetition and each coefficient.
I would wrap the fitting part into a function FX. Generate n=30 seeds with sample, loop over it with an lapply to repeat FX n=30 times.
FX <- function(seed, data, k=30) {
set.seed(seed) ## sets seed for each iteration
folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE) ## folds
## OLS
lm1 <- lapply(folds, function(folds) lm(V6 ~ ., data=data[folds, ]))
lm.coefs <- t(sapply(lm1, coef)) ## lm coefficients
## LAD
lad1 <- lapply(folds, function(folds) lad(V6 ~ ., data=data[folds, ], method="BR"))
lad.coefs <- t(sapply(lad1, coef)) ## lad coefficients
## calculate column variances for both coef matrices
## use `attr<-` to add the seed as an attribute if you want
return(`attr<-`(cbind(lm=apply(lm.coefs, 2, var), lad=apply(lad.coefs, 2, var)),
"seed", seed))
}
seeds <- 1:30 ## specific seeds 1, 2, ... 30
## if you want non-consecutive specific seeds, do:
# set.seed(42) ## set some initial seed
# n <- 30 ## n. o. seeds
# seeds <- sample(1:1e6, n) ## sample seeds for `FX`
res <- lapply(seeds, FX, data=wdbcc) ## lapply loop
Result
This results in a list of length 30 with variance matrices for each repetition, each model, and each coefficient.
res[1:2] ## first two lists
# [[1]]
# lm lad
# (Intercept) 9.104280e-06 1.273920e-05
# V1 2.609623e-05 6.992753e-05
# V2 7.082099e-05 2.075875e-05
# V3 1.352299e-05 1.209651e-05
# V4 7.986000e-06 9.273005e-06
# V5 5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
#
# [[2]]
# lm lad
# (Intercept) 4.558722e-06 2.031707e-05
# V1 2.256583e-05 9.291900e-05
# V2 6.519648e-05 2.768443e-05
# V3 1.800889e-05 9.983524e-06
# V4 1.131813e-05 1.174496e-05
# V5 3.866105e-05 1.022452e-05
# attr(,"seed")
# [1] 2
length(res)
# [1] 30
To calculate the sum of variances for each seed you may use colSums in an sapply.
# sum of variances
sov <- t(sapply(res, colSums))
dim(sov)
# [1] 30 2
head(sov)
# lm lad
# [1,] 1.829835e-04 0.0001401535
# [2,] 1.603091e-04 0.0001728735
# [3,] 1.003093e-04 0.0001972869
# [4,] 1.460591e-04 0.0001508251
# [5,] 9.915082e-05 0.0001262106
# [6,] 1.425996e-04 0.0001478449
To understand what one iteration of the lapply does, consider this:
## provide the values of first iteration for arguments of function `FX`
seed <- 1
data <- wdbcc
k <- 30
## first iteration of `lapply`
set.seed(seed)
folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE) ## folds
## OLS
lm1 <- lapply(folds, function(folds) lm(V6 ~ ., data=data[folds, ]))
lm.coefs <- t(sapply(lm1, coef)) ## lm coefficients
dim(lm.coefs)
# [1] 30 6
head(lm.coefs)
# (Intercept) V1 V2 V3 V4 V5
# Fold01 -0.0039130125 -0.5806272 -0.3564769 -0.4804492 0.2271908 -0.2805472
# Fold02 0.0013260444 -0.5863764 -0.3533327 -0.4759213 0.2253128 -0.2874691
# Fold03 0.0006791787 -0.5890787 -0.3678586 -0.4832066 0.2220979 -0.2739124
# Fold04 -0.0010721593 -0.5868079 -0.3722466 -0.4895328 0.2227811 -0.2749657
# Fold05 0.0021856620 -0.5850165 -0.3495360 -0.4810657 0.2235410 -0.2936287
# Fold06 0.0001486909 -0.5872607 -0.3677774 -0.4848523 0.2275780 -0.2823764
## LAD (same as OLS)
lad1 <- lapply(folds, function(folds) lad(V6 ~ ., data=data[folds, ], method="BR"))
lad.coefs <- t(sapply(lad1, coef)) ## lad coefficients
## return, throws variances for each coefficient of each model in a matrix
## the seed is added as an attribute, to be able to identify it later
res.1 <- `attr<-`(cbind(var.lm=apply(lm.coefs, 2, var),
var.lad=apply(lad.coefs, 2, var)),
"seed", seed)
res.1
# var.lm var.lad
# (Intercept) 9.104280e-06 1.273920e-05
# V1 2.609623e-05 6.992753e-05
# V2 7.082099e-05 2.075875e-05
# V3 1.352299e-05 1.209651e-05
# V4 7.986000e-06 9.273005e-06
# V5 5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
Compare res.1 with the first element of list res above.
sov.1 <- colSums(res.1)
sov.1
# var.lm var.lad
# 0.0001829835 0.0001401535
Compare sov.1 with the first row of matrix sov above.
Edit
For regression functions with matrix notation, such as lm.fit, we may use model.matrix and do the subsetting beforehand, see line lm2.coefs in the function; compare lm and lm2 columns in res2 below, they're equal. (lm.fit is also faster than lm, because it omits unnecessary calculations, and you just need the coefficients; hence you may actually replace lm with lm.fit line. There might also be a way with lad using lsfit in the code, but honestly I'm too unfamiliar with lad to provide you this solution.)
Also notice, that, for sake of brevity I merged the two lines per model into one using sapply directly on the $coefficients. sapply works as lapply but throws a matrix; note that we need to transpose.
FX2 <- function(seed, data, k=30) {
set.seed(seed) ## sets seed for each iteration
folds <- createFolds(data[, "V6"], k=k, list=TRUE, returnTrain=TRUE) ## draw folds
lm.coefs <- t(sapply(folds, function(f) lm(V6 ~ ., data=data[f, ])$coef))
lm2.coefs <- t(sapply(folds, function(f) {
data2 <- data[f, ]
lm.fit(x=model.matrix(V6 ~ ., data2), y=data2[,"V6"])$coef
}))
lad.coefs <- t(sapply(folds, function(f) lad(V6 ~ ., data=data[f, ], method="BR")$coef))
return(`attr<-`(cbind(lm=apply(lm.coefs, 2, var),
lm2=apply(lm2.coefs, 2, var),
lad=apply(lad.coefs, 2, var)),
"seed", seed))
}
seeds <- 1:30
res.2 <- lapply(seeds, FX2, data=wdbcc) ## lapply loop
res.2[1:2]
# [[1]]
# lm lm2 lad
# (Intercept) 9.104280e-06 9.104280e-06 1.273920e-05
# V1 2.609623e-05 2.609623e-05 6.992753e-05
# V2 7.082099e-05 7.082099e-05 2.075875e-05
# V3 1.352299e-05 1.352299e-05 1.209651e-05
# V4 7.986000e-06 7.986000e-06 9.273005e-06
# V5 5.545298e-05 5.545298e-05 1.535849e-05
# attr(,"seed")
# [1] 1
#
# [[2]]
# lm lm2 lad
# (Intercept) 4.558722e-06 4.558722e-06 2.031707e-05
# V1 2.256583e-05 2.256583e-05 9.291900e-05
# V2 6.519648e-05 6.519648e-05 2.768443e-05
# V3 1.800889e-05 1.800889e-05 9.983524e-06
# V4 1.131813e-05 1.131813e-05 1.174496e-05
# V5 3.866105e-05 3.866105e-05 1.022452e-05
# attr(,"seed")
# [1] 2
Data and libraries:
invisible(lapply(c("caret", "L1pack"), library, character.only=TRUE))
wdbcc <- read.delim("airfoil_self_noise.dat", header=F)
wdbcc[] <- lapply(wdbcc, scale)
I was wondering why lm() says 5 coefs not defined because of singularities and then gives all NA in the summary output for 5 coefficients.
Note that all my predictors are categorical.
Is there anything wrong with my data on these 5 coefficients or code? How can I possibly fix this?
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
summary(vv)
First 6 lines of output:
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.17835 0.63573 0.281 0.779330
Age1 -0.04576 0.86803 -0.053 0.958010
Age2 0.46431 0.87686 0.530 0.596990
Age99 -1.64099 1.04830 -1.565 0.118949
genre2 1.57015 0.55699 2.819 0.005263 **
genre4 NA NA NA NA ## For example here is all `NA`s? there are 4 more !
As others noted, a problem is that you seem to have multicollinearity. Another is that there are missing values in your dataset. The missing values should probably just be removed. As for correlated variables, you should inspect your data to identify this collinearity, and remove it. Deciding which variables to remove and which to retain is a very domain-specific topic. However, you could if you wish decide to use regularisation and fit a model while retaining all variables. This also allows you to fit a model when n (number of samples) is less than p (number of predictors).
I've shown code below that demonstrates how to examine the correlation structure within your data, and to identify which variables are most correlated (thanks to this answer. I've included an example of fitting such a model, using L2 regularisation (commonly known as ridge regression).
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T) # Data
nms <- c("Age","genre","Length","cf.training","error.type","cf.scope","cf.type","cf.revision")
d[nms] <- lapply(d[nms], as.factor) # make factor
vv <- lm(dint~Age+genre+Length+cf.training+error.type+cf.scope+cf.type+cf.revision, data = d)
df <- d
df[] <- lapply(df, as.numeric)
cor_mat <- cor(as.matrix(df), use = "complete.obs")
library("gplots")
heatmap.2(cor_mat, trace = "none")
## https://stackoverflow.com/questions/22282531/how-to-compute-correlations-between-all-columns-in-r-and-detect-highly-correlate
library("tibble")
library("dplyr")
library("tidyr")
d2 <- df %>%
as.matrix() %>%
cor(use = "complete.obs") %>%
## Set diag (a vs a) to NA, then remove
(function(x) {
diag(x) <- NA
x
}) %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(!is.na(value)) %>%
## Sort by decreasing absolute correlation
arrange(-abs(value))
## 2 pairs of variables are almost exactly correlated!
head(d2)
#> var1 var2 value
#> 1 id study.name 0.9999430
#> 2 study.name id 0.9999430
#> 3 Location timed 0.9994082
#> 4 timed Location 0.9994082
#> 5 Age ed.level 0.7425026
#> 6 ed.level Age 0.7425026
## Remove some variables here, or maybe try regularized regression (see below)
library("glmnet")
## glmnet requires matrix input
X <- d[, c("Age", "genre", "Length", "cf.training", "error.type", "cf.scope", "cf.type", "cf.revision")]
X[] <- lapply(X, as.numeric)
X <- as.matrix(X)
ind_na <- apply(X, 1, function(row) any(is.na(row)))
X <- X[!ind_na, ]
y <- d[!ind_na, "dint"]
glmnet <- glmnet(
x = X,
y = y,
## alpha = 0 is ridge regression
alpha = 0)
plot(glmnet)
Created on 2019-11-08 by the reprex package (v0.3.0)
Under such situation you can use "olsrr" package in R for stepwise regression analysis. I am providing you a sample code to do stepwise regression analysis in R
library("olsrr")
#Load the data
d <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/v.csv", h = T)
# stepwise regression
vv <- lm(dint ~ Age + genre + Length + cf.training + error.type + cf.scope + cf.type + cf.revision, data = d)
summary(vv)
k <- ols_step_both_p(vv, pent = 0.05, prem = 0.1)
# stepwise regression plot
plot(k)
# final model
k$model
It will provide you exactly the same output as that of SPSS.
I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.
I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.
Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
Using a loop and subset approach, like below, seems computationally burdensome.
Note that for my real data frame, n, and the number iterations is much larger than the example below.
My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use
sample(unique(df1$subject),n,replace=TRUE)
to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.
Example slow code:
require(plm)
data("Grunfeld", package="plm")
firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()
for(j in 1:iterations){
v = sample(length(firms),n,replace=TRUE)
newdata = NULL
for(i in 1:n){
newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
}
reg1 = lm(value ~ inv + capital, data = newdata)
mybootresults[[j]] = coefficients(reg1)
}
mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults
(Intercept) inv capital
1 373.8591 6.981309 -0.9801547
2 370.6743 6.633642 -1.4526338
3 528.8436 6.960226 -1.1597901
4 331.6979 6.239426 -1.0349230
5 507.7339 8.924227 -2.8661479
...
...
How about something like this:
myfit <- function(x, i) {
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
firms <- unique(Grunfeld$firm)
b0 <- boot(firms, myfit, 999)
You can also use the tsboot function in the boot package with fixed block resampling scheme.
require(plm)
require(boot)
data(Grunfeld)
### each firm is of length 20
table(Grunfeld$firm)
## 1 2 3 4 5 6 7 8 9 10
## 20 20 20 20 20 20 20 20 20 20
blockboot <- function(data)
{
coefficients(lm(value ~ inv + capital, data = data))
}
### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")
boot.1
## Bootstrap Statistics :
## original bias std. error
## t1* 410.81557 -25.785972 174.3766
## t2* 5.75981 0.451810 2.0261
## t3* -0.61527 0.065322 0.6330
dim(boot.1$t)
## [1] 99 3
head(boot.1$t)
## [,1] [,2] [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283 0.031324
## [3,] 479.74 3.2531 0.637298
## [4,] 557.79 4.5284 0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860
Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.
# get function for boot
myIndex <- function(x, i) {
# select the observations to subset. Likely repeated observations
blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
# run regression for given replicate, return estimated coefficients
coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}
now, bootstrap
# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)
Run the accepted answer
set.seed(1234)
b0 <- boot(firms, myfit, 200)
Let's eyeball a comparison
using indexing
b1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myIndex, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
Original version
b0
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myfit, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
These look pretty close. Now, a bit more checking
identical(b0$t, b1$t)
[1] TRUE
and
identical(summary(b0), summary(b1))
[1] TRUE
Finally, we'll do a quick benchmark
library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)},
rbind={b0 <- boot(firms, myfit, 200)})
On my computer, this returns
Unit: milliseconds
expr min lq mean median uq max neval
index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866 100
rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759 100
So, direct indexing is more than 2 times faster at every level of the distribution.
note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.
The solution needs to be modified to manage fixed effects.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")
## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)
## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
## join the sampled firms to their firm-year data
mydata <- left_join(X[i,], Grunfeld, by="firm")
## Distinguish between multiple resamples of the same firm
## Otherwise they have the same id in the fixed effects regression
## And trouble ensues
mydata <- mutate(group_by(mydata,firm,year),
firm_uniq4boot = paste(firm,"+",row_number())
)
## Run regression with and without firm fixed effects
c(coefficients(lm(value ~ inv + capital, data = mydata)),
coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
}
set.seed(1)
system.time(b <- boot(firms, myfit, 1000))
summary(b)
summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))
I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
# First get the data
data("Grunfeld", package="plm")
firms <- unique(Grunfeld$firm)
myfit1 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
myfit2 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
coefficients(lm(value ~ inv + capital, data = mydata))
}
# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
## user system elapsed
## 13.51 0.01 13.62
# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
## user system elapsed
## 8.16 0.02 8.26
b1
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191
b2
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191