I want to understand how my parallelization is working when there is a for-loop structure inside of the structure that I am parallelizing.
I have a routine called reg_simulation(), which generated 100 estimations (nrep=100) of linear regression, each of those using a different seed (seed <- seed + i).
Additionally, I wrapped up the reg_simulation() routine inside par_wrapper() to run it using different possible configurations of the data generating process. In particular, changing the number of observations (obs) and the error term variance (sigma). Finally, I parallelized this structure using pblapply.
Using the described setup, I am using a grid of obs = c(250, 500, 750, 1000, 2500) and
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) meaning 5 values in each variable, leading to a 25 combinations of the two variables. However, I am running 100 times these 25 combinations.
Finally, here is my question:
My code is...
(a) Running in parallel 25 combinations but serially the 100 repetition inside of them.
(b) Running in parallel all the 2500 models.
If the answer is (a), please let me know how you arrived at such a conclusion because I haven't been sorted out yet, and probably it might imply that I should change my code structure.
Some additional comments: (1) The seed declaration on each iteration is important because it allows me to recover each possible combination of the data (e.g., iteration 78 (seed = 78), with sigma=0.1 and obs=1000) (2) I am using pblapply because I want to track my code simulations' progress.
Here the aforementioned routines:
reg_simulation()
reg_simulation<- function(obs = 1000,
sigma = 0.5,
nrep = 10 ,
seed = 0){
#seet seed
res <- vector("list", nrep)
# Forloop
for ( i in 1:nrep) {
#Changing seed each iteration
seed <- seed + i
#set seed
set.seed(seed)
#DGP
x1 <- rnorm(obs, 0 , sigma)
x2 <- rnorm(obs, 0 , sigma)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
returnlist <- list(intercept = ols$coefficients[1],
beta1 = ols$coefficients[2],
beta2 = ols$coefficients[3],
seed = seed)
#save each iteration
res[[i]] <- returnlist
}
return(res)
}
par_wrapper()
### parallel wrapper
par_wrapper <- function(obs = c(250,500,750,1000,2500),
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) ,
nrep = 10,
nClusters = 4)
{
require(parallel)
require(pbapply)
#grid of searching space
prs <- expand.grid(obs = obs,
sigma = sigma)
nprs <- nrow(prs)
rownames(prs) <- c(1:NROW(prs))
#Print number of combinations
print(prs)
#### ---- PARALLEL INIT ---- ####
## Parallel options
cl <- makeCluster(nClusters)
## Attaching necessary functions for internal computations
parallel::clusterExport(cl= cl,
list("reg_simulation"))
# pblapply
par_simres <- pblapply(cl = cl,
X = 1:nprs,
FUN = function(i){
reg_simulation(
sigma = prs$sigma[i],
obs = prs$obs[i],
nrep = nrep,
seed = 0)})
##exit cluster mode
stopCluster(cl)
return(par_simres)
}
Using the par_wrapper() function over a grid.
#using generated structure.
res_list <- par_wrapper(
obs = c(250,500,750,1000, 2500 ),
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) ,
nrep = 100,
nClusters = 4)
Console output.
# obs sigma
# 1 250 0.1
# 2 500 0.1
# 3 750 0.1
# 4 1000 0.1
# 5 2500 0.1
# 6 250 0.2
# 7 500 0.2
# 8 750 0.2
# 9 1000 0.2
# 10 2500 0.2
# 11 250 0.5
# 12 500 0.5
# 13 750 0.5
# 14 1000 0.5
# 15 2500 0.5
# 16 250 0.8
# 17 500 0.8
# 18 750 0.8
# 19 1000 0.8
# 20 2500 0.8
# 21 250 1.0
# 22 500 1.0
# 23 750 1.0
# 24 1000 1.0
# 25 2500 1.0
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Related
I have the following function that uses nested loops and honestly I'm not sure how to proceed with making the code run more efficient. It runs fine for 100 sims in my opinion but when I ran for 2000 sims it took almost 12 seconds.
This code will generate any n Brownian Motion simulations and works well, the issue is once the simulation size is increased to say 500+ then it starts to bog down, and when it hits 2k then it's pretty slow ie 12.
Here is the function:
ts_brownian_motion <- function(.time = 100, .num_sims = 10, .delta_time = 1,
.initial_value = 0) {
# TidyEval ----
T <- as.numeric(.time)
N <- as.numeric(.num_sims)
delta_t <- as.numeric(.delta_time)
initial_value <- as.numeric(.initial_value)
# Checks ----
if (!is.numeric(T) | !is.numeric(N) | !is.numeric(delta_t) | !is.numeric(initial_value)){
rlang::abort(
message = "All parameters must be numeric values.",
use_cli_format = TRUE
)
}
# Initialize empty data.frame to store the simulations
sim_data <- data.frame()
# Generate N simulations
for (i in 1:N) {
# Initialize the current simulation with a starting value of 0
sim <- c(initial_value)
# Generate the brownian motion values for each time step
for (t in 1:(T / delta_t)) {
sim <- c(sim, sim[t] + rnorm(1, mean = 0, sd = sqrt(delta_t)))
}
# Bind the time steps, simulation values, and simulation number together in a data.frame and add it to the result
sim_data <- rbind(
sim_data,
data.frame(
t = seq(0, T, delta_t),
y = sim,
sim_number = i
)
)
}
# Clean up
sim_data <- sim_data %>%
dplyr::as_tibble() %>%
dplyr::mutate(sim_number = forcats::as_factor(sim_number)) %>%
dplyr::select(sim_number, t, y)
# Return ----
attr(sim_data, ".time") <- .time
attr(sim_data, ".num_sims") <- .num_sims
attr(sim_data, ".delta_time") <- .delta_time
attr(sim_data, ".initial_value") <- .initial_value
return(sim_data)
}
Here is some output of the function:
> ts_brownian_motion(.time = 10, .num_sims = 25)
# A tibble: 275 × 3
sim_number t y
<fct> <dbl> <dbl>
1 1 0 0
2 1 1 -2.13
3 1 2 -1.08
4 1 3 0.0728
5 1 4 0.562
6 1 5 0.255
7 1 6 -1.28
8 1 7 -1.76
9 1 8 -0.770
10 1 9 -0.536
# … with 265 more rows
# ℹ Use `print(n = ...)` to see more rows
As suggested in the comments, if you want speed, you should use cumsum. You need to be clear what type of Brownian Motion you want (arithmetic, geometric). For geometric Brownian motion, you'll need to correct the approximation error by adjusting the mean. As an example, the NMOF package (which I maintain), contains a function gbm that implements geometric Brownian Motion through cumsum. Here is an example call for 2000 paths with 100 timesteps each.
library("NMOF")
library("zoo") ## for plotting
timesteps <- 100
system.time(b <- NMOF::gbm(2000, tau = 1, timesteps = 100, r = 0, v = 1))
## user system elapsed
## 0.013 0.000 0.013
dim(b) ## each column is one path, starting at time zero
## [1] 101 2000
plot(zoo(b[, 1:5], 0:timesteps), plot.type = "single")
Let's assume we want to generate n samples from a multinomial distribution from given probabilities p. This works well with sample or rmultinorm. The totals can then be counted with table. Now I wonder if there is a direct way (or another distribution) available to get the result of table directly without generating complete sample vectors.
Here an example:
set.seed(123)
n <- 10000 # sample size
p <- c(0.1, 0.2, 0.7) # probabilities, sum up to 1.0
## 1) approach with sample
x <- sample(1:3, size = n, prob = p, replace = TRUE)
table(x)
# x
# 1 2 3
# 945 2007 7048
## 2) approach with rmultinorm
x <- rmultinom(n, size = 1, prob = p) * 1:3
table(x[x != 0])
# 1 2 3
# 987 1967 7046
I do a simulation using R program but I cannot to organize/make the final table in R.
rep=100
n=20
# define starting values
mu<-100
sig<-3
theta=c(mu,sig) # store starting values
#Tables
#********
LHE=array(0, c(2, rep));
rownames(LHE)= c("MLE_mu", "MLE_sigma")
bias= array(0, c(2, rep));
rownames(bias)= c("bias_mu", "bias_sigma")
#Simulation {FOR LOOP}
#***********************
set.seed(1)
for(i in 1:rep){
myx <- rnorm(100, 100, 3)
loglikenorm<-function(x, myx) # always use x to hold parameter values
{
mu<-x[1]
sig<-x[2]
n<-length(myx)
loglike<- -n*log(sig*sqrt(2*pi))- sum((1/(2*sig^2))*(myx-mu)^2) # note
# use of sum
loglike<- -loglike
}
result<-nlm(loglikenorm, theta , myx=myx, hessian=TRUE, print.level=1) #ML estimation using nlm
mle<-result$estimate #extract and store mles
LHE[,i]= c(mle[1], mle[2])
bias[,i]= c(mle[1]-theta[1], mle[2]-theta[2])
} # end for i
L <-round(apply(LHE, 1, mean), 3) # MLE of all the applied iterations
bs <-round(apply(bias,1, mean),3) # bias of all the applied iterations
row<- c(L, bs); row
This will run the MLE 100 times. I want to compute the MLE for different sample sizes (n=20,50,100) and different set of parameters c(c(mu= 100, sigma=3), c(mu=80 , sigma=4))
I want two things the first one is how to run the code to compute the MLE for different sample sizes and different sets of parameters. The second one how can I organize the output (like the attached image) using R program.
The final table from R
Any help will be appreciated.
I would suggest you building a function for MLE estimation and then apply it for different settings of parameters. Here the function using your code:
#Function
mymle <- function(n,mu,sig,rep)
{
#Set reps
rep=rep
# define starting values
mu<-mu
sig<-sig
theta=c(mu,sig) # store starting values
#Tables
LHE=array(0, c(2, rep));
rownames(LHE)= c("MLE_mu", "MLE_sigma")
#Bias
bias= array(0, c(2, rep));
rownames(bias)= c("bias_mu", "bias_sigma")
#Simulation
set.seed(1)
#Loop
for(i in 1:rep){
myx <- rnorm(rep, mu, sig)
loglikenorm<-function(x, myx) # always use x to hold parameter values
{
mu<-x[1]
sig<-x[2]
n<-length(myx)
loglike<- -n*log(sig*sqrt(2*pi))- sum((1/(2*sig^2))*(myx-mu)^2) # note
# use of sum
loglike<- -loglike
}
result<-nlm(loglikenorm, theta , myx=myx, hessian=TRUE, print.level=1) #ML estimation using nlm
mle<-result$estimate #extract and store mles
LHE[,i]= c(mle[1], mle[2])
bias[,i]= c(mle[1]-theta[1], mle[2]-theta[2])
} # end for i
#Format results
L <-round(apply(LHE, 1, mean), 3) # MLE of all the applied iterations
bs <-round(apply(bias,1, mean),3) # bias of all the applied iterations
row<- c(L, bs)
#Format a label
lab <- paste0('n= ',n,';',' mu= ',mu,';',' sig= ',sig)
row2 <- c(lab,row)
row2 <- as.data.frame(t(row2))
return(row2)
}
Now we apply for different parameter settings:
#Example 1
ex1 <- mymle(n = 20,mu = 100,sig = 3,rep = 100)
ex2 <- mymle(n = 50,mu = 100,sig = 3,rep = 100)
ex3 <- mymle(n = 100,mu = 100,sig = 3,rep = 100)
#Example 2
ex4 <- mymle(n = 20,mu = 80,sig = 4,rep = 100)
ex5 <- mymle(n = 50,mu = 80,sig = 4,rep = 100)
ex6 <- mymle(n = 100,mu = 80,sig = 4,rep = 100)
Finally, we bind all results for an output close to what you want:
#Bind all
df <- rbind(ex1,ex2,ex3,ex4,ex5,ex6)
Output:
V1 MLE_mu MLE_sigma bias_mu bias_sigma
1 n= 20; mu= 100; sig= 3 99.98 3.015 -0.02 0.015
2 n= 50; mu= 100; sig= 3 99.98 3.015 -0.02 0.015
3 n= 100; mu= 100; sig= 3 99.98 3.015 -0.02 0.015
4 n= 20; mu= 80; sig= 4 79.974 4.02 -0.026 0.02
5 n= 50; mu= 80; sig= 4 79.974 4.02 -0.026 0.02
6 n= 100; mu= 80; sig= 4 79.974 4.02 -0.026 0.02
I tested your loop individually for different combinations of parameters and results were equal. Just as reminder based on large numbers principle, as we add more rep the results will tend to real values, so if you change rep to 1000, the values will change.
I tried to implement the AdaBoost algorithm of Freund and Schapire as close to the original as possible (see p. 2 here: http://rob.schapire.net/papers/explaining-adaboost.pdf):
library(rpart)
library(OneR)
maxdepth <- 1
T <- 100 # number of rounds
# Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {-1, +1}
myocarde <- read.table("http://freakonometrics.free.fr/myocarde.csv", head = TRUE, sep = ";")
#myocarde <- read.table("data/myocarde.csv", header = TRUE, sep = ";")
y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1
x <- myocarde[ , 1:7]
m <- nrow(x)
data <- data.frame(x, y)
# Initialize: D_1(i) = 1/m for i = 1,...,m
D <- rep(1/m, m)
H <- replicate(T, list())
a <- vector(mode = "numeric", T)
set.seed(123)
# For t = 1,...,T
for(t in 1:T) {
# Train weak learner using distribution D_t
# Get weak hypothesis h_t: X -> {-1, +1}
data_D_t <- data[sample(m, 10*m, replace = TRUE, prob = D), ]
H[[t]] <- rpart(y ~., data = data_D_t, maxdepth = maxdepth, method = "class")
# Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i]
h <- predict(H[[t]], x, type = "class")
e <- sum(h != y) / m
# Choose a_t = 0.5 * log((1-e) / e)
a[t] <- 0.5 * log((1-e) / e)
# Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(-a_t * y_i * h_t(x_i))) / Z_t
# where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution)
D <- D * exp(-a[t] * y * as.numeric(h))
D <- D / sum(D)
}
# Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T)
newdata <- x
H_x <- sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class"))))
H_x <- t(a * t(H_x))
pred <- sign(rowSums(H_x))
#H
#a
eval_model(pred, y)
##
## Confusion matrix (absolute):
## Actual
## Prediction -1 1 Sum
## -1 0 1 1
## 1 29 41 70
## Sum 29 42 71
##
## Confusion matrix (relative):
## Actual
## Prediction -1 1 Sum
## -1 0.00 0.01 0.01
## 1 0.41 0.58 0.99
## Sum 0.41 0.59 1.00
##
## Accuracy:
## 0.5775 (41/71)
##
## Error rate:
## 0.4225 (30/71)
##
## Error rate reduction (vs. base rate):
## -0.0345 (p-value = 0.6436)
As can be seen the accuracy of the model is horrible compared to other AdaBoost implementations, e.g.:
library(JOUSBoost)
## JOUSBoost 2.1.0
boost <- adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T)
pred <- predict(boost, x)
eval_model(pred, y)
##
## Confusion matrix (absolute):
## Actual
## Prediction -1 1 Sum
## -1 29 0 29
## 1 0 42 42
## Sum 29 42 71
##
## Confusion matrix (relative):
## Actual
## Prediction -1 1 Sum
## -1 0.41 0.00 0.41
## 1 0.00 0.59 0.59
## Sum 0.41 0.59 1.00
##
## Accuracy:
## 1 (71/71)
##
## Error rate:
## 0 (0/71)
##
## Error rate reduction (vs. base rate):
## 1 (p-value < 2.2e-16)
My question
Could you please give me a hint what went wrong in my implementation? Thank you
Edit
The final and corrected code can be found in my blog post: Understanding AdaBoost – or how to turn Weakness into Strength
There are quite a few contributing factors as to why your implementation is not working.
You were not using rpart correctly. Adaboost implementation does not mention upsampling with the weights - but rpart itself can accept weights. My example below shows how rpart should be used for this purpose.
Calculation of the weighted error was wrong. You were calculating the error proportion (number of samples calculated incorrectly divided by number of samples). Adaboost uses the sum of the weights that were incorrectly predicted (sum(D[y != yhat])).
Final predictions seemed to be incorrect too, I just ended up doing a simple loop.
Next time I recommend diving into the source code the the other implementations you are comparing against.
https://github.com/cran/JOUSBoost/blob/master/R/adaboost.R uses almost identical code to my below example - and probably would have helped guide you originally.
Additionally using T as a variable could potentially interfere with the logical TRUE and it's shorthand T, so I'd avoid it.
### packages ###
library(rpart)
library(OneR)
### parameters ###
maxdepth <- 1
rounds <- 100
set.seed(123)
### data ###
myocarde <- read.table("http://freakonometrics.free.fr/myocarde.csv", head = TRUE, sep = ";")
y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1
x <- myocarde[ , 1:7]
m <- nrow(x)
dataset <- data.frame(x, y)
### initialisation ###
D <- rep(1/m, m)
H <- list()
a <- vector(mode = "numeric", length = rounds)
for (i in seq.int(rounds)) {
# train weak learner
H[[i]] = rpart(y ~ ., data = dataset, weights = D, maxdepth = maxdepth, method = "class")
# predictions
yhat <- predict(H[[i]], x, type = "class")
yhat <- as.numeric(as.character(yhat))
# weighted error
e <- sum(D[yhat != y])
# alpha coefficient
a[i] <- 0.5 * log((1 - e) / e)
# updating weights (D)
D <- D * exp(-a[i] * y * yhat)
D <- D / sum(D)
}
# predict with each weak learner on dataset
y_hat_final <- vector(mode = "numeric", length = m)
for (i in seq(rounds)) {
pred = predict(H[[i]], dataset, type = "class")
pred = as.numeric(as.character(pred))
y_hat_final = y_hat_final + (a[i] * pred)
}
pred <- sign(y_hat_final)
eval_model(pred, y)
> eval_model(pred, y)
Confusion matrix (absolute):
Actual
Prediction -1 1 Sum
-1 29 0 29
1 0 42 42
Sum 29 42 71
Confusion matrix (relative):
Actual
Prediction -1 1 Sum
-1 0.41 0.00 0.41
1 0.00 0.59 0.59
Sum 0.41 0.59 1.00
Accuracy:
1 (71/71)
Error rate:
0 (0/71)
Error rate reduction (vs. base rate):
1 (p-value < 2.2e-16)
I am trying to use code for fully reproducible parallel models in caret but do not understand how to set the size of the vectors in the seed object. For gbm I have 4 tuning parameters with a total of 11 different levels, and I have 54 rows in my tuning grid. If I specify any value < 18 as the last value in the "for(i in 1:10)" line below, I get an error: "Bad seeds: the seed object should be a list of length 11 with 10 integer vectors of size 18 and the last list element having a single integer." Why 18? Also it runs without errors for values > 18 (e.g., 54) - why? Many thanks for the help. The following is based on http://topepo.github.io/caret/training.html, added some things.
library(mlbench)
data(Sonar)
str(Sonar[, 1:10])
library(caret)
library(doParallel)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
grid <- expand.grid(n.trees = seq(50,150,by=50), interaction.depth = seq(1,3,by=1),
shrinkage = seq(.09,.11,by=.01),n.minobsinnode=seq(8,10,by=2))
# set seed to run fully reproducible model in parallel mode using caret
set.seed(825)
seeds <- vector(mode = "list", length = 11) # length is = (n_repeats*nresampling)+1
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 11) # ...the number of tuning parameter...
seeds[[11]]<-sample.int(1000, 1) # for the last model
fitControl <- trainControl(method = "cv",number = 10,seeds=seeds)
# run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
gbmFit1 <- train(Class ~ ., data = training,method = "gbm",
trControl = fitControl,tuneGrid=grid,verbose = FALSE)
gbmFit1
I will address your question in two parts:
1 - Setting the seeds:
The code to do it as you stated :
set.seed(825)
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
The 11 in seeds <- vector(mode = "list", length = 11) is (n_repeats*nresampling)+1, so in your case, you're using 10-fold CV, so 10+1 = 11. If you were using repeatedcv with number=10 and repeats = 5 you would replace the 11 by (5*10)+1 = 51.
The 10 in for(i in 1:10) is (n_repeats*nresampling). in your case it is 10 because you're using 10-fold CV. Similarly, if you were using repeatedcv with number=10 and repeats = 5 it would be for(i in 1:50).
The 54 in sample.int(n=1000, 54) is the number of tuning parameter combinations. In your case, you have 4 parameters with 3,3,3 and 2 values. So, it is 3*3*3*2 = 54. But, I remember I red somewhere that for gbm, the model is fit to the max(n.trees) in the grid, and the models with less trees are derived from it, this explains why caret calculates the seeds based on the interaction.depth * shrinkage * n.minobsinnode in your case 3 * 3 * 2 = 18 and not 3*3*3*2 = 54 as we will see later.
But if you were using a SVM model with a grid svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5)) your value is 6 * 6 = 36
Remember, the goal of using seeds is to allow reproducible research by setting the seeds for the models fit at each resampling iteration.
The seeds[[11]]<-sample.int(1000, 1) is used to set the seed for the last (optimum) model fit to the complete dataset.
2 - Why you get an error if you specify a value < 18, but no error with a value >= 18
I was able to reproduce the same error on my machine:
Error in train.default(x, y, weights = w, ...) :
Bad seeds: the seed object should be a list of length 11 with 10 integer vectors of size 18 and the last list element having a single integer
So, by inspecting the train.default I was able to find its source. The error message is triggered by the stop in lines 7 to 10 based on the test badSeed in lines 4 and 5.
else {
if (!(length(trControl$seeds) == 1 && is.na(trControl$seeds))) {
numSeeds <- unlist(lapply(trControl$seeds, length))
4 badSeed <- (length(trControl$seeds) < length(trControl$index) +
5 1) || (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop)))
if (badSeed)
7 stop(paste("Bad seeds: the seed object should be a list of length",
8 length(trControl$index) + 1, "with", length(trControl$index),
9 "integer vectors of size", nrow(trainInfo$loop),
10 "and the last list element having a", "single integer"))
}
}
The number 18 is coming from nrow(trainInfo$loop), so we need to find the value of trainInfo$loop. The object trainInfo is assigned a value trainInfo <- models$loop(tuneGrid) in line 3:
if (trControl$method != "none") {
if (is.function(models$loop) && nrow(tuneGrid) > 1) {
3 trainInfo <- models$loop(tuneGrid)
if (!all(c("loop", "submodels") %in% names(trainInfo)))
stop("The 'loop' function should produce a list with elements 'loop' and 'submodels'")
}
Now, we need to find the object models. It is assigned the value of models <- getModelInfo(method, regex = FALSE)[[1]] in line 2:
else {
2 models <- getModelInfo(method, regex = FALSE)[[1]]
if (length(models) == 0)
stop(paste("Model", method, "is not in caret's built-in library"))
}
Since we are using method = "gbm", we can see the value of getModelInfo("gbm", regex = FALSE)[[1]]$loop and inspect the result below:
> getModelInfo("gbm", regex = FALSE)[[1]]$loop
function(grid) {
3 loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
function(x) c(n.trees = max(x$n.trees)))
submodels <- vector(mode = "list", length = nrow(loop))
for(i in seq(along = loop$n.trees)) {
index <- which(grid$interaction.depth == loop$interaction.depth[i] &
grid$shrinkage == loop$shrinkage[i] &
grid$n.minobsinnode == loop$n.minobsinnode[i])
trees <- grid[index, "n.trees"]
submodels[[i]] <- data.frame(n.trees = trees[trees != loop$n.trees[i]])
}
list(loop = loop, submodels = submodels)
}
>
The loop (in line 3 above) is assigned the value:
loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
function(x) c(n.trees = max(x$n.trees)))`
Now, let's pass your grid with 54 rows to the line above and inspect the result:
> nrow(grid)
[1] 54
>
> loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
+ function(x) c(n.trees = max(x$n.trees)))
> loop
shrinkage interaction.depth n.minobsinnode n.trees
1 0.09 1 8 150
2 0.09 1 10 150
3 0.09 2 8 150
4 0.09 2 10 150
5 0.09 3 8 150
6 0.09 3 10 150
7 0.10 1 8 150
8 0.10 1 10 150
9 0.10 2 8 150
10 0.10 2 10 150
11 0.10 3 8 150
12 0.10 3 10 150
13 0.11 1 8 150
14 0.11 1 10 150
15 0.11 2 8 150
16 0.11 2 10 150
17 0.11 3 8 150
18 0.11 3 10 150
>
ahh!, we found it. The value 18 is coming from nrow(trainInfo$loop) which is coming from getModelInfo("gbm", regex = FALSE)[[1]]$loop shown above with just 18 rows.
Now, going back to the test that triggered the error:
badSeed <- (length(trControl$seeds) < length(trControl$index) +
1) || (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop)))
The first part of the test (length(trControl$seeds) < length(trControl$index) + 1) is FALSE, but the second part (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop))) is TRUE for all valuse less that 18 [coming from nrow(trainInfo$loop)], and FALSE for all valuse greater than 18. That's why the error is triggered for a value <18 and not for >=18. As I said above, the caret's calculates the seeds based on the interaction.depth * shrinkage * n.minobsinnode in your case 3 * 3 * 2 = 18 (a model is fit to the max(n.trees) and the others are derived from it, so there is no need for 54 integers).