I am trying to use code for fully reproducible parallel models in caret but do not understand how to set the size of the vectors in the seed object. For gbm I have 4 tuning parameters with a total of 11 different levels, and I have 54 rows in my tuning grid. If I specify any value < 18 as the last value in the "for(i in 1:10)" line below, I get an error: "Bad seeds: the seed object should be a list of length 11 with 10 integer vectors of size 18 and the last list element having a single integer." Why 18? Also it runs without errors for values > 18 (e.g., 54) - why? Many thanks for the help. The following is based on http://topepo.github.io/caret/training.html, added some things.
library(mlbench)
data(Sonar)
str(Sonar[, 1:10])
library(caret)
library(doParallel)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
grid <- expand.grid(n.trees = seq(50,150,by=50), interaction.depth = seq(1,3,by=1),
shrinkage = seq(.09,.11,by=.01),n.minobsinnode=seq(8,10,by=2))
# set seed to run fully reproducible model in parallel mode using caret
set.seed(825)
seeds <- vector(mode = "list", length = 11) # length is = (n_repeats*nresampling)+1
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 11) # ...the number of tuning parameter...
seeds[[11]]<-sample.int(1000, 1) # for the last model
fitControl <- trainControl(method = "cv",number = 10,seeds=seeds)
# run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
gbmFit1 <- train(Class ~ ., data = training,method = "gbm",
trControl = fitControl,tuneGrid=grid,verbose = FALSE)
gbmFit1
I will address your question in two parts:
1 - Setting the seeds:
The code to do it as you stated :
set.seed(825)
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
The 11 in seeds <- vector(mode = "list", length = 11) is (n_repeats*nresampling)+1, so in your case, you're using 10-fold CV, so 10+1 = 11. If you were using repeatedcv with number=10 and repeats = 5 you would replace the 11 by (5*10)+1 = 51.
The 10 in for(i in 1:10) is (n_repeats*nresampling). in your case it is 10 because you're using 10-fold CV. Similarly, if you were using repeatedcv with number=10 and repeats = 5 it would be for(i in 1:50).
The 54 in sample.int(n=1000, 54) is the number of tuning parameter combinations. In your case, you have 4 parameters with 3,3,3 and 2 values. So, it is 3*3*3*2 = 54. But, I remember I red somewhere that for gbm, the model is fit to the max(n.trees) in the grid, and the models with less trees are derived from it, this explains why caret calculates the seeds based on the interaction.depth * shrinkage * n.minobsinnode in your case 3 * 3 * 2 = 18 and not 3*3*3*2 = 54 as we will see later.
But if you were using a SVM model with a grid svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5)) your value is 6 * 6 = 36
Remember, the goal of using seeds is to allow reproducible research by setting the seeds for the models fit at each resampling iteration.
The seeds[[11]]<-sample.int(1000, 1) is used to set the seed for the last (optimum) model fit to the complete dataset.
2 - Why you get an error if you specify a value < 18, but no error with a value >= 18
I was able to reproduce the same error on my machine:
Error in train.default(x, y, weights = w, ...) :
Bad seeds: the seed object should be a list of length 11 with 10 integer vectors of size 18 and the last list element having a single integer
So, by inspecting the train.default I was able to find its source. The error message is triggered by the stop in lines 7 to 10 based on the test badSeed in lines 4 and 5.
else {
if (!(length(trControl$seeds) == 1 && is.na(trControl$seeds))) {
numSeeds <- unlist(lapply(trControl$seeds, length))
4 badSeed <- (length(trControl$seeds) < length(trControl$index) +
5 1) || (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop)))
if (badSeed)
7 stop(paste("Bad seeds: the seed object should be a list of length",
8 length(trControl$index) + 1, "with", length(trControl$index),
9 "integer vectors of size", nrow(trainInfo$loop),
10 "and the last list element having a", "single integer"))
}
}
The number 18 is coming from nrow(trainInfo$loop), so we need to find the value of trainInfo$loop. The object trainInfo is assigned a value trainInfo <- models$loop(tuneGrid) in line 3:
if (trControl$method != "none") {
if (is.function(models$loop) && nrow(tuneGrid) > 1) {
3 trainInfo <- models$loop(tuneGrid)
if (!all(c("loop", "submodels") %in% names(trainInfo)))
stop("The 'loop' function should produce a list with elements 'loop' and 'submodels'")
}
Now, we need to find the object models. It is assigned the value of models <- getModelInfo(method, regex = FALSE)[[1]] in line 2:
else {
2 models <- getModelInfo(method, regex = FALSE)[[1]]
if (length(models) == 0)
stop(paste("Model", method, "is not in caret's built-in library"))
}
Since we are using method = "gbm", we can see the value of getModelInfo("gbm", regex = FALSE)[[1]]$loop and inspect the result below:
> getModelInfo("gbm", regex = FALSE)[[1]]$loop
function(grid) {
3 loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
function(x) c(n.trees = max(x$n.trees)))
submodels <- vector(mode = "list", length = nrow(loop))
for(i in seq(along = loop$n.trees)) {
index <- which(grid$interaction.depth == loop$interaction.depth[i] &
grid$shrinkage == loop$shrinkage[i] &
grid$n.minobsinnode == loop$n.minobsinnode[i])
trees <- grid[index, "n.trees"]
submodels[[i]] <- data.frame(n.trees = trees[trees != loop$n.trees[i]])
}
list(loop = loop, submodels = submodels)
}
>
The loop (in line 3 above) is assigned the value:
loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
function(x) c(n.trees = max(x$n.trees)))`
Now, let's pass your grid with 54 rows to the line above and inspect the result:
> nrow(grid)
[1] 54
>
> loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
+ function(x) c(n.trees = max(x$n.trees)))
> loop
shrinkage interaction.depth n.minobsinnode n.trees
1 0.09 1 8 150
2 0.09 1 10 150
3 0.09 2 8 150
4 0.09 2 10 150
5 0.09 3 8 150
6 0.09 3 10 150
7 0.10 1 8 150
8 0.10 1 10 150
9 0.10 2 8 150
10 0.10 2 10 150
11 0.10 3 8 150
12 0.10 3 10 150
13 0.11 1 8 150
14 0.11 1 10 150
15 0.11 2 8 150
16 0.11 2 10 150
17 0.11 3 8 150
18 0.11 3 10 150
>
ahh!, we found it. The value 18 is coming from nrow(trainInfo$loop) which is coming from getModelInfo("gbm", regex = FALSE)[[1]]$loop shown above with just 18 rows.
Now, going back to the test that triggered the error:
badSeed <- (length(trControl$seeds) < length(trControl$index) +
1) || (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop)))
The first part of the test (length(trControl$seeds) < length(trControl$index) + 1) is FALSE, but the second part (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop))) is TRUE for all valuse less that 18 [coming from nrow(trainInfo$loop)], and FALSE for all valuse greater than 18. That's why the error is triggered for a value <18 and not for >=18. As I said above, the caret's calculates the seeds based on the interaction.depth * shrinkage * n.minobsinnode in your case 3 * 3 * 2 = 18 (a model is fit to the max(n.trees) and the others are derived from it, so there is no need for 54 integers).
Related
I want to understand how my parallelization is working when there is a for-loop structure inside of the structure that I am parallelizing.
I have a routine called reg_simulation(), which generated 100 estimations (nrep=100) of linear regression, each of those using a different seed (seed <- seed + i).
Additionally, I wrapped up the reg_simulation() routine inside par_wrapper() to run it using different possible configurations of the data generating process. In particular, changing the number of observations (obs) and the error term variance (sigma). Finally, I parallelized this structure using pblapply.
Using the described setup, I am using a grid of obs = c(250, 500, 750, 1000, 2500) and
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) meaning 5 values in each variable, leading to a 25 combinations of the two variables. However, I am running 100 times these 25 combinations.
Finally, here is my question:
My code is...
(a) Running in parallel 25 combinations but serially the 100 repetition inside of them.
(b) Running in parallel all the 2500 models.
If the answer is (a), please let me know how you arrived at such a conclusion because I haven't been sorted out yet, and probably it might imply that I should change my code structure.
Some additional comments: (1) The seed declaration on each iteration is important because it allows me to recover each possible combination of the data (e.g., iteration 78 (seed = 78), with sigma=0.1 and obs=1000) (2) I am using pblapply because I want to track my code simulations' progress.
Here the aforementioned routines:
reg_simulation()
reg_simulation<- function(obs = 1000,
sigma = 0.5,
nrep = 10 ,
seed = 0){
#seet seed
res <- vector("list", nrep)
# Forloop
for ( i in 1:nrep) {
#Changing seed each iteration
seed <- seed + i
#set seed
set.seed(seed)
#DGP
x1 <- rnorm(obs, 0 , sigma)
x2 <- rnorm(obs, 0 , sigma)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
returnlist <- list(intercept = ols$coefficients[1],
beta1 = ols$coefficients[2],
beta2 = ols$coefficients[3],
seed = seed)
#save each iteration
res[[i]] <- returnlist
}
return(res)
}
par_wrapper()
### parallel wrapper
par_wrapper <- function(obs = c(250,500,750,1000,2500),
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) ,
nrep = 10,
nClusters = 4)
{
require(parallel)
require(pbapply)
#grid of searching space
prs <- expand.grid(obs = obs,
sigma = sigma)
nprs <- nrow(prs)
rownames(prs) <- c(1:NROW(prs))
#Print number of combinations
print(prs)
#### ---- PARALLEL INIT ---- ####
## Parallel options
cl <- makeCluster(nClusters)
## Attaching necessary functions for internal computations
parallel::clusterExport(cl= cl,
list("reg_simulation"))
# pblapply
par_simres <- pblapply(cl = cl,
X = 1:nprs,
FUN = function(i){
reg_simulation(
sigma = prs$sigma[i],
obs = prs$obs[i],
nrep = nrep,
seed = 0)})
##exit cluster mode
stopCluster(cl)
return(par_simres)
}
Using the par_wrapper() function over a grid.
#using generated structure.
res_list <- par_wrapper(
obs = c(250,500,750,1000, 2500 ),
sigma = c(0.1, 0.2, 0.5, 0.8 , 1 ) ,
nrep = 100,
nClusters = 4)
Console output.
# obs sigma
# 1 250 0.1
# 2 500 0.1
# 3 750 0.1
# 4 1000 0.1
# 5 2500 0.1
# 6 250 0.2
# 7 500 0.2
# 8 750 0.2
# 9 1000 0.2
# 10 2500 0.2
# 11 250 0.5
# 12 500 0.5
# 13 750 0.5
# 14 1000 0.5
# 15 2500 0.5
# 16 250 0.8
# 17 500 0.8
# 18 750 0.8
# 19 1000 0.8
# 20 2500 0.8
# 21 250 1.0
# 22 500 1.0
# 23 750 1.0
# 24 1000 1.0
# 25 2500 1.0
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
I've got this data processing:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?
Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))
I want to use the function "CreateDataPartition" and define some training and testing data. But the programm always put the first 2 lines of my table in "training" and all other lines in "testing". I thought changing the percentage from "0.75" to "0.5" would channge something, but it didnĀ“t.
Table:
t<-read.csv2("test.csv")
> test
Spalte1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
First run with p=.75:
> inTraining <- createDataPartition(a1, p = .75, list = FALSE)
Warnmeldungen:
1: In createDataPartition(a1, p = 0.75, list = FALSE) :
Some classes have no records ( ) and these will be ignored
2: In createDataPartition(a1, p = 0.75, list = FALSE) :
Some classes have a single record ( ) and these will be selected for the sample
> training <- test[ inTraining,]
> testing <- test[-inTraining,]
> traing
Fehler: Objekt 'traing' nicht gefunden
> training
[1] 1 2
> testing
[1] 3 4 5 6 7 8 9 10
Second run with p=.5:
> inTraining <- createDataPartition(a1, p = .5, list = FALSE)
Warnmeldungen:
1: In createDataPartition(a1, p = 0.5, list = FALSE) :
Some classes have no records ( ) and these will be ignored
2: In createDataPartition(a1, p = 0.5, list = FALSE) :
Some classes have a single record ( ) and these will be selected for the sample
> training <- test[ inTraining,]
> testing <- test[-inTraining,]
> training
[1] 1 2
> testing
[1] 3 4 5 6 7 8 9 10
What do I have to change to be able to telle the programm how many lines I want for testing and how many lines I want for training?
If you are just looking for a solution to the apparent problem then you could just use base R function sample.int() to the same end:
# Dummy data
data <- data.frame(Spalte1 = 1:10)
# Simple way to partition without any external package:
p <- 0.75
set.seed(1)
inTraining <- sample.int(n = nrow(data),
size = floor(p * nrow(data)),
replace = FALSE)
train <- data[inTraining, ]
test <- data[-inTraining, ]
train
3 4 5 7 2 8 9
test
1 6 10
Note that here the number of lines in the training data is equal to floor(p * nrow(data)).
I would like to see If SOM algorithm can be used for classification prediction.
I used to code below but I see that the classification results are far from being right. For example, In the test dataset, I get a lot more than just the 3 values that I have in the training target variable. How can I create a prediction model that will be in alignment to the training target variable?
library(kohonen)
library(HDclassif)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, ])
Xtest <- scale(wine[-training, ],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
som.wine <- som(Xtraining, grid = somgrid(5, 5, "hexagonal"))
som.prediction$pred <- predict(som.wine, newdata = Xtest,
trainX = Xtraining,
trainY = factor(Xtraining$class))
And the result:
$unit.classif
[1] 7 7 1 7 1 11 6 2 2 7 7 12 11 11 12 2 7 7 7 1 2 7 2 16 20 24 25 16 13 17 23 22
[33] 24 18 8 22 17 16 22 18 22 22 18 23 22 18 18 13 10 14 15 4 4 14 14 15 15 4
This might help:
SOM is an unsupervised classification algorithm, so you shouldn't expect it to be trained on a dataset that contains a classifier label (if you do that it will need this information to work, and will be useless with unlabelled datasets)
The idea is that it will kind of "convert" an input numeric vector to a network unit number (try to run your code again with a 1 per 3 grid and you'll have the output you expected)
You'll then need to convert those network units numbers back into the categories you are looking for (that is the key part missing in your code)
Reproducible example below will output a classical classification error. It includes one implementation option for the "convert back" part missing in your original post.
Though, for this particular dataset, the model overfitts pretty quickly: 3 units give the best results.
#Set and scale a training set (-1 to drop the classes)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, -1])
#Scale a test set (-1 to drop the classes)
Xtest <- scale(wine[-training, -1],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
#Set 2D grid resolution
#WARNING: it overfits pretty quickly
#Errors are 36% for 1 unit, 63% for 2, 93% for 3, 89% for 4
som_grid <- somgrid(xdim = 1, ydim=3, topo="hexagonal")
#Create a trained model
som_model <- som(Xtraining, som_grid)
#Make a prediction on test data
som.prediction <- predict(som_model, newdata = Xtest)
#Put together original classes and SOM classifications
error.df <- data.frame(real = wine[-training, 1],
predicted = som.prediction$unit.classif)
#Return the category number that has the strongest association with the unit
#number (0 stands for ambiguous)
switch <- sapply(unique(som_model$unit.classif), function(x, df){
cat <- as.numeric(names(which.max(table(
error.df[error.df$predicted==x,1]))))
if(length(cat)<1){
cat <- 0
}
return(c(x, cat))
}, df = data.frame(real = wine[training, 1], predicted = som_model$unit.classif))
#Translate units numbers into classes
error.df$corrected <- apply(error.df, MARGIN = 1, function(x, switch){
cat <- switch[2, which(switch[1,] == x["predicted"])]
if(length(cat)<1){
cat <- 0
}
return(cat)
}, switch = switch)
#Compute a classification error
sum(error.df$corrected == error.df$real)/length(error.df$real)
Following up from this question (see for reproducible data frame) I want to run MCMCGLMM n times, where n is the number of randomisations. I have tried to construct a loop which runs all the chains, and saves them (to retrieve the posterior distributions of the randomised variable later) but I am encountering problems.
This is what the data frame looks like (when n = 5, hence R1-R5), A = response variable, L and V are random effect variables, B is a fixed effect, R1-R5 are random assignments of L with structure of V maintained:
ID L B V A R1 R2 R3 R4 R5
1 1_1_1 1 1 1 11.1 6 19 21 1 31
2 1_1_1 1 1 1 6.9 6 19 21 1 31
3 1_1_4 1 1 4 7.7 2 24 8 22 22
4 1_1_4 1 1 4 10.5 2 24 8 22 22
5 1_1_5 1 1 5 8.5 11 27 14 17 22
6 1_1_7 1 1 7 11.2 5 24 13 18 25
I can create the names I want to assign to my chains, and the names of the variable that changes with each run of the MCMC chain (R1-Rn):
n = 5
Rs = as.vector(rep(NA,n))
for(i in 1:n){
Rs[i] = paste("R",i, sep = "")
}
Rs
Output:
> Rs
[1] "R1" "R2" "R3" "R4" "R5"
I then tried this loop to produce 5 chains:
for(i in 1:n){
chains[i] = MCMCglmm(A ~1 + B,
random = as.formula(paste0("~" ,Rs[i], " + Vial")),
rcov = ~units,
nitt = 500,
thin = 2,
burnin = 50,
prior = prior2,
family = "gaussian",
start = list(QUASI = FALSE),
data = df)
}
Thanks Roland for helping to get the random effect to call properly, previously I was getting an error Error in buildZ(rmodel.terms[r] ... object Rs[i] not found- fixed by as.formula
But this stores all of the data in chains and seemingly only the $Sol components, but I need to be able to access the values within the VCV, specifically the posterior distributions of the R variables (e.g. summary(chainR1$VCV))
In summary: It seems I am making a mistake in how I assign the chain names, does anyone have a suggestion of how to do this, and save the posterior distributions or even the whole chain?
Using assign was a key point:
n = 10 #Number of chains to run
chainVCVdf = matrix(rep(NA, times = ((nitt-burnin)/thin)*n), ncol = n)
colnames(chainVCVdf)=c(rep("X", times = n))
for(i in 1:n){
assign("chainX",paste0("chain",Rs[i]))
chainX = MCMCglmm(A ~1 + B,
random = as.formula(paste0("~" ,Rs[i], " + V")),
rcov = ~units,
nitt = nitt,
thin = thin,
burnin = burnin,
prior = prior1,
family = "gaussian",
start = list(QUASI = FALSE),
data = df)
assign("chainVCV", chainX$VCV[,1])
chainVCVdf[,i]=(chainVCV)
colnames(chainVCVdf)[i] = colnames(chainX$VCV)[1]
}
It then became possible to build a matrix of the VCV component that I am interested in (namely the randomised L assignment in columns R1-Rn)
It seems as though you want to run a number of different MCMCglmm formulas in a loop. #Roland has helped you found the solution to this (although I personally would create the formulas prior to the loop). #Roland also points out that in order to save the results of each model, you should save them in a list - rather than a chain as you are currently doing. You could also save each model as an .RData file, as seen in the end of the question. To formalize an answer to this question I would perform this in the following way:
Rs = paste0("~R", 1:5, " + V") ## Create all model formulae
chainNames = paste0("chainR", 1:5) ## Names for each model
chains = list() ## Initialize list
## Loop over models
for(i in 1:length(Rs)){
chains[[i]] = MCMCglmm(A ~1 + B,
random = formula(Rs[i]),
rcov = ~units,
nitt = 500,
thin = 2,
burnin = 50,
prior = prior2,
family = "gaussian",
start = list(QUASI = FALSE),
data = df)
}
names(chains) = chainNames ## Name each model
save(chains, "chainsR1-R5.Rdata") ## Save all model output
A side note, paste0 is the same as paste, but with the argument sep="" by default