A critical difference (CD) plot for comparing classifiers over multiple data sets (Demšar2006) can be generated with the mlr package like this:
# THIS WORKS
library(mlr)
lrns = list(makeLearner("classif.knn"), makeLearner("classif.svm"))
tasks = list(iris.task, sonar.task)
rdesc = makeResampleDesc("CV", iters = 2L)
meas = list(acc)
bmr = benchmark(lrns, tasks, rdesc, measures = meas)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
This requires the evaluation results to reside in a rather complex BenchmarkResult object, although the data is basically a matrix (where M[i, j] holds the score of classifier i for data set j).
I have previously generated such data in a Python workflow and imported in R into a data.frame (as there seems to be no Python package for such plots).
How can I generate a CD plot from this data?
I thought about creating a BenchmarkResult from the data.frame, but didn't know where to start:
# THIS DOES NOT WORK
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results = data.frame(replicate(5, runif(30, 0, 1)))
# This is the functionality I'm looking for
bmr = benchmarkResultFromDataFrame(results)
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
I finally managed to create the plot. It is necessary to set only a handful of the BenchmarkResult's attributes:
leaners with id and short.name for each classifier
measures
results with aggr for each dataset/classifier combination
The code may then look like this (smaller example of 5 datasets):
library(mlr)
# Here I would import results from my experiments instead of using random data
# e.g. scores for 5 classifiers and 30 data sets, each
results <- data.frame(replicate(5, runif(30, 0, 1)))
clf <- c('clf1', 'clf2', 'clf3', 'clf4', 'clf5')
clf.short.name <- c('c1', 'c2', 'c3', 'c4', 'c5')
dataset <- c('dataset1', 'dataset2', 'dataset3', 'dataset4', 'dataset5')
score <- list(acc)
# Setting up the learners: id, short.name
bmr <- list()
for (i in 1:5){
bmr$learners[[clf[i]]]$id <- clf[i]
bmr$learners[[clf[i]]]$short.name <- clf.short.name[i]
}
# Setting up the measures
bmr$measures <- list(acc)
# Setting up the results
for (i in 1:5){
bmr$results$`dataset1`[[clf[i]]]$aggr <- list('acc.test.mean' = results[1, i])
}
for (i in 1:5){
bmr$results$`dataset2`[[clf[i]]]$aggr <- list('acc.test.mean' = results[2, i])
}
for (i in 1:5){
bmr$results$`dataset3`[[clf[i]]]$aggr <- list('acc.test.mean' = results[3, i])
}
for (i in 1:5){
bmr$results$`dataset4`[[clf[i]]]$aggr <- list('acc.test.mean' = results[4, i])
}
for (i in 1:5){
bmr$results$`dataset5`[[clf[i]]]$aggr <- list('acc.test.mean' = results[5, i])
}
# Set BenchmarkResult class
class(bmr) <- "BenchmarkResult"
# Statistics and plot
cd = generateCritDifferencesData(bmr)
plotCritDifferences(cd)
Anyone who could teach me better R to avoid these for loops and code duplication would still be very welcome!
Related
I want to calculate the Moran index for several random phylogenetic trees from a data set.
As the result I want a list of: the values of Moran index calculation and the selected tree for that given calculation of Moran index (at each loop).
I need to know which tree was used to run Moran index.
I was able to get Moran values for the random trees, however I was not able to save in any way the random tree that were selected at each loop.
Since my data set is big, I created the example below:
library(ape)
x <- rmtree(20, 20)
names(x) <- paste("t", 1:10, sep = "")
var <- as.data.frame(matrix(rnorm(20),nrow=20))
nams <- data.frame(paste0("t",1:20,sep=""))
var_list <- cbind(nams,var)
colnames(var_list) <- c("names","variable")
var_list1 <- var_list$variable
names(var_list1) <- c(var_list[,1])
# Loop to select a random tree and run the Moran index
tr <- function(x, var_list1){
resulist <- as.list(1:15)
# treelist <- as.list(as.phylo(x[[i]])) #### not working
for(i in 1:15){
tutreer <- sample(x,size=1)[[1]]
inr <- 1/cophenetic(tutreer)
diag(inr) <- 0
mran <- Moran.I(var_list1,inr,scaled = TRUE)
mran
resulist[[i]] <- list(obs=mran$observed,expect=mran$expected,
sd=mran$sd,pval=mran$p.value)
# treelist[[i]] <- list(tutreer) #### Here, not working
}
return(resulist)
# return(treelist)
}
tr(x,var_list1)
Any ideas on how could I save the trees that was selected (and used to calculate the index) at each loop?
Why not just add it to the resulist[[i]] on each iteration?:
resulist[[i]] <- list(obs=mran$observed,expect=mran$expected,
sd=mran$sd,pval=mran$p.value, tree=tutreer)
Or, return the entire sampled tree, like this:
tr <- function(x, var_list1){
resulist <- as.list(1:15)
for(i in 1:15){
sampled_tree <- sample(x,size=1)
inr <- 1/cophenetic(sampled_tree[[1]])
diag(inr) <- 0
mran <- Moran.I(var_list1,inr,scaled = TRUE)
resulist[[i]] <- c(mran, list(tree = sampled_tree))
}
return(resulist)
}
I've recently been interested in trying to develop a for-loop that would be able to run multiple generalized additive models and then produce results in a table that ranks them based on AIC, p-value of each smooth in the model, deviance explained of the overall model, etc.
I found this related question in stack overflow which is basically what I want and was able to run this well for gam() instead of gamm(), however I want to expand this to include multiple independent variables in the model, not just 1.
Ideally, the models would run all possible combinations of independent variables against the dependent variable, and it would test combinations anywhere from 1 independent variable in the model, up to all of the possible covariates in "d_pred" in the model.
I have attempted to do this so far by starting out small and finding all possible combinations of 2 independent variables (df_combinations2), which results in a list of data frames. Then I adjusted the rest of the code to run the for loop such that each iteration will run a different combination of the two variables:
library(mgcv)
## Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
df_combinations2 <- lapply(1:(ncol(combn(1:ncol(d_pred), m = 2))),
function(y) d_pred[, combn(1:ncol(d_pred), m = 2)[,y]])
## create a "matrix" list of dimensions i x j
results_m2 <-lapply(1:length(df_combinations2), matrix, data= NA, nrow=ncol(d_resp), ncol=2)
## for-loop
for(k in 1:length(df_combinations2)){
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(df_combinations2[[k]])){
results_m2[i, j][[1]] <- gam(d_resp[, i] ~ s(df_combinations2[[k]][,1])+s(df_combinations2[[k]][,2]))
}
}}
However, after running the for-loop I get the error "Error in all.vars1(gp$fake.formula[-2]) : can't handle [[ in formula".
Anyone know why I am getting this error/ how to fix it?
Any insight is much appreciated. Thanks!
Personally, I would create a data.table() containing all combinations of target variables and combinations of predictors and loop through all rows. See below.
library(data.table)
library(dplyr)
# Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
#select names of targets and predictors
targets <- c("y", "y1")
predictors <- colnames(d)[!colnames(d) %in% targets]
#create all combinations of predictors
predictor_combinations <- lapply(1:length(predictors), FUN = function(x){
#create combination
combination <- combn(predictors, m = x) |> as.data.table()
#add s() to all for gam
combination <- sapply(combination, FUN = function(y) paste0("s(", y, ")")) |> as.data.table()
#collapse
combination <- summarize_all(combination, .funs = paste0, collapse = "+")
#unlist
combination <- unlist(combination)
#remove names
names(combination) <- NULL
#return
return(combination)
})
#merge combinations of predictors as vector
predictor_combinations <- do.call(c, predictor_combinations)
#create folder to save results to
if(!dir.exists("dev")){
dir.create("dev")
}
if(!dir.exists("dev/models")){
dir.create("dev/models")
}
#create and save hypergrid (all combinations of targets and predictors combinations)
if(!file.exists("dev/hypergrid.csv")){
#create hypergrid and save to dev
hypergrid <- expand.grid(target = targets, predictors = predictor_combinations) |> as.data.table()
#add identifier
hypergrid[, model := paste0("model", 1:nrow(hypergrid))]
#save to dev
fwrite(hypergrid, file = "dev/hypergrid.csv")
} else{
#if file exists read
hypergrid <- fread("dev/hypergrid.csv")
}
#loop through hypergrid, create GAM models
#progressbar
pb <- txtProgressBar(min = 1, max = nrow(hypergrid), style = 3)
for(i in 1:nrow(hypergrid)){
#update progressbar
setTxtProgressBar(pb, i)
#select target
target <- hypergrid[i,]$target
#select predictors
predictors <- hypergrid[i,]$predictors
#create formula
gam.formula <- as.formula(paste0(target, "~", predictors))
#run gam
gam.model <- gam(gam.formula, data = d)
#save gam model do dev/model
saveRDS(gam.model, file = paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
}
#example where you extract model performances
for(i in 1:nrow(hypergrid)){
#read the right model
rel.model <- readRDS(paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
#extract model performance, add to hypergrid
hypergrid[i, R2 := summary(rel.model)[["r.sq"]]]
}
#arrange hypergrid on target and r2
hypergrid <- dplyr::arrange(hypergrid, hypergrid$target, desc(hypergrid$R2))
Which would give
head(hypergrid)
target predictors model R2
1: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5) model319 0.6957242
2: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5) model423 0.6953753
3: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x7) model437 0.6942054
4: y s(x0)+s(x1)+s(x2)+s(x5) model175 0.6941025
5: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x6) model435 0.6940569
6: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5)+s(x7) model481 0.6939756
All models are saved to a folder with an identifier (for if you want to use the model or extract more information from the model).
Notably, p-hacking comes to mind using this appraoch and I would be careful by conducting your analysis like this.
I would like to create replicate predictions for one integer independent variable (iv1) given some model and a data frame called training. This is my current approach. I appreciate this is not self containing but hopefully it is self explanatory:
number_of_samples <- 10
results <- NULL
for (row in 1:nrow(training)) {
fake_iv1_values <- sample(1:100, number_of_samples)
case <- training[row,]
for (iv1 in fake_iv1_values) {
case$iv1 <- iv1
case$prediction <- predict(some_model, newdata = case)
results <- rbind(results, case)
}
}
Using loops is very slow. I wonder, if this could be sped up? Thanks!
Try with this.
Reproducible fake data and model:
# create fake data
n_row <- 100
n_xs <- 100
training <- data.frame(y = rnorm(n_row), iv1 = rnorm(n_row))
training[, paste0("x",1:n_xs)] <- replicate(n_xs, list(rnorm(n_row)))
# example model
some_model <- lm(y~., training)
Rewritten code:
number_of_samples <- 10
results <- NULL
# vector of several fake_iv1_values vectors
fake_iv1_values <- as.numeric(replicate(nrow(training), sample(1:100, number_of_samples)))
# replicate each row of the original dataframe
results <- training[rep(seq_len(nrow(training)), each = number_of_samples), ]
# add fake values to the replicated dataframe
results$iv1 <- fake_iv1_values
# get predictions
results$prediction <- predict(some_model, newdata = results)
I am trying to create 10 folds of my data. What I want to have is a data structure of length 10 (number of folds) and each element of the data structure contains an object/data structure that has two attributes/elements; the training set and the test set at that fold. This is my R code.
I wanted to access for example, the training set at fold 8 by View(data_pairs[[8]]$training_set). But it did not work. Any help would be appreciated :)
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(data), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- data[test_ind,]
train <- data[-test_ind,]
data_pair <- list(training_set = list(train), test_set = list(test))
data_pairs <- append(x = data_pairs, values = data_pair)
}
You were very close, you just needed to wrap values in a list call.
k <- 10 # number of folds
i <- 1:k
folds <- sample(i, nrow(mtcars), replace = TRUE)
data_pairs <- list()
for (j in i) {
test_ind <- which(folds==j,arr.ind=TRUE)
test <- mtcars[test_ind,]
train <- mtcars[-test_ind,]
data_pair <- list(training_set = train, test_set = test)
data_pairs <- append(x = data_pairs, values = list(data_pair))
#data_pairs <- c(data_pairs, list(data_pair))
}
If your data is big I would suggest you read these two posts on more efficient ways to grow a list.
Append an object to a list in R in amortized constant time, O(1)?
Here we go again: append an element to a list in R
I would also like to point out that you are not creating "folds" of your data. In your case you are attempting a 10-fold cross validation, which means your data should be separated into 10 "equal" sized chunks. Then you create 10 train/test data sets using each fold as the test data and the rest for training.
It seems like the package modelr could help you here.
In particular I would point you to:
https://modelr.tidyverse.org/reference/resample_partition.html
library(modelr)
ex <- resample_partition(mtcars, c(test = 0.3, train = 0.7))
mod <- lm(mpg ~ wt, data = ex$train)
rmse(mod, ex$test)
#> [1] 3.229756
rmse(mod, ex$train)
#> [1] 2.88216
Alternatively, producing a dataset of these partitions can be done with:
crossv_mc(data, n, test = 0.2, id = ".id")
I am currently working on a For loop in R. If I run the For loop on my own data, it takes ages, and I believe because I did something inefficient in my code. Could you please help me with improving it?
# Loop through the samples, explaining one instance at a time.
shap_values <- vector("list", nrow(X)) # initialize the results list.
system.time({
for (i in seq_along(shap_values)) {
set.seed(224)
shap_values[[i]] <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)$results
shap_values[[i]]$predicted_value <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)$y.hat.interest
shap_values[[i]]$sample_num <- i # identifier to track our instances.
}
data_shap_values <- dplyr::bind_rows(shap_values) # collapse the list.
})
I believe that my problem is in the
shap_values[[i]]$sample_num
variable, since I am redoing there the calculations of the previous
shap_values[[i]]$predicted_value
variable. The reason why I added that variable, was because I needed the
$y.hat.interest
as part of the new data frame (which is called "shap_values" and later "data_shap_values").
REPRODUCIBLE EXAMPLE: (starts at "This is the important part:)
#Example Shapley
#https://cran.r-project.org/web/packages/iml/vignettes/intro.html
data("Boston", package = "MASS")
head(Boston)
set.seed(42)
#install.packages("iml")
library("iml")
library("randomForest")
data("Boston", package = "MASS")
rf = randomForest(medv ~ ., data = Boston, ntree = 50)
# We create a Predictor object, that holds the model and the data.
# The iml package uses R6 classes: New objects can be created by calling Predictor$new()
X = Boston[which(names(Boston) != "medv")]
predictor = Predictor$new(rf, data = X, y = Boston$medv)
# Feature Importance
## Shifting each future, and measring how much the performance drops ##
imp = FeatureImp$new(predictor, loss = "mae")
plot(imp)
# Shapley value. Assume that for 1 data point, the feature values play a game together, in which
# they get the prediction as payout. Tells us how fairly distibute the payout among the feature values.
View(X)
shapley = Shapley$new(predictor, x.interest = X[1,])
shapley$plot()
# Reuse the object to explain other data points
shapley$explain(x.interest = X[2,])
shapley$plot()
# Results in data.frame form can be extracted like this:
results = shapley$results
head(results)
# THIS IS THE IMPORTANT PART:
# It might make sense for testing, to reduce the data:
X = X[1:10,]
# Loop through the samples, explaining one instance at a time.
shap_values <- vector("list", nrow(X)) # initialize the results list.
system.time({
for (i in seq_along(shap_values)) {
set.seed(224)
shap_values[[i]] <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)$results
shap_values[[i]]$predicted_value <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)$y.hat.interest
shap_values[[i]]$sample_num <- i # identifier to track our instances.
}
data_shap_values <- dplyr::bind_rows(shap_values) # collapse the list.
})
Update
As requested by #Ralf Stubner a Profiling of the for loop:
You are doubling your run-time by calling imp::Shapely$new twice with identical parameters. As an alternative, you can create the object once and extract the two values:
system.time({
for (i in seq_along(shap_values)) {
set.seed(224)
shapley <- iml::Shapley$new(predictor, x.interest = X[i, ],sample.size = 30)
shap_values[[i]] <- shapley$results
shap_values[[i]]$predicted_value <- shapley$y.hat.interest
shap_values[[i]]$sample_num <- i # identifier to track our instances.
}
data_shap_values <- dplyr::bind_rows(shap_values) # collapse the list.
})
If you have sufficient RAM to store your data multiple times, you might also try parallel processing using parallel, foreach or future.apply.