Performing multiclass PLS-DA with mlr package in R - r

I want to use partial least squares discriminant analysis (PLS-DA) to solve a classification problem where there are multiple classes to be predicted. I know PLS-DA is not limited to the two class problem, and I believe that using plsda from the Caret package can handle this ok, but when I try to build a PLS-DA model in the mlr package, I get an error telling me my task is a "multiclass-problem, but learner 'classif.plsdaCaret' does not support that!"
Is it possible to build a multiclass PLS-DA model using mlr and am I simply using the wrong learner? Here is a reproducible example:
# LOAD PACKAGES ----
#install.packages("BiocManager")
#BiocManager::install("mixOmics")
library(mlr)
library(tidyverse)
library(mixOmics)
# LOAD IN DATA ----
data(liver.toxicity)
liverTib <- as.tibble(cbind(liver.toxicity$treatment$Treatment.Group,
liver.toxicity$gene)
)
names(liverTib)[1] <- "Treatment"
liverTib
# MAKE TASK, LEARNER AND ATTEMPT TO BULD MODEL
liverTask <- makeClassifTask(data = liverTib, target = "Treatment")
plsda <- makeLearner("classif.plsdaCaret")
liverModel <- train(plsda, liverTask)

In the development version of mlr (v2.14.0.9000) multiclass classification via plsdaCaret model is enabled. You can download the package from GitHub using this code:
install.packages("remotes")
remotes::install_github("mlr-org/mlr")
A PLS-DA example with 3 classes:
library(mlr)
#> Loading required package: ParamHelpers
tsk <- makeClassifTask("iris", iris, target = "Species")
lrn1 <- makeLearner("classif.plsdaCaret")
mod1 <- train(lrn1, tsk)
prd <- predict(mod1, tsk)
calculateConfusionMatrix(prd)
#> predicted
#> true setosa versicolor virginica -err.-
#> setosa 50 0 0 0
#> versicolor 0 31 19 19
#> virginica 0 8 42 8
#> -err.- 0 8 19 27
Created on 2019-07-18 by the reprex package (v0.3.0)
(This pull request solved the issue.)

The current implementation does not support multiclass, see here: https://mlr.mlr-org.com/articles/tutorial/integrated_learners.html
You can change the code for the learner (https://github.com/mlr-org/mlr/blob/master/R/RLearner_classif_plsdaCaret.R) to make multiclass possible (see here for an instruction: https://mlr.mlr-org.com/articles/tutorial/create_learner.html).

Related

Loading {logistf} breaks MCMCglmm()

Loading the package logistf breaks MCMCglmm(). Unloading logistf before running the command doesn't remove the error.
Why is that? Is there a way to solve this?
Works
library(MCMCglmm)
#> Loading required package: Matrix
#> Loading required package: coda
#> Loading required package: ape
data(PlodiaPO)
MCMCglmm(PO ~ plate, data = PlodiaPO)
#>
#> MCMC iteration = 0
#>
#> MCMC iteration = 1000
#>
#> MCMC iteration = 2000
#>
#> MCMC iteration = 3000
#>
[...]
#> attr(,"class")
#> [1] "MCMCglmm"
Created on 2022-06-07 by the reprex package (v2.0.1)
Doesn't work
library(logistf)
library(MCMCglmm)
#> Loading required package: Matrix
#> Loading required package: coda
#> Loading required package: ape
data(PlodiaPO)
MCMCglmm(PO ~ plate, data = PlodiaPO)
#> Error in terms.formula(formula, data = data): invalid term in model formula
unloadNamespace("logistf")
MCMCglmm(PO ~ plate, data = PlodiaPO)
#> Error in terms.formula(formula, data = data): invalid term in model formula
Created on 2022-06-07 by the reprex package (v2.0.1)
After some research i found that the problem not from logistf but it comes from the imported package formula.tools to reproduce the error try :
library(formula.tools)
#>formula.tools-1.7.1 - Copyright © 2022 Decision Patterns
library(MCMCglmm)
#> Loading required package: Matrix
#> Loading required package: coda
#> Loading required package: ape
data(PlodiaPO)
MCMCglmm(PO ~ plate, data = PlodiaPO)
#> Error in terms.formula(formula, data = data) :
invalid term in model formula
and this issue known for formula.tools see Weird package dependency introduces error
The solution detailed in this issue is:
fork fomula.tools repo
(remove this line)[https://github.com/decisionpatterns/formula.tools/blob/45b6654e4d8570cbaf1e2fd527652471202d97ad/NAMESPACE#L3]
install_github from your repo
OR
run as.character.formula = function(x) as.character.default(x) right after loading formula.tools. That might break code using as.character.formula though (but not sure).
Thanks for this question

De-identifying survival or flexsurvreg objects in R

Please consider the following:
I need to provide some R code syntax to analyse data with the flexsurv package. I am not allowed to receive/analyse directly or on-site. I am however allowed to receive the analysis results.
Problem
When we run the flexsurvreg() function on some data (here ovarian from the flexsurv package), the created object (here fitw) contains enough information to "re-create" or "back-engineer" the actual data. But then I would technically have access to the data I am not allowed to have.
# Load package
library("flexsurv")
#> Loading required package: survival
# Run flexsurvreg with data = ovarian
fitw <- flexsurvreg(formula = Surv(futime, fustat) ~ factor(rx) + age,
data = ovarian, dist="weibull")
# Look at first observation in ovarian
ovarian[1, ]
#> futime fustat age resid.ds rx ecog.ps
#> 1 59 1 72.3315 2 1 1
# With the following from the survival object, the data could be re-created
fitw$data$Y[1, ]
#> time status start stop time1 time2
#> 59 1 0 59 59 Inf
fitw$data$m[1, ]
#> Surv(futime, fustat) factor(rx) age (weights)
#> 1 59 1 72.3315 1
Potential solution
We could write the code so that it also sets all those data that might be used for this back-engineering to NA as follows:
# Setting all survival object observation to NA
fitw$data$Y <- NA
fitw$data$m <- NA
fitw$data$mml$scale <- NA
fitw$data$mml$rate <- NA
fitw$data$mml$mu <- NA
Created on 2021-08-27 by the reprex package (v2.0.0)
Question
If I proceed as the above and set all these parameters to NA, could I then receive the fitw object (e.g. as an .RDS file) without ever being able to "back-engineer" the original data? Or is there any other way to share fitw without the attached data?
Thanks!
Setting, e.g. fitw$data <- NULL will remove all the individual-level data from the fitted model object. Some of the output functions may not work with objects stripped of data however. In the current development version on github, printing the model object should work. Also summary and predict methods should work, as long as covariate values are supplied in newdata - omitting them won't work, since the default is to take the covariate values from the observed data.

R won't predict a continuous variable using a bayesian network (Throws Error)

I'm running into an error using bnlearn in R to try and predict a continuous variable:
library(bnlearn) # Load the package in R
data(gaussian.test)
training.set = gaussian.test[1:4000, ] # This is training set to learn the parameters
test.set = gaussian.test[4001:4010, ] # This is test set to give as evidence
res = hc(training.set) # learn BN structure on training set data
fitted = bn.fit(res, training.set) # learning of parameters
pred = predict(fitted$C, test.set) # predicts the value of node C given test set
The error I get reads:
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "bn.fit.gnode"
I could not find anything googling the error. I got the example from another thread, where it seems to have worked.
What am I missing?
I'm grateful for every hint. Thank you!
By default, most Bayesian network algorithms assume that all the variables are binary (yes/no) consisting of ones and zeroes, and not continuous. Thus, probably check the data requirements/options for the R-package you're using. Bayesian networks which use continuous variables commonly employ kernel density estimation (KDE). You could try searching for a Bayesian network package that uses the Epanechnikov kernel, which is one of the most common kernels used for KDE.
I looked at your R-code, and after tweaking some of the dependencies, you need to use the following code to make a prediction and obtain the accuracy:
library(bnlearn) # Load the package in R
install.packages("Rcpp")
install.packages("forecast", dependencies = TRUE)
library(forecast)
data(gaussian.test)
training.set = gaussian.test[1:4000, ] # This is training set to learn the parameters
test.set = gaussian.test[4001:4010, ] # This is test set to give as evidence
res = hc(training.set) # learn BN structure on training set data
fitted = bn.fit(res, training.set) # learning of parameters
pred = predict(fitted, "C", test.set) # predicts the value of node C given test set
cbind(pred, test.set[, "C"]) # compare the actual and predicted
accuracy(f = pred, x = test.set[, "C"])
pred
[1,] 3.5749952 3.952410
[2,] 0.7434548 1.443177
[3,] 5.1731669 5.924198
[4,] 10.0840800 10.296560
[5,] 12.3966908 12.268170
[6,] 9.1834888 9.725431
[7,] 6.8067145 5.625797
[8,] 9.9246630 9.597326
[9,] 5.9426798 6.503896
[10,] 16.0056136 16.037176
> accuracy(f = pred, x = test.set[, "C"])
ME RMSE MAE MPE MAPE
Test set 0.1538594 0.5804431 0.4812143 6.172352 11.26223

Why importance is affected after parallelization of randomForest?

I am working now with the randomForest package in R. To speed up the classification step, I was interested in performing the forest in parallel. For that, I have used the package 'foreach' in a similar way that it is indicated on the 'foreach' vignette. This consists in splitting the total number of trees by the number of cores you would like to use, and then combining them with the function 'combine' of the package 'randomForest':
require(randomForest)
require(foreach)
require(doParallel)
registerDoParallel(cores=CPUS)
rf <- foreach::foreach(ntree=rep(ceiling(NTREE/CPUS), CPUS), .combine=randomForest::combine, .packages='randomForest') %dopar% {
randomForest::randomForest(x=t(Y), y=A, ntree=ntree, importance=TRUE, ...)
}
I compared the results of the "parallel" forest with the forest generated in one core. The prediction capacity with the test set seems to be similar, but the 'importance' values are considerably reduced, and this affects the following steps of variable selection.
imp <- importance(rf,type=1)
I would like to know why this happens, and if it is correct or there is any mistake. Thanks a lot!
randomForest::combine does not support re-calculation of variable importance. In randomForest package importance is only calculated just before randomForest::randomForest function terminates. Two options are:
Write your own variable importance function, which will take the combined forest and training set as inputs. That is roundly ~50 lines of code.
Use a 'lapply'-like parallel computation, where each randomForest object is an element in the output list. Next aggregate variable importance across all forests and simply compute the mean. Use do.call(rf.list,combine) outside foreach loop instead. This method is an approximation of the total variable importance, but a quite good one.
Windows supported code example:
library(randomForest)
library(doParallel)
CPUS=6; NTREE=5000
cl = makeCluster(CPUS)
registerDoParallel(cl)
data(iris)
rf.list = foreach(ntree = rep(NTREE/CPUS,CPUS),
.combine=c,
.packages="randomForest") %dopar% {
list(randomForest(Species~.,data=iris,importance=TRUE, ntree=ntree))
}
stopCluster(cl)
big.rf = do.call(combine,rf.list)
big.rf$importance = rf.list[[1]]$importance
for(i in 2:CPUS) big.rf$importance = big.rf$importance + rf.list[[i]]$importance
big.rf$importance = big.rf$importance / CPUS
varImpPlot(big.rf)
#test number of trees in one forest and combined forest, big.rf
print(big.rf) #5000 trees
rf.list[[1]]$ntree
#training single forest
rf.single = randomForest(Species~.,data=iris,ntree=5000,importance=T)
varImpPlot(big.rf)
varImpPlot(rf.single)
#print unscaled variable importance, no large deviations
print(big.rf$importance)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 0.033184860 0.023506673 0.04043017 0.03241500 9.679552
# Sepal.Width 0.008247786 0.002135783 0.00817186 0.00613059 2.358298
# Petal.Length 0.335508637 0.304525644 0.29786704 0.30933142 43.160074
# Petal.Width 0.330610910 0.307016328 0.27129746 0.30023245 44.043737
print(rf.single$importance)
# setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
# Sepal.Length 0.031771614 0.0236603417 0.03782824 0.031049531 9.516198
# Sepal.Width 0.008436457 0.0009236979 0.00880401 0.006048261 2.327478
# Petal.Length 0.341879367 0.3090482654 0.29766905 0.312507316 43.786481
# Petal.Width 0.322015885 0.3045458852 0.26885097 0.296227150 43.623370
#but when plotting using varImppLot, scale=TRUE by default
#either simply turn of scaling to get comparable results
varImpPlot(big.rf,scale=F)
varImpPlot(rf.single,scale=F)
#... or correct scaling to the number of trees
big.rf$importanceSD = CPUS^-.5 * big.rf$importanceSD
#and now there are no large differences for scaled variable importance either
varImpPlot(big.rf,scale=T)
varImpPlot(rf.single,scale=T)

Help fitting a poisson glmer (getting an error message)

I am trying to fit a Poisson glmer model in R, to determine if 4 experimental
treatments affected the rate at which plants developed new branches over time.
New branches were counted after 35, 70 and 83 days and data were organised as follows:
treatment replicate.plant time branches
a ID4 35 0
a ID4 70 1
a ID4 83 1
a ID12 35 1
a ID12 70 3
a ID12 83 8
Loading the package lme4, I ran the following model:
mod<-glmer(branches ~ treatment + (1|time),
family=poisson,
data=dataset)
but I obtain the following error message:
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object '.setDummyField' not found
Can anyone please give me an indication of why I am getting this error
and what does it mean?
Any advicse on how to make this model run will be greatly appreciated.
This is a known issue, see here: https://github.com/lme4/lme4/issues/54
The problem seems to be limited to R version 3.0.0. You should update to a more recent version.

Resources