How to do modify an existing function from a function? - r

Im trying to increase the limit of trials parameter which is currently set to 100 in C50 package. I tried to do this using fix.
library(C50)
data(churn)
fix(C5.0.default) # i change the maxtrials <- 200
treeModel <- C5.0(x = churnTrain[, -20], y = churnTrain$churn, trials = 150)
Then i get the following error when trials are less than 200.
could not find function "makeNamesFile"
I restart R and then try using fixInNamespace and changed the trials to 200.
fixInNamespace("C5.0.default", pos="package:C50")
treeModel <- C5.0(x = churnTrain[, -20], y = churnTrain$churn, trials = 150)
The model works for trials below 100 but gives a following error for trials above 100. This is the standard error that C5.0 gives when user inputs the trials above 100.
number of boosting iterations must be between 1 and 100
I want to increase no of trials(boosting) for C5 model. How do i do that? This might be an implementation constraint but since xgboost can handle more than 100 boosting iteration there might be a way for C5 to handle this.
I am able to increase the iteration to more than 100 with fix call. But the thing is that i need to run all the R scripts that are in source version of C50 package. What can i do to avoid this. I tried installing C50 package from the source and gave this a try but it didnt work out.

I could get more than 100 trails by tweaking the source code from this link. You need to source the R files, then you can change the default no of trails to get more than 100 trials.
# Allow for more than 100 Boosting
setwd('Path to R files')
files <- list.files(pattern = "\\.R$")
lapply(files, source)
fix(C5.0.default)

Related

Consensus clustering with diceR package

I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?
The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h

Octave Error: out of memory or dimension too large for Octave's index type

I am trying to run the following code in Octave. The variable "data" consists of 864 rows and 25333 columns.
clc; clear all; close all;
pkg load statistics
GEO = load("GSE59739.mat");
GEOT = tabulate(GEO.class)
data = GEO.data;
clear GEO
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
xlabel('Silhouette Value')
ylabel('Cluster')
This is the error I get when trying to run the silhouette function:
"error: out of memory or dimension too large for Octave's index type". Any idea on how I can fix it?
It appears the problem is not necessarily with your data but with the way Octave's statistics package has implemented pdist. It uses an expansion that results in an array with dimensions that do exceed the system limits, just as the error message says.
Running through your example with some dummy data of the same size, on Octave 6.4.0 and statistics 1.4.3, I get:
pkg load statistics
data = rand(864,25333);
idx = kmeans(data,3,'Distance','cosine');
test1 = silhouette(data, idx, 'cosine');
error: out of memory or dimension too large for Octave's index type
error: called from
pdist at line 164 column 14
silhouette at line 125 column 16
pdist is a function to calculate the "distance" between any two rows in matrix, using one of several methods. silhouette is called using the cosine metric, and the error occurs in that calculation section:
pdist, lines 163-166 cosine block:
case "cosine"
prod = X(:,Xi) .* X(:,Yi);
weights = sumsq (X(:,Xi), 1) .* sumsq (X(:,Yi), 1);
y = 1 - sum (prod, 1) ./ sqrt (weights);
The first line calculating prod causes the error, as X = data' is 25333x864, and Xi and Yi are each 372816x1, and were formed by running nchoosek(1:rows(data),2) (producing 372816 sets of all 2 element combinations of 1:864).
X(:,Xi) and X(:,Yi) each request creation of a rows(X) x rows(Xi) array, or 25333x372816, or 9,444,547,728 elements, which for double precision data requires 75,556,381,824 Bytes or 75.6GB. Odds are your machine can't handle this.
Just checking with Matlab 2022a, it is able to run those lines without any out of memory errors in a few seconds and the test1 output is only 864x1. So it appears this excessive memory overhead is an issue specific to Octave's implementation and not inherent to the the technique.
I've filed a bug report regarding this behavior at https://savannah.gnu.org/bugs/index.php?62495, but for now the answer appears to be that the 'cosine' metric, and perhaps others as well, simply cannot be used with input data of this size.
Update: as of 19 JUN 2022, a fix for this pdist memory problem has been pushed to the statistics package repository, and will be included in the next major package release. In the meantime the updated function can be found at https://github.com/gnu-octave/statistics/blob/main/inst/pdist.m

bartMachine in caret train error : incorrect number of dimensions

I encounter a strange problem when trying to train a model in R using caret :
> bart <- train(x = cor_data, y = factor(outcome), method = "bartMachine")
Error in tuneGrid[!duplicated(tuneGrid), , drop = FALSE] :
nombre de dimensions incorrect
However, when using rf, xgbTree, glmnet, or svmRadial instead of bartMachine, no error is raised.
Moreover, dim(cor_data) and length(outcome) return [1] 3056 134 and [1] 3056 respectively, which indicates that there is indeed no issue with the dimensions of my dataset.
I have tried changing the tuneGrid parameter in train, which resolved the problem but caused this issue instead :
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-89-thread-1"
My dataset includes no NA, and all variables are either numerical or binary.
My goal is to extract the most important variables in the bart model. For example, I use for random forests:
rf <- train(x = cor_data, y = factor(outcome), method = "rf")
rfImp <- varImp(rf)
rf_select <- row.names(rfImp$importance[order(- rfImp$importance$Overall)[1:43], , drop = FALSE])
Thank you in advance for your help.
Since your goal is to extract the most important variables in the bart model, I will assume you are willing to bypass the caret wrapper and do it directly in R bartMachine, which is the only way I could successfully run it.
For my system, solving the memory issue required 2 further things:
Restart R and before loading anything, allocate 8Gb memory as so:
options(java.parameters = "-Xmx8g")
When running bartMachineCV, turn off mem_cache_for_speed:
library(bartMachine)
set_bart_machine_num_cores(16)
bart <- bartMachineCV(X = cor_data, y = factor(outcome), mem_cache_for_speed = F)
This will iterate through 3 values of k (2, 3 and 5) and 2 values of m (50 and 200) running 5 cross-validations each time, then builds a bartMachine using the best hyperparameter combination. You may also have to reduce the number of cores depending on your system, but this took about an hour on a 20,000 observation x 12 variable training set on 16 cores. You could also reduce the number of hyperparameter combinations it tests using the k_cvs and num_tree_cvs arguments.
Then to get the variable importance:
vi <- investigate_var_importance(bart, num_replicates_for_avg = 20)
print(vi)
You can also use it as a predictive model with predict(bart, new_data=new) similar to the object normally returned by caret::train(). This worked on R4.0.5, bartMachine_1.2.6 and rJava_1.0-4

How to limit the number of iterations the pam function from cluster library does?

How to reduce the number of iterations for PAM clustering algorithm in the cluster package?
I am trying to produce a couple of plots showing how pam works, so trying to reduce the number of iterations to 2. I have cloned the cluster repo to my working directory, where I have edited the pam.q file (directory ./cluster/R) for nMax to be equal to 2.
# original
nMax <- 65536 # 2^16 (as 1+ n(n-1)/2 must be < max_int = 2^31-1)
# modified
nMax <- 2
However, even with no changes applied to the original file, pam algorithm fails to run. If I load it by typing in library(cluster) instead, it works as supposed, but this way I have no ability to manipulate the number of iterations.
Sample code of what I'm trying to achieve is displayed below:
# -- Working code --
library(datasets)
data(iris)
library(cluster)
df <- data.frame(iris$Petal.Length, iris_modified$Petal.Width)
pam.res <- pam(df, k = 2)
pam.res
# -- Failing Code --
library(datasets)
data(iris)
source("./cluster/R/pam.q")
df <- data.frame(iris$Petal.Length, iris_modified$Petal.Width)
pam.res <- pam(df, k = 2)
pam.res
This is the error I'm getting, when running the "Failing Code" above:
Error in pam(clust_ex, k = 2) : object 'cl_Pam' not found
I expect the same output as for the working code, when I am linking the pam.q file directly instead of loading the library.
Is there something I'm not doing quite right in the way I import the q file? Or is there another way to change the number of iterations the pam algorithm performs?
Nmax is the maximum number of objects.
Its not the maximum number of iterations.
Is also not sufficient to just modify the .q file.
It's probably easier to do this with ELKI...

R bsts predictions are not consistent

Whenever I run the predict function multiple times on a bsts model using the same prediction data, I get different answers. So my question is, is there a way to return consistent answers given I keep my predictor dataset the same?
Example using the iris data set (I know it's not time series but it will illustrate my point)
iris_train <- iris[1:100,1:3]
iris_test <- iris[101:150,1:3]
ss <- AddLocalLinearTrend(list(), y = iris_train$Sepal.Length)
iris_bsts <- bsts(formula = Sepal.Length ~ ., data = iris_train,
state.specification = ss,
family = 'gaussian', seed = 1, niter = 500)
burn <- SuggestBurn(0.1,iris_bsts)
Now if I run this following line say, 10 times, each result is different:
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn)
iris_predict$mean
I understand that it is running MCMC simulations, but I require consistent results and have therefore tried:
Setting the seed in bsts and before predict
Setting the state space standard deviation to near 0, which just creates unstable results.
And neither seem to work. Any help would be appreciated!
I encountered the same problem. To fix it, you need to set the random seed in the embedded C code. I forked the packaged and made the modifications here: BSTS.
For package installation only, download bsts_0.7.1.1.tar.gz in the build folder. If you already have bsts installed, replace it with this version via:
remove.packages("bsts")
# assumes working directory is whre file is located
install.packages("bsts_0.7.1.1.tar.gz", repos=NULL, tyype="source")
If you do not have bsts installed, please install it first to ensure all dependencies are there. (This may require installing Rtools, Boom, and BoomSpikeSlab individually.)
This package version only modifies the predict function from bsts, all code should work as is. It automatically sets the random seed to 1 each time predict is called. If you want predictions to vary, you'll need to explicitly set the predict parameter each time.
You can make a function to specify seed each time (set.seed was unnecessary...):
reproducible_predict <- function(S) {
iris_bsts <- bsts(formula = Sepal.Length ~ ., data = iris_train, state.specification = ss, seed = S, family = 'gaussian', niter = 500)
burn <- SuggestBurn(0.1,iris_bsts)
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn)
return(iris_predict$mean)
}
reproducible_predict(1)
[1] 7.043592 6.212780 6.789205 6.563942 6.746156
reproducible_predict(1)
[1] 7.043592 6.212780 6.789205 6.563942 6.746156
reproducible_predict(200)
[1] 7.013679 6.173846 6.763944 6.567651 6.715257
reproducible_predict(200)
[1] 7.013679 6.173846 6.763944 6.567651 6.715257
I have come across the same issue.
The problem comes from setting the seed within the model definition only.
To solve your problem, you have to set a seed within the predict function such as:
iris_predict <- predict(iris_bsts, newdata = iris_test, burn = burn, seed=X)
Hope this helps.

Resources