How to limit the number of iterations the pam function from cluster library does? - r

How to reduce the number of iterations for PAM clustering algorithm in the cluster package?
I am trying to produce a couple of plots showing how pam works, so trying to reduce the number of iterations to 2. I have cloned the cluster repo to my working directory, where I have edited the pam.q file (directory ./cluster/R) for nMax to be equal to 2.
# original
nMax <- 65536 # 2^16 (as 1+ n(n-1)/2 must be < max_int = 2^31-1)
# modified
nMax <- 2
However, even with no changes applied to the original file, pam algorithm fails to run. If I load it by typing in library(cluster) instead, it works as supposed, but this way I have no ability to manipulate the number of iterations.
Sample code of what I'm trying to achieve is displayed below:
# -- Working code --
library(datasets)
data(iris)
library(cluster)
df <- data.frame(iris$Petal.Length, iris_modified$Petal.Width)
pam.res <- pam(df, k = 2)
pam.res
# -- Failing Code --
library(datasets)
data(iris)
source("./cluster/R/pam.q")
df <- data.frame(iris$Petal.Length, iris_modified$Petal.Width)
pam.res <- pam(df, k = 2)
pam.res
This is the error I'm getting, when running the "Failing Code" above:
Error in pam(clust_ex, k = 2) : object 'cl_Pam' not found
I expect the same output as for the working code, when I am linking the pam.q file directly instead of loading the library.
Is there something I'm not doing quite right in the way I import the q file? Or is there another way to change the number of iterations the pam algorithm performs?

Nmax is the maximum number of objects.
Its not the maximum number of iterations.
Is also not sufficient to just modify the .q file.
It's probably easier to do this with ELKI...

Related

Consensus clustering with diceR package

I am supposed to perform a combined K-means + Gaussian mixture Models to determine a set of consensus clusters for a fixes number of clusters (k = 4). My data is composed of 231 cells from 4 different types of tumor which have a total of 19'177 variables (genes in this case).
I have never tried to perform this and I tried to follow the instructions from this R package : https://search.r-project.org/CRAN/refmans/diceR/html/consensus_cluster.html
However I must have done something wrong since when I try to run the code:
cc <- consensus_cluster(data, nk = 4, algorithms =c("gmm", "km"), progress = F )
it takes way too much time and ends up saying this error:
Error: cannot allocate vector of size 11.0 Gb
So clearly my generated vector is too heavy and I must have understood things wrong in the tutorial.
Is someone familiar with diceR package and could explain to me if there is a way to make it work?
The consensus_cluster during it's execution "eats up" memory of R session. You have so many variables that their handling cannot be allocated in the memory.
So you have two choices: increase physical memory or use not full data, but its partial sample. Let's assume that physical memory increase is not feasible. Then you should use prep.data = "sample" option. However you'll need to wait. I model data and for GMM it was 8 hours to wait.
Please see below:
library(diceR)
observ = 23
variables = 19177
dat <- matrix(rnorm(observ * variables), ncol = variables)
cc <- consensus_cluster(dat, nk = 4, algorithms =c("gmm", "km"), progress = TRUE,
prep.data = "sample")
Output (was not so patient to wait):
Clustering Algorithm 1 of 2: GMM (k = 4) [---------------------------------] 1% eta: 8h

How do I suppress a random number generation warning with future.callr?

I'm using the future.callr that creates a new thread(?) every time a future is requested so it is calculated separately and the main R script can keep moving on.
I'm getting the following warning when my futures come back:
Warning message:
UNRELIABLE VALUE: Future (‘<none>’) unexpectedly generated random numbers without specifying argument 'seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
In the actual code I'm running, it's just loading some data and I don't know why (or care EDIT: I do care, see comments below) that it's generating random numbers. How do I stop that warning from being shown (either fixing the rng generation or just ignoring it)?
I've got a lot of lines with futures so I am hoping to be able to just set the option at the beginning somehow and not have to add it to every line.
Here's an example and my attempt to ignore the warning.
library(future.callr)
set.seed(1234567)
future.seed = TRUE
#normal random number - no problem
a<-runif(1)
print(a)
#random number in future, using callr plan
plan(callr, future.rng.onMisuse = 'ignore')
b %<-% runif(1)
print(b)
See help("%seed%", package = "future").
You could either use %seed% like below:
b %<-% runif(1) %seed% TRUE # seed is set by future pkg
print(b)
# or
b %<-% runif(1) %seed% 1234567 # your seed
print(b)
Or to disable checking
options(future.rng.onMisuse = "ignore")
b %<-% runif(1)
print(b)

Conditional simulation (with Kriging) in R with parallelization?

I am using gstat package in R to generate sequential gaussian simulations. My pc have 4 cores and I tried to parallelize the krige() function using the parallel package following the script provided by Guzmán to answer the question How to achieve parallel Kriging in R to speed up the process?.
The resulting simulations are, however, different from the ones using only one core at the time (no parallelization). It looks a geometry problem, but i can't find out how to fix it.
Next i will provide an example (using 4 cores) generating 2 simulations. You will see that after running the code, the simulated maps derived from parallelization show some artifacts (like vertical lines), and are different from the ones using only one core at the time.
The code needs the libraries gstat, sp, raster, parallel and spatstat. If any of the lines library() do not work, run install.packages() first.
library(gstat)
library(sp)
library(raster)
library(parallel)
library(spatstat)
# create a regular grid
nx=100 # number of columns
ny=100 # number of rows
srgr <- expand.grid(1:ny, nx:1)
names(srgr) <- c('x','y')
gridded(srgr)<-~x+y
# generate a spatial process (unconditional simulation)
g<-gstat(formula=z~x+y, locations=~x+y, dummy=T, beta=15, model=vgm(psill=3, range=10, nugget=0,model='Exp'), nmax=20)
sim <- predict(g, newdata=srgr, nsim=1)
r<-raster(sim)
# generate sample data (Poisson process)
int<-0.02
rpp<-rpoispp(int,win=owin(c(0,nx),c(0,ny)))
df<-as.data.frame(rpp)
coordinates(df)<-~x+y
# assign raster values to sample data
dfpp <-raster::extract(r,df,df=TRUE)
smp<-cbind(coordinates(df),dfpp)
smp<-smp[complete.cases(smp), ]
coordinates(smp)<-~x+y
# fit variogram to sample data
vs <- variogram(sim1~1, data=smp)
m <- fit.variogram(vs, vgm("Exp"))
plot(vs, model = m)
# generate 2 conditional simulations with one core processor
one <- krige(formula = sim1~1, locations = smp, newdata = srgr, model = m,nmax=12,nsim=2)
# plot simulation 1 and 2: statistics (min, max) are ok, simulations are also ok.
spplot(one["sim1"], main = "conditional simulation")
spplot(one["sim2"], main = "conditional simulation")
# generate 2 conditional with parallel processing
no_cores<-detectCores()
cl<-makeCluster(no_cores)
parts <- split(x = 1:length(srgr), f = 1:no_cores)
clusterExport(cl = cl, varlist = c("smp", "srgr", "parts","m"), envir = .GlobalEnv)
clusterEvalQ(cl = cl, expr = c(library('sp'), library('gstat')))
par <- parLapply(cl = cl, X = 1:no_cores, fun = function(x) krige(formula=sim1~1, locations=smp, model=m, newdata=srgr[parts[[x]],], nmax=12, nsim=2))
stopCluster(cl)
# merge all parts
mergep <- maptools::spRbind(par[[1]], par[[2]])
mergep <- maptools::spRbind(mergep, par[[3]])
mergep <- maptools::spRbind(mergep, par[[4]])
# create SpatialPixelsDataFrame from mergep
mergep <- SpatialPixelsDataFrame(points = mergep, data = mergep#data)
# plot mergep: statistics (min, max) are ok, but simulated maps show "vertical lines". i don't understand why.
spplot(mergep[1], main = "conditional simulation")
spplot(mergep[2], main = "conditional simulation")
I have tried your code and I think the problem lies with the way you split the work:
parts <- split(x = 1:length(srgr), f = 1:no_cores)
On my dual core machine that meant that all odd indices in srgr where handled by one process and all even indices where handled by the other process. This is probably the source of the vertical artifacts you are seeing.
A better way should be to split the data into consecutive chunks like this:
parts <- parallel::splitIndices(length(srgr), no_cores)
Using this splitting with the rest of your code I get results that look comparable to the sequential ones. At least to my untrained eyes ...
Original answer, which is only a minor effect. It still might make sense to fix the seed with set.seed for sequential and clusterSetRNGStream for parallel processing.
From what I have read about Kriging it requires you to draw random numbers. These random numbers will be different with parallel processing. See section 6 of the parallel vignette (vignette("parallel")) for more details.

How to do modify an existing function from a function?

Im trying to increase the limit of trials parameter which is currently set to 100 in C50 package. I tried to do this using fix.
library(C50)
data(churn)
fix(C5.0.default) # i change the maxtrials <- 200
treeModel <- C5.0(x = churnTrain[, -20], y = churnTrain$churn, trials = 150)
Then i get the following error when trials are less than 200.
could not find function "makeNamesFile"
I restart R and then try using fixInNamespace and changed the trials to 200.
fixInNamespace("C5.0.default", pos="package:C50")
treeModel <- C5.0(x = churnTrain[, -20], y = churnTrain$churn, trials = 150)
The model works for trials below 100 but gives a following error for trials above 100. This is the standard error that C5.0 gives when user inputs the trials above 100.
number of boosting iterations must be between 1 and 100
I want to increase no of trials(boosting) for C5 model. How do i do that? This might be an implementation constraint but since xgboost can handle more than 100 boosting iteration there might be a way for C5 to handle this.
I am able to increase the iteration to more than 100 with fix call. But the thing is that i need to run all the R scripts that are in source version of C50 package. What can i do to avoid this. I tried installing C50 package from the source and gave this a try but it didnt work out.
I could get more than 100 trails by tweaking the source code from this link. You need to source the R files, then you can change the default no of trails to get more than 100 trials.
# Allow for more than 100 Boosting
setwd('Path to R files')
files <- list.files(pattern = "\\.R$")
lapply(files, source)
fix(C5.0.default)

Simulated Annealing in R: GenSA running time

I am using simulated annealing, as implemented in R's package GenSa (function GenSA), to search for values of input variables that result in "good values" (compared to some baseline) of a highly dimensional function. I noticed that setting maximum number of calls of the objective function has no effect on the running time. Am I doing something wrong or is this a bug?
Here is a modification of the example given in GenSA help file.
library(GenSA)
Rastrigin <- local({
index <- 0
function(x){
index <<- index + 1
if(index%%1000 == 0){
cat(index, " ")
}
sum(x^2 - 10*cos(2*pi*x)) + 10*length(x)
}
})
set.seed(1234)
dimension <- 1000
lower <- rep(-5.12, dimension)
upper <- rep(5.12, dimension)
out <- GenSA(lower = lower, upper = upper, fn = Rastrigin, control = list(max.call = 10^4))
Even though the max.call is specified to be 10,000, GenSA calls the objective function more than 46,000 times (note that the objective is called within a local environment in order to track the number of calls). The same problem rises when trying to specify the maximum running time via max.time.
This is an answer by the package maintainer :
max.call and max.time are soft limits that do not include local
searches that are performed before reaching these limits. The
algorithm does not stop the local search strategy loop before its end
and this may exceed the limitation that you have set but will stop
after that last search. We have designed the algorithm that way to
make sure that the algorithm isn't stopped in the middle of searching
valley. Such an option to stop anywhere will be implemented in the
next release of the package.

Resources