kmeans: Quick-TRANSfer stage steps exceeded maximum - r

I am running k-means clustering in R on a dataset with 636,688 rows and 7 columns using the standard stats package: kmeans(dataset, centers = 100, nstart = 25, iter.max = 20).
I get the following error: Quick-TRANSfer stage steps exceeded maximum (= 31834400), and although one can view the code at http://svn.r-project.org/R/trunk/src/library/stats/R/kmeans.R - I am unsure as to what is going wrong. I assume my problem has to do with the size of my dataset, but I would be grateful if someone could clarify once and for all what I can do to mitigate the issue.

I just had the same issue.
See the documentation of kmeans in R via ?kmeans:
The Hartigan-Wong algorithm
generally does a better job than either of those, but trying
several random starts (‘nstart’> 1) is often recommended. In rare
cases, when some of the points (rows of ‘x’) are extremely close,
the algorithm may not converge in the “Quick-Transfer” stage,
signalling a warning (and returning ‘ifault = 4’). Slight
rounding of the data may be advisable in that case.
In these cases, you may need to switch to the Lloyd or MacQueen algorithms.
The nasty thing about R here is that it continues with a warning that may go unnoticed. For my benchmark purposes, I consider this to be a failed run, and thus I use:
if (kms$ifault==4) { stop("Failed in Quick-Transfer"); }
Depending on your use case, you may want to do something like
if (kms$ifault==4) { kms = kmeans(X, kms$centers, algorithm="MacQueen"); }
instead, to continue with a different algorithm.
If you are benchmarking K-means, note that R uses iter.max=10 per default. It may take much more than 10 iterations to converge.

Had the same problem, seems to have something to do with available memory.
Running Garbage Collection before the function worked for me:
gc()
or reference:
Increasing (or decreasing) the memory available to R processes

#jlhoward's comment:
Try
kmeans(dataset, algorithm="Lloyd", ..)

I got the same error message, but in my case it helped to increase the number of iterations iter.max. That contradicts the theory of memory overload.

Related

Julia not solving sparse Linear system

Introduction
I'm doing research in computational contact mechanics, in which I try to solve a PDE using a finite difference method. Long story short, I need to solve a linear system like Ax = b.
The suspects
In the problem, the matrix A is sparse, and so I defined it accordingly. On the other hand, both x and b are dense arrays.
In fact, x is defined as x = A\b, the potential solution of the problem.
So, the least one might expect from this solution is to satisfy that Ax is close to b in some sense. Great is my surprise when I find that
julia> norm(A*x-b) # Frobenius or 2-norm
5018.901093242197
The vector x does not solve the system! I've tried a lot of tricks discover what is going on, but no clues as of now. My first candidate is that I've found a bug, however I need more evidence to make this assertion.
The hints
Here are some tests that I've done to try to pinpoint the error
If you convert A to dense, the solution changes completely, and in fact it returns the correct solution.
I have repeated the proccess above in matlab, and it seems to work well with both sparse and dense matrices (that is, the sparse version does not agree with that of Julia's)
Not all sparse matrices cause a problem. I have tried other initial conditions and the solver seems to work quite well. I am not able to predict what property of the matrix can be causing this discrepancy. However;
A has a condition number of 120848.06, which is quite high, although matlab doesn't seem to complain. Also, the absolute error of the solution to the real solution is huge.
How to reproduce this "bug"
Download the .csv files in the following link
Run the following code in the folder of the files (install the packages if necessary
using DelimitedFiles, LinearAlgebra, SparseArrays;
A = readdlm("A.csv", ',');
b = readdlm("b.csv", ',');
x = readdlm("x.csv", ',');
A_sparse = sparse(A);
println(norm(A_sparse\b - x)); # You should get something close to zero, x is the solution of the sparse implementation
println(norm(A_sparse*x - b)); # You should get something not close to zero, something is not working!
Final words
It might easily be the case that I'm missing something. Are there any other implementations apart from the usual A\b to test against?
To solve a sparse square system Julia chooses to do a sparse LU decomposition. For the specific matrix A in the question, this decomposition is numerically ill-conditioned. This is evidenced by the cond(lu(A_sparse).U) == 2.879548971708896e64. This causes the solve routine to make numerical errors in turn.
A quick solution is to use a QR decomposition instead, by running x = qr(A_sparse)\b.
The solve or LU routines might need to be fixed to handle this case, or at least maintainers need to know of this issue, so opening an issue on the github repo might be good.
(this is a rewrite of my comment on question)

Error: Required number of iterations = 1087633109 exceeds iterMax = 1e+06 ; either increase iterMax, dx, dt or reduce sigma

I am getting this error and this post telling me that I should decrease the sigma but here is the thing this code was working fine a couple of months ago. Nothing change based on the data and the code. I was wondering why this error out of blue.
And the second point, when I lower the sigma such as 13.1, it looks running (but I have been waiting for an hour).
sigma=203.9057
dimyx1=1024
A22den=density(Lnetwork,sigma,distance="path",continuous=TRUE,dimyx=dimyx1) #
About Lnetwork
Point pattern on linear network
69436 points
Linear network with 8417 vertices and 8563 lines
Enclosing window: rectangle = [143516.42, 213981.05] x [3353367, 3399153] units
Error: Required number of iterations = 1087633109 exceeds iterMax = 1e+06 ; either increase iterMax, dx, dt or reduce sigma
This is a question about the spatstat package.
The code for handling data on a linear network is still under active development. It has changed in recent public releases of spatstat, and has changed again in the development version. You need to specify exactly which version you are using.
The error report says that the required number of iterations of the algorithm is too large. This occurs because either the smoothing bandwidth sigma is too large, or the spacing dx between sample points along the network is too small. The number of iterations is proportional to (sigma/dx)^2 in most cases.
First, check that the value of sigma is physically reasonable.
Normally you shouldn't have to worry about the algorithm parameter dx because it is determined automatically by default. However, it's possible that your data are causing the code to choose a very small value of dx.
The internal code which automatically determines the spacing dx of sample points along the network has been changed recently, in order to fix several bugs.
I suggest that you specify the algorithm parameters manually. See the help file for densityHeat for information on how to control the spacings. Setting the parameters manually will also ensure greater consistency of the results between different versions of the software.
The quickest solution is to set finespacing=FALSE. This is not the best solution because it still uses some of the automatic rules which may be giving problems. Please read the help file to understand what that does.
Did you update spatstat since this last worked? Probably the internal code for determining spacing on the network etc. changed a bit. The actual computations are done by the function densityHeat(), and you can see how to manually set spacing etc. in its help file.

How to speed up the generation of a latin hypercube (LHS) design

I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.
Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.

Unexpected behaviour of ftol_abs and ftol_rel in NLopt

UPDATE: For anyone else who visits this page, it is worth having a look at this SO question and answer as I suspect the solution there is relevant to the problem I was having here.
This question duplicates one I have asked on the julia-users mailing list, but I haven't gotten a response there (admittedly it has only been 4 days), so thought I would ask here.
I am calling the NLopt API from Julia, although I think my question is independent of the Julia language.
I am trying to solve an optimisation problem using COBYLA but in many cases I am failing to trigger the stopping criteria. My problem is reasonably complicated, but I can reproduce the problem behaviour with a much simpler example.
Specifically, I try to minimize x1^2 + x2^2 + 1 using COBYLA, and I set both ftol_rel and ftol_abs to 0.5. My objective function includes a statement to print the current value to the console, so I can watch the convergence. The final five values printed to the console during convergence are:
1.161
1.074
1.004
1.017
1.038
My understanding is that any of these steps should have triggered the stopping criteria. All steps are less than 0.5, so that should trigger ftol_abs. Further, each value is approximately 1, and 0.5*1 = 0.5, so all steps should also have triggered ftol_rel. In fact, this behaviour is true of the last 8 steps in the convergence routine.
NLopt has been around for a while, so I'm guessing the problem is with my understanding of how ftol_abs and ftol_rel work, rather than being a bug.
Can anyone shed any light on why the stopping criteria are not being triggered much earlier?
If it is of any use, the following Julia code snippet can be used to reproduce everything I have just stated:
using NLopt
function objective_function(param::Vector{Float64}, grad::Vector{Float64})
obj_func_value = param[1]^2 + param[2]^2 + 1.0
println("Objective func value = " * string(obj_func_value))
println("Parameter value = " * string(param))
return(obj_func_value)
end
opt1 = Opt(:LN_COBYLA, 2)
lower_bounds!(opt1, [-10.0, -10.0])
upper_bounds!(opt1, [10.0, 10.0])
ftol_rel!(opt1, 0.5)
ftol_abs!(opt1, 0.5)
min_objective!(opt1, objective_function)
(fObjOpt, paramOpt, flag) = optimize(opt1, [9.0, 9.0])
Presumably, ftol_rel and ftol_abs are supposed to supply numerically guaranteed errors. The earlier values are close enough but the algorithm might not be able to guarantee it. For example, the gradient or Hessian at the evaluation point might supply such a numerical guarantee. So, it continues a bit further.
To be sure, it's best to look at the optimization algorithm source. If I manage this, I will add it to this answer.
Update: The COBYLA algorithm approximates the gradient (vector derivative) numerically using several evaluation points. As mentioned, this is used to model what the error might be. The errors can really be mathematically guaranteed only for functions restricted to some nice family (e.g. polynomials with some bound on degree).
Take home message: It's OK. Not a bug, but the best the algorithm can do. Let it have those extra iterations.

Diagnozing opaque errors and stabilizing / robustifying a Simulation in R

Apologies, since this question is somewhat vague and general, and is certainly not reproducible since the code is too complex. However, I suspect it could be answered by equally vague strategies of approaching these issues that are instructive and helpful.
I have coded a simulator which has a main, parallelized loop iterating through parameter values, loading them to the model and running them n times.
The issue: while the code generally works well for smaller problem dimensions, it fails at a significant frequency at higher dimensions (particularly higher n); most parameter values execute fine and output is produced, but once in a while there is no file produced. The 'post processing' then fails because of missing files.
What I know: Rerunning the function, different parameter values are effected, so this is not due to invalid parameter values, but seemingly a random failure. There have also been some runs without any problems. There was once an error message about failure to allocate vector of size xyz.
What I tried: traceback() seems to focus on the failure at the end of the sim (a symptom) but doesn't find the real cause. I also tried adding a while loop conditional on the existence of the output file, what would rerun the parameter value if it failed (see below, commented out). This seemed to help a little, but not completely.
The above leads me to suspect some threads crash somehow, and then fail to output any of the parameters assigned to it.
Questions: What strategies would you use to diagnose this issue? What methods can one implement to make such a simulation more robust to errors (diagnosed or otherwise)? What kind of operations might I be doing what can cause such failures?
Sketch of the Sim. Loop:
library(foreach)
library(doMC)
Simulator <- function(params,...)
{
[... Pre Processing...]
times<-foreach(i=1:length(params)) %dopar%
{
# while(!file.exists(paste0("output",i,".rds"))) {
run <-list()
run$par <-params[[i]]
run$data <-list()
foreach(j=1:n) %do% # Run Sim n times with params
{
run$data[[j]] <- SimRun(params[[i]],...)
}
# Combine into single array and label dimensions
run$data <- abind(run$data,along=4)
dimnames(run$data)<- headers
# Compute statistics and save them
run$stats <- Stats(run$data,params[[i]])
saveRDS(run,paste0("output",i,".rds"))
# }
[...etc...]
}
[... Post Processing....]
}
Thanks for your patience!

Resources