R: SQUAREM vs. TURBOEM for fixed point convergence? - r

I have a calculation in R that needs to iteratively call a function for a fixed point contraction mapping. I've been using the squarem function out of the SQUAREM package by Ravi Varadhan. Today while trying to figure out a way around an issue I was having with squarem I came across the TURBOEM package, also by Varadhan. At first glance TURBOEM seems to do the same things as SQUAREM, but with additional functionality in some dimensions.
Does anyone know whether one or the other of these packages is preferred, either in general or for particular applications? Is one more current/updated than the other? TURBOEM seems to have the ability to customize the convergence criterion, which might get me out of the current bind I'm in, but I'm concerned there might be other issues. Obviously I can go off and test the corresponding functions from each package, but if someone out there knows some background on the two packages it might save me a ton of time.

There are four underlying SQUAREM algorithms used by each package. They are effectively identical*. You can see the underlying functions for yourself by using:
SQUAREM:::cyclem1
SQUAREM:::cyclem2
SQUAREM:::squarem1
SQUAREM:::squarem2
turboEM:::bodyCyclem1
turboEM:::bodyCyclem2
turboEM:::bodySquarem1
turboEM:::bodySquarem2
* apart from some differences due to the way in which these are used within the packages. And the argument method in SQUAREM is called version in turboEM
I would say turboEM would probably be preferred in general, for the following reasons:
As you mention, turboEM allows the user to select the convergence criterion, either based on the L2-norm of the change in the parameter vector (convtype = "parameter"), the L1-norm of the change in the objective function (convtype = "objfn"), or by specifying a custom function (convfn.user). SQUAREM only checks convergence using the L2-norm of the change in parameter vector.
turboEM can also stop the algorithm prior to convergence based on either the number of iterations (stoptype = "maxiter") or the amount of time elapsed (stoptype = "maxtime"). SQUAREM only stops after the specified number of iterations.
The pconstr and project arguments to turboem allow the user to define parameter space constraints and a function that projects estimates back into the parameter space if these are violated. SQUAREM does not have this functionality.
turboEM can easily apply multiple versions of the algorithm to the same data (e.g. with different orders, step sizes, ...), by providing a vector to the method argument and a list to the control.method argument...
... and it can do this in parallel via the foreach package.
turboEM also offers a convenient interface through which to apply a vanilla EM algorithm, as well as EM acceleration schemes other than SQUAREM: parabolic EM (method = "pem"), dynamic ECME ("decme") and quasi-Newton ("qn").
The turboEM package also provides the turboSim function, which allows the user to easily conduct benchmark studies comparing the different acceleration schemes.
The one downside that I can see to using turboEM instead of SQUAREM is that, if you are really interested in the particulars of the SQUAREM algorithm, the trace provided by squarem gives more specific information (residual, extrapolation, step length) than that provided by turboem (objective function [if calculated], iteration number, L2-norm of parameter change).
One final aside: The current version of SQUAREM on CRAN (v 2016.8-2) has indeed been updated more recently than the version of turboEM on CRAN (v 2014.8-1). However, the NEWS suggests that the only updates to SQUAREM since December 2010 have been vignettes and demos, while turboEM's first release was in December 2011.

Thanks for your interest in SQUAREM and turboEM. I am the author of both packages. In future, you may contact me directly with any questions.
The goals of the 2 packages are different. SQUAREM implements one class of acceleration methods. turboEM on the other hand includes a variety of state-of-art EM acceleration techniques. The goal of turboEM is to provide a go-to-place for all your EM acceleration needs! In particular, turboEM allows you to benchmark the different algorithms for your problem and determine the best one. In my experience, the squarem class of algorithms most often out perform the other 3 classes (quasi-Newton, dynamic EM, and parabolic EM). Hence, you might also directly use the SQUAREM package. However, turboEM has a number of additional features as pointed out by Mark.
Ravi Varadhan

Related

Convergence Criteria in glmmTMB - what are my options?

When using glmmTMB() of the R-package {glmmTMB} (see CRAN with links to manual & vignettes), I am aware that I have certain options when dealing with the convergence of models. More specifically, there is the control = argument to which I can pass glmmTMBControl() parameters, whose section in the manual is this:
Furthermore, one of the vignettes - i.e. Troubleshooting with glmmTMB - talks explicitly about dealing with convergence problems. My key point is now, however, that to my knowledge any time glmmTMBControl() is mentioned, it is always in one of these two ways:
glmmTMBControl(optCtrl=list(iter.max=1e3,eval.max=1e3)) i.e. increase the number of iterations
glmmTMBControl(optimizer=optim, optArgs=list(method="BFGS")) i.e. try a different optimizer
Regarding the second one, I am left with the impression that I have multiple options besides the one shown there since "The optimizer argument can be any optimization (minimizing) function [...]" and the following phrasing:
Yet, I was not able to find out about any other options I could actually put as my optimizer=, since it really seems to be the exact example shown above that is presented, and I would be thankful if someone could provide a list.
P.S.: I am trying to play around with glmmTMBs convergence criteria, because it seems to often estimate slightly smaller variance components compared to the same model fit via PROC MIXED in SAS.
From ?glmmTMB:
The ‘optimizer’ argument can be any optimization (minimizing) function, provided that:
• the first three arguments, in order, are the starting values, objective function, and gradient function;
• the function also takes a ‘control’ argument;
• the function returns a list with elements (at least) ‘par’,
‘objective’, ‘convergence’ (0 if convergence is successful)
and ‘message’ (‘glmmTMB’ automatically handles output from
‘optim()’, by renaming the ‘value’ component to ‘objective’)
The options built into base R that satisfy these requirements (gradient-based minimizers) are nlminb and optim with method="BFGS", "L-BFGS-B", or "CG". You could also look into optimizers provided by optimx or nloptr, but you'd probably have to write a wrapper function to make sure they satisfied the criteria above ...

Can I use automatic differentiation for non-differentiable functions?

I am testing performance of different solvers on minimizing an objective function derived from simulated method of moments. Given that my objective function is not differentiable, I wonder if automatic differentiation would work in this case? I tried my best to read some introduction on this method, but I couldn't figure it out.
I am actually trying to use Ipopt+JuMP in Julia for this test. Previously, I have tested it using BlackBoxoptim in Julia. I will also appreciate if you could provide some insights on optimization of non-differentiable functions in Julia.
It seems that I am not clear on "non-differentiable". Let me give you an example. Consider the following objective function. X is dataset, B is unobserved random errors which will be integrated out, \theta is parameters. However, A is discrete and therefore not differentiable.
I'm not exactly an expert on optimization, but: it depends on what you mean by "nondifferentiable".
For many mathematical functions that are used, "nondifferentiable" will just mean "not everywhere differentiable" -- but that's still "differentiable almost everywhere, except on countably many points" (e.g., abs, relu). These functions are not a problem at all -- you can just chose any subgradient and apply any normal gradient method. That's what basically all AD systems for machine learning do. The case for non-singular subgradients will happen with low probability anyway. An alternative for certain forms of convex objectives are proximal gradient methods, which "smooth" the objective in an efficient way that preserves optima (cf. ProximalOperators.jl).
Then there's those functions that seem like they can't be differentiated at all, since they seem "combinatoric" or discrete, but are in fact piecewise differentiable (if seen from the correct point of view). This includes sorting and ranking. But you have to find them, and describing and implementing the derivative is rather complicated. Whether such functions are supported by an AD system depends on how sophisticated its "standard library" is. Some variants of this, like "permute", can just fall out AD over control structures, while move complex ones require the primitive adjoints to be manually defined.
For certain kinds of problems, though, we just work in an intrinsically discrete space -- like, integer parameters of some probability distributions. In these case, differentiation makes no sense, and hence AD libraries define their primitives not to work on these parameters. Possible alternatives are to use (mixed) integer programming, approximations, search, and model selection. This case also occurs for problems where the optimized space itself depends on the parameter in question, like the second argument of fill. We also have things like the ℓ0 "norm" or the rank of a matrix, for which there exist well-known continuous relaxations, but that's outside of the scope of AD).
(In the specific case of MCMC for discrete or dimensional parameters, there's other ways to deal with that, like combining HMC with other MC methods in a Gibbs sampler, or using a nonparametric model instead. Other tricks are possible for VI.)
That being said, you will rarely encounter complicated nowhere differentiable continuous functions in optimization. They are already complicated to describe, are just unlikely to arise in the kind of math we use for modelling.

Numerical method produces platform dependent results

I have a rather complicated issue with my small package. Basically, I'm building a GARCH(1,1) model with rugarch package that is designed exactly for this purpose. It uses a chain of solvers (provided by Rsolnp and nloptr, general-purpose nonlinear optimization) and works fine. I'm testing my method with testthat by providing a benchmark solution, which was obtained previously by manually running the code under Windows (which is the main platform for the package to be used in).
Now, I initially had some issues when the solution was not consistent across several consecutive runs. The difference was within the tolerance I specified for the solver (default solver = 'hybrid', as recommended by the documentation), so my guess was it uses some sort of randomization. So I took away both random seed and parallelization ("legitimate" reasons) and the issue was solved, I'm getting identical results every time under Windows, so I run R CMD CHECK and testthat succeeds.
After that I decided to automate a little bit and now the build process is controlled by travis. To my surprise, the result under Linux is different from my benchmark, the log states that
read_sequence(file_out) not equal to read_sequence(file_benchmark)
Mean relative difference: 0.00000014688
Rebuilding several times yields the same result, and the difference is always the same, which means that under Linux the solution is also consistent. As a temporary fix, I'm setting a tolerance limit depending on the platform, and the test passes (see latest builds).
So, to sum up:
A numeric procedure produces identical output on both Windows and Linux platforms separately;
However, these outputs are different and are not caused by random seeds and/or parallelization;
I generally only care about supporting under Windows and do not plan to make a public release, so this is not a big deal for my package per se. But I'm bringing this to attention as there may be an issue with one of the solvers that are being used quite widely.
And no, I'm not asking to fix my code: platform dependent tolerance is quite ugly, but it does the job so far. The questions are:
Is there anything else that can "legitimately" (or "naturally") lead to the described difference?
Are low-level numeric routines required to produce identical results on all platforms? Can it happen I'm expecting too much?
Should I care a lot about this? Is this a common situation?

Parallelize Solve() for Ax=b?

Crossposted with STATS.se since this problem could straddle both STATs.se/SO
https://stats.stackexchange.com/questions/17712/parallelize-solve-for-ax-b
I have some extremely large sparse matrices created using spMatrix function from the matrix package.
Using the solve() function works for my Ax=b issue, but it takes a very long time. Several days.
I noticed that http://cran.r-project.org/web/packages/RScaLAPACK/RScaLAPACK.pdf
appears to have a function that can parallelize the solve function, however, it can take several weeks to get new packages installed on this particular server.
The server already has the snow package installed it.
So
Is there a way of using snow to parallelize this operation?
If not, are there other ways to speed up this type of operation?
Are there other packages like RScaLAPACK? My search on RScaLAPACK seemed to suggest people had a lot of issues with it.
Thanks.
[EDIT] -- Additional details
The matrices are about 370,000 x 370,000.
I'm using it to solve for alpha centrality, http://en.wikipedia.org/wiki/Alpha_centrality. I was originally using the alpha centrality function in the igraph package, but it would crash R.
More details
This is on a single machine with 12 cores and 96 gigs of memory (I believe)
It's a directed graph along the lines of paper citation relationships.
Calculating condition number and density will take awhile. Will post as it comes available.
Will crosspost on stat.SE and will add a link back to here
[Update 1: For those just tuning in: The original question involved parallelizing computations to solving a regression problem; given that the underlying problem is related to alpha centrality, some of the issues, such as bagging and regularized regression may not be as immediately applicable, though that leads down the path of further statistical discussions.]
There are a bundle of issues to address here, from the infrastructural to the statistical.
Infrastructure
[Updated - also see Update #2 below.]
Regarding parallelized linear solvers, you can replace R's BLAS / LAPACK library with one that supports multithreaded computations, such as ATLAS, Goto BLAS, Intel's MKL, or AMD's ACML. Personally, I use AMD's version. ATLAS is irritating, because one fixes the number of cores at compilation, not at run-time. MKL is commercial. Goto is not well supported anymore, but is often the fastest, but only by a slight margin. It's under the BSD license. You can also look at Revolution Analytics's R, which includes, I think, the Intel libraries.
So, you can start using all of the cores right away, with a simple back-end change. This could give you a 12X speedup (b/c of the number of cores) or potentially much more (b/c of better implementation). If that brings down the time to an acceptable range, then you're done. :) But, changing the statistical methods could be even better.
You've not mentioned the amount of RAM available (or the distribution of it per core or machine), but A sparse solver should be pretty smart about managing RAM accesses and not try to chew on too much data at once. Nonetheless, if it is on one machine and if things are being done naively, then you may encounter a lot of swapping. In that case, take a look at packages like biglm, bigmemory, ff, and others. The former addresses solving linear equations (or GLMs, rather) in limited memory, the latter two address shared memory (i.e. memory mapping and file-based storage), which is handy for very large objects. More packages (e.g. speedglm and others) can be found at the CRAN Task View for HPC.
A semi-statistical, semi-computational issue is to address visualization of your matrix. Try sorting by the support per row & column (identical if graph is undirected, else do one then the other, or try a reordering method like reverse Cuthill-McKee), and then use image() to plot the matrix. It would be interesting to see how this is shaped, and that affects which computational and statistical methods one could try.
Another suggestion: Can you migrate to Amazon's EC2? It is inexpensive, and you can manage your own installation. If nothing else, you can prototype what you need and migrate it in-house once you have tested the speedups. JD Long has a package called segue that apparently makes life easier for distributing jobs on Amazon's Elastic MapReduce infrastructure. No need to migrate to EC2 if you have 96GB of RAM and 12 cores - distributing it could speed things up, but that's not the issue here. Just getting 100% utilization on this machine would be a good improvement.
Statistical
Next up are multiple simple statistical issues:
BAGGING You could consider sampling subsets of your data in order to fit the models and then bag your models. This can give you a speedup. This can allow you to distribute your computations on as many machines & cores as you have available. You can use SNOW, along with foreach.
REGULARIZATION The glmnet supports sparse matrices and is very fast. You would be wise to test it out. Be careful about ill-conditioned matrices and very small values of lambda.
RANK Your matrices are sparse: are they full rank? If they are not, that could be part of the issue you're facing. When matrices are either singular or very nearly so (check your estimated condition number, or at least look at how your 1st and Nth eigenvalues compare - if there's a steep drop off, you're in trouble - you might check eval1 versus ev2,...,ev10,...). Again, if you have nearly singular matrices, then you need to go back to something like glmnet to shrink out the variables are either collinear or have very low support.
BOUNDING Can you reduce the bandwidth of your matrix? If you can block diagonalize it, that's great, but you'll likely have cliques and members of multiple cliques. If you can trim the most poorly connected members, then you may be able to estimate their alpha centrality as being upper bounded by the lowest value in the same clique. There are some packages in R that are good for this sort of thing (check out Reverse Cuthill-McKee; or simply look to see how you'd convert it into rectangles, often relating to cliques or much smaller groups). If you have multiple disconnected components, then, by all means, separate the data into separate matrices.
ALTERNATIVES Are you wedded to the Alpha Centrality? There may be other measures that are monotonically correlated (i.e. have high rank correlation) with the same value that could be calculated more cheaply or at least implemented quite efficiently. If those will work, then your analyses could proceed with a lot less effort. I have a few ideas, but SO isn't really the place to go about that discussion.
For more statistical perspectives, appropriate Q&A should occur on the stats.stackexchange.com, Cross-Validated.
Update 2: I was a bit too quick in answering and didn't address this from the long-term perspective. If you are planning to do research on such systems for the long-term, you should look at other solvers that may be more applicable to your type of data and computing infrastructure. Here is a very nice directory of the options for both solvers and pre-conditioners. It seems this doesn't include IBM's "Watson" solver suite. Although it may take weeks to get software installed, it's quite possible that one of the packages is already installed if you have a good HPC administrator.
Also, keep in mind that R packages can be installed to the user directory - you need not have a package installed in the general directory. If you need to execute something as a user other than yourself, you could also download a package to the scratch or temporary space (if you're running within just 1 R instance, but using multiple cores, check out tempdir).

Implementation of Particle Swarm Optimization Algorithm in R

I'm checking a simple moving average crossing strategy in R. Instead of running a huge simulation over the 2 dimenional parameter space (length of short term moving average, length of long term moving average), I'd like to implement the Particle Swarm Optimization algorithm to find the optimal parameter values. I've been browsing through the web and was reading that this algorithm was very effective. Moreover, the way the algorithm works fascinates me...
Does anybody of you guys have experience with implementing this algorithm in R? Are there useful packages that can be used?
Thanks a lot for your comments.
Martin
Well, there is a package available on CRAN called pso, and indeed it is a particle swarm optimizer (PSO).
I recommend this package.
It is under actively development (last update 22 Sep 2010) and is consistent with the reference implementation for PSO. In addition, the package includes functions for diagnostics and plotting results.
It certainly appears to be a sophisticated package yet the main function interface (the function psoptim) is straightforward--just pass in a few parameters that describe your problem domain, and a cost function.
More precisely, the key arguments to pass in when you call psoptim:
dimensions of the problem, as a vector
(par);
lower and upper bounds for each
variable (lower, upper); and
a cost function (fn)
There are other parameters in the psoptim method signature; those are generally related to convergence criteria and the like).
Are there any other PSO implementations in R?
There is an R Package called ppso for (parallel PSO). It is available on R-Forge. I do not know anything about this package; i have downloaded it and skimmed the documentation, but that's it.
Beyond those two, none that i am aware of. About three months ago, I looked for R implementations of the more popular meta-heuristics. This is the only pso implementation i am aware of. The R bindings to the Gnu Scientific Library GSL) has a simulated annealing algorithm, but none of the biologically inspired meta-heuristics.
The other place to look is of course the CRAN Task View for Optimization. I did not find another PSO implementation other than what i've recited here, though there are quite a few packages listed there and most of them i did not check other than looking at the name and one-sentence summary.

Resources