I am pretty new to the Julia language (using Version 0.6.0 (2017-06-19 13:05 UTC) Official http://julialang.org/ release x86_64-w64-mingw32 on a Windows 7 machine). I have an R background and found the implementations of mixed models to be slow for very big data sets (n>2,000,000, p > 100). Hence, I searched for alternatives and Julia seems to be lightning fast when it comes to estimation time.
The question I want to raise here is about the MixedModels.jl package of dmbates. Despit its incredible speed compared to, e.g., lme4, I was wondering if there is also some prediction function. Here is a mwe that calls the Dyestuff data from R`s lme4 package:
using MixedModels, RCall
R> library(“lme4”)
R> data(Dyestuff)
Dyestuff = rcopy(R"Dyestuff");
mm = fit!(lmm(#formula(Yield ~ 1 + (1 | Batch)), Dyestuff));
So how can I do predictions using something like:
predict(mm, newdata = Dyestuff)
Many thanks in advance.
Please note that Julia has yet to reach v1.0 yet and there are often big, breaking API changes between releases. Likewise MixedModels.jl is in active development and must track the Julia API changes in its own API. The information here is (hopefully) correct at the time of writing.
Looking at the source code for MixedModels.jl at the current revision e566fcf, there is no predict() method, but there is fitted() method which inherits from / overrides StatsBase.fitted(). It should be easy enough to also write predict() method overriding StatsBase.predict() and submit it as a pull request. You might want to look at the simulate() method -- instead of generating new data based on existing data, you would use data passed as an argument.
Related
I am developing an iterative algorithm that uses quantile regression models at each iteration. For that I use the rq function from the quantreg package in R. So far it has worked fine. However, I have found a dataset where, at one of the iterations, the rq function simply gets stuck. No error message, no warning. It simply goes on as if still working, but never finishes computation.
I provide here a very small minimal code example. You can download the problematic data on this link:
https://www.dropbox.com/s/yrlotit1ovk9yzd/r555.RData?dl=0
library(quantreg)
load('~r555.RData')
dependent = r$dependent
independent = r$independent
quantreg::rq(dependent ~ -1 + independent, tau=0.1)
If you execute the above mentioned code, the rq function will get stuck and never finish. Be aware that the data provided is part of the iterative process I am developing, so it has no direct interpretation by itself. I am writing to check for possible reasons on this behaviour and check for possible solutions.
Dont know if it matters, but I have tested this on two different computers running Windows10 and using different versions of the quantreg package.
Changing the default method="br" to method="fn" fixes the problem.
quantreg::rq(dependent ~ -1 + independent, tau=0.1, method="fn")
I'm trying to fit +- 70.000 values as a function of two variables using the loess() function several times. I want to use this fit to de-trend the data. My problem is that once I start the loess function, the R session takes up all available cores on the system, and that would be inconsiderate towards other users on the same computing cluster.
The relevant code would be analogous to the following:
# Approximation of the data
df <- data.frame(y = rpois(70000, rnorm(70000, 10, 2)), # y is count data
x = 50000 - rpois(70000, 100),
z = runif(70000))
# The problematic operation
fit <- loess(y ~ x + z, data = df)
When I run this example on my local machine, it only takes up 1 core, but on the cluster it takes as many cores as it could get (up to 48). Ideally, I would loess() to run on only 1 core.
I've tried to trace any multicore parameters in the code of loess, which I couldn't find. I know that loess calls stats:::simpleLoess, which in turn calls C code, which in turn calls Fortran code. I have no experience in C or Fortran and I haven't been able to figure out how I can restrict the CPU usage for this function.
Does anyone has any suggestion on how I can limit the CPU usage of the loess function?
I am not knowledgeable enough to comment on specifics about how all of this works, but I know that C++ and FORTRAN for R are usually built using the OpenMP framework for multi-thread programming. Empirically, I do know that your issue can be resolved if you set the OMP_NUM_THREADS argument before you launch R or if you set it within an R session.
Let's say you wanted to use 2 threads for the loess function. Before you launch R, you would do this ($ to signify typing this in a shell session):
$ OMP_NUM_THREADS=2 R [whatever other options you use to launch R]
Here's how to do it from within R (> to indicate an interactive R session):
> Sys.setenv("OMP_NUM_THREADS" = 2)
If you ever need to check the variable from within R, you can do the following (this will return a character vector with the number):
> Sys.getenv("OMP_NUM_THREADS")
# The result in our example will be "2"
For completeness, be sure to use ?Sys.setenv or ?Sys.getenv if you wish to get more information about those functions, and check out this site for details about OMP_NUM_THREADS.
Hope that helps!
So McG led me down a path that eventually gave me the ability to control the number of cores, which I'll post as another answer.
There were a few details I foolishly neglected to mention, namely that I was working on an RStudio server. For all other purposes, I indeed think that McG's answer would be excellent.
That answer helped me get the correct terms to google, and strolling around the search results I stumbled upon this thread that suggested that the RhpcBLASctl package has a function to set the number of cores as follows:
blas_set_num_threads(2)
Setting this in an RMarkdown document before running loess kept my CPU usage at 200% while running the loess function afterwards that was problematic before.
I have a rather complicated issue with my small package. Basically, I'm building a GARCH(1,1) model with rugarch package that is designed exactly for this purpose. It uses a chain of solvers (provided by Rsolnp and nloptr, general-purpose nonlinear optimization) and works fine. I'm testing my method with testthat by providing a benchmark solution, which was obtained previously by manually running the code under Windows (which is the main platform for the package to be used in).
Now, I initially had some issues when the solution was not consistent across several consecutive runs. The difference was within the tolerance I specified for the solver (default solver = 'hybrid', as recommended by the documentation), so my guess was it uses some sort of randomization. So I took away both random seed and parallelization ("legitimate" reasons) and the issue was solved, I'm getting identical results every time under Windows, so I run R CMD CHECK and testthat succeeds.
After that I decided to automate a little bit and now the build process is controlled by travis. To my surprise, the result under Linux is different from my benchmark, the log states that
read_sequence(file_out) not equal to read_sequence(file_benchmark)
Mean relative difference: 0.00000014688
Rebuilding several times yields the same result, and the difference is always the same, which means that under Linux the solution is also consistent. As a temporary fix, I'm setting a tolerance limit depending on the platform, and the test passes (see latest builds).
So, to sum up:
A numeric procedure produces identical output on both Windows and Linux platforms separately;
However, these outputs are different and are not caused by random seeds and/or parallelization;
I generally only care about supporting under Windows and do not plan to make a public release, so this is not a big deal for my package per se. But I'm bringing this to attention as there may be an issue with one of the solvers that are being used quite widely.
And no, I'm not asking to fix my code: platform dependent tolerance is quite ugly, but it does the job so far. The questions are:
Is there anything else that can "legitimately" (or "naturally") lead to the described difference?
Are low-level numeric routines required to produce identical results on all platforms? Can it happen I'm expecting too much?
Should I care a lot about this? Is this a common situation?
I'm checking a simple moving average crossing strategy in R. Instead of running a huge simulation over the 2 dimenional parameter space (length of short term moving average, length of long term moving average), I'd like to implement the Particle Swarm Optimization algorithm to find the optimal parameter values. I've been browsing through the web and was reading that this algorithm was very effective. Moreover, the way the algorithm works fascinates me...
Does anybody of you guys have experience with implementing this algorithm in R? Are there useful packages that can be used?
Thanks a lot for your comments.
Martin
Well, there is a package available on CRAN called pso, and indeed it is a particle swarm optimizer (PSO).
I recommend this package.
It is under actively development (last update 22 Sep 2010) and is consistent with the reference implementation for PSO. In addition, the package includes functions for diagnostics and plotting results.
It certainly appears to be a sophisticated package yet the main function interface (the function psoptim) is straightforward--just pass in a few parameters that describe your problem domain, and a cost function.
More precisely, the key arguments to pass in when you call psoptim:
dimensions of the problem, as a vector
(par);
lower and upper bounds for each
variable (lower, upper); and
a cost function (fn)
There are other parameters in the psoptim method signature; those are generally related to convergence criteria and the like).
Are there any other PSO implementations in R?
There is an R Package called ppso for (parallel PSO). It is available on R-Forge. I do not know anything about this package; i have downloaded it and skimmed the documentation, but that's it.
Beyond those two, none that i am aware of. About three months ago, I looked for R implementations of the more popular meta-heuristics. This is the only pso implementation i am aware of. The R bindings to the Gnu Scientific Library GSL) has a simulated annealing algorithm, but none of the biologically inspired meta-heuristics.
The other place to look is of course the CRAN Task View for Optimization. I did not find another PSO implementation other than what i've recited here, though there are quite a few packages listed there and most of them i did not check other than looking at the name and one-sentence summary.
I am using the 'R' library "glmulti" and performing an exhaustive search.
relevant code:
local1.model <- glmulti(est, # use the model with built as a starting point
level = 1, # just look at main effects
method = "h",
crit="aicc") # use AICc because it works better than AIC for small sample sizes
The variable "est" is a fitted GLM that informs glmulti.
If I were a Java-based program that had to do the same thing several hundred thousand times, then I would use more than one core.
My glmulti is not using my cores efficiently.
Is there a way to switch it to make use of more of my system?
Note: when I use 'h2o' it can max out the CPU and make a strong hit on the memory.
R is single-threaded (unless the function is built on a library with its own threading). You can manually add parallelization to your code, using the rparallel library (which is part of core R): http://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf
I would class it as non-trivial to use. It is a bit of a hack on top of R, so it does lots of memory copying, and you need to think about what is going on if you care about efficiency.
glmulti looks like it ought to be parallel (i.e. each combination of parameters could be done in parallel, even if using a genetic algorithm). My guess is they intended to add it, but development stopped (no updates since Sep 2009).