In Yosemite MacOS 10.10 (including 10.10.0, 10.10.1 and 10.10.2) performances of these 2 classes (NSTableView and NSOutlineView) show very bad performances compared to Mavericks (10.9). I tried to convert them to Core Animation (enabling layer-backing): performances are a bit better, but still very bad. When 'sampling' the application, MacOS spends 99% in MacOS functions (CALayer drawInContext, CALayer display, ...).
Any workaround to this problem? It seems to affect any applications that uses NSOutlineView and NSTableView.
Related
I'm looking for an advice please. After cca 6 months I got back to a code I wrote that by then took around 30 minutes to finish. Now, when I run it's way slower. It looks like it could take days. Since back then, hardware didn't change, I'm using Windows 10 and since then I updated my RStudio to current version (2022.07.2 Build 576), and I didn't update R version, which is "4.1.2 (2021-11-01)".
I noticed that in contrast to before, now RStudio is not using more than around 400MB RAM. Before it was much more. I don't run any other SW and there is plenty RAM available.
I had an idea that antivirus might cause this, even though I didn't change any settings. I put RStudio and R to exceptions and didn't change anything.
I also updated RStudio from the previous version, which didn't help.
Please, does anyone have an idea what can be causing this? Sorry if the description is not optimal, it's my first post here and I'm not a programmer, I just use R for data analysis for my biology related diploma thesis.
Thanks a lot!
Daniel
I'm trying to push the maximum from a Ryzen 3 3950x 16-core machine on Ubuntu 20.04, Microsoft R 3.5.2, with Intel MKL, and the RCpp code is compiled with Sys.setenv(MKL_DEBUG_CPU_TYPE=5) header.
The following are the main operations, that I'd like to optimize for:
Fast multivariate random normal (for which I use the Armadillo version):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::mat mvrnormArma(int n, arma::vec mu, arma::mat sigma) {
int ncols = sigma.n_cols;
arma::mat Y = arma::randn(n, ncols);
return arma::repmat(mu, 1, n).t() + Y * arma::chol(sigma);
}
Fast SVD (I found that the base::svd performs better than any Rcpp realization I've found so far, including arma::svd("dc"), probably due to different U,S,V dimensions).
Fast matrix multiplications for various results (found code written in C, rewrote all of it in base R and am finding vast improvements due to multicore vs previous 1-core performance. Can base R matrix operations be further improved?)
I've tried various setups with R4.0.2 and openBLAS (through the Ropenblas package), played with various Intel MKL releases, researched about AMD's BLIS and libflame (which I don't know how to even test with R).
Overall, this setup is able to outperform a laptop with i7-8750h and Microsoft R 3.5.1 (With working MKL) by around 2x, while based on 6 vs 16 cores (and faster RAM), I was expecting at least 3-3.5x improvement (based, e.g., by cinebench and similar performance benchmarks).
How can this setup be further improved?
My main issues/questions:
First, I've noticed that the current setup, when ran with 1 worker, is using around 1000-1200% cpu when looking at top call. Through experimentation, I've found that spawning two parallel workers uses most of the cpu, around 85-95%, and delivers the best performance. For example, 3 workers uses whole 100%, but bottlenecks somewhere, drastically reducing the performance for some reason.
I'm guessing that this is a limitation either coming from R/MKL, or something when compiling Rcpp code, since 10-12 cores seems oddly specific. Can this be improved by some hints when compiling Rcpp code?
Secondly, I'm sure I'm not using the optimal BLAS/LAPACK/etc drivers for the job. My guess is that properly compiled R4.0.2 should be significantly faster than Microsoft R3.5.2, but I have absolutely no idea what am I missing, whether the AVX/AVX2 are properly called/used, and what else should I try with the machine?
Lastly, I have seen zero guides for calling/working with AMD BLIS/libflame for R. If this is trivial, would appreciate any hints/help of what to look into.
Until any other (hopefully much better) answers pops up, will post here my latest findings by guesswork. Hopefully, someone with a similar machine will find this useful. Will try expanding the answer if any additional improvements comes up.
Guide for clean R compiling. Seems outdated, but hopefully nothing much missing:
Speed up RcppArmadillo: How to link to OpenBlas in an R package
OpenBLAS and IntelMKL examples + Rstudio
OpenBLAS works terrible on my Ryzen + Ubuntu configuration; with 3.10 BLAS, compiled with zen2 hints, uses all the CPU cores, but terribly. top reports 3200% usage for R instance, but total CPU utilisation doesn't rise more than 20-30%. Hence, the performance is at least 3x slower than with Intel MKL.
IntelMKL. Versions till 2019 work with the MKL_DEBUG_CPU_TYPE workaround. Can confirm that intel-mkl-64bit-2019.5-075 works.
For later versions since 2020.0-088 a different
workaround is
needed. With my benchmarks the performance did not see any
improvement, however, this may change with future MKL releases.
The 10-12 hardcoded cap per instance appears to be controlled by several environmental variables. I found the following list as per this old guide. Likely that these may change with later versions, but seems to work with 2019.5-075:
export MKL_NUM_THREADS=2
export OMP_NESTED="TRUE"
export MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
export OMP_NUM_THREADS=1
export MKL_DYNAMIC="TRUE"
export OMP_DYNAMIC="FALSE"
Playing around with various configurations I was able to find that reducing the number of threads and spawning more workers, for my specific benchmark that I've tested on, increased the performance drastically (around 3-4 fold). Even though the claimed CPU usage were similar to multicore variants of the configuration, 2 workers using 16 threads each (totaling in ~70% cpu utilisation) are much slower than 16 workers using 2 threads each (also, similar cpu utilisation). The results may vary with different tasks, so these seem to be the go-to parameters to play with for every longer task.
AMD BLIS. Testing this as an alternative to MKL BLAS, currently only experimenting, but the performance seems to be on par with the Intel MKL with all the fixes. Checked whether BLIS was actually used through perf, for my benchmarks the calls were made bli_dgemmsup_rd_haswell_asm_6x8m, bli_daxpyv_zen_int10 and others. Not sure yet whether the settings for compiling BLIS were optimal. The takeaway here could be that both the MKL and BLIS are actually pushing max from the CPU, given my specific benchmarks... or at least that both libraries are similarly optimized.
Important downside of sticking to AMD BLIS: noticed after months of usage, but there seems to be some unresolved issues with BLIS or LAPACK packed into AMD libraries that I have no idea about. I've noticed random matrix multiplication issues that are not replicable (essentially, hitting into this problem) and are solved by switching back to MKL libs. I can't say whether there's a problem with my way of building R or the actual libs, so just a warning.
I am using a trial font to create some figures in R using ggplot2. (In case anyone is interested, I am using this https://www.grillitype.com/typefaces/gt-walsheim)
Trial version of this font only provides a limited set of characters, which means that symbols like "=" are not included in the trial version. So I am generating graphs under both Linux and Mac using ggplot2 and cairo_pdf. In Linux, the missing characters in the font I am using are automatically substituted with a fallback typeface such as Helvetica, but this is not happening in Mac OS. I have searched the netweb for this question, but so far have no answer as to why this is happening.
In Mac OS El Capitan, the generated title looks like this:
, which does not have the character "=" automatically substituted. In Linux however, the title looks fine:, with the missing character "=" automatically substituted with another typeface.
So my question is, how do I make this happen in Mac OS El Capitan as well??? Thanks a bunch!!
In the 90's computer fonts were expensive objects, often released under restrictive licenses. Windows and OS/X got around by commissioning a restricted font set, and using it everywhere (arial + times new roman + courier for windows, etc).
Linux systems had no such possibility, their distribution model required specific licensing conditions most font foundries were not agreable with, so they had to make to with the few limited fonts that were available.
Improvements in font creation tools lowered font costs drastically, and sustained advocacy efforts convinced font foundries it would not be the end of the world to release some fonts under less restrictive terms (making the Open font Library, and Google font directory possible). So nowadays Linux systems also ship with extensive font collections.
Also Linux systems always supported every possible language in one release, so they had to work out how to deal with font conflicts. Apple and Microsoft mostly worked around the problem by releasing different OS versions for different regions, with different non-conflicting font sets.
As a result the Linux font stack integrates extensive support to workaround font problems (autohinting, substitution, etc). You won't find the same level of support on OS/X or windows. Apple and Microsoft always considered the solution to incomplete fonts was to pay for more extensive versions.
So I think I don't quite understand how memory is working in R. I've been running into problems where the same piece of code gets slower later in the week (using the same R session - sometimes even when I clear the workspace). I've tried to develop a toy problem that I think reproduces the "slowing down affect" I have been observing, when working with large objects. Note the code below is somewhat memory intensive (don't blindly run this code without adjusting n and N to match what your set up can handle). Note that it will likely take you about 5-10 minutes before you start to see this slowing down pattern (possibly even longer).
N=4e7 #number of simulation runs
n=2e5 #number of simulation runs between calculating time elapsed
meanStorer=rep(0,N);
toc=rep(0,N/n);
x=rep(0,50);
for (i in 1:N){
if(i%%n == 1){tic=proc.time()[3]}
x[]=runif(50);
meanStorer[i] = mean(x);
if(i%%n == 0){toc[i/n]=proc.time()[3]-tic; print(toc[i/n])}
}
plot(toc)
meanStorer is certainly large, but it is pre-allocated, so I am not sure why the loop slows down as time goes on. If I clear my workspace and run this code again it will start just as slow as the last few calculations! I am using Rstudio (in case that matters). Also here is some of my system information
OS: Windows 7
System Type: 64-bit
RAM: 8gb
R version: 2.15.1 ($platform yields "x86_64-pc-mingw32")
Here is a plot of toc, prior to using pre-allocation for x (i.e. using x=runif(50) in the loop)
Here is a plot of toc, after using pre-allocation for x (i.e. using x[]=runif(50) in the loop)
Is ?rm not doing what I think it's doing? Whats going on under the hood when I clear the workspace?
Update: with the newest version of R (3.1.0), the problem no longer persists even when increasing N to N=3e8 (note R doesn't allow vectors too much larger than this)
Although it is quite unsatisfying that the fix is just updating R to the newest version, because I can't seem to figure out why there was problems in version 2.15. It would still be nice to know what caused them, so I am going to continue to leave this question open.
As you state in your updated question, the high-level answer is because you are using an old version of R with a bug, since with the newest version of R (3.1.0), the problem no longer persists.
Just got a Windows box set up with two 64 bit Intel Xeon X5680 3.33 GHz processors (6 cores each) and 12 GB of RAM. I've been using SAS on some large data sets, but it's just too slow, so I want to set up R to do parallel processing. I want to be able to carry out matrix operations, e.g., multiplication and inversion. Most of my data are not huge, 3-4 GB range, but one file is around 50 GB. It's been a while since I used R, so I looked around on the web, including the CRAN HPC, to see what was available. I think a foreach loop and the bigmemory package will be applicable. I came across this post: Is there a package for parallel matrix inversion in R that had some interesting suggestions. I was wondering if anyone has experience with the HIPLAR packages. Looks like hiparlm adds functionality to the matrix package and hiplarb add new functions altogether. Which of these would be recommended for my application? Furthermore, there is a reference to the PLASMA library. Is this of any help? My matrices have a lot of zeros, so I think they could be considered sparse. I didn't see any examples of how to pass data fro R to PLASMA, and looking at the PLASMA docs, it says it does not support sparse matrices, so I'm thinking that I don't need this library. Am I on the right track here? Any suggestions on other approaches?
EDIT: It looks like HIPLAR and package pbdr will not be helpful. I'm leaning more toward bigmemory, although it looks like I/O may be a problem: http://files.meetup.com/1781511/bigmemoryRandLinearAlgebra_BryanLewis.pdf. This article talks about a package vam for virtual associative matrices, but it must be proprietary. Would package ff be of any help here? My R skills are just not current enough to know what direction to pursue. Pretty sure I can read this using bigmemory, but not sure the processing will be very fast.
If you want to use HiPLAR (MAGMA and PLASMA libraries in R), it is only available for Linux at the moment. For this and many other things, I suggest switching your OS to the penguin.
That being said, Intel MKL optimization can do wonders for these sort of operations. For most practical uses, it is the way to go. Python built with MKL optimization for example can process large matrices about 20x faster than IDL, which was designed specifically for image processing. R has similarly shown vast improvements when built with MKL optimization. You can also install R Open from Revolution Analytics, which includes MKL optimization, but I am not sure that it has quite the same effect as building it yourself using Intel tools: https://software.intel.com/en-us/articles/build-r-301-with-intel-c-compiler-and-intel-mkl-on-linux
I would definitely consider the type of operations one is looking to perform. GPU processes are those that lend well to high parallelism (many of the same little computations running at once, as with matrix algebra), but they are limited by bus speeds. Intel MKL optimization is similar in that it can help use all of your CPU cores, but it is really optimized to Intel CPU architecture. Hence, it should provide basic memory optimization too. I think that is the simplest route. HiPLAR is certainly the future, as it is CPU-GPU by design, especially with highly parallel heterogeneous architectures making their way into consumer systems. Most consumer systems today cannot fully utilize this though I think.
Cheers,
Adam