We have a recent performance bench mark that I am trying to understand. We have a large script that performance appears 50% slower on a Redhat Linux machine than a Windows 7 laptop where the specs are comparable. The linux machine is virtualized using kvm and has 4 cores assigned to it along with 16GB of memory. The script is not io intensive but has quite a few for loops. Mainly I am wondering if there are any R compile options that I can use to optimize or any kernel compiler options that might help to make this more comparable. Any pointers would be appreciated. I will try to get another machine and test it in using raw metal also for a better comparison.
These are the configure flags that I am using to compile R on the linux machine. I have experimented quite a bit and this seems to cut 12 seconds off the execution time in the green for larger data sets. Essentially I went from 2.087 to 1.48 seconds with these options.
./configure CFLAGS="-O3 -g -std=gnu99" CXXFLAGS="-O3 -g" FFLAGS="-O2 -g" LDFLAGS="-Bdirect,--hash-stype=both,-Wl,-O1" --enable-R-shlib --without-x --with-cairo --with-libpng --with-jpeglib
Update 1
The script has not been optimized yet. Another group is actually working on the script and we have put in requests to use the apply function but not sure how this explains the disparity in the times.
The top of the profile looks like this. Most of these functions will later be optimized using the apply functions but right now it is bench marked apples to apples on both machines.
"eval.with.vis" 8.66 100.00 0.12 1.39
"source" 8.66 100.00 0.00 0.00
"[" 5.38 62.12 0.16 1.85
"GenerateQCC" 5.30 61.20 0.00 0.00
"[.data.frame" 5.20 60.05 2.40 27.71
"ZoneCalculation" 5.12 59.12 0.02 0.23
"cbind" 3.22 37.18 0.34 3.93
"[[" 2.26 26.10 0.38 4.39
"[[.data.frame" 1.88 21.71 0.82 9.47
My first suspicion and I will be testing shortly and updating with my findings is that KVM linux virtualization is to blame. This script is very memory intensive and due to the large number of array operations and R being pass by copy ( which of course has to malloc ) this may be causing the problem. Since the VM does not have direct access to the memory controller and must share it with it's other VM's this could very likely cause the problem. I will be getting a raw machine later on today and will update with my findings.
Thank you all for the quick updates.
Update 2
We originally thought the cause of the performance problem was caused by hyper threading with a VM, but this turned out to be incorrect and performance was the same on a bare metal machine comparatively.
We later realized that the windows laptop is using a 32 bit version of R for computations. This led us to try the 64 bit version of R and the result was ~140% slower than 32 bit on the same exact same script. This leads me to the question of how is it possible that the 64 bit could be ~140% slower than the 32 bit version of R?
What we are seeing is that the 32
Windows 32 bit execution time 48 seconds
Windows 64 bit execution time 2.33 seconds.
Linux 64 bit execution time 2.15 seconds.
Linux 32 bit execution time < in progress > ( Built a 32 bit version on RHEL 6.3 x86_64 but did not see performance improvement am going to reload with 32 bit version of RHEL 6.3 )
I found this link but it only explains a 15-20% hit on some 64 bit machines.
http://www.hep.by/gnu/r-patched/r-admin/R-admin_51.html
Sorry I cannot legally post the script.
Have a look at the sections on "profiling" in the Writing R Extensions manual.
From 30,000 feet, you can't say much else -- you will need profiling information. "General consensus" (and I am putting this is in quotes as you can't really generalize these things) is that Linux tends to do better on memory management and file access, so I am a little astonished by your results.
Building R with --enable-R-shlib can cause a performance penalty. This is discussed in R Installation and Administration, in Appendix B, Section 1. That alone could explain 10-20% of the variation. Other sources could be from the differences of the "comparable specs".
The issue was resolved and it was caused by a non optimized BLAS library.
This article was a great help. Using ATLAS was a great help.
http://www.cybaea.net/Blogs/Data/Faster-R-through-better-BLAS.html
Related
I am trying to use Intel's MKL with R and adjust the number of threads using the MKL_NUM_THREADS variable.
It loads correctly, and I can see it using 3200% CPU in htop. However, it isn't actually faster than using only one thread.
I've been adapting Dirk Eddelbuettel's guide for centos, but I may have missed some flag or config somewhere.
Here is a simplified version of how I am testing how number of threads relates to job time. I do get expected results when using OpenBlas.
require(callr)
#> Loading required package: callr
f <- function(i) r(function() crossprod(matrix(1:1e9, ncol=1000))[1],
env=c(rcmd_safe_env(),
R_LD_LIBRARY_PATH=MKL_R_LD_LIBRARY_PATH,
MKL_NUM_THREADS=as.character(i),
OMP_NUM_THREADS="1")
)
system.time(f(1))
#> user system elapsed
#> 14.675 2.945 17.789
system.time(f(4))
#> user system elapsed
#> 54.528 2.920 19.598
system.time(f(8))
#> user system elapsed
#> 115.628 3.181 20.364
system.time(f(32))
#> user system elapsed
#> 787.188 7.249 36.388
Created on 2020-05-13 by the reprex package (v0.3.0)
EDIT 5/18
Per the suggestion to try MKL_VERBOSE=1, I now see the following on stdout which shows it properly calling lapack:
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7fff436222c0,0x7f71024ef040,1000000,0x7fff436222d0,0x7f7101d4d040,1000) 10.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
for f(8), it shows NThr:8
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7ffe6b39ab40,0x7f4bb52eb040,1000000,0x7ffe6b39ab50,0x7f4bb4b49040,1000) 11.98s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8
I still am not getting any expected performance increase from extra cores.
EDIT 2
I am able to get the expected results using Microsoft's distribution of MKL, but not with Intel's official distribution as in the walkthrough. It appears that MS is using a GNU threading library; could the problem be in the threading library and not in blas/lapack itself?
Only seeing this now: Did you check the obvious one, ie whether R on CentOS actually picks up the MKL?
As I recall, R on CentOS it is built in a more, ahem, "restricted" mode with the shipped-with-R reference BLAS. And if and when that is the case you simply cannot switch and choose another one as we have done within Debian and Ubuntu for 20+ years as that requires a different initial choice when R is compiled.
Edit: Per subsequent discussions (see comments below) we all re-realized that it is important to have the threading libraries / models aligned. The MKL is an Intel product and defaults to using their threading library, on Linux the GNU compiler is closer to the system and has its own. That latter one needs to be selected. In my writeup / script for the MKL on .deb systems I use
echo "MKL_THREADING_LAYER=GNU" >> /etc/environment
so set this "system-wide" on the machine, one can also add it just to the R environment files.
I am not sure exactly how R call MKL but if the crossprod function calls mkl's gemm underneath then we have to see very good scalability results with such inputs. What is the input problem sizes?
MKL supports the verbose mode. This option could help to see the many useful runtime info when dgemm will be running. Could you try to export the MKL_VERBOSE=1 environment and see the log file?
Though, I am not pretty sure if R will not suppress the output.
If I run my tests using PHP7.2 or PHP7.1 they are about 3x slower than if I run them using PHP7.0. Is there anyway to get to the bottom of why this is happening?
Even when I run the test suites (Feature & Unit) separately I still see the slow down. It's only when I run the tests individually does the speed difference become insignificant.
I'm using Laravel 5.5.20 and Laravel Homestead 7.0.1. I have 47 rather simple tests, some hitting the database, others just simple assertions; so there isn't anything that should take ages.
I installed johnkary/phpunit-speedtrap to see which tests take the longest so I could remove those but there isn't a specific test that takes a long time because if I remove the offending test, the next one will take ages (see below).
First Run Second Run
Test A 0.2 sec Test A 0.2 sec
Test B. 0.3 sec Test B. 0.3 sec
Test C 0.1 sec Test C 0.1 sec
Test D 0.1 sec Test D 0.1 sec
Test E 9.3 sec REMOVED Test E
Test F 0.3 sec Test F 9.3 sec <-- Test F now takes ages
Test G 0.2 sec Test G 0.2 sec
I am also using an in-memory SQLite3 database, with the Laravel CreatesApplication and RefreshDatabase trait as I want each test to run independently.
I do not have Xdebug installed or running. Is there something known that PHP7.1 and PHP7.2 take a long time to run PHPUnit tests? Is there something else I can install (or even run it with Xdebug) to track down what exactly it is that is causing the issue?
Setup
Laravel 5.5.20
Laravel Homestead 7.0.1 (Per-project installation)
PHPUnit 6.4.4
Vagrant 2.0.1
Virtualbox 5.2.4
Results
PHP 7.2 PHPUnit 6.4.4
Time: 12.4 seconds, Memory: 162.00MB
PHP 7.1 PHPUnit 6.4.4
Time: 12.19 seconds, Memory: 162.00MB
PHP 7.0 PHPUnit 6.4.4
Time: 4.88 seconds, Memory: 162.00MB
I had the same issue as you but with XDebug installed. I found a pretty good hint by a user named Roni on Laracasts (I can't find the link anymore, sorry) which says to run tests with the -n flag of the php command, like this: php -n vendor/bin/phpunit.
According to the documentation at php.net (command line options) this is for running the command without what php.ini is defining. Which means no includes of extensions.
-n No php.ini file will be used
So, for me it runs tests in a minute now, not in 15 minutes. The problem is kind of strange, because it began with php 7.2 on my machine but others in my team don't have the issue, despite the fact that xdebug is installed. I wonder what really is behind this issue.
Windows 10 64 bit, 32 GB RAM, Rstudio 1.1.383 and R 3.4.2 (up-to-date)
I have several csv files which have at least 1 or 2 lines full of many nul values. So I wrote a script that uses read_lines_raw() from stringr package in R which reads the file in raw format. It produces a list where each element is a row. Then I check for 00 (the nul value) and when it is found that line gets deleted.
One of the files is 2.5 GB in size and also has nul value somewhere in it. The problem is, read_lines_raw is not able to read this file and throws an error:
r in read_lines_raw_(ds, n_max = n_max, progress = progress) :
negative length vectors are not allowed
I don't even understand the problem. Some of my research hints something regarding the size, but not even half of the RAM is used. Some other files that it was able to read were 1.5 GB in size. Is this file too big, or is it something else that causes this?
Update 1:
I tried to read in the whole file using scan but that also gave me an error:
could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'
So although my pc is 32 GB, the maximum allowed space for an entity is 2 GB? And I checked to make sure it is running 64 bit R, and yes it is.
> version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.2
year 2017
month 09
day 28
svn rev 73368
language R
version.string R version 3.4.2 (2017-09-28)
nickname Short Summer
It seems like many people are facing similar issues, but there is no solution I could find. How can we increase the memory allocation for individual entities? The memory.limit() gives back 32 GB, which is the RAM size, but that isn't helpful. memory.size() does give something close 2 GB, and since the file is 2.7 GB on the disk, I assume this is the reason for getting the error.
Thank you.
Take Cairo as an example, when I run Pkg.add("Cairo"), there's nothing displayed in the console.
Is there a way to let Pkg.add() display more information when it is working?
What steps does Pkg.add() carry out? Download, compile?
Is it possible to speed it up? I kept waiting for 15 minutes, nothing out! Maybe it's Julia's problem, or maybe it's system's problem, how can one tell?
Edit
Julia version: 0.3.9 (Installed using binary from julia-lang.org)
OS: Winsows 7 64bit.
CPU: Core Duo 2.4GHz
RAM: 4G
Hard Disk: SSD
ping github.com passed, 0% loss.
Internet download speedtest: ~30 Mbps.
I don't know whether this is normal: it took me 11 seconds to get the version.
PS C:\Users\Nick> Measure-Command {julia --version}
Days : 0
Hours : 0
Minutes : 0
Seconds : 11
Milliseconds : 257
Ticks : 112574737
TotalDays : 0.000130294834490741
TotalHours : 0.00312707602777778
TotalMinutes : 0.187624561666667
TotalSeconds : 11.2574737
TotalMilliseconds : 11257.4737
And it took nearly 2 minutes to load the Gadfly package:
julia> #time require("Gadfly")
elapsed time: 112.131236102 seconds (442839856 bytes allocated, 0.39% gc time)
Does it runs faster on Linux/Mac than on Windows? It is usually not easy to build software on Windows; however, will it improve the performance if I build from source?
Julia is awesome, I really hope it works!
As mentioned by Colin T. Bowers, your particular case is abnormally slow and indicates something wrong with your installation. But Pkg.add (along with other Pkg operations) is known to be slow on Julia 0.4. Luckily, this issue has been fixed.
Pkg operations have seen a substantial increase in performance in Julia v0.5, which is being released today (September 19, 2016). You can go to the downloads page to get v0.5. (It might be a few hours before they're up as of the time of this writing.)
I'm just doing some performance testing on a new laptop. My problem starts when I tried to test it on parallel computing.
So, when I run the function detectCores() from parallel the result is 1. The problem is that the laptop has an i7- 4800MQ processor which as 4 cores.
As a result when I run my code it thinks that it has only one core and the time to execute the code is exactly the same as without the parallelization.
I’ve tested the code in a different machine with an i5 processor also with 4 cores using the same R version (R 3.0.2 64 bits) and the code runs perfectly. The only difference is that the new computer as installed windows 8.1 while the old one has windows 7
Also, when I run Sys.getenv(“NUMBER_OF_PROCESSORS”) I also get 1 as an answer.
I've search the internet looking for an answer with no joy. As anyone came across this problem before?
Manny thanks
Make sure you are loading the parallel package before running detectCores(). I also have an i-7 processor (Windows 8.1, 64 Bit) and I am able to see as 8 cores when I run detectCores(logical = TRUE) and I get 4 when I run detectCores(logical = FALSE). For more, kindly refer this link. HTH