Intel MKL/Xeon Phi Offload Runtime Issue - Auto Offload not working - r

I have set up my Xeon phi 3120A in Windows 10 Pro, with MPSS 3.8.4 and Parallel XE 2017 (Initial Release). I have chosen this Parallel XE as this was the last supported XE for the x100 series. I have installed the MKL version that is packaged with the Parallel XE 2017 (Initial Release).
What have I done / setup:
After setting up MPSS 3.8.4, and following the steps such as flashing and pinging, I have checked that micctrl -s shows “mic0 ready” (with linux image containing the appropriate KNC name), miccheck produces all "passes" and micinfo gives me a reading for all the key stats that the co-processor is providing.
Hence to me it looks like the co-processor is certainly installed and being recognised by my computer. I can also see that mic0 is up and running in the micsmc gui.
I have then set up my environment variables to enable automatic offload, namely, MKL_MIC_ENABLE=1, OFFLOAD_DEVICES= 0, MKL_MIC_MAX_MEMORY= 2GB, MIC_ENV_PREFIX= MIC, MIC_OMP_NUM_THREADS= 228, MIC_KMP_AFFINITY= balanced.
The Problem - Auto Offload is not working
When I go to run some simple code in R-3.4.3 (copied below, designed specifically for automatic offload), it keeps running the code through my host computer rather than running anything through the Xeon phi.
To support this, I cannot see any activity on the Xeon Phis when I look at the micsmc gui.
Hence, auto offload is not working.
The R code:
require(Matrix)
sink("output.txt")
N <- 16000
cat("Initialization...\n")
a <- matrix(runif(N*N), ncol=N, nrow=N);
b <- matrix(runif(N*N), ncol=N, nrow=N);
cat("Matrix-matrix multiplication of size ", N, "x", N, ":\n")
for (i in 1:5) {
dt=system.time( c <- a %*% b )
gflops = 2*N*N*N*1e-9/dt[3]
cat("Trial: ", i, ", time: ", dt[3], " sec, performance: ", gflops, " GFLOP/s\n")
}
Other steps I have tried:
I then proceeded to set up the MKL_MIC_DISABLE_HOST_FALLBACK=1 environmental variable, and as expected, when I ran the above code, R terminated.
In Using Intel® MKL Automatic Offload on Intel Xeon Phi Coprocessors it says that if the HOST_FALLBACK flag is active and offload is attempted but fails (due to “offload runtime cannot find a coprocessor or cannot initialize it properly”), it will terminate the program – this is happening in that R is terminating completely. For completeness, this problem is happening on R-3.5.1, Microsoft R Open 3.5.0 and R-3.2.1 as well.
So my questions are:
What am I missing to make the R code run on the Xeon phi? Can you
please advise me on what I need to do to make this work?
(linked to
1) is there a way to check if the MKL offload runtime can see the
Xeon phi? Or that it is correctly set up, or what (if any) problem
that MKL is having initialising the Xeon phi?
Will sincerely appreciate if someone out there can help me – I believe that I am missing a fundamental/simple step, and have been tearing my hair out trying to make this work.
Many thanks in advance,
Rash

Related

Intel Advisor beta offloading analysis: No execution count

I am trying to use Intel oneAPI advisor beta to do a GPU offloading analysis (via analyze.py and collect.py). I have the problem that all non offloaded regions show Cannot be modelled: No Execution Count.
Furthermore I get the warning
advixe: Warning: A symbol file is not found. The call stack passing through `...../programm.out' module may be incorrect.
I've already tried the troubleshooting described here and here. Moreover I tried using a programm with larger runtime.
I compiled with the compiler flags (according to this) (notice that debugging information is turned on):
-O2 -std=c++11 -fopenmp -g -no-ipo -debug inline-debug-info
I am using Intel(R) Advisor 2021.1 beta07 (build 606302), and Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 2021.1 Beta Build 202006. The program uses OpenMP.
What could I do to solve this problem?
The problem occurred because the program had too large a workload / the memory of the machine was not sufficient.
Try
running collect.py with --no-track-heap-objects (might reduce precision)
reducing the runtime and memory usage of the program that is analysed
pausing and resuming only on relevant parts via the libittnotify API

Why is MKL in parallel not faster than serial in R 3.6?

I am trying to use Intel's MKL with R and adjust the number of threads using the MKL_NUM_THREADS variable.
It loads correctly, and I can see it using 3200% CPU in htop. However, it isn't actually faster than using only one thread.
I've been adapting Dirk Eddelbuettel's guide for centos, but I may have missed some flag or config somewhere.
Here is a simplified version of how I am testing how number of threads relates to job time. I do get expected results when using OpenBlas.
require(callr)
#> Loading required package: callr
f <- function(i) r(function() crossprod(matrix(1:1e9, ncol=1000))[1],
env=c(rcmd_safe_env(),
R_LD_LIBRARY_PATH=MKL_R_LD_LIBRARY_PATH,
MKL_NUM_THREADS=as.character(i),
OMP_NUM_THREADS="1")
)
system.time(f(1))
#> user system elapsed
#> 14.675 2.945 17.789
system.time(f(4))
#> user system elapsed
#> 54.528 2.920 19.598
system.time(f(8))
#> user system elapsed
#> 115.628 3.181 20.364
system.time(f(32))
#> user system elapsed
#> 787.188 7.249 36.388
Created on 2020-05-13 by the reprex package (v0.3.0)
EDIT 5/18
Per the suggestion to try MKL_VERBOSE=1, I now see the following on stdout which shows it properly calling lapack:
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7fff436222c0,0x7f71024ef040,1000000,0x7fff436222d0,0x7f7101d4d040,1000) 10.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
for f(8), it shows NThr:8
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7ffe6b39ab40,0x7f4bb52eb040,1000000,0x7ffe6b39ab50,0x7f4bb4b49040,1000) 11.98s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8
I still am not getting any expected performance increase from extra cores.
EDIT 2
I am able to get the expected results using Microsoft's distribution of MKL, but not with Intel's official distribution as in the walkthrough. It appears that MS is using a GNU threading library; could the problem be in the threading library and not in blas/lapack itself?
Only seeing this now: Did you check the obvious one, ie whether R on CentOS actually picks up the MKL?
As I recall, R on CentOS it is built in a more, ahem, "restricted" mode with the shipped-with-R reference BLAS. And if and when that is the case you simply cannot switch and choose another one as we have done within Debian and Ubuntu for 20+ years as that requires a different initial choice when R is compiled.
Edit: Per subsequent discussions (see comments below) we all re-realized that it is important to have the threading libraries / models aligned. The MKL is an Intel product and defaults to using their threading library, on Linux the GNU compiler is closer to the system and has its own. That latter one needs to be selected. In my writeup / script for the MKL on .deb systems I use
echo "MKL_THREADING_LAYER=GNU" >> /etc/environment
so set this "system-wide" on the machine, one can also add it just to the R environment files.
I am not sure exactly how R call MKL but if the crossprod function calls mkl's gemm underneath then we have to see very good scalability results with such inputs. What is the input problem sizes?
MKL supports the verbose mode. This option could help to see the many useful runtime info when dgemm will be running. Could you try to export the MKL_VERBOSE=1 environment and see the log file?
Though, I am not pretty sure if R will not suppress the output.

XGBoost not using max number of cores available under Windows?

I am using XGBoost via the R package, and did not specify an nthread parameter (should default to the maximum number of available cores, which it does in Ubuntu).
On a Windows PC with an i7-4770 CPU (which has 4 cores = 8 threads), however, only max. 50% of the max CPU level is reached, even when I manually set nthread = 8 (The exact same code uses 100% of max CPU level under Ubuntu, so this is not an implementation issue I think). I also tried nthread = 4 which leads to around 30% of max CPU usage.
How do I get XGBoost to use all available threads under Windows?
I've found that when installing the Windows XGBoost R package from CRAN via install.packages("xgboost") it does not have MPI support. Without MPI you will not get the full benefit of parallel processing and your CPUs will be under-utilised. You can confirm this in your scenario by using software like Dependency Walker on the xgboost.dll file—you will note that it doesn't link with any MPI library (usually vcomp140.dll on Windows).
The solution in my case was to uninstall the CRAN-supplied R package and build XGBoost and its R package from source, which was an adventure in itself, but did give me an MPI-enabled installation that pushed all 16 cores in my system to 100% utilisation.
(Edited for extra clarity)

Parallel detectCores() giving incorrect output on a Virtual machine

I'm just doing some performance testing on a new laptop. My problem starts when I tried to test it on parallel computing.
So, when I run the function detectCores() from parallel the result is 1. The problem is that the laptop has an i7- 4800MQ processor which as 4 cores.
As a result when I run my code it thinks that it has only one core and the time to execute the code is exactly the same as without the parallelization.
I’ve tested the code in a different machine with an i5 processor also with 4 cores using the same R version (R 3.0.2 64 bits) and the code runs perfectly. The only difference is that the new computer as installed windows 8.1 while the old one has windows 7
Also, when I run Sys.getenv(“NUMBER_OF_PROCESSORS”) I also get 1 as an answer.
I've search the internet looking for an answer with no joy. As anyone came across this problem before?
Manny thanks
Make sure you are loading the parallel package before running detectCores(). I also have an i-7 processor (Windows 8.1, 64 Bit) and I am able to see as 8 cores when I run detectCores(logical = TRUE) and I get 4 when I run detectCores(logical = FALSE). For more, kindly refer this link. HTH

R: Cannot allocate memory greater than x MB

I have a main function in R which calls other files to run my program. I call the main file through a bat file(.exe). When I run it line-by-line it runs without a memory error, but when I call the bat file to run it, it halts and gives me the following error:
Cannot allocate memory greater than 51 MB.
How can I avoid this?
Memory limitations in R such as this are a recurring nightmare for a lot of us.
Very often the problem is a limit imposed by your OS limits (which can usually be changed on a Bash or PowerShell command line), architecture (32 v. 64 bit), or the availability of contiguous free RAM, irregardless of overall available memory.
It's hard to say why something would not cause a memory issue when run line by line, but would hit the memory limit when run as a .bat.
What version of R are you running? Do you have both installed? Is 32-bit being called by Rscript when you run your .bat file whereas you run a 64-bit version line by line? You can check the version of R that's being run with R.Version().
You can test this by running the command memory.limit() in both your R IDE/terminal and in your .bat file (be sure to print or save the result as an object in your .bat file). You might also do well to try setting memory.limit() in your .bat file, as it may just have a smaller default, perhaps due to differences in your R Profile that's invoked in your IDE or terminal versus the .bat file.
If architecture isn't the cause of your memory error, then you have several more troubleshooting steps to try:
Check memory usage in both environments (in R directly and via your .bat process) using this:
sort( sapply(ls(),function(x){object.size(get(x))}))
Run the garbage collector explicitly in your scripts, that's the gc() command
Check all object sizes to make sure there are no unexpected results in your .bat process: sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")}))
Try memory profiling:
Rprof(tf <- "rprof.log", memory.profiling=TRUE)
Rprof(NULL)
summaryRprof(tf)
While this is a RAM issue, for good measure you might want to check that the compute power available is both sufficient and not varying between these two ways of running your code: parallel::detectCores()
Examine your performance with Prof. Hadley Wikham's lineprof tool (warning: requires devtools and doesn't work on lines of code which call the C programming language)
References While I'm pulling these snippets out of my own code, most of them originally came from other, related StackOverflow posts, such as:
Reaching memory allocation in R
R Memory Allocation "Error: cannot allocate vector of size 75.1 Mb"
R memory limit warning vs "unable to allocate..."
How to compute the size of the allocated memory for a general type
R : Any other solution to "cannot allocate vector size n mb" in R?
Yes you should be using 64bit R, if you can.
See this question, and this from the R docs.

Resources