OpenACC on Intel built-in graphics cards (Intel Iris Plus Graphics 655) - intel

I would like to find out, whether built-in Intel graphics cards (e.g. Intel Iris Plus Graphics 655) support OpenACC directives? Would anyone be able to direct me to any relevant information?

The PGI C compiler does not support Intel as a target architecture, where architecture can be specified with the -ta option:
pgcc -I../common -acc -ta=nvidia,time -Minfo=accel -o laplace2d_acc laplace2d.c
Compiler issues the following warning:
pgcc-Warning-OpenACC for GPUs no longer supported on macOS, enabling multicore CPU code generation. Use -ta=multicore to avoid this warning
That means that no GPUs are supported on macOS, but it is still possible to compile the code with OpenACC directives aiming for execution on multiple cores of the CPU with -ta=multicore:
pgcc -I../common -acc -ta=multicore,time -Minfo=accel -o laplace2d_acc laplace2d.c
GNU C compiler (starting from version 7) supports OpenACC (ver. 7 and 8 support OpenACC 2.0a, ver. 9--OpenACC 2.5), where the acc directives are enabled with the -fopenacc option:
gcc -I../common -fopenacc -o laplace2d_acc laplace2d.c
However, I wasn't able to find compiler flags to target the Intel Iris card specifically.

Related

Intel Advisor beta offloading analysis: No execution count

I am trying to use Intel oneAPI advisor beta to do a GPU offloading analysis (via analyze.py and collect.py). I have the problem that all non offloaded regions show Cannot be modelled: No Execution Count.
Furthermore I get the warning
advixe: Warning: A symbol file is not found. The call stack passing through `...../programm.out' module may be incorrect.
I've already tried the troubleshooting described here and here. Moreover I tried using a programm with larger runtime.
I compiled with the compiler flags (according to this) (notice that debugging information is turned on):
-O2 -std=c++11 -fopenmp -g -no-ipo -debug inline-debug-info
I am using Intel(R) Advisor 2021.1 beta07 (build 606302), and Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 2021.1 Beta Build 202006. The program uses OpenMP.
What could I do to solve this problem?
The problem occurred because the program had too large a workload / the memory of the machine was not sufficient.
Try
running collect.py with --no-track-heap-objects (might reduce precision)
reducing the runtime and memory usage of the program that is analysed
pausing and resuming only on relevant parts via the libittnotify API

Why is MKL in parallel not faster than serial in R 3.6?

I am trying to use Intel's MKL with R and adjust the number of threads using the MKL_NUM_THREADS variable.
It loads correctly, and I can see it using 3200% CPU in htop. However, it isn't actually faster than using only one thread.
I've been adapting Dirk Eddelbuettel's guide for centos, but I may have missed some flag or config somewhere.
Here is a simplified version of how I am testing how number of threads relates to job time. I do get expected results when using OpenBlas.
require(callr)
#> Loading required package: callr
f <- function(i) r(function() crossprod(matrix(1:1e9, ncol=1000))[1],
env=c(rcmd_safe_env(),
R_LD_LIBRARY_PATH=MKL_R_LD_LIBRARY_PATH,
MKL_NUM_THREADS=as.character(i),
OMP_NUM_THREADS="1")
)
system.time(f(1))
#> user system elapsed
#> 14.675 2.945 17.789
system.time(f(4))
#> user system elapsed
#> 54.528 2.920 19.598
system.time(f(8))
#> user system elapsed
#> 115.628 3.181 20.364
system.time(f(32))
#> user system elapsed
#> 787.188 7.249 36.388
Created on 2020-05-13 by the reprex package (v0.3.0)
EDIT 5/18
Per the suggestion to try MKL_VERBOSE=1, I now see the following on stdout which shows it properly calling lapack:
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7fff436222c0,0x7f71024ef040,1000000,0x7fff436222d0,0x7f7101d4d040,1000) 10.64s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
for f(8), it shows NThr:8
MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7ffe6b39ab40,0x7f4bb52eb040,1000000,0x7ffe6b39ab50,0x7f4bb4b49040,1000) 11.98s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:8
I still am not getting any expected performance increase from extra cores.
EDIT 2
I am able to get the expected results using Microsoft's distribution of MKL, but not with Intel's official distribution as in the walkthrough. It appears that MS is using a GNU threading library; could the problem be in the threading library and not in blas/lapack itself?
Only seeing this now: Did you check the obvious one, ie whether R on CentOS actually picks up the MKL?
As I recall, R on CentOS it is built in a more, ahem, "restricted" mode with the shipped-with-R reference BLAS. And if and when that is the case you simply cannot switch and choose another one as we have done within Debian and Ubuntu for 20+ years as that requires a different initial choice when R is compiled.
Edit: Per subsequent discussions (see comments below) we all re-realized that it is important to have the threading libraries / models aligned. The MKL is an Intel product and defaults to using their threading library, on Linux the GNU compiler is closer to the system and has its own. That latter one needs to be selected. In my writeup / script for the MKL on .deb systems I use
echo "MKL_THREADING_LAYER=GNU" >> /etc/environment
so set this "system-wide" on the machine, one can also add it just to the R environment files.
I am not sure exactly how R call MKL but if the crossprod function calls mkl's gemm underneath then we have to see very good scalability results with such inputs. What is the input problem sizes?
MKL supports the verbose mode. This option could help to see the many useful runtime info when dgemm will be running. Could you try to export the MKL_VERBOSE=1 environment and see the log file?
Though, I am not pretty sure if R will not suppress the output.

Load SPIR binary with clBuildProgram on Windows

I am trying to load a SPIR binary i created with clang+llvm 6.0.1.
Created a few different files with :
clang -target spir-unknown-unknown -cl-std=CL1.2 -c -emit-llvm -Xclang -finclude-default-header OCLkernel.cl
clang -target amdgcn-amd-amdhsa -cl-std=CL1.2 -c -emit-llvm -Xclang -finclude-default-header OCLkernel.cl
clang -cc1 -emit-llvm-bc -triple spir-unknown-unknown -cl-std=CL1.2 -include "include\opencl-c.h" OCLkernel.cl
This is all happening on windows, installed AMD APP SDK 3 and Adrenalin 18.6.1 drivers.
After this i try to create a program from the binary :
clCreateProgramWithBinary(context, 1, &device, &programSrcSize, (const unsigned char**)&programSrc, 0 , &status)
This all goes OK, i don't get any errors here, but i do when trying to build it afterwards :
clBuildProgram(program, 1, &device, " –x spir -spir-std=1.2", NULL, NULL);
The error i get is :
Error CL_INVALID_COMPILER_OPTIONS when calling clBuildProgram
I tried without the "-x spir..." stuff too, but then i just get a :
error: Invalid value (Producer: 'LLVM6.0.1' Reader: 'LLVM 3.9.0svn')
EDIT:
CL_DEVICE_NAME: gfx900
CL_DEVICE_VERSION: OpenCL 2.0 AMD-APP (2580.6)
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 2.0
CL_DRIVER_VERSION: 2580.6 (PAL,HSAIL)
CL_DEVICE_SPIR_VERSIONS: 1.2
After running clCreateProgramWithBinary i query the device with clGetProgramBuildInfo and get :
CL_PROGRAM_BINARY_TYPE = [CL_PROGRAM_BINARY_TYPE_INTERMEDIATE]
So that should mean the binary is being recognised, else i guess it would return CL_PROGRAM_BINARY_TYPE_NONE
EDIT2:
I think clang isn't creating a 'good' binary, but how to create it then?
Appreciate your help!
Unfortunately the support for SPIR was silently removed from AMD drivers, see dipak answers in this thread of AMD community forum:
https://community.amd.com/thread/232093
Regarding your second question: general clang+LLVM (not the secret version tuned by AMD and included in their proprietary drivers) still cannot produce binaries compatible with general-purpose Windows AMD drivers, however it is possible for Linux: all new AMD’s ROCm, AMD PAL and Mesa 3D runtime are covered.
It is a mystery for me why LLVM AMDGPU backend developers do not prioritize the task to produce binaries for Windows drivers, as there is a couple of GCN assembler projects that provide such a functionality through Windows OpenCL interface, to name a few: CLRadeonExtender, ASM4GCN, HepPas, etc. Moreover I know an undocumented fork of clang+LLVM that (as its author states) produce such OpenCL binaries! "There are more things in heaven and earth, Horatio, Than are dreamt of in your philosophy."

clCreateFromGLTexture() returns CL_INVALID_CONTEXT on certain platforms only

After positive creation of the shared context between OpenGL and OpenCL using following:
cl_context_properties cps[] = {
CL_GL_CONTEXT_KHR,
(cl_context_properties)glXGetCurrentContext(),
CL_GLX_DISPLAY_KHR,
(cl_context_properties)glXGetCurrentDisplay(),
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform_id,
0
};
// Create an OpenCL context
m_contextCL = clCreateContext( cps, 1, &device_id, NULL, NULL, &err);
I try to create a shared texture:
cl_mem mem = clCreateFromGLTexture(
m_contextCL ,
CL_MEM_READ_ONLY ,
GL_TEXTURE_2D ,
0 ,
qt_fbo->texture() ,
&err
);
Now the call is successful only on xubuntu 16.04 with NVIDIA Quadro K620 using proprietary driver version 387.26 and OpenCL delivered with CUDA implementation package.
However when trying it on Toshiba laptop with Intel HD Graphics 520 on Manjaro OS and Beignet OpenCL implementation. The clCreateFromGLTexture(...) is failing by returning CL_INVALID_CONTEXT,
Additionally I tried another platform with Ubuntu 16.04 and Intel Iris IGP (Integrated Graphics Processor) using both Intel SDK and Beignet OpenCL. It fails at the same point of shared texture creation.
I created minimum working example for comparison two GPU techniques (OpenGL and OpenCL) and its interoperability with Qt:
https://github.com/pietrzakmat/opengl-opencl-qt-interop.
All the steps are derived from two tutorials:
1. https://www.codeproject.com/Articles/685281/OpenGL-OpenCL-Interoperability-A-Case-Study-Using
2. https://software.intel.com/en-us/articles/opencl-and-opengl-interoperability-tutorial
Anyone could point out what am I doing wrong and why the creation of shared texture fails on the platforms with integrated graphics or IGP Intel cpu? Is this some problem with drivers or OpenCL implementations? I managed to build and run the samples included in Beignet or intel_ocl_examples so I think the installation is correct.
1) is supported cl_khr_gl_sharing extension? Did you try to use this code on Windows Platform/macOS for Intel GPU ?
2) Did you try to use texture without attached to FBO ?
any way, i think this is problem in OpenCL implementation on Linux platform.

"A request was made to bind to that would result in binding more processes than cpus on a resource" mpirun command (for mpi4py)

I am running OpenAI baselines, specifically the Hindsight Experience Replay code. (However, I think this question is independent of the code and is an MPI-related one, hence why I'm posting on StackOverflow.)
You can see the README there but the point is, the command to run is:
python -m baselines.her.experiment.train --num_cpu 20
where the number of CPUs can vary and is for MPI.
I am successfully running the HER training script with 1-4 CPUs (i.e., --num_cpu x for x=1,2,3,4) on a single machine with:
Ubuntu 16.04
Python 3.5.2
TensorFlow 1.5.0
One TitanX GPU
The number of CPUs seems to be 8 as I have a quad-core i7 Intel processor with hyperthreading, and Python confirms that it sees 8 CPUs.
(py3-tensorflow) daniel#titan:~/baselines$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import os, multiprocessing
In [2]: os.cpu_count()
Out[2]: 8
In [3]: multiprocessing.cpu_count()
Out[3]: 8
Unfortunately, when I run with 5 or more CPUs, I get this message blocking the code from running:
(py3-tensorflow) daniel#titan:~/baselines$ python -m baselines.her.experiment.train --num_cpu 5
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: titan
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
And here's where I got lost. There's no error message or line of code that I need to fix. I am therefore unsure about where I even add overload-allowed in the code?
The way this code works at a high level is that it takes in this argument and uses the python subprocess module to run an mpirun command. However, checking mpirun --help on the command line doesn't reveal overload-allowed as a valid argument.
Googling this error message leads to questions in the openmpi repository, for instance:
https://github.com/open-mpi/ompi/issues/626 (seems to have died out without resolving issue)
https://github.com/open-mpi/ompi/issues/2158 (not sure how this relates to my issue, didn't get clear resolution)
But I'm not sure if it's an OpenMPI thing or an mpi4py thing?
Here's pip list in my virtual environment if it helps:
(py3.5-mpi-practice) daniel#titan:~$ pip list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
decorator (4.2.1)
ipython (6.2.1)
ipython-genutils (0.2.0)
jedi (0.11.1)
line-profiler (2.1.2)
mpi4py (3.0.0)
numpy (1.14.1)
parso (0.1.1)
pexpect (4.4.0)
pickleshare (0.7.4)
pip (9.0.1)
pkg-resources (0.0.0)
pprintpp (0.3.0)
prompt-toolkit (1.0.15)
ptyprocess (0.5.2)
Pygments (2.2.0)
setuptools (20.7.0)
simplegeneric (0.8.1)
six (1.11.0)
traitlets (4.3.2)
wcwidth (0.1.7)
So, TL;DR:
How do I fix this error in my code?
If I add the "overload-allowed" thing, what happens? Is it safe?
Thanks!
overload-allowed is a qualifier that is passed to --bind-to parameter of mpirun (source).
mpirun ... --bind-to core:overload-allowed
Beware that hyperthreading thing is more about marketing than about performance bonuses.
Your i7 can actually have four silicon cores and four "logical" ones. The logical ones basically try to use resources of the silicon cores that are currently unused. The problem is that a good HPC program will use 100% of the CPU hardware, and hyperthreading won't have resources to successfully operate.
So, it is safe to "overload" "cores", but it's not a performance boost candidate #1.
Regarding the advice that the paper authors give about reproducing the results. In the best case less cpus just means slow learning. However, if learning doesn't converge to an expected value no matter how hyperparams are tweaked, then it is a reason to look closer at the proposed algorithm.
While IEEE754 computations do differ if done in different order, this difference should not play the crucial role.
The error message suggests that mpi4py is built on top of Open MPI.
By default, a slot is a core, but if you want a slot to be an hyperthread, then you should
mpirun --use-hwthread-cpus ...

Resources