Confusion about compiling with AVX512 - intel

I'm reading this document about how to compile C/C++ code using the Intel C++ compiler and AVX512 support on a Intel Knights Landing.
However, I'm a little bit confused about this part:
-xMIC-AVX512: use this option to generate AVX-512F, AVX-512CD, AVX-512ER and AVX-512FP.
-xCORE-AVX512: use this option to generate AVX-512F, AVX-512CD, AVX-512BW, AVX-512DQ and AVX-512VL.
For example, to generate Intel AVX-512 instructions for the Intel Xeon
Phi processor x200, you should use the option –xMIC-AVX512. For
example, on a Linux system
$ icc –xMIC-AVX512 application.c This compiler option is useful when
you want to build a huge binary for the Intel Xeon Phi processor x200.
Instead of building it on the coprocessor where it will take more
time, build it on an Intel Xeon processor-based machine
My Xeon Phi KNL doesn't have a coprocessor (No need to ssh micX or to compile with the -mmic flag). However, I don't understand if it's better to use the -xMIC or -xCORE?
In second place about -ax instead of -x:
This compiler option is useful when you try to build a binary that can run on multiple platforms.
So -ax is used for cross-platform support, but is there any performance difference comapred to -x?

For the first question, please use –xMIC-AVX512 if you want to compile for the Intel Xeon Phi processor x200 (aka KNL processor). Note that the phrase in the paper that you mentioned was mistyped, it should read "This compiler option is useful when you want to build a huge binary for the Intel Xeon Phi processor x200. Instead of building it on the Intel Xeon Phi processor x200 where it will take more time, build it on an Intel Xeon processor-based machine."
For the second question, there should not be a performance difference if you run the binaries on an Intel Xeon Phi processor x200. However, the size of the binary complied with -ax should be bigger than the one compiled with -x option.

Another option from the link you provide is to build with -xCOMMON-AVX512. This is a tempting option because in my case it has all the instructions that I need and I can use the same option for both a KNL and a Sklake-AVX512 system. Since I don't build on a KNL system I cannot use -xHost (or -march=native with GCC).
However, -xCOMMON-AVX512 should NOT be used with KNL. The reason is that it generates the vzeroupper instruction (https://godbolt.org/z/PgFX55) which is not only not necessary it actually is very slow on a KNL system.
From Agner Fog's micro-architecture manual he writes in the KNL section.
The VZEROALL or VZEROUPPER instructions are not only superfluous here, they are actually
harmful for the performance. A VZEROALL or VZEROUPPER instruction takes 36 clock cycles
in 64 bit mode...
Therefore for a KNL system you should use -xMIC-AVX512for other systems with AVX512 you should use -xCORE-AVX512 (or -xSKYLAKE-AVX512). I use -qopt-zmm-usage=high as well.
I am not aware of a switch for ICC to disable vzeroupper once it is enabled (with GCC you can use -mno-vzeroupper).
Incidentally, by the same logic you should use -march=knl with GCC and not -mavx512f (-mavx512f -mno-vzeroupper may work if you are sure you don't need AVX512ER or AVX512PF).

Related

PyCaret methods on GPU/TPU

When I ran best_model = compare_models() there is a huge load on CPU memory, while my GPU is unutilized. How do I run the setup() or compare_models() on GPU?
Is there an in-built method in PyCaret?
Only some models can run on GPU, and they must be properly installed to use GPU. For example, for xgboost, you must install it with pip and have CUDA 10+ installed (or install a GPU xgboost version from anaconda, etc). Here is the list of estimators that can use GPU and their requirements: https://pycaret.readthedocs.io/en/latest/installation.html?highlight=gpu#pycaret-on-gpu
As Yatin said, you need to use use_gpu=True in setup(). Or you can specify it when creating an individual model, like xgboost_gpu = create_model('xgboost', fold=3, tree_method='gpu_hist', gpu_id=0).
For installing CUDA, I like using Anaconda since it makes it easy, like conda install -c anaconda cudatoolkit. It looks like for the non-boosted methods, you need to install cuML for GPU use.
Oh, and looks like pycaret can't use tune-sklearn with GPU (in the warnings here at the bottom of the tune_model doc section).
To use gpu in PyCaret you have to simply pas use_gpu=True as parameter in setup function.
Example:
model = setup(data,target_variable,use_gpu=True)

cvxopt uses just one core, need to run on all / some

I call cvxopt.glpk.ilp in Python 3.6.6, cvxopt==1.2.3 for a boolean optimization problem with about 500k boolean variables. It is solved in 1.5 hours, but it seems to run on just one core! How can I make it run on all or a specific set of cores?
The server with Linux Ubuntu x86_64 has 16 or 32 physical cores. My process affinity is 64 cores (I assume due to hyperthreading).
> grep ^cpu\\scores /proc/cpuinfo | uniq
16
> grep -c ^processor /proc/cpuinfo
64
> taskset -cp <PID>
pid <PID> current affinity list: 0-63
However top shows only 100% CPU for my process, and htop shows that only one core is 100% busy (some others are slightly loaded presumably by other users).
I set OMP_NUM_THREADS=32 and started my program again, but still one core. It's a bit difficult to restart the server itself. I don't have root access to the server.
I installed cvxopt from a company's internal repo which should be a mirror of PyPI. The following libs are installed in /usr/lib: liblapack, liblapack_atlas, libopenblas, libblas, libcblas, libatlas.
Here some SO-user writes, that GLPK is not multithreaded. This is the solver used by default as cvxopt has no own MIP-solver.
As cvxopt only supports GLPK as open-source mixed-integer programming solver, you are out of luck.
Alternatively you can use CoinOR's Cbc, which is usually a much better solver than GLPK while still being open-source. This one also can be compiled with parallelization. See some benchmarks which also indicate that GLPK is really without parallel support.
But as there is no support in cvxopt, you will need some alternative access-point:
own C/C++ based wrapper
pulp
binary install available
python-mip
binary install available
Google's ortools
binary install available
cylp
cvxpy + cylp
binary install available for cvxpy; without cylp-build
Those:
have very different modelling-styles (from completely low-level: cylp to very high-level: cvxpy)
i'm not sure if all those builds are compiled with enable-parallel (which is needed when compiling Cbx)
Furthermore: don't expect too much gain from multithreading. It's usually way worse than linear speedup (as for all combinatorial-optimization problems which are not based on brute-force).
(Imho the GIL does not matter as all those are C-extensions where the GIL is not in the way)

How to compile opencl-kernel-file(.cl) to LLVM IR

This question is related to LLVM/clang.
I already know how to compile opencl-kernel-file(.cl) using OpenCL API ( clBuildProgram() and clGetProgramBuildInfo() )
my question is this:
How to compile opencl-kernel-file(.cl) to LLVM IR with OpenCL 1.2 or higher?
In the other words, How to compile opnecl-kernel-file(.cl) to LLVM IR without libclc?
I have tried various methods to get LLVM-IR of OpenCL-Kernel-File.
I first followed the clang user manual.(https://clang.llvm.org/docs/UsersManual.html#opencl-features) but it did not run.
Secondly, I found a way to use libclc.
commands is this:
clang++ -emit-llvm -c -target -nvptx64-nvidial-nvcl -Dcl_clang_storage_class_specifiers -include /usr/local/include/clc/clc.h -fpack-struct=64 -o "$#".bc "$#" <br>
llvm-link "$#".bc /usr/local/lib/clc/nvptx64--nvidiacl.bc -o "$#".linked.bc <br>
llc -mcpu=sm_52 -march=nvptx64 "$#".linked.bc -o "$#".nvptx.s<br>
This method worked fine, but since libclc was built on top of the OpenCL 1.1 specification, it could not be used with OpenCL 1.2 or later code such as code using printf.
And this method uses libclc, which implements OpenCL built-in functions in the shape of new function. You can observe that in the assembly(ptx) of result opencl binary, it goes straight to the function call instead of converting it to an inline assembly. I am concerned that this will affect gpu behavior and performance, such as execution time.
So now I am looking for a way to replace compilation using libclc.
As a last resort, I'm considering using libclc with the NVPTX backend and AMDGPU backend of LLVM.
But if there is already another way, I want to use it.
(I expect that the OpenCL front-end I have not found yet exists in clang)
My program's scenarios are:
There is opencl kernel source file(.cl)
Compile the file to LLVM IR
IR-Level process to the IR
Compile(using llc) the IR to Binary
with each gpu targets(nvptx, amdgcn..)
Using the binary, Run host(.c or .cpp with lib OpenCL) with clCreateProgramWithBinary()
Now, When I compile kernel source file to LLVM IR, I have to include header of libclc(-include option in first one of above command) for compiling built-in functions. And I have to link libclc libraries before compile IR to binary
My environments are below:
GTX960
- NVIDIA's Binary appears in nvptx format
- I'm using sm_52 nvptx for my gpu.
Ubuntu Linux 16.04 LTS
LLVM/Clang 5.0.0
- If there is another way, I am willing to change the LLVM version.
Thanks in advice!
Clang 9 (and up) can compile OpenCL kernels written in the OpenCL C language. You can tell Clang to emit LLVM-IR by passing the -emit-llvm flag (add -S to output the IR in text rather than in bytecode format), and specify which version of the OpenCL standard using e.g. -cl-std=CL2.0. Clang currently supports up to OpenCL 2.0.
By default, Clang will not add the standard OpenCL headers, so if your kernel uses any of the OpenCL built-in functions you may see an error like the following:
clang-9 -c -x cl -emit-llvm -S -cl-std=CL2.0 my_kernel.cl -o my_kernel.ll
my_kernel.cl:17:12: error: implicit declaration of function 'get_global_id' is invalid in OpenCL
int i = get_global_id(0);
^
1 error generated.
You can tell Clang to include the standard OpenCL headers by passing the -finclude-default-header flag to the Clang frontend, e.g.
clang-9 -c -x cl -emit-llvm -S -cl-std=CL2.0 -Xclang -finclude-default-header my_kernel.cl -o my_kernel.ll
(I expect that the OpenCL front-end I have not found yet exists in clang)
There is an OpenCL front-end in clang - and you're using it, otherwise you couldn't compile a single line of OpenCL with clang. Frontend is Clang recognizing the OpenCL language. There is no OpenCL backend of any kind in LLVM, it's not the job of LLVM; it's the job of various OpenCL implementations to provide proper libraries. Clang+LLVM just recognizes the language and compiles it to bitcode & machine binaries, that's all it does.
in the assembly(ptx) of result opencl binary, it goes straight to the function call instead of converting it to an inline assembly.
You could try linking to a different library instead of libclc, if you find one. Perhaps NVidia's CUDA has some bitcode libraries somewhere, then again licensing issues... BTW are you 100% sure you need LLVM IR ? getting OpenCL binaries using the OpenCL runtime, or using SPIR-V, might get you faster binaries & certainly be less painful to work with. Even if you manage to get a nice LLVM IR, you'll need some runtime which actually accepts it (i could be wrong, but i doubt proprietary AMD/NVIDIA OpenCL will just accept random LLVM IR as inputs).
Clang does not provide a standard CL declaration header file (for example, C's stdio.h), which is why you're getting "undefined type float" and whatnot.
If you get one such header, you can then mark it as implicit include using "clang -include cl.h -x cl [your filename here]"
One such declaration header can be retrieved from the reference OpenCL compiler implementation at
https://github.com/KhronosGroup/SPIR-Tools/blob/master/headers/opencl_spir.h
And by the way, consider using this compiler which generates SPIR (albeit 1.0) which can be fed into OpenCL drivers as input.

Is there an offlineOpenCL compiler (for NVIDIA graphics cards)?

The normal way to run an OpenCL program is to include the openCL kernel that is compiled at runtime (online compilation).
But I've seen examples of compiling OpenCL to binary before, called offline compilation. I'm aware of the disadvantages (reducing compatibility across hardware).
There used to be an offline compiler at http://www.fixstars.com/en/ but it does not seems to be available anymore.
So is there an offline compiler for OpenCL available, in particular for NVIDIA-based cards?
Someone suggested that nvcc.exe in NVidia's CUDA SDK could compile .cl files with
nvcc -x cu file.cl -o file.out -lOpenCL
...but it says missing cl.exe at least on Windows. This might be worth checking out, however: http://clcc.sourceforge.net/
As well as:
https://github.com/HSAFoundation/CLOC (AMD-maintained offline compiler)
https://github.com/Maratyszcza/clcc (includes also links to above ones and more)

OpenCL kernel compiler optimisations

I'm using OpenCL on OS X, I was wondering if someone could tell me the compiler which is used to generate the GPU binary from the OpenCL kernel source code? In OS X is the OpenCL kernel compiled to LLVM first then optimized and then finally compiled to GPU native code? Also I was wondering if the OpenCL kernel compiler does optimisations on the kernel such as loop invariant code motion?
Yes, on Mac OS X all OpenCL code is compiled to LLVM IR, which is then passed to device-specific optimizations and code generation.
You can generate LLVM bitcode files offline, and use the result in clCreateProgramWithBinary. The openclc compiler is inside the OpenCL framework (/System/Library/Framework/OpenCL.framework/Libraries/openclc). You need these options (arch can be i386, x86_64, gpu_32):
openclc -c -o foo.bc -arch gpu_32 -emit-llvm foo.cl

Resources