Tesla K80 and OpenCL - opencl

I have an Nvidia Tesla K80 sitting in a LINUX box. I know that internally a Tesla K80 has two GPUs. When I ran a OpenCL program on that machine, looping over all the devices, I get to see four devices (4 Tesla K80s). Would you know why this could be happening?
Here is the host code:
ret = clGetPlatformIDs(0, NULL, &platformCount); openclCheck(ret);
platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * platformCount);
ret = clGetPlatformIDs(platformCount, platforms, NULL); openclCheck(ret);
printf("Detect %d platform available.\n",platformCount);
for (unsigned int i= 0; i < platformCount; i++) {
// get all devices
ret = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount); openclCheck(ret)
devices = (cl_device_id*) malloc(sizeof(cl_device_id) * deviceCount);
ret = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL); openclCheck(ret)
printf("Platform %d. %d device available.\n", i+1, deviceCount );
// for each device print critical attributes
for (unsigned int j = 0; j < deviceCount; j++) {
// print device name
ret = clGetDeviceInfo(devices[j], CL_DEVICE_NAME, 0, NULL, &valueSize); openclCheck(ret)
value = (char*) malloc(valueSize);
ret = clGetDeviceInfo(devices[j], CL_DEVICE_NAME, valueSize, value, NULL); openclCheck(ret)
printf("\t%d. Device: %s\n", j+1, value);
free(value);
//more code here to print device attributes
Here is the output:
Detect 1 platform available.
Platform 1. 4 device available.
1. Device: Tesla K80
1.1 Hardware version: OpenCL 1.2 CUDA
1.2 Software version: 352.79
1.3 OpenCL C version: OpenCL C 1.2
1.4 Parallel compute units: 13
2. Device: Tesla K80
2.1 Hardware version: OpenCL 1.2 CUDA
2.2 Software version: 352.79
2.3 OpenCL C version: OpenCL C 1.2
2.4 Parallel compute units: 13
3. Device: Tesla K80
3.1 Hardware version: OpenCL 1.2 CUDA
3.2 Software version: 352.79
3.3 OpenCL C version: OpenCL C 1.2
3.4 Parallel compute units: 13
4. Device: Tesla K80
4.1 Hardware version: OpenCL 1.2 CUDA
4.2 Software version: 352.79
4.3 OpenCL C version: OpenCL C 1.2
4.4 Parallel compute units: 13

Most probably 2 are 32-bit implementations, 2 are 64-bit implementations from multiple drivers. Maybe old drivers needs to be cleansed by some display driver uninstaller software. Check bitness of each device implementation please.
Or, there are virtual gpu(GRID?) services left active causing duplicated devices so maybe you can deactivate virtual gpu to solve this.

Related

OpenCL reports half the expected compute units

I use OpenCL (under Ubuntu) to query the available platforms, which yields one platform, with
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VERSION: OpenCL 2.1 AMD-APP (3143.9)
CL_PLATFORM_NAME: AMD Accelerated Parallel Processing
CL_PLATFORM_VENDOR: Advanced Micro Devices, Inc.
Which offers one device, which I query with:
cl_device_id device = devices[ j ];
cl_uint units = -1;
cl_device_type type;
size_t lmem = -1;
cl_uint dims = -1;
size_t wisz[ 3 ];
size_t wgsz = -1;
size_t gmsz = -1;
err = clGetDeviceInfo( device, CL_DEVICE_NAME, sizeof(name), name, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_NAME, sizeof(vend), vend, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(units), &units, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_TYPE, sizeof(type), &type, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(lmem), &lmem, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(dims), &dims, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(wisz), &wisz, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(wgsz), &wgsz, 0 );
CHECK_CL
err = clGetDeviceInfo( device, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(gmsz), &gmsz, 0 );
CHECK_CL
if ( type == CL_DEVICE_TYPE_GPU )
device_id = device;
printf( " %s %s with [%d units] localmem=%zu globalmem=%zu dims=%d(%zux%zux%zu) max workgrp sz %zu", name, vend, units, lmem, gmsz, dims, wisz[0], wisz[1], wisz[2], wgsz );
Which gives me:
gfx1012 gfx1012 with [11 units] localmem=65536 globalmem=8573157376 dims=3(1024x1024x1024) max workgrp sz 256
The CL_DEVICE_MAX_COMPUTE_UNITS value of 11 worries me.
My system is equipped with the Radeon RX 5500 XT, which according to both AMDs website and Wikipedia is supposed to have 22 Compute Units.
Why does OpenCL report half the expected number, 11 Compute Units, instead of 22?
lspci reports:
19:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (rev c5) (prog-if 00 [VGA controller])
Subsystem: XFX Pine Group Inc. Navi 14 [Radeon RX 5500/5500M / Pro 5500M]
Flags: bus master, fast devsel, latency 0, IRQ 83, NUMA node 0
Memory at b0000000 (64-bit, prefetchable) [size=256M]
Memory at c0000000 (64-bit, prefetchable) [size=2M]
I/O ports at 7000 [size=256]
Memory at c5d00000 (32-bit, non-prefetchable) [size=512K]
Expansion ROM at c5d80000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
And the AMD GPU PRO driver was installed.
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: Radeon RX 5500 XT
OpenGL core profile version string: 4.6.14752 Core Profile Context 20.30
OpenGL core profile shading language version string: 4.60
For AMD RDNA GPUs, OpenCL with CL_DEVICE_MAX_COMPUTE_UNITS reports the number of dual compute units (see the RDNA whitepaper, pages 4-9). Each dual compute unit contains 2 compute units, as the name suggests. So your hardware and driver installation is fine.

How to enable etnaviv drivers without using Yocto build?

I have custom board with kernel 4.14 and vivante drivers 6.2.4p4.0 (Official Freescale's ones).
I want to test my Qt application using the mesa drivers instead of the Freescale's.
I've already downloaded and manually compiled and installed the mesa drivers with kmsro and etnaviv drivers options enabled, but these steps doesn't seem to be enough.
What are the steps to do after installing the mesa drivers to enable them?
I don't have access to a Yocto layer for my board, so rebuilding the image is not an option.
thanks!
To get the etnaviv kernel module enabled, I did the following:
Download and compile kernel source v4.14, enabling the following options:
Device Drivers-> Graphics Support-> [M]ETNAVIV
MXC support drivers-> MXC Vivante GPU support->[*]MXC Vivante GPU support
Then install or compile MESA. If you choose to compile MESA remember to enable the options kmsro and etnaviv on the meson_options.txt file.
Lastly, to check if etnaviv was successfully loaded, do:
# dmesg | grep etnaviv
Should output something like this:
[ 6.249793] etnaviv gpu-subsystem: bound 134000.gpu (ops gpu_ops [etnaviv])
[ 6.249866] etnaviv gpu-subsystem: bound 130000.gpu (ops gpu_ops [etnaviv])
[ 6.249919] etnaviv gpu-subsystem: bound 2204000.gpu (ops gpu_ops [etnaviv])
[ 6.249934] etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
[ 6.332274] etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108
[ 6.402442] etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
[ 6.402474] etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0
[ 6.416880] [drm] Initialized etnaviv 1.1.0 20151214 for gpu-subsystem on minor 1
Check also your dtb file for proper initialization, mine has the following entries regarding the gpu:
gpu#00130000 {
compatible = "vivante,gc";
reg = <0x130000 0x4000>;
interrupts = <0x0 0x9 0x4>;
clocks = <0x2 0x1b 0x2 0x7a 0x2 0x4a>;
clock-names = "bus", "core", "shader";
power-domains = <0x9>;
linux,phandle = <0x82>;
phandle = <0x82>;
};
gpu#00134000 {
compatible = "vivante,gc";
reg = <0x134000 0x4000>;
interrupts = <0x0 0xa 0x4>;
clocks = <0x2 0x1a 0x2 0x79>;
clock-names = "bus", "core";
power-domains = <0x9>;
linux,phandle = <0x81>;
phandle = <0x81>;
};
gpu#02204000 {
compatible = "vivante,gc";
reg = <0x2204000 0x4000>;
interrupts = <0x0 0xb 0x4>;
clocks = <0x2 0x8f 0x2 0x79>;
clock-names = "bus", "core";
power-domains = <0x9>;
linux,phandle = <0x83>;
phandle = <0x83>;
};
gpu-subsystem {
compatible = "fsl,imx-gpu-subsystem";
cores = <0x81 0x82 0x83>;
};
Note: If you get a dmesg error output like "command buffer outside valid memory window", it might be the case that you need to increase the cma being reserved. You must do it via kernel parameter, In my case I had to set via uboot the following: cma=256M#2G

Why Global memory allocation is successful with the size more than the limit in GPU?

I am experimenting with Vivante GPU GC2000 Series, where the clinfo produced the below result.
CL_DEVICE_GLOBAL_MEM_SIZE: 64 MByte
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 32 MByte
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE: Read/Write
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 4096
CL_DEVICE_LOCAL_MEM_SIZE: 1 KByte
CL_DEVICE_LOCAL_MEM_TYPE: Global
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 4 KByte
CL_DEVICE_MAX_CONSTANT_ARGS: 9
From above output, it is clear that 64MByte is the limit for Global Memory allocation.
Now, when I tried allocating 900Mbytes global size, i have not received any error and it is successful.
int noOfBytes = (900 * 1024 * 1024);
memPtr = clCreateBuffer(context, CL_MEM_READ_WRITE, noOfBytes, NULL, &err);
if ( err != CL_SUCESS) {
printf ("Ooops.. Failed");
}
Sounds this experiment is show proving what the clinfo claims. Am i missing any theory or something else ?
Because buffers and images are allocated on an OpenCL context (not an OpenCL device) the actual device allocation is often deferred until the buffer is used on a specific device. So while this allocation seemed to work, if you try to actually use that buffer on your device, you'll get an error.

OpenCL does not detect Xeon Phi

We created a small program to detect Xeon Phi, here is our code snippet
std::vector<cl::Platform> platformList(5);
std::vector<cl::Device> deviceList;
cl::Platform::get(&platformList);
if(platformList.size() == 0){
std::cout << "No Platforms found. Check OpenCL installation!" << std::endl;
exit(1);
}
for(i=0; i<platformList.size(); i++){
// for(i=0; i<1; i++){
std::cout << platformList[i].getInfo<CL_PLATFORM_NAME>()<< std::endl;
platformList[i].getDevices(CL_DEVICE_TYPE_ALL, &deviceList);
if(deviceList.size() == 0){
std::cout << "No Devices found. Check OpenCL installation!" << std::endl;
exit(1);
}
for(j=0; j<deviceList.size(); j++){
// dims = deviceList[j].getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();
// for(k=0; k<dims.size(); k++)
// std::cout << dims[k] << std::endl;
std::cout << deviceList[j].getInfo<CL_DEVICE_NAME>()<< std::endl;
}
}
cl::Device device = deviceList[j-1];
std::cout << "Using device: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;
but it does not detect the Phi, we get only this output;
Intel(R) OpenCL
Intel(R) Xeon(R) CPU E5-2609 v2 # 2.50GHz
Using device: Intel(R) Xeon(R) CPU E5-2609 v2 # 2.50GHz
Hello World
Do you know what are we doing wrong?
P.S. Below can you find micinfo output
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.
Created Thu Oct 2 15:04:08 2014
System Info
HOST OS : Linux
OS Version : 2.6.32-431.el6.x86_64
Driver Version : 3.2-1
MPSS Version : 3.2
Host Physical Memory : 16274 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.2
Device Serial Number : ADKC32800437
Board
Vendor ID : 0x8086
Device ID : 0x225d
Subsystem ID : 0x3608
Coprocessor Stepping ID : 2
PCIe Width : Insufficient Privileges
PCIe Speed : Insufficient Privileges
PCIe Max payload size : Insufficient Privileges
PCIe Max read req size : Insufficient Privileges
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-3120/3140 P/A
ECC Mode : Enabled
SMC HW Revision : Product 300W Active CS
Cores
Total No of Active Cores : 57
Voltage : 0 uV
Frequency : 1100000 kHz
Thermal
Fan Speed Control : On
Fan RPM : 1200
Fan PWM : 20
Die Temp : 45 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 5952 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
You might want to look at https://software.intel.com/en-us/articles/opencl-runtime-release-notes. It is more recent than the page Cicada pointed you to and provides a link to Intel® OpenCL™ Runtime 14.2.
The libmic_device.so is included with the OpenCL runtime and is, by default, in /opt/intel/opencl{version_number}/lib64. You will want to make sure that path is in your LD_LIBRARY_PATH environment variable. You will also want to make sure that /opt/intel/opencl{version_number}/mic is in your MIC_LD_LIBRARY_PATH environment variable.
You already have the Intel MPSS installed; otherwise micinfo would not work. The libcoi_host.so is included in the MPSS and installs in /usr/lib64, which is already in your library search path.
The version of the MPSS that you are running is 3.2-1. The "What's new" notes for the OpenCL runtime 14.1 on the release notes web page says that version 14.1 is unstable under MPSS 3.2-1. I am trying to find out if there is a different version of the runtime you can use with MPSS 3.2-1 that is more stable or if the only recommendation is to install a newer MPSS. You can find the latest MPSS releases at https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss.

OpenCL Bus error

I have problem with my OpenCL code. I compile and running it on CPU (core 2 duo) Mac OS X 10.6.7. Here is the code:
#define BUFSIZE (524288) // 512 KB
#define BLOCKBYTES (32) // 32 B
__kernel void test(__global unsigned char *in,
__global unsigned char *out,
unsigned int srcOffset,
unsigned int dstOffset) {
int grId = get_group_id(0);
unsigned char msg[BUFSIZE];
srcOffset = grId * BUFSIZE;
dstOffset = grId * BLOCKBYTES;
// Copy from global to private memory
size_t i;
for (i = 0; i < BUFSIZE; i++)
msg[i] = in[ srcOffset + i ];
// Make some computation here, not complicated logic
// Copy from private to global memory
for (i = 0; i < BLOCKBYTES; i++)
out[ dstOffset + i ] = msg[i];
}
The code gave me an runtime error "Bus error". When I makes help printf in cycle which copy from global to private memory then see there the error occurs, every time in different iteration of i. When I reduce size of BUFSIZE to 262144 (256 KB) then the code runs fine. I tried to have only one work-item on one work-group. The *in points to memory area which have thousands KB of data. I suspect to limit of private memory, but then threw an error in the allocation of memory, not when copy.
Here is my OpenCL device query:
-
--------------------------------
Device Intel(R) Core(TM)2 Duo CPU P7550 # 2.26GHz
---------------------------------
CL_DEVICE_NAME: Intel(R) Core(TM)2 Duo CPU P7550 # 2.26GHz
CL_DEVICE_VENDOR: Intel
CL_DRIVER_VERSION: 1.0
CL_DEVICE_VERSION: OpenCL 1.0
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1 / 1 / 1
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1
CL_DEVICE_MAX_CLOCK_FREQUENCY: 2260 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1024 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1535 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: global
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest
CL_DEVICE_IMAGE <dim> 2D_MAX_WIDTH 8192
2D_MAX_HEIGHT 8192
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048
CL_DEVICE_EXTENSIONS: cl_khr_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_byte_addressable_store
cl_APPLE_gl_sharing
cl_APPLE_SetMemObjectDestructor
cl_APPLE_ContextLoggingFunctions
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
You use a variable msg with a size of 512kB. This variable should be in private memory. The private memory is not that big. This shouldn't work, as far as I know.
Why do you have the parameters srcOffsetand dstOffset? You do not use them.
I do not see more issues. Try to allocate local memory. Do you have a version of you code without optimization running? A version which just calculates in global memory?

Resources