I use OpenCL (under Ubuntu) to query the available platforms, which yields one platform, with
CL_PLATFORM_PROFILE: FULL_PROFILE
CL_PLATFORM_VERSION: OpenCL 2.1 AMD-APP (3143.9)
CL_PLATFORM_NAME: AMD Accelerated Parallel Processing
CL_PLATFORM_VENDOR: Advanced Micro Devices, Inc.
Which offers one device, which I query with:
cl_device_id device = devices[ j ];
cl_uint units = -1;
cl_device_type type;
size_t lmem = -1;
cl_uint dims = -1;
size_t wisz[ 3 ];
size_t wgsz = -1;
size_t gmsz = -1;
err = clGetDeviceInfo( device, CL_DEVICE_NAME, sizeof(name), name, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_NAME, sizeof(vend), vend, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(units), &units, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_TYPE, sizeof(type), &type, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(lmem), &lmem, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, sizeof(dims), &dims, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_ITEM_SIZES, sizeof(wisz), &wisz, 0 );
err = clGetDeviceInfo( device, CL_DEVICE_MAX_WORK_GROUP_SIZE, sizeof(wgsz), &wgsz, 0 );
CHECK_CL
err = clGetDeviceInfo( device, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(gmsz), &gmsz, 0 );
CHECK_CL
if ( type == CL_DEVICE_TYPE_GPU )
device_id = device;
printf( " %s %s with [%d units] localmem=%zu globalmem=%zu dims=%d(%zux%zux%zu) max workgrp sz %zu", name, vend, units, lmem, gmsz, dims, wisz[0], wisz[1], wisz[2], wgsz );
Which gives me:
gfx1012 gfx1012 with [11 units] localmem=65536 globalmem=8573157376 dims=3(1024x1024x1024) max workgrp sz 256
The CL_DEVICE_MAX_COMPUTE_UNITS value of 11 worries me.
My system is equipped with the Radeon RX 5500 XT, which according to both AMDs website and Wikipedia is supposed to have 22 Compute Units.
Why does OpenCL report half the expected number, 11 Compute Units, instead of 22?
lspci reports:
19:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (rev c5) (prog-if 00 [VGA controller])
Subsystem: XFX Pine Group Inc. Navi 14 [Radeon RX 5500/5500M / Pro 5500M]
Flags: bus master, fast devsel, latency 0, IRQ 83, NUMA node 0
Memory at b0000000 (64-bit, prefetchable) [size=256M]
Memory at c0000000 (64-bit, prefetchable) [size=2M]
I/O ports at 7000 [size=256]
Memory at c5d00000 (32-bit, non-prefetchable) [size=512K]
Expansion ROM at c5d80000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: amdgpu
Kernel modules: amdgpu
And the AMD GPU PRO driver was installed.
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: Radeon RX 5500 XT
OpenGL core profile version string: 4.6.14752 Core Profile Context 20.30
OpenGL core profile shading language version string: 4.60
For AMD RDNA GPUs, OpenCL with CL_DEVICE_MAX_COMPUTE_UNITS reports the number of dual compute units (see the RDNA whitepaper, pages 4-9). Each dual compute unit contains 2 compute units, as the name suggests. So your hardware and driver installation is fine.
Related
I am experimenting with Vivante GPU GC2000 Series, where the clinfo produced the below result.
CL_DEVICE_GLOBAL_MEM_SIZE: 64 MByte
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 32 MByte
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE: Read/Write
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 4096
CL_DEVICE_LOCAL_MEM_SIZE: 1 KByte
CL_DEVICE_LOCAL_MEM_TYPE: Global
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 4 KByte
CL_DEVICE_MAX_CONSTANT_ARGS: 9
From above output, it is clear that 64MByte is the limit for Global Memory allocation.
Now, when I tried allocating 900Mbytes global size, i have not received any error and it is successful.
int noOfBytes = (900 * 1024 * 1024);
memPtr = clCreateBuffer(context, CL_MEM_READ_WRITE, noOfBytes, NULL, &err);
if ( err != CL_SUCESS) {
printf ("Ooops.. Failed");
}
Sounds this experiment is show proving what the clinfo claims. Am i missing any theory or something else ?
Because buffers and images are allocated on an OpenCL context (not an OpenCL device) the actual device allocation is often deferred until the buffer is used on a specific device. So while this allocation seemed to work, if you try to actually use that buffer on your device, you'll get an error.
I have an Nvidia Tesla K80 sitting in a LINUX box. I know that internally a Tesla K80 has two GPUs. When I ran a OpenCL program on that machine, looping over all the devices, I get to see four devices (4 Tesla K80s). Would you know why this could be happening?
Here is the host code:
ret = clGetPlatformIDs(0, NULL, &platformCount); openclCheck(ret);
platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * platformCount);
ret = clGetPlatformIDs(platformCount, platforms, NULL); openclCheck(ret);
printf("Detect %d platform available.\n",platformCount);
for (unsigned int i= 0; i < platformCount; i++) {
// get all devices
ret = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount); openclCheck(ret)
devices = (cl_device_id*) malloc(sizeof(cl_device_id) * deviceCount);
ret = clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL); openclCheck(ret)
printf("Platform %d. %d device available.\n", i+1, deviceCount );
// for each device print critical attributes
for (unsigned int j = 0; j < deviceCount; j++) {
// print device name
ret = clGetDeviceInfo(devices[j], CL_DEVICE_NAME, 0, NULL, &valueSize); openclCheck(ret)
value = (char*) malloc(valueSize);
ret = clGetDeviceInfo(devices[j], CL_DEVICE_NAME, valueSize, value, NULL); openclCheck(ret)
printf("\t%d. Device: %s\n", j+1, value);
free(value);
//more code here to print device attributes
Here is the output:
Detect 1 platform available.
Platform 1. 4 device available.
1. Device: Tesla K80
1.1 Hardware version: OpenCL 1.2 CUDA
1.2 Software version: 352.79
1.3 OpenCL C version: OpenCL C 1.2
1.4 Parallel compute units: 13
2. Device: Tesla K80
2.1 Hardware version: OpenCL 1.2 CUDA
2.2 Software version: 352.79
2.3 OpenCL C version: OpenCL C 1.2
2.4 Parallel compute units: 13
3. Device: Tesla K80
3.1 Hardware version: OpenCL 1.2 CUDA
3.2 Software version: 352.79
3.3 OpenCL C version: OpenCL C 1.2
3.4 Parallel compute units: 13
4. Device: Tesla K80
4.1 Hardware version: OpenCL 1.2 CUDA
4.2 Software version: 352.79
4.3 OpenCL C version: OpenCL C 1.2
4.4 Parallel compute units: 13
Most probably 2 are 32-bit implementations, 2 are 64-bit implementations from multiple drivers. Maybe old drivers needs to be cleansed by some display driver uninstaller software. Check bitness of each device implementation please.
Or, there are virtual gpu(GRID?) services left active causing duplicated devices so maybe you can deactivate virtual gpu to solve this.
Just started to learn OpenCL and setup a Visual Studio project using VS2015. Somehow, the code can find only 1 platform (I guess it should be the CPU), and cannot find the GPU device. Can someone please help? The detailed information is as follows:
GPU: Nvidia Quadro K4000
CUDA Installation
CUDA is at: “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5”
OpenCL related files are located at "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include\CL" and "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\lib\Win32" (assuming 32bit system)
The installer created two environment variables “CUDA_PATH” and “CUDA_PATH_V7_5”. They both point to the above location.
In Visual Studio, the project is set up as
"Project Properties" -> "C/C++" -> "Additional Include Directories" -> "$(CUDA_PATH)\include"
"Project Properties" -> "Linker" -> "Additional Library Directories" -> "$(CUDA_PATH)\lib\Win32"
"Project Properties" -> "Linker" -> "Input" -> "Additional Dependencies" -> "OpenCL.lib"
The code is very simple:
#include "stdafx.h"
#include <iostream>
#include <CL/cl.h>
using namespace std;
int main()
{
cl_int err;
cl_uint numPlatforms;
err = clGetPlatformIDs(0, NULL, &numPlatforms);
if (CL_SUCCESS == err)
cout << "Detected OpenCL platforms: " << numPlatforms << endl;
else
cout << "Error calling clGetPlatformIDs. Error code:" << err << endl;
cl_device_id device = NULL;
err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
if (err == CL_SUCCESS)
cout << device << endl;
return 0;
}
The code compiles and runs, but it cannot the GPU device. Specifically, the returned value of variable device is device = 0x00000000 <NULL>. What would be the problem? Thanks for the help.
This is not the way you use the OpenCL API.
You need to obtain a valid cl_platform_id object which it needs to be used to retrieve a cl_device_id. You are always passing NULL, this can't work.
The first time you invoke the clGetPlatformIds, you do it in order to obtain the number of platforms in the system. After than you need to invoke the method again in order to retrieve the actual cl_platform_ids:
size_t numPlatforms;
err = clGetPlatformIDs(0, NULL, &numPlatforms);
assert(numPlatforms > 0);
cl_platform_id platform_ids[numPlatforms];
err = clGetPlatformIDs(numPlatforms, platform_ids, NULL);
However, if you already know there is going to be only one platform in the system, then you can do speedup things as follows, but make sure to check for errors:
cl_platform_id platform_id;
err = clGetPlatformIDs(1, &platform_id, NULL);
assert(err == CL_SUCCESS);
After you have obtained a platform you need to follow the same procedure to first obtain the number of devices and then retrieve the list of OpenCL devices (which you then will need to build a cl_context, queues...):
// Note: this has to be done for each `cl_platform_id`
// until you find the device you were looking for
size_t numDevices;
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);
assert(numDevices > 0);
cl_device_id devices[numDevices];
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, numDevices, devices, NULL);
I guess you understand the procedure now. If like above, you already know that there is only 1 GPU device in the system, you can directly get its cl_device_id as follows:
cl_device_id device;
err = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
assert(err == CL_SUCCESS);
We created a small program to detect Xeon Phi, here is our code snippet
std::vector<cl::Platform> platformList(5);
std::vector<cl::Device> deviceList;
cl::Platform::get(&platformList);
if(platformList.size() == 0){
std::cout << "No Platforms found. Check OpenCL installation!" << std::endl;
exit(1);
}
for(i=0; i<platformList.size(); i++){
// for(i=0; i<1; i++){
std::cout << platformList[i].getInfo<CL_PLATFORM_NAME>()<< std::endl;
platformList[i].getDevices(CL_DEVICE_TYPE_ALL, &deviceList);
if(deviceList.size() == 0){
std::cout << "No Devices found. Check OpenCL installation!" << std::endl;
exit(1);
}
for(j=0; j<deviceList.size(); j++){
// dims = deviceList[j].getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();
// for(k=0; k<dims.size(); k++)
// std::cout << dims[k] << std::endl;
std::cout << deviceList[j].getInfo<CL_DEVICE_NAME>()<< std::endl;
}
}
cl::Device device = deviceList[j-1];
std::cout << "Using device: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;
but it does not detect the Phi, we get only this output;
Intel(R) OpenCL
Intel(R) Xeon(R) CPU E5-2609 v2 # 2.50GHz
Using device: Intel(R) Xeon(R) CPU E5-2609 v2 # 2.50GHz
Hello World
Do you know what are we doing wrong?
P.S. Below can you find micinfo output
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.
Created Thu Oct 2 15:04:08 2014
System Info
HOST OS : Linux
OS Version : 2.6.32-431.el6.x86_64
Driver Version : 3.2-1
MPSS Version : 3.2
Host Physical Memory : 16274 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.2
Device Serial Number : ADKC32800437
Board
Vendor ID : 0x8086
Device ID : 0x225d
Subsystem ID : 0x3608
Coprocessor Stepping ID : 2
PCIe Width : Insufficient Privileges
PCIe Speed : Insufficient Privileges
PCIe Max payload size : Insufficient Privileges
PCIe Max read req size : Insufficient Privileges
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-3120/3140 P/A
ECC Mode : Enabled
SMC HW Revision : Product 300W Active CS
Cores
Total No of Active Cores : 57
Voltage : 0 uV
Frequency : 1100000 kHz
Thermal
Fan Speed Control : On
Fan RPM : 1200
Fan PWM : 20
Die Temp : 45 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 5952 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
You might want to look at https://software.intel.com/en-us/articles/opencl-runtime-release-notes. It is more recent than the page Cicada pointed you to and provides a link to Intel® OpenCL™ Runtime 14.2.
The libmic_device.so is included with the OpenCL runtime and is, by default, in /opt/intel/opencl{version_number}/lib64. You will want to make sure that path is in your LD_LIBRARY_PATH environment variable. You will also want to make sure that /opt/intel/opencl{version_number}/mic is in your MIC_LD_LIBRARY_PATH environment variable.
You already have the Intel MPSS installed; otherwise micinfo would not work. The libcoi_host.so is included in the MPSS and installs in /usr/lib64, which is already in your library search path.
The version of the MPSS that you are running is 3.2-1. The "What's new" notes for the OpenCL runtime 14.1 on the release notes web page says that version 14.1 is unstable under MPSS 3.2-1. I am trying to find out if there is a different version of the runtime you can use with MPSS 3.2-1 that is more stable or if the only recommendation is to install a newer MPSS. You can find the latest MPSS releases at https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss.
I have problem with my OpenCL code. I compile and running it on CPU (core 2 duo) Mac OS X 10.6.7. Here is the code:
#define BUFSIZE (524288) // 512 KB
#define BLOCKBYTES (32) // 32 B
__kernel void test(__global unsigned char *in,
__global unsigned char *out,
unsigned int srcOffset,
unsigned int dstOffset) {
int grId = get_group_id(0);
unsigned char msg[BUFSIZE];
srcOffset = grId * BUFSIZE;
dstOffset = grId * BLOCKBYTES;
// Copy from global to private memory
size_t i;
for (i = 0; i < BUFSIZE; i++)
msg[i] = in[ srcOffset + i ];
// Make some computation here, not complicated logic
// Copy from private to global memory
for (i = 0; i < BLOCKBYTES; i++)
out[ dstOffset + i ] = msg[i];
}
The code gave me an runtime error "Bus error". When I makes help printf in cycle which copy from global to private memory then see there the error occurs, every time in different iteration of i. When I reduce size of BUFSIZE to 262144 (256 KB) then the code runs fine. I tried to have only one work-item on one work-group. The *in points to memory area which have thousands KB of data. I suspect to limit of private memory, but then threw an error in the allocation of memory, not when copy.
Here is my OpenCL device query:
-
--------------------------------
Device Intel(R) Core(TM)2 Duo CPU P7550 # 2.26GHz
---------------------------------
CL_DEVICE_NAME: Intel(R) Core(TM)2 Duo CPU P7550 # 2.26GHz
CL_DEVICE_VENDOR: Intel
CL_DRIVER_VERSION: 1.0
CL_DEVICE_VERSION: OpenCL 1.0
CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1 / 1 / 1
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1
CL_DEVICE_MAX_CLOCK_FREQUENCY: 2260 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1024 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1535 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: global
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest
CL_DEVICE_IMAGE <dim> 2D_MAX_WIDTH 8192
2D_MAX_HEIGHT 8192
3D_MAX_WIDTH 2048
3D_MAX_HEIGHT 2048
3D_MAX_DEPTH 2048
CL_DEVICE_EXTENSIONS: cl_khr_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_byte_addressable_store
cl_APPLE_gl_sharing
cl_APPLE_SetMemObjectDestructor
cl_APPLE_ContextLoggingFunctions
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2
You use a variable msg with a size of 512kB. This variable should be in private memory. The private memory is not that big. This shouldn't work, as far as I know.
Why do you have the parameters srcOffsetand dstOffset? You do not use them.
I do not see more issues. Try to allocate local memory. Do you have a version of you code without optimization running? A version which just calculates in global memory?