I have just started working on SYCL and ran ComputeCpp_info on my system and following data on 3 devices is showed
ComputeCpp Info (CE 1.1.0)
SYCL 1.2.1 revision 3
Device 1 ( GeForce GTX 1050 = NO - Device does not support SPIR)
Device 2 (Intel(R) HD Graphics 630 = UNTESTED - Device not tested on this OS)
Device 3 (Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz = UNTESTED - Device running untested driver)
Now my question is can I work on these devices as 2 are untested and 1 is not possible? or am i missing some drivers?
Also I implemented a simple example but it gives me CL/cl.h not found error
#include <CL/sycl.hpp>
#include <array>
#include <numeric>
#include <iostream>
int main() {
const size_t array_size = 1024 * 512;
std::array<cl::sycl::cl_int, array_size> in, out;
std::iota(begin(in), end(in), 0);
cl::sycl::queue device_queue;
cl::sycl::range<1> n_items{ array_size };
cl::sycl::buffer < cl::sycl::cl_int, 1> in_buffer(in.data(), n_items);
cl::sycl::buffer < cl::sycl::cl_int, 1> out_buffer(out.data(), n_items);
device_queue.submit([&](cl::sycl::handler &cgh) {
constexpr auto sycl_read = cl::sycl::access::mode::read;
constexpr auto sycl_write = cl::sycl::access::mode::write;
auto in_accessor = in_buffer.get_access<sycl_read>(cgh);
auto out_accessor = out_buffer.get_access<sycl_write>(cgh);
cgh.parallel_for<class VecScalMul>(n_items,
[=](cl::sycl::id<1> wiID) {
out_accessor[wiID] = in_accessor[wiID] * 2;
});
});
}
The computecpp_info tool shows the devices that are or are not supported by ComputeCpp on your system. Here's an explanation of the outputs:
NO - Device does not support SPIR: This means that the device can be seen but it does not support SPIR instructions and so cannot be supported by ComputeCpp
UNTESTED - Device not tested on this OS: This means that the device can be seen and is reporting that it supports SPIR instructions. It should work with ComputeCpp but this specific device has not been tested by the ComputeCpp team
The cl.h header missing error is because you are missing the OpenCL headers. These can be found here and you'll need to point at them when you compile your code. I'd suggest using the Getting Started guide with the sample code and then modifying the hello world example to test out your code. This has an existing CMake file that is designed to search for all the dependencies you need.
(This is a ComputeCpp-specific question, rather than SYCL) The UNTESTED platforms will probably work, but Codeplay cannot guarantee it. In my experience, both should work, maybe some OpenCL driver bugs will hit you on Intel GPU depending on your configuration.
You need the OpenCL headers in your system, since SYCL 1.2.1 specification is built on top of OpenCL
Disclaimer: I am a Codeplay employee working in ComputeCpp!
Related
I'm failing to compile programs, using the write_imagef() functions on Nvidia implementations.
Working with a Tesla K10.G2.8GB using the driver version 367.35 on python 2.7 with PyopenCL 2016.1,
I'm trying to compile the following program, which fails with a build error:
Host Code:
import pyopencl as cl
platform = cl.get_platforms()[0]
devs = platform.get_devices()
device1 = devs[1]
mf = cl.mem_flags
ctx = cl.Context([device1])
Queue1 = cl.CommandQueue(ctx)
f = open('Minimal.cl', 'r')
fstr = "".join(f.readlines())
prg = cl.Program(ctx, fstr).build()
Kernel (Minimal.cl)
__kernel void test(image2d_t d_output){
write_imagef(d_output,(int2)(1,1),(float4)(1.0f,1.0f,1.0f,1.0f));
}
The error I get is:
pyopencl.cffi_cl.RuntimeError: clbuildprogram failed: BUILD_PROGRAM_FAILURE -
I checked, if my device has image support and that it supports reading and writing to
texture buffers in the specified format. I think, that the same case won't work for the 3d case,
because the extension cl_khr_3d_image_writes is not supported on any of our Nvidia devices,
but I don't understand the problem for the 2D case.
Image arguments must be declared as either read_only or write_only (or read_write with OpenCL 2.x), so your kernel definition should look like this:
__kernel void test(write_only image2d_t d_output){
write_imagef(d_output,(int2)(1,1),(float4)(1.0f,1.0f,1.0f,1.0f));
}
I'm doing a small OpenCL benchmark using Nvidia drivers,
my kernel performs 1024 fuse multiply-adds and store the result in an array:
#define FLOPS_MACRO_1(x) { (x) = (x) * 0.99f + 10.f; } // Multiply-add
#define FLOPS_MACRO_2(x) { FLOPS_MACRO_1(x) FLOPS_MACRO_1(x) }
#define FLOPS_MACRO_4(x) { FLOPS_MACRO_2(x) FLOPS_MACRO_2(x) }
#define FLOPS_MACRO_8(x) { FLOPS_MACRO_4(x) FLOPS_MACRO_4(x) }
// more recursive macros ...
#define FLOPS_MACRO_1024(x) { FLOPS_MACRO_512(x) FLOPS_MACRO_512(x) }
__kernel void ocl_Kernel_FLOPS(int iNbElts, __global float *pf)
{
for (unsigned i = get_global_id(0); i < iNbElts; i += get_global_size(0))
{
float f = (float) i;
FLOPS_MACRO_1024(f)
pf[i] = f;
}
}
But when I look in the PTX generated, I see this:
.entry ocl_Kernel_FLOPS(
.param .u32 ocl_Kernel_FLOPS_param_0,
.param .u32 .ptr .global .align 4 ocl_Kernel_FLOPS_param_1
)
{
.reg .f32 %f<1026>; // 1026 float registers !
.reg .pred %p<3>;
.reg .s32 %r<19>;
ld.param.u32 %r1, [ocl_Kernel_FLOPS_param_0];
// some more code unrelated to the problem
// ...
BB1_1:
and.b32 %r13, %r18, 65535;
cvt.rn.f32.u32 %f1, %r13;
fma.rn.f32 %f2, %f1, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f3, %f2, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f4, %f3, 0f3F7D70A4, 0f41200000;
fma.rn.f32 %f5, %f4, 0f3F7D70A4, 0f41200000;
// etc
// ...
If I am correct, the PTX uses 1026 float registers to perform the 1024 operations and never reuse a register twice even if it could perform all the multiply-add operations using only 2 registers. 1026 is far above the maximum number of registers a thread is allow to have (according to the specs), so I guess this ends up in memory spilling.
Is it a compiler bug or am I totally missing something ?
I am using nvcc version 6.5 on a Quadro K2000 GPU.
EDIT
Actually I did miss something in the specs:
"Since PTX supports virtual registers, it is quite common for a compiler frontend to generate
a large number of register names. Rather than require explicit declaration of every name,
PTX supports a syntax for creating a set of variables having a common prefix string
appended with integer suffixes. For example, suppose a program uses a large number, say
one hundred, of .b32 variables, named %r0, %r1, ..., %r99"
The PTX file format is intended to describe a virtual machine and instruction set architecture:
PTX defines a virtual machine and ISA for general purpose parallel thread execution. PTX programs are translated at install time to the target hardware instruction set. The PTX-to-GPU translator and driver enable NVIDIA GPUs to be used as programmable parallel computers.
So the PTX output that you are obtaining there is not a form of "GPU assembler". It is only an intermediate representation, intended to be capable of describing virtually any form of parallel computation.
The PTX representation is then compiled into actual binaries for the respective target GPU. This is important in order to be possible to abstract from the actual architecture - specifically, regarding your example: It should be possible to use the same PTX representation of a program, regardless of the number of registers that are available on a specific target machine. The 1026 "registers" that you see there are "virtual" registers, and in the end, may be mapped to the (few) real hardware registers that are actually available. You may add the --ptxas-options=-v argument to the NVCC during the compilation to obtain addition information about the register usage.
(This is roughly the same idea as that behind the LLVM - namely, to have a representation that can be optimized on and argued about, both abstracting from the original source code and from the actual target architecture).
Can anybody tell me how to measure the consumed RAM for a particular code running on Arduino Mega or Due.
There is two kinds of numbers to this question:
Global static usage and current run time.
The static estimated usage can be determined by adding the following line to (if it does not already exist)
.\arduino-1.5.5\hardware\arduino\avr\boards.txt
uno.upload.maximum_ram_size=2048
This then allows the compiler to output the additional 2nd line in the following example in the IDE's result window
Binary sketch size: 25,880 bytes (of a 32,256 byte maximum)
Estimated used SRAM memory: 990 bytes (of a 2048 byte maximum)
To see the amount of memory used at any given point. Including memory space currently in use, that exists while only in functions and members. This includes the HEAP and such. I use the following MemoryFree library at specific points in the code to reveal the high-water. The readme explains how to save unnecessarily/unintentionally used RAM by prints.
Note: That while the original Arduino IDE 1.0.5's boards.txt files does contain these ram_sizes, it does not actually use display usage. Where the original Arduino IDE 1.5.5 does, along with Arduino ERW 1.0.5 does (an non-supported fork).
In my Arduino IDE 2.1.0
I edit the file: /usr/share/arduino/hardware/arduino/boards.txt
but the second line don't appear
After read:
check-ram-memory-usage-arduino-optimization
measuring-free-memory
I tried:
Show vervose output during compilation
and use avr-size /tmp/build4042914391435450796.tmp/XXXXXXX.cpp.elf
then i get my memory used
Best Regards!
int freeRam () {
extern int __heap_start, *__brkval;
int v;
int fr = (int) &v - (__brkval == 0 ? (int) &__heap_start : (int) __brkval);
Serial.print("Free ram: ");
Serial.println(fr);
}
I've been browsing the Qt source code trying to find the actual system calls but it seems Qt doesn't use the Windows API documented on MSDN. For example grepping the source for "GetClipboardData" returns results in two files:
qclipboard_win.cpp:
#if defined(Q_OS_WINCE)
...
HANDLE clipData = GetClipboardData(CF_TEXT)
qaxserverbase.cpp:
STDMETHOD(GetClipboardData)(DWORD dwReserved, IDataObject** ppDataObject);
...
HRESULT WINAPI QAxServerBase::GetClipboardData(DWORD, IDataObject**)
{
return E_NOTIMPL;
}
and "SetClipboardData":
qclipboard_win.cpp:
#if defined(Q_OS_WINCE)
...
result = SetClipboardData(CF_UNICODETEXT, wcsdup(reinterpret_cast<const wchar_t *> (data->text().utf16()))) != NULL;
Neither of which seems useful, since they're being declared for Win CE/Mobile.
My Qt (4.8.1) uses OleSetClipboard and OleGetClipboard. The lines you got to are never reached in regular windows, as only in case of #if defined(Q_OS_WINCE) Qt uses #define OleSetClipboard QtCeSetClipboard and #define OleGetClipboard QtCeGetClipboard, and otherwise uses system-provided versions of those functions.
It was a little dificult to see this #if defined though, so you are excused ;)
It is so at least on my Qt version. If you are talking about qt, and especially about it's internals, you should menstion the version, right?
Is there a (Qt) way to determine the platform a Qt application is running on at runtime?
Intention: While I hate to bring up a question
that is almost 2 years old, I think
that a good amended answer is valuable
to have on record so that others that
end up on this question can do it the
right way.
I can't help but notice that most of the answers recommend using the Q_WS set of macros to determine the Operating System, this is not a good solution, since Q_WS_* refers to the Windowing System and not the Operating System platform(for eg. X11 can be run on Windows or Mac OS X then what?), thus one should not follow those macros to determine the platform for which the application has been compiled.
Instead one should use the Q_OS_* set of macros which have the precise purpose of determining the Operating System.
The set currently consists of the following macros for Qt 4.8:
Q_OS_AIX
Q_OS_BSD4
Q_OS_BSDI
Q_OS_CYGWIN
Q_OS_DARWIN
Q_OS_DGUX
Q_OS_DYNIX
Q_OS_FREEBSD
Q_OS_HPUX
Q_OS_HURD
Q_OS_IRIX
Q_OS_LINUX
Q_OS_LYNX
Q_OS_MAC
Q_OS_MSDOS
Q_OS_NETBSD
Q_OS_OS2
Q_OS_OPENBSD
Q_OS_OS2EMX
Q_OS_OSF
Q_OS_QNX
Q_OS_RELIANT
Q_OS_SCO
Q_OS_SOLARIS
Q_OS_SYMBIAN
Q_OS_ULTRIX
Q_OS_UNIX
Q_OS_UNIXWARE
Q_OS_WIN32
Q_OS_WINCE
References:
Qt 4 https://doc.qt.io/archives/qt-4.8/qtglobal.html
Qt 5 https://doc.qt.io/qt-5/qtglobal.html
Qt 6 https://doc.qt.io/qt-6/qtglobal.html
NB: As mentioned by Wiz in the comments, Qt 5 completely removed the Q_WS_* set of macros, thus now all you can use are Q_OS_* ones.
Note that the Q_WS_* macros are defined at compile time, but QSysInfo gives some run time details.
To extend gs's function to get the specific windows version at runtime, you can do
#ifdef Q_WS_WIN
switch(QSysInfo::windowsVersion())
{
case QSysInfo::WV_2000: return "Windows 2000";
case QSysInfo::WV_XP: return "Windows XP";
case QSysInfo::WV_VISTA: return "Windows Vista";
default: return "Windows";
}
#endif
and similar for Mac.
If you are using a Qt version 5.9 or above, kindly use the below mentioned library function to retrieve correct OS details, more on this can be found here. There is also a QSysInfo class which can do some additional functionalities.
#ifdef Q_WS_WIN
#include <QOperatingSystemVersion>
switch(QOperatingSystemVersion::current())
{
case QOperatingSystemVersion::Windows7: return "Windows 7";
case QOperatingSystemVersion::Windows8: return "Windows 8";
case QOperatingSystemVersion::Windows10: return "Windows 10";
default: return "Windows";
}
#endif
For Qt5 I use the following:
logging.info("##### System Information #####")
sysinfo = QtCore.QSysInfo()
logging.info("buildCpuArchitecture: " + sysinfo.buildCpuArchitecture())
logging.info("currentCpuArchitecture: " + sysinfo.currentCpuArchitecture())
logging.info("kernel type and version: " + sysinfo.kernelType() + " " + sysinfo.kernelVersion())
logging.info("product name and version: " + sysinfo.prettyProductName())
logging.info("#####")
Documentation: http://doc.qt.io/qt-5/qsysinfo.html
Here is part of my code to detect windows or mac at run time and the version
#include <QSysInfo>
#include <QOperatingSystemVersion>
auto OSType= OSInfo.type();
auto OSInfo = QOperatingSystemVersion::current();
if (OSType !=1) //not windows os
{
return 0;
}
if (OSInfo < QOperatingSystemVersion::Windows7) // less than win7
{
return 0;
}