OpenCL's get_default_queue() always returns 0 - opencl

I'm trying to get device-side enqueuing working in my OpenCL application, but no matter what I do, get_default_queue() returns 0.
I've set up a second command queue with on device, on device default and out of order exec mode enabled
I've checked that the device enqueue capabilities supports device side enqueuing
I've checked that the main command queue device default returns the device command queue
I've got the CL3.0 compile tag
The kernel functions perfectly fine in every way except for this.
My code can be found here: https://github.com/labiraus/svo-tracer/blob/main/SvoTracer/SvoTracer.Kernel/EnqueueTest.cs
A stripped down version of the kernel:
kernel void test(global int *out) {
int i = get_global_id(0);
if (get_default_queue() != 0) {
out[i] = 1;
} else {
out[i] = 2;
}
}

In OpenCL 3.0 this became an optional feature. Whilst some devices will correctly return that device side enqueuing is not supported, it seems that the card I was testing on may not support it whilst reporting that it did.
Ultimately device-side enqueuing isn't functionality that can be relied upon to be implemented by OpenCL 3.0 compatible devices.

Related

Device-side enqueue causes CL_OUT_OF_RESOURCES

I have a program utilizing OpenCL 2.0 because I want to take advantage of device-side enqueue. I have a test program that performs the following tasks on the host side:
Allocates 16 kilobytes of floating point memory on the device and zeros it out.
Builds the OpenCL program below, and creates a kernel of masterKernel()
Sets the first argument of masterKernel() (heap) to the allocated memory in step 1
Enqueues that masterKernel() via clEnqueueNDRangeKernel() with a work_dim of 1 and a global work size of 1. (So it only runs once, with get_global_id(0) always being zero)
Reads the memory back into the host and displays it.
Here is the OpenCL code:
//This function was stripped down to nothing for testing purposes.
kernel void childKernel(global float* heap)
{
}
//Enqueues the child kernel.
kernel void masterKernel(global float* heap)
{
ndrange_t ndRange = ndrange_1D(16); //Arbitrary, could be any number.
if(get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), 0, ndRange,
^{ childKernel(heap); });
}
}
The program builds successfully. However, when I try to run masterKernel(), The call to enqueue_kernel() here causes the host side call to clEnqueueNDRangeKernel() to fail with an error code of CL_OUT_OF_RESOURCES. OpenCL's documentation says enqueue_kernel() should return CL_SUCCESS or CL_ENQUEUE_FAILURE depending on if the block enqueues successfully or not. It does not say that clEnqueueNDRangeKernel() itself should fail. Here are some other things I've tried:
Commenting out the call to enqueue_kernel() causes the program to succeed.
Adding a line that sets heap[0] to any number causes the host-side program to reflect that change. So I know that it's not a problem with how I'm feeding the arguments in
Modifying the if statement so that it reads something impossible like if(get_global_id(0) == 6000) still causes the error. This tells me that the error is not caused by enqueue_kernel() executing (I verified get_global_size(0) == 1), but merely that it exists in the program at all.
Modifying the if statement to if(0) does make the error not happen.
Making it so childKernel() actually does something does not make the error go away.
I am not really sure what to try next. I know my device supports OpenCL 2.0. My device is an AMD Radeon R9 380 graphics card. I do not have access to any other OpenCL 2.0 capable hardware to test it on.
I ended up figuring this one out. This issue happened because I did not create a device-side queue (one with the flags of CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT).

How to do a kill(pid, SIGINT) in windows? [duplicate]

I have (in the past) written cross-platform (Windows/Unix) applications which, when started from the command line, handled a user-typed Ctrl-C combination in the same way (i.e. to terminate the application cleanly).
Is it possible on Windows to send a Ctrl-C/SIGINT/equivalent to a process from another (unrelated) process to request that it terminate cleanly (giving it an opportunity to tidy up resources etc.)?
I have done some research around this topic, which turned out to be more popular than I anticipated. KindDragon's reply was one of the pivotal points.
I wrote a longer blog post on the topic and created a working demo program, which demonstrates using this type of system to close a command line application in a couple of nice fashions. That post also lists external links that I used in my research.
In short, those demo programs do the following:
Start a program with a visible window using .Net, hide with pinvoke, run for 6 seconds, show with pinvoke, stop with .Net.
Start a program without a window using .Net, run for 6 seconds, stop by attaching console and issuing ConsoleCtrlEvent
Edit: The amended solution from #KindDragon for those who are interested in the code here and now. If you plan to start other programs after stopping the first one, you should re-enable CTRL+C handling, otherwise the next process will inherit the parent's disabled state and will not respond to CTRL+C.
[DllImport("kernel32.dll", SetLastError = true)]
static extern bool AttachConsole(uint dwProcessId);
[DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
static extern bool FreeConsole();
[DllImport("kernel32.dll")]
static extern bool SetConsoleCtrlHandler(ConsoleCtrlDelegate HandlerRoutine, bool Add);
delegate bool ConsoleCtrlDelegate(CtrlTypes CtrlType);
// Enumerated type for the control messages sent to the handler routine
enum CtrlTypes : uint
{
CTRL_C_EVENT = 0,
CTRL_BREAK_EVENT,
CTRL_CLOSE_EVENT,
CTRL_LOGOFF_EVENT = 5,
CTRL_SHUTDOWN_EVENT
}
[DllImport("kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
private static extern bool GenerateConsoleCtrlEvent(CtrlTypes dwCtrlEvent, uint dwProcessGroupId);
public void StopProgram(Process proc)
{
//This does not require the console window to be visible.
if (AttachConsole((uint)proc.Id))
{
// Disable Ctrl-C handling for our program
SetConsoleCtrlHandler(null, true);
GenerateConsoleCtrlEvent(CtrlTypes.CTRL_C_EVENT, 0);
//Moved this command up on suggestion from Timothy Jannace (see comments below)
FreeConsole();
// Must wait here. If we don't and re-enable Ctrl-C
// handling below too fast, we might terminate ourselves.
proc.WaitForExit(2000);
//Re-enable Ctrl-C handling or any subsequently started
//programs will inherit the disabled state.
SetConsoleCtrlHandler(null, false);
}
}
Also, plan for a contingency solution if AttachConsole() or the sent signal should fail, for instance sleeping then this:
if (!proc.HasExited)
{
try
{
proc.Kill();
}
catch (InvalidOperationException e){}
}
The closest that I've come to a solution is the SendSignal 3rd party app. The author lists source code and an executable. I've verified that it works under 64-bit windows (running as a 32-bit program, killing another 32-bit program), but I've not figured out how to embed the code into a windows program (either 32-bit or 64-bit).
How it works:
After much digging around in the debugger I discovered that the entry point that actually does the behavior associated with a signal like ctrl-break is kernel32!CtrlRoutine. The function had the same prototype as ThreadProc, so it can be used with CreateRemoteThread directly, without having to inject code. However, that's not an exported symbol! It's at different addresses (and even has different names) on different versions of Windows. What to do?
Here is the solution I finally came up with. I install a console ctrl handler for my app, then generate a ctrl-break signal for my app. When my handler gets called, I look back at the top of the stack to find out the parameters passed to kernel32!BaseThreadStart. I grab the first param, which is the desired start address of the thread, which is the address of kernel32!CtrlRoutine. Then I return from my handler, indicating that I have handled the signal and my app should not be terminated. Back in the main thread, I wait until the address of kernel32!CtrlRoutine has been retrieved. Once I've got it, I create a remote thread in the target process with the discovered start address. This causes the ctrl handlers in the target process to be evaluated as if ctrl-break had been pressed!
The nice thing is that only the target process is affected, and any process (even a windowed process) can be targeted. One downside is that my little app can't be used in a batch file, since it will kill it when it sends the ctrl-break event in order to discover the address of kernel32!CtrlRoutine.
(Precede it with start if running it in a batch file.)
I guess I'm a bit late on this question but I'll write something anyway for anyone having the same problem.
This is the same answer as I gave to this question.
My problem was that I'd like my application to be a GUI application but the processes executed should be run in the background without any interactive console window attached. I think this solution should also work when the parent process is a console process. You may have to remove the "CREATE_NO_WINDOW" flag though.
I managed to solve this using GenerateConsoleCtrlEvent() with a wrapper app. The tricky part is just that the documentation is not really clear on exactly how it can be used and the pitfalls with it.
My solution is based on what is described here. But that didn't really explain all the details either and with an error, so here is the details on how to get it working.
Create a new helper application "Helper.exe". This application will sit between your application (parent) and the child process you want to be able to close. It will also create the actual child process. You must have this "middle man" process or GenerateConsoleCtrlEvent() will fail.
Use some kind of IPC mechanism to communicate from the parent to the helper process that the helper should close the child process. When the helper get this event it calls "GenerateConsoleCtrlEvent(CTRL_BREAK, 0)" which closes down itself and the child process. I used an event object for this myself which the parent completes when it wants to cancel the child process.
To create your Helper.exe create it with CREATE_NO_WINDOW and CREATE_NEW_PROCESS_GROUP. And when creating the child process create it with no flags (0) meaning it will derive the console from its parent. Failing to do this will cause it to ignore the event.
It is very important that each step is done like this. I've been trying all different kinds of combinations but this combination is the only one that works. You can't send a CTRL_C event. It will return success but will be ignored by the process. CTRL_BREAK is the only one that works. Doesn't really matter since they will both call ExitProcess() in the end.
You also can't call GenerateConsoleCtrlEvent() with a process groupd id of the child process id directly allowing the helper process to continue living. This will fail as well.
I spent a whole day trying to get this working. This solution works for me but if anyone has anything else to add please do. I went all over the net finding lots of people with similar problems but no definite solution to the problem. How GenerateConsoleCtrlEvent() works is also a bit weird so if anyone knows more details on it please share.
Somehow GenerateConsoleCtrlEvent() return error if you call it for another process, but you can attach to another console application and send event to all child processes.
void SendControlC(int pid)
{
AttachConsole(pid); // attach to process console
SetConsoleCtrlHandler(NULL, TRUE); // disable Control+C handling for our app
GenerateConsoleCtrlEvent(CTRL_C_EVENT, 0); // generate Control+C event
}
Edit:
For a GUI App, the "normal" way to handle this in Windows development would be to send a WM_CLOSE message to the process's main window.
For a console app, you need to use SetConsoleCtrlHandler to add a CTRL_C_EVENT.
If the application doesn't honor that, you could call TerminateProcess.
Here is the code I use in my C++ app.
Positive points :
Works from console app
Works from Windows service
No delay required
Does not close the current app
Negative points :
The main console is lost and a new one is created (see FreeConsole)
The console switching give strange results...
// Inspired from http://stackoverflow.com/a/15281070/1529139
// and http://stackoverflow.com/q/40059902/1529139
bool signalCtrl(DWORD dwProcessId, DWORD dwCtrlEvent)
{
bool success = false;
DWORD thisConsoleId = GetCurrentProcessId();
// Leave current console if it exists
// (otherwise AttachConsole will return ERROR_ACCESS_DENIED)
bool consoleDetached = (FreeConsole() != FALSE);
if (AttachConsole(dwProcessId) != FALSE)
{
// Add a fake Ctrl-C handler for avoid instant kill is this console
// WARNING: do not revert it or current program will be also killed
SetConsoleCtrlHandler(nullptr, true);
success = (GenerateConsoleCtrlEvent(dwCtrlEvent, 0) != FALSE);
FreeConsole();
}
if (consoleDetached)
{
// Create a new console if previous was deleted by OS
if (AttachConsole(thisConsoleId) == FALSE)
{
int errorCode = GetLastError();
if (errorCode == 31) // 31=ERROR_GEN_FAILURE
{
AllocConsole();
}
}
}
return success;
}
Usage example :
DWORD dwProcessId = ...;
if (signalCtrl(dwProcessId, CTRL_C_EVENT))
{
cout << "Signal sent" << endl;
}
A solution that I have found from here is pretty simple if you have python 3.x available in your command line. First, save a file (ctrl_c.py) with the contents:
import ctypes
import sys
kernel = ctypes.windll.kernel32
pid = int(sys.argv[1])
kernel.FreeConsole()
kernel.AttachConsole(pid)
kernel.SetConsoleCtrlHandler(None, 1)
kernel.GenerateConsoleCtrlEvent(0, 0)
sys.exit(0)
Then call:
python ctrl_c.py 12345
If that doesn't work, I recommend trying out the windows-kill project:
Choco: https://github.com/ElyDotDev/windows-kill
Node: https://github.com/ElyDotDev/node-windows-kill
void SendSIGINT( HANDLE hProcess )
{
DWORD pid = GetProcessId(hProcess);
FreeConsole();
if (AttachConsole(pid))
{
// Disable Ctrl-C handling for our program
SetConsoleCtrlHandler(NULL, true);
GenerateConsoleCtrlEvent(CTRL_C_EVENT, 0); // SIGINT
//Re-enable Ctrl-C handling or any subsequently started
//programs will inherit the disabled state.
SetConsoleCtrlHandler(NULL, false);
WaitForSingleObject(hProcess, 10000);
}
}
Thanks to jimhark's answer and other answers here, I found a way to do it in PowerShell:
$ProcessID = 1234
$MemberDefinition = '
[DllImport("kernel32.dll")]public static extern bool FreeConsole();
[DllImport("kernel32.dll")]public static extern bool AttachConsole(uint p);
[DllImport("kernel32.dll")]public static extern bool GenerateConsoleCtrlEvent(uint e, uint p);
public static void SendCtrlC(uint p) {
FreeConsole();
if (AttachConsole(p)) {
GenerateConsoleCtrlEvent(0, p);
FreeConsole();
}
AttachConsole(uint.MaxValue);
}'
Add-Type -Name 'dummyName' -Namespace 'dummyNamespace' -MemberDefinition $MemberDefinition
[dummyNamespace.dummyName]::SendCtrlC($ProcessID)
What made things work was sending the GenerateConsoleCtrlEvent to the desired process group instead of sending it to all processes that share the console of the calling process and then AttachConsole back to the current process' parent's console.
Yes. The https://github.com/ElyDotDev/windows-kill project does exactly what you want:
windows-kill -SIGINT 1234
It should be made crystal clear because at the moment it isn't.
There is a modified and compiled version of SendSignal to send Ctrl-C (by default it only sends Ctrl+Break). Here are some binaries:
(2014-3-7) : I built both 32-bit and 64-bit version with Ctrl-C, it's called SendSignalCtrlC.exe and you can download it at: https://dl.dropboxusercontent.com/u/49065779/sendsignalctrlc/x86/SendSignalCtrlC.exe https://dl.dropboxusercontent.com/u/49065779/sendsignalctrlc/x86_64/SendSignalCtrlC.exe -- Juraj Michalak
I have also mirrored those files just in case:
32-bit version: https://www.dropbox.com/s/r96jxglhkm4sjz2/SendSignalCtrlC.exe?dl=0
64-bit version: https://www.dropbox.com/s/hhe0io7mcgcle1c/SendSignalCtrlC64.exe?dl=0
Disclaimer: I didn't build those files. No modification was made to the compiled
original files. The only platform tested is the 64-bit Windows 7. It is recommended to adapt the source available at http://www.latenighthacking.com/projects/2003/sendSignal/ and compile it yourself.
In Java, using JNA with the Kernel32.dll library, similar to a C++ solution. Runs the CtrlCSender main method as a Process which just gets the console of the process to send the Ctrl+C event to and generates the event. As it runs separately without a console the Ctrl+C event does not need to be disabled and enabled again.
CtrlCSender.java - Based on Nemo1024's and KindDragon's answers.
Given a known process ID, this consoless application will attach the console of targeted process and generate a CTRL+C Event on it.
import com.sun.jna.platform.win32.Kernel32;
public class CtrlCSender {
public static void main(String args[]) {
int processId = Integer.parseInt(args[0]);
Kernel32.INSTANCE.AttachConsole(processId);
Kernel32.INSTANCE.GenerateConsoleCtrlEvent(Kernel32.CTRL_C_EVENT, 0);
}
}
Main Application - Runs CtrlCSender as a separate consoless process
ProcessBuilder pb = new ProcessBuilder();
pb.command("javaw", "-cp", System.getProperty("java.class.path", "."), CtrlCSender.class.getName(), processId);
pb.redirectErrorStream();
pb.redirectOutput(ProcessBuilder.Redirect.INHERIT);
pb.redirectError(ProcessBuilder.Redirect.INHERIT);
Process ctrlCProcess = pb.start();
ctrlCProcess.waitFor();
I found all this too complicated and used SendKeys to send a CTRL-C keystroke to the command line window (i.e. cmd.exe window) as a workaround.
A friend of mine suggested a complete different way of solving the problem and it worked for me. Use a vbscript like below. It starts and application, let it run for 7 seconds and close it using ctrl+c.
'VBScript Example
Set WshShell = WScript.CreateObject("WScript.Shell")
WshShell.Run "notepad.exe"
WshShell.AppActivate "notepad"
WScript.Sleep 7000
WshShell.SendKeys "^C"
// Send [CTRL-C] to interrupt a batch file running in a Command Prompt window, even if the Command Prompt window is not visible,
// without bringing the Command Prompt window into focus.
// [CTRL-C] will have an effect on the batch file, but not on the Command Prompt window itself -- in other words,
// [CTRL-C] will not have the same visible effect on a Command Prompt window that isn't running a batch file at the moment
// as bringing a Command Prompt window that isn't running a batch file into focus and pressing [CTRL-C] on the keyboard.
ulong ulProcessId = 0UL;
// hwC = Find Command Prompt window HWND
GetWindowThreadProcessId (hwC, (LPDWORD) &ulProcessId);
AttachConsole ((DWORD) ulProcessId);
SetConsoleCtrlHandler (NULL, TRUE);
GenerateConsoleCtrlEvent (CTRL_C_EVENT, 0UL);
SetConsoleCtrlHandler (NULL, FALSE);
FreeConsole ();
SIGINT can be send to program using windows-kill, by syntax windows-kill -SIGINT PID, where PID can be obtained by Microsoft's pslist.
Regarding catching SIGINTs, if your program is in Python then you can implement SIGINT processing/catching like in this solution.
Based on process id, we can send the signal to process to terminate forcefully or gracefully or any other signal.
List all process :
C:\>tasklist
To kill the process:
C:\>Taskkill /IM firefox.exe /F
or
C:\>Taskkill /PID 26356 /F
Details:
http://tweaks.com/windows/39559/kill-processes-from-command-prompt/

How to collect binary kernels to distribute proprietary code

I have a code that contains know-how I would not like to distribute in source code. One of solutions is to provide a bunch of pre-compiled kernels and choose the correct binary depending on the user's hardware.
How to cover most of the users (AMD and Intel, as Nvidia can use CUDA code) with minimum of the binaries and minimum of the machines where I have to run my offline compiler? Are there families of GPUs that can use the same binaries? CUDA compiler can compile for different architectures, what about OpenCL? Binary compatibility data doesn't seem well documented but maybe someone collected these data for himself.
I know there's SPIR but older hardware doesn't support it.
Here are details of my implementation if someone found this question and did less than I do. I made a tool that compiles the kernel to the file and then I collected all these binaries into a C array to be included into the main application:
const char* binaries[] = {
//kernels/HD Graphics 4000
"\x62\x70\x6c\x69\x73\x74\x30\x30\xd4\x01\x02\x03"
"\x04\x05\x06\x07\x08\x5f\x10\x0f\x63\x6c\x42\x69"
"\x6e\x61\x72\x79\x56\x65\x72\x73\x69\x6f\x6e\x5c"
...
"\x00\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00\x00"
"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x06\x47\xe0"
,
//here more kernels
};
size_t binaries_sizes[] = {
204998,
205907,
...
};
And then I use the following code which iterates all the kernels (I didn't invent anything more clever than trial-and-error, choosing the first kernel that builds successfully, probably there's better solution):
int e3 = -1;
int i = 0;
while (e3 != CL_SUCCESS) {
if (i == lenof(binaries)) {
throw Error();
}
program = clCreateProgramWithBinary(context, 1, &deviceIds[devIdx], &binaries_sizes[i],
(const unsigned char**)&binaries[i],
nullptr, &e3);
if (e3 != CL_SUCCESS) {
++i;
continue;
}
int e4 = clBuildProgram(program, 1, &deviceIds[devIdx],
"", nullptr, nullptr);
e3 = e4;
++i;
}
Unfortunately, there is no standard solution for your problem. OpenCL is platform-independent, and there is no standard way (apart from SPIR) to deal with this problem. Each vendor decide a different compiler toolchain internally, and even this can change across multiple versions of the same driver, or for different devices.
You could add some meta-data to the kernel to identify which platform have you compiled it for, which will save you of the trial and error part (i.e, instead of just storing binaries and binaries_size, you can also store binary_platform and binary_device and then iterate through those arrays to see which binary you should load).
The best solution for you would be SPIR (or the new SPIRV), which are intermediate representations that can be then "re-compiled" by the OpenCL driver to the actual architecture instruction set.
If you store your binaries in SPIRV, and have access to/knowledge of some compiler magic, you can use a translator tool to get back the LLVM-IR and then compile down to other platforms, such as AMD or PTX, using the LLVM infrastructure (see https://github.com/KhronosGroup/SPIRV-LLVM)

heap memory release policy in Arduino

#include <Arduino.h>
#include "include/MainComponent.h"
/*
 Turns on an LED on for one second, then off for one second, repeatedly.
*/
MainComponent* mainComponent;
void setup()
{
   mainComponent = new MainComponent();
   mainComponent->beginComponent();
}
void loop()
{
   mainComponent->runComponent();
}
is there any callback to release memory in Arduino ?(e.g to call delete mainComponent)
or this will happen automatically as the loop ends?
what is the strategy to ensure freeing the memory allocated in that code snippet?
SCENARIO :"I wanted to access the object in both methods , so the  object is declared in the global scope then instantiated at setup."
What happen when loop() terminated ? will  mainComponent still remain in the memory?
If it was in OS NO , process will terminated then resources will be deallocated.
So in Arduino how can I achieve above SCENARIO , by ensuring memory will be deallocated when the controller is switched off ?
What is confusing you is that the main() function is hidden by the basic Arduino IDE. Your programs have a main() function just like on any other platform, and have a lifecycle same as when run on a computer with OS. If you look under arduino___\hardware\cores\aduino, you will find a file main.cpp, which is included into your binaries:
int main(void)
{
init();
//...
setup();
for (;;) {
loop();
if (serialEventRun) serialEventRun();
}
return 0;
}
Considering this file you will now see, that while you exit the loop(), it is continuously called. Your program never exits. In general, your best pattern is to new objects once and never delete, like you have done here. If you are new'ing and delete'ing objects repeatedly on a microcontroller, you are not thinking about lifecycles and resources wisely.
So
"is the new'd object deleted at return from loop()?" No, the program is still running and it stays on the heap.
"What happens at power off? Is there a way to clean up?" The moment the supply voltage drops too low, the microcontroller will stop executing instructions. Power supervisor circuitry prevents the controller from doing anything erratic as the voltage drops (should prevent) When the voltage is conpletely drained, all the RAM is lost. Without adding circuitry, you have no way to execute any clean up at power off.
"Do I need to clean up?" No, at power up, everything is reset to a known state. Operation cannot be affected by anything left behind in RAM (presumes you initialize all your variables).

OpenCL Profiling Info Unavailable for Marker Commands

Using AMDs APP OpenCL implementation with JOCL bindings, I'm trying to create a generic bracketing profiler using Java automatic resource management. The basic idea is:
class Timer implements AutoCloseable {
...
Timer {
...
clEnqueueMarker( commandQueue, startEvent );
}
void close() {
cl_event stopEvent = new cl_event();
clEnqueueMarker( commandQueue, stopEvent );
clFinish( commandQueue );
... calculate and output times ...
}
}
My problem is that profiling information is not available for the marker command events (stopEvent and startEvent). This is despite a) setting CL_QUEUE_PROFILING_ENABLE on the command queue and b) flushing and waiting on the command queue and verifying that the stop and start events are CL_COMPLETE with no errors.
So my question is, is profiling supported on marker commands in AMD OpenCL? If not, is it explicitly disallowed by the spec (I found nothing to this effect)?
Thanks.
I've rechecked the spec and it seems to me that what you get is normal (though I've never paid much attention to that detail previously). In the section 5.12 about profiling, the standard states:
This section describes profiling of OpenCL functions that are enqueued
as commands to a command-queue. The specific functions being
referred to are: clEnqueue{Read|Write|Map}Buffer,
clEnqueue{Read|Write}BufferRect, clEnqueue{Read|Write|Map}Image,
clEnqueueUnmapMemObject, clEnqueueCopyBuffer, clEnqueueCopyBufferRect,
clEnqueueCopyImage, clEnqueueCopyImageToBuffer,
clEnqueueCopyBufferToImage, clEnqueueNDRangeKernel , clEnqueueTask and
clEnqueueNativeKernel.
So the clEnqueueMarker() function is not in the list, and I guess the CL_PROFILING_INFO_NOT_AVAILABLE value returned makes sense.
I just tried this and it seems to work now. Tested on Windows 10 with an AMD 7870 and on Linux with Nvidias Titan Black and Titan X cards.
The OpenCL 1.2 specs still contain the paragraph #CaptainObvious quoted. The clEnqueueMarker function is still missing, but I can get profiling information without a problem.
The start and end times on marker events are always equal, which makes a lot of sense.
Btw. clEnqueueMarker is deprecated in OpenCL 1.2 and should be replaced with clEnqueueMarkerWithWaitList.

Resources