I'm currently inspecting an execution of our software through the Nsight profiler with visual studio.
I'm wondering what is the trigger for the "start" and "stop" of the ranges for the CPU and GPU GPU 0 - G... as highlighted in the following image, more precisely, what would trigger the switch from CPU frame 53 to CPU frame 54, same for GPU GPU 0 frame 53 to GPU GPU 0 frame 54:
Second question: should I care?
The 'frame' range is before and after I call the 'frame' method of the viewer in OSG.
I guess you could configure nSight to interpret different events to detect start and end times of a rendering frame, but the default would be the final buffer swap or glFlush issued by OSG at the end of all rendering traversals.
Related
I have an audio playing app that runs on several distributed devices, each with their own clock. I am using QAudioOutput to play the same audio on each device, and UDP broadcast from a master device to synchronize the other devices, so far so good.
However, I am having a hard time getting an accurate picture of "what is playing now" from QAudioOutput. I am using the QAudioOutput bufferSize() and bytesFree() to estimate what audio frame is currently being fed to the sound system, but the bytesFree() value progresses in a "chunky" fashion, so that (bufferSize() - bytesFree()) / bytesPerFrame doesn't give the number of frames remaining in the buffer, but some smaller number that bounces around relative to it.
The result I am getting now is that when my "drift indicator" updates, it will run around 0 for several seconds, then get indications in the -15 to -35 ms range every few seconds for maybe 20 seconds, then a correcting jump of about +120ms. Although I could try to analyze this long term pattern and tease out the true drift rate (maybe a couple of milliseconds per minute), I'd much rather work with more direct information if it's available.
Is there any way to read the true number of frames remaining in the QAudioOutput buffer while it is playing a stream?
I realize I could minimize my problems by radically reducing the buffer size and feeding QAudioOutput with a high priority process, but I'd rather have a solution that uses longer buffers and isn't so fussy about what it runs on - target platforms vary from Windows 10 machines to Raspberry Pi Zero Ws, possibly to Android phones.
For a university project I'm currently working on a slurm supercomputer cluster and have written a number of C programs using MPI.
While profiling one of them I have observed that the time elapsed between an MPI_Send and an accompanying MPI_Recv operation is a mostly linear function of the message length. However, at around 32 MiB there is a sudden jump in latency from around 10ms to around 20ms. This happens both for two processes on the same node and two processes on separate nodes.
Now I would like to find out why this happens. I'm aware that this is not an MPI intrinsic phenomenon but must be related to the underlying hardware setup, but I'm not sure where to begin looking for an explanation. What are some possible explanations for this and how could I check whether they apply in my case?
I notice a very important performance difference (in FPS) between the CefSharp.WinForms.Example and CefSharp.Wpf.Example when using http://www.vsynctester.com
When turning off VSync in my video card control panel and in the settings in CefExample Init()
settings.CefCommandLineArgs.Add("disable-gpu-vsync", "0");
For CefSharp.WinForms.Example I get around 500 FPS (steady)
For CefSharp.Wpf.Example I barely get 30 FPS
I understand that Wpf uses offscreen rendering, but what explains the big performance difference for the same web page?
I'm using a MacBook Pro with Win 8.1 with NVidia GT 750M Graphics.
CefSharp version is 8755a9496ffbd5f21bc6ef062bce687a22d83938 (March 1st 2015) and Cef version 3.2171.1979
The maximum rate in frames per second (fps) that CefRenderHandler::OnPaint will be called for a windowless browser. The actual fps may be lower if the browser cannot generate frames at the requested rate. The minimum value is 1 and the maximum value is 60 (default 30).
Direct quote from the CEF documentation see http://magpcss.org/ceforum/apidocs3/projects/%28default%29/_cef_browser_settings_t.html#windowless_frame_rate
The entire process is more CPU bound that it is GPU bound. The slow part is the bitmap buffer is copied in memory, before it's displayed. CEF also supports DirtyRects which is currently not implemented, so even a small graphical change forces a complete screen redraw.
When the upstream CEF issue 1006 is resolved we can then look at making some more improvements.
https://code.google.com/p/chromiumembedded/issues/detail?id=1006&q=label%3AOSR
I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.
__kernel
void kernel__test_warpsize(
__global T* dataSet,
uint size
)
{
size_t idx = get_global_id(0);
T value = dataSet[idx];
if (idx<size-1)
dataSet[idx+1] = value;
}
In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values:
dataSet[0]=0
dataSet[1]=0
dataSet[2]=1
...
dataSet[63]=62
dataSet[64]=63
dataSet[65]=63
dataSet[66]=65
...
dataSet[127]=126
However, I found dataSet[65] is 64, not 63, which is not as my expectation.
My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?
You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.
In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.
Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.
Based on your comments I edited this part:
Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.
And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.
But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).
CUDA model - what is warp size?
I think this is a good answer which explains the warp briefly.
But I am a bit confused about what sharpneli said such as
" [If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. ]".
I think the local size which means the group size is set by the programmer. But when the implement occurs, the subdivied group is set by hardware like warp.
As the title says, when I run my OpenCL kernel the entire screen stops redrawing (the image displayed on monitor remains the same until my program is done with calculations. This is true even in case I unplug it from my notebook and plug it back - allways the same image is displayed) and the computer does not seem to react to mouse moves either - the cursor stays in the same position.
I am not sure why this happens. Could it be a bug in my program, or is this a standard behaviour ?
While searching on Google I found this thread on AMD's forum and some people there suggested it's normal as the GPU can't refresh the screen, when it is busy with computations.
If this is true, is there still any way to work around this ?
My kernel computation can take up to several minutes and having my computer practically unusable for whole that time is really painful.
EDIT1: this is my current setup:
graphics card is ATI Mobility Radeon HD 5650 with 512 MB of memory and latest Catalyst beta driver from AMD's website
the graphics is switchable - Intel integrated/ATI dedicated card, but
I have disabled switching in BIOS, because otherwise I could not get
the driver working on Ubuntu.
the operating system is Ubuntu 12.10 (64-bits), but this happens on Windows 7 (64-bits) as well.
I have my monitor plugged in via HDMI (but the notebook screen freezes
too, so this should not be a problem)
EDIT2: so after a day of playing with my code, I took the advices from your responses and changed my algorithm to something like this (in pseudo code):
for (cl_ulong chunk = 0; chunk < num_chunks; chunk += chunk_size)
{
/* set kernel arguments that are different for each chunk */
clSetKernelArg(/* ... */);
/* schedule kernel for next execution */
clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, NULL);
/* read out the results from kernel and append them to output array on host */
clEnqueueReadBuffer(cmd_queue, of_buf, CL_TRUE, 0, chunk_size, output + chunk, 0, NULL, NULL);
}
So now I split the whole workload on host and send it to GPU in chunks. For each chunk of data I enqueue a new kernel and the results that I get from it are appended to the output array at a correct offset.
Is this how you meant that the calculation should be divided ?
This seems like the way to remedy the freeze problem and even more now I am able to process data much larger than the available GPU memory, but I will yet have to make some good performance meassurements, to see what is the good chunk size...
Whenever a GPU is running an OpenCL kernel it is completely dedicated to OpenCL. Some modern Nvidia GPUs are the exception, I think from the GeForce GTX 500 series onwards, which could run multiple kernels if those kernels did not use all available compute units.
Your solutions are to divide your calculations into multiple short kernel calls, which is the best all round solution since it will work even on single GPU machines, or to invest in a cheap GPU for driving your display.
If you are going to run long kernels on your GPUs then you must disable timeout detection and recovery for GPUs or make the timeout delay longer than the maximum kernel runtime (better as bugs can still be caught), see here for how to do this.
Every time I have had a display freeze or "Display driver stopped responding and has recovered" it's been due to a bug. It can freeze the whole system and the only thing I can do is reset. Instead, now I develop on the CPU first. This never crashes my whole system. It's easier to debug this way as well since I can use printf. Once I got my code working bug free on the CPU I try it on the GPU.
I am new to opencl and encountered a similar problem. I found a short calculation works OK, but a longer one freezes the mouse cursor. For my problem, Windows leaves a yellow triangle in the tray area, and puts a message in the event log about "Display driver stopped responding and has recovered". The solution I found is to break up the calculation into small parts that take no more than a couple of seconds each. These run back to back, yet apparently let the video driver in enough to keep it happy. If I set global_work_size to a value high enough to maximize throughput, the video response is painfully slow, but the driver restart/mouse freeze problem never occurs.