Memory Traces from Intel Vtune - intel

Is it possible to extraction memory traces information along instruction count from intel vtune? If yes, can you please give me idea how to perform this operation.
Thanks

It used to be possible to get PEBS information from Precise Event with iPTU, which runs on top of the VTune driver. As far as I know, iPTU support was discontinued on IvyBridge+, but you can still collect the trace on SandyBridge-.
http://software.intel.com/en-us/articles/use-intelr-performance-tuning-utility-to-view-result-from-vtunetm-performance-analyzer
I don't exactly remember how to do that, but there are command line report tools to output memory ref. report.

Related

#OPENCL#clBuildProgram failed with error code -5

I met a problem when using clBuildProgram() on GTX 750. The kernel failed to build with error code -5(CL_OUT_OF_RESOURCES) and an empty build log.
There is a possible solution, which is adding '-cl-nv-verbose' as input option to clBuildProgram(). However, it doesn't work for all kernels.
Based on that, I tried another optimization option which is '-cl-opt-disable'. It also works for some kernels.
Then I got confused.
I cannot find the real reason for causing the error;
Why do different build-options make sense for some kernels?
The error seems like architecture independent.Since the same Opencl code is executed successfully on GTX 750, while failed on Tesla P100.
Does anyone has ideas?
Possible reasons I can think of:
Running out of registers. This happens if you have a lot of (private) variables in your kernel code, especially arrays. Each core only has a certain amount of registers available (architecture dependent), and it may not be possible for the compiler to "spill" them to global memory. If this is the problem, you can try to rearrange your code so your variables have more limited scope, or you can try to move some arrays to local memory (bearing in mind this is shared between work items in a group, and also limited in size). A good GPU profiler/code analysis tool should be able to tell you how much register pressure there is, so if you've got the kernel working on some hardware, you should be able to find out register pressure for that, and draw conclusions for other hardware too.
Code size itself. I didn't think this should be much of a problem anymore on modern GPUs, but it might be possible if you have truly gigantic kernels.

Should I release the buffers within IMFSample before I release the IMFSample object?

I can't find any information about a requirement to release the buffers within an IMFSample before actually releasing the IMFSample itself. If I do just release the IMFSample does that automatically release it's buffers? I'm writing a video player app and I'm receiving a sample from IMFSourceReader::ReadSample. While I see the code running I see a small increase in memory usage in VS2017 and I'm not sure if it's a leak yet. The code I'm using is based on the sample code in this post.
Media Foundation webcam video H264 encode/decode produces artifacts when played back
I found the IMFSample::RemoveAllBuffers method that may or may not release the buffers, it doesn't specify in the documentation. Maybe this needs to be used before releasing the IMFSample? https://msdn.microsoft.com/en-us/library/windows/desktop/ms703108(v=vs.85).aspx
(I also run across another related post in my research but I don't think it applies to my question:)
Should I release the returned IMFSample of an internally allocated MFT output buffer?
Thanks!
Regular COM interface pointer management rules apply: you just release IMFSample pointer and that's it.
In certain cases your release of a sample pointer does not lead to actual freeing of memory because samples might be pooled: the released sample object is returned to its parent pool and is ready for reuse.
Just to echo from what Roman said, if you just handle an IMFSample, you just need to release this IMFSample. If you AddBuffer on the IMFSample, you have to remove it (i agree it's not clear, but if buffers are removed at this time, this will create problems with the component who handles IMFSample, and it will not find buffers after). that said, this design can cause problems if no one calls RemoveAllBuffers at the right time.
Differents components are not supposed to use AddBuffer on the same IMFSample. But it is possible to do it.
This is bad design for memory management. You have to deal with it.
Last but not least.
According to Microsoft you have to release the buffers (before- or afterward doesn't matter).
See: Decode the audio

Methods to discourage reverse engineering of an opencl kernel

I am preparing my opencl accelerated toolkit for release. Currently, I compile my opencl kernels into binaries targeted for a particular card.
Are there other ways of discouraging reverse engineering? Right now, I have many opencl binaries in my release folder, one for each kernel. Would it be better to splice these binaries into one single binary, or even add them into the host binary, and somehow read them in using a special offset ?
OpenCL 2.0 and SPIR-V can be used for this, but is not available on all platforms yet.
Encode binaries. Keep keys in server and have clients request it at time of usage. Ofcourse keys should be encoded too,( using a variable value such as time of server maybe). Then decode in client to use as binary kernel.
I'm not encode pro but I would use multiple algorithms applied multiple times to make it harder, if they are crunchable in several months(needed for new version update of your GPGPU software for example) when they are alone. But simple unknown algorithm of your own such as reversing order of bits of all data (1st bit goes nth position, nth goes 1st) should make it look hard for level-1 hackers.
Warning: some profiling tools could get its codes in "run-time" so you should add many maybe hundreds of trivial kernels without performance penalty to hide it in a crowded timeline or you could disable profiling in kernel options or you could add a deliberate error maybe some broken events in queues then restart so profiler cannot initiate.
Maybe you could obfuscate final C99 code so it becomes unreadable by humans. If can, he/she doesn't need hacking in first place.
Maybe most effectively, don't do anything, just buy copyrights of your genuine algorithm and show it in a txt so they can look but can not dare copying for money.
If kernel can be rewritten into an "interpreted" version without performance penalty, you can get bytecodes from server to client, so when client person checks profiler, he/she sees only interpreter codes but not real working algorithm since it depends on bytecodes from server as being "data"(buffer). For example, c=a+b becomes if(..)else if(...)else(case ...) and has no meaning without a data feed.
On top of all these, you could buy time against some evil people reverseengineer it, you could pick variable names to initiate his/her "selective perception" so he/she can't focus for a while. Then you develop a meaner version at the same time. Such as c=a+b becomes bulletsToDevilsEar=breakDevilsLeg+goodGame

How to plot graph in Qt using the data received from serial port?

I have successfully communicated with one of my ARM7 board through serial port in Qt using qextserialport.
Now I have the data with me. I want to use it to plot graph (realtime plot), so, can anybody tell me how to do it?
If possible please, provide the sample example.
Qwt offers several optimizations how to implement real time plots - f.e in opposite to all other Qt plot packages it offers incremental painting.
You can check the oscilloscope example to see what is possible without almost no CPU usage.
Having to repaint from scratch for every incoming sample is the worst case, but even then there are strategies to speed up the repaints. But what optimzations are possible depends on your specific situation and can't be answered in general.
But if you need support for the Qwt library it doesn't make much sense to ask anywhere ales beside on the official Qwt support channels you can find on the Qwt project page !
I think what you looking for is this
http://www.workslikeclockwork.com/index.php/components/qt-plotting-widget/

cudaEventSynchronize behavior under different circumstances

I have a question regarding the CUDA call cudaEventSynchronize.
AFAIK, it activelly polls the event, thus consuming CPU cycles. If I would like to make it synchronous so CPU can be yielded as I can do with kernel executions, how could I do it?.
More specifically, what would be the expected behaviour under:
using CUDA_LAUNCH_BLOCKING=1 env variable.
using cudaDeviceScheduleBlockingSync
using cudaDeviceScheduleYield
I have been experiencing strange behaviours and need some help to elucidate this. Nvidia information on specific technical aspects are very reluctant to help with this... I suppose implementation details must be kept secret.
Thanks in advance,
Jose.
If you want cudaEventSynchronize to use blocking synchronization than you will need to create your event using
cudaError_t cudaEventCreateWithFlags (cudaEvent_t event, unsigned int flags)
and pass cudaEventBlockingSync as the flag.

Resources