pebble bitmap xor or mask? - pebble-sdk

On the Pebble watch I am trying to overwrite a bitmap layer with text so the text is written white over black areas and black over white areas.
In other environments I would do this with an XOR operation, or create a mask and perform a couple of writes after masking out what I don't want overwritten.
I don't see an XOR graphics operator or a mask operator in the Pebble graphics library.
How can this be done?
I'm using C and CloudPebble.
Lynd

Here is a routine to do it. The text layer must be 'under' the graphic layer, then you use layer_set_update_proc() to set the update procedure for the graphic layer to the routine below.
static void bitmapLayerUpdate(struct Layer *layer, GContext *ctx){
GBitmap *framebuffer;
const GBitmap *graphic = bitmap_layer_get_bitmap((BitmapLayer *)layer);
int height;
uint32_t finalBits;
uint32_t *bfr, *bitmap;
framebuffer = graphics_capture_frame_buffer(ctx);
if (framebuffer == NULL){
APP_LOG(APP_LOG_LEVEL_DEBUG, "capture frame buffer failed!!");
} else {
// APP_LOG(APP_LOG_LEVEL_DEBUG, "capture frame buffer succeeded");
}
height = graphic->bounds.size.h;
for (int yindex =0; yindex < height; yindex++){
for ( unsigned int xindex = 0; xindex < (graphic->row_size_bytes); xindex+=4){
bfr = (uint32_t*)((framebuffer->addr)+(yindex * framebuffer->row_size_bytes)+xindex);
bitmap = (uint32_t*)((graphic->addr)+(yindex * graphic->row_size_bytes)+xindex);
finalBits = *bitmap ^ *bfr;
// APP_LOG(APP_LOG_LEVEL_DEBUG, "bfr: %0x, bitmsp: %0x, finalBits: %x", (unsigned int)bfr, (unsigned int)bitmap, finalBits );
*bfr = finalBits;
}
}
graphics_release_frame_buffer(ctx, framebuffer);
}

Related

Manual loop unrolling with known maximum size

Please take a look at this code in an OpenCL kernel:
uint point_color = 4278190080;
float point_percent = 1.0f;
float near_pixel_size = (...);
float far_pixel_size = (...);
float delta_pixel_size = far_pixel_size - near_pixel_size;
float3 near = (...);
float3 far = (...);
float3 direction = normalize(far - near);
point_position = (...) + 10;
for (size_t p = 0; p < point_count; p++, position += 4)
{
float3 point = (float3)(point_list[point_position], point_list[point_position + 1], point_list[point_position + 2]);
float projection = dot(point - near, direction);
float3 projected = near + direction * projection;
float rejection_length = distance(point, projected);
float percent = projection / segment_length;
float pixel_size = near_pixel_size + percent * delta_pixel_size;
bool is_candidate = (pixel_size > rejection_length && point_percent > percent);
point_color = (is_candidate ? (uint)point_list[point_position + 3] | 4278190080 : point_color);
point_percent = (is_candidate ? percent : point_percent);
}
This code attempts to find the point in a list that is nearest to the line segment between far and near, and assigning its color to point_color and its "percentual distance" into point_percent. (Incidentally, the code seems to be OK).
The number of elements specified by point_count is variable, so I cannot assume too much about it, save for one thing: point_count will always be equal or less than 8. That's a fixed fact in my code and data.
I would like to unroll this loop manually, and I'm afraid I will need to use lots of
value = (point_count < constant ? new_value : value)
for all lines in it. In your experience, will such a strategy increase performance in my kernel?
And yes, I know, I should be performing some benchmarking by myself; I just wanted to ask someone with lots of experience in OpenCL before actually attempting this on my own.
Most OpenCL drivers (that I'm familiar with, at least) support the use of #pragma unroll to unroll loops at compile time. Simply use it like so:
#pragma unroll
for (int i = 0; i < 4; i++) {
/* ... */
}
It's effectively the same as unrolling it manually, with none of the effort. In your case, this would probably look more like:
if (pointCount == 1) {
/* ... */
} else if (pointCount == 2) {
#pragma unroll
for (int i = 0; i < 2; i++) { /* ... */ }
} else if (pointCount == 3) {
#pragma unroll
for (int i = 0; i < 3; i++) { /* ... */ }
}
I can't say for certain whether there will be an improvement, but there's one way to find out. If pointCount is constant for the local work group for example, it might improve performance, but if it's completely variable, this might actually make things worse.
You can read more about it here.

CUDA streams are blocking despite Async

I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0)
Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest frames.
So I need to capture ALL the frames at 30 fps, and I expect to get the hard-work result at 5 fps.
My problems is that I cannot keep the capture running at the right pace : it seems that the hard-work calculation slows down the capture of frames, either at CPU level or at GPU level. I miss some frames...
I tried many solutions. None worked:
I tried to set-up jobs on 2 streams (image below):
the host gets a frame
First stream (called Stream2) : cudaMemcpyAsync copies the frame on the Device. Then, a first kernel does the basic bluring calculations. (In the attached image, bluring appears as a short slot at 3.07 s and 3.085 s. And then nothing... until the big part has finished)
the host checks if the second stream is "available" thanks to a CudaEvent, and lauches it if possible. Practically, the stream is available 1/2 of tries.
Second stream (called Stream4) : starts hard-work calculations in a kernel ( kernelCalcul_W2), outputs the result, and records an Event.
NSight capture
Practically, I wrote :
cudaStream_t sHigh, sLow;
cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
cudaEvent_t event_1;
cudaEventCreate(&event_1);
if (frame has arrived)
{
cudaMemcpyAsync(..., sHigh); // HtoD, to upload images in the GPU
blur_Image <<<... , sHigh>>> (...)
if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
else printf("Event 2 not ready\n");
}
void hard_work( cudaStream_t sLow_)
{
kernelCalcul_W2<<<... , sLow_>>> (...);
cudaMemcpyAsync(... the result..., sLow_); //DtoH
cudaEventRecord(event_1, sLow_);
}
I tried to use only one stream. It's the same code as above, but change 1 parameter while launching hard_work.
host gets a frame
Stream: cudaMemcpyAsync copies the frame on the Device. Then, the kernel does the basic bluring calculations. Then, if the CudaEvent Event_1 is ok, I lauch the hard-work, and I add an Event_1 to get the status on next round.
Practically, the stream is ALWAYS available: I never fall in the "else" part.
This way, while the hard-work is running, I expected to "buffer" all the frames to copy, and not to lose any. But I do lose some: it turns out that each time I get a frame and I copy it, Event_1 seems ok so I launch the hard-work, and only get the the next frame very late.
I tried to put the two streams in two different threads (in C). Not better (even worse).
So the question is: how to ensure that the first stream captures ALL frames?
I really have the feeling that the different streams block the CPU.
I display the images with OpenGL. Would it interfere?
Any idea of ways to improve this?
Thanks a lot!
EDIT:
As requested, I put here a MCVE.
There is a parameter you can tune (#define ADJUST) to see what's happening. Basically, the main procedure sends CUDA requests in Async mode, but it seems to block the main thread. As you will see in the image, I have "memory access" (i.e. images captured ) every 30 ms except when the hard-work is running (then, I just don't get images).
Last detail: I'm using CUDA 7.5 to run this. I tried to install 8.0 but apparently the compiler is still 7.5
#define _USE_MATH_DEFINES 1
#define _CRT_SECURE_NO_WARNINGS 1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Windows.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, b, n, n0;
float input;
b = 3;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
// storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
free(images_as_Output_in_Host);
free(camera);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
Sleep(30); // to simulate Camera
}
releaseMapFile();
getchar();
}
The primary issue here is that cudaMemcpyAsync is only a properly non-blocking async operation if the host memory involved is pinned, i.e. allocated using cudaHostAlloc. This characteristic is covered in several places, including the API documentation and the relevant programming guide section.
The following modification to your code (to run on linux, which I prefer) demonstrates the behavioral difference:
$ cat t33.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, n, n0;
float input;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
#ifndef QUICK
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize/2, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
#endif
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
#ifdef USE_HOST_ALLOC
cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
#else
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
#endif
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
cudaEventRecord(event_2, s2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
usleep(30000); // to simulate Camera
}
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("some CUDA error: %s\n", cudaGetErrorString(err));
releaseMapFile();
}
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard work is launched after image 1 is captured ==> running ok
Dealing with image 2
Hard work is launched after image 2 is captured ==> running ok
Dealing with image 3
Hard work is launched after image 3 is captured ==> running ok
Dealing with image 4
Hard work is launched after image 4 is captured ==> running ok
Dealing with image 5
Hard work is launched after image 5 is captured ==> running ok
Dealing with image 6
Hard work is launched after image 6 is captured ==> running ok
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard work is launched after image 8 is captured ==> running ok
Dealing with image 9
Hard work is launched after image 9 is captured ==> running ok
real 0m2.790s
user 0m0.688s
sys 0m0.966s
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured ==> running ok
Dealing with image 1
Hard_work still running, so unable to process after image 1
Dealing with image 2
Hard_work still running, so unable to process after image 2
Dealing with image 3
Hard_work still running, so unable to process after image 3
Dealing with image 4
Hard_work still running, so unable to process after image 4
Dealing with image 5
Hard_work still running, so unable to process after image 5
Dealing with image 6
Hard_work still running, so unable to process after image 6
Dealing with image 7
Hard work is launched after image 7 is captured ==> running ok
Dealing with image 8
Hard_work still running, so unable to process after image 8
Dealing with image 9
Hard_work still running, so unable to process after image 9
real 0m1.721s
user 0m0.028s
sys 0m0.629s
$
In the USE_HOST_ALLOC case above, the launch pattern for the low-priority kernel is intermittent, as expected, and the overall run time is considerably shorter.
In short, if you want the expected behavior out of cudaMemcpyAsync, make sure any participating host allocations are page-locked.
A pictorial (profiler) example of the effect that pinning can have on multi-stream behavior can be seen in this answer.

Combine multiple mono QImage into a color QImage

For a list of colors, I have a corresponding list of QImage, format mono. The mono images have been processed in such a way that a single pixel can be black from all images.
I would like to merge them into a color image.
I had 3 ideas.
1. Use image composition modes. I was unable to get it to work. (Editing to remove it, to clean up post...)
2. Use the mono images as masks for each of the colors when added to the destination.
I have no idea how to implement it.
3. Iterating through pixels - slow, the documentation says that pixel manipulation functions are slow...
This works:
// Creating destination image
// m_colors: list of (n+1) QCcolor (last one corresponding to background)
// m_images: list of n QImage, Format_Mono, all of the same size (as the destination)
// using indexed image for destination since I have a limited palette
QImage preview = QImage(rect.size().toSize(), QImage::Format_Indexed8);
int previewWidth = preview.size().width();
int previewHeight = preview.size().height();
int colorsSize = m_colors.size();
for(int k = 0; k < colorsSize; ++k)
{
preview.setColor(k, m_colors.at(k).rgb());
}
--colorsSize;
// combining images
for(int j = 0; j < previewHeight; ++j)
{
for(int i = 0; i < previewWidth; ++i)
{
// set background color
preview.setPixel(i, j, colorsSize);
for(int k = 0; k < colorsSize; ++k)
{
QImage im = m_images.at(k);
if(!im.isNull())
{
if(m_images.at(k).pixelIndex(i, j) == 0)
{
preview.setPixel(i, j, k);
}
}
}
}
}
I should at least improve this using scanLine(), but don't know how... I can only find examples that use scanLine() with 32 bit images, not 8 or 2.
Is it actually possible to use scanLine() with 8 or 2 bit images ?
I don't understand the documentation - does it mean that only 32 bit images can be read/written using scanLine() or that regardless of image type, the function will work the same way, and I only use one of the 4 bytes ?
Would be more effective to use 32 bit images instead of 8 or 2 bit ?
If I use 32 bit image for destination, and try to use scanLine() to write data, still how can I improve my reading of the mono images ?
Please help me improve my algorithm, either to improve the version that I get to work iterating through all pixels of images, or perhaps using some tools like combining images using composition.
Is it actually possible to use scanLine() with 8 or 2 bit images ?
Yes it is.
Would be more effective to use 32 bit images instead of 8 or 2 bit ?
You will have to measure, and it will depend on the specific code. I used 8 bit code here for simplicity and because your code did.
If I use 32 bit image for destination, and try to use scanLine() to write data, still how can I improve my reading of the mono images ?
It is probably not a good idea to copy the image in the inner loop
QImage im = m_images.at(k)
and to then not use that copy for the next access.
if(m_images.at(k).pixelIndex(i, j) == 0)
It should speed up your painting if your inner loop iterates over an image instead of iterating over the destination pixels in the inner loop.
If the image is monochrome, then the scan line will point to packed color information which will need to be unpacked. It is easier (and maybe faster) to let convertToFormat convert the image and then use scanLine to read the unpacked information. In the example below the images are all 8 bit.
#include<vector>
#include <QtGui/QImage>
#include <QtGui/QColor>
static char * img1[] = {
"5 5 2 1", "a c #000000", "b c #ffffff",
"aabba", "aabba", "aabba", "aabba", "aabba"
};
static char * img2[] = {
"5 5 2 1","a c #000000","b c #ffffff",
"aaaaa", "aaaaa", "bbbbb", "bbbbb", "aaaaa"
};
int main( int argc, char* arg[] )
{
auto images = std::vector<QImage>( 2 );
images[0] = QImage( img1 );
images[1] = QImage( img2 );
auto colors = std::vector<QColor>( 2 );
colors[0] = QColor( Qt::red );
colors[1] = QColor( Qt::green );
QImage combined = QImage( images[0].size(), QImage::Format_Indexed8 );
combined.setColor( 0, Qt::black );
combined.fill(0);
for( int k = 1, num = images.size(); k <= num; ++k )
{
combined.setColor( k, colors[k-1].rgb() );
QImage img= images[k-1];
for( int i = 0, height = img.height(); i < height ; ++i )
{
uchar* src = img.scanLine(i);
uchar* dst = combined.scanLine(i);
for( int j = 0, width = mono.width(); j < width; ++j )
{
if( src[j] != 0 )
dst[j] = k;
}
}
}

How to change volume of an audio AVPacket

I have a desktop Qt-based application that fetches a sound stream from the network and plays it using QAudioOutput. I want to provide a volume control to the user so that he can reduce the volume. My code looks like this:
float volume_control = get_user_pref(); // user provided volume level {0.0,1.0}
for (;;) {
AVPacket *retrieved_pkt = get_decoded_packet_stream(); // from network stream
AVPacket *work_pkt
= change_volume(retrieved_pkt, volume_control); // this is what I need
// remaining code to play the work_pkt ...
}
How do I implement change_volume() or is there any off the shelf function that I can use?
Edit: Adding codec-related info as requested in the comments
QAudioFormat format;
format.setFrequency(44100);
format.setChannels(2);
format.setSampleSize(16);
format.setCodec("audio/pcm");
format.setByteOrder(QAudioFormat::LittleEndian);
format.setSampleType(QAudioFormat::SignedInt);
The following code works just fine.
// audio_buffer is a byte array of size data_size
// volume_level is a float between 0 (silent) and 1 (original volume)
int16_t * pcm_data = (int16_t*)(audio_buffer);
int32_t pcmval;
for (int ii = 0; ii < (data_size / 2); ii++) { // 16 bit, hence divided by 2
pcmval = pcm_data[ii] * volume_level ;
pcm_data[ii] = pcmval;
}
Edit: I think there is a significant scope of optimization here, since my solution is compute-intensive. I guess avcodec_decode_audio() can be used to speed it up.

how to display image from array of colors data in Qt?

I have a char* data, where every char represents red/green/blue/alpha value of a pixel.
So, the first four numbers are red, green, blue and alpha value of the first pixel, the next four are R, G, B, A value of the pixel on the right and so on.
It represents a picture (with previously known width and height).
Now, I want to somehow take this array and display it on Qt window. How to do it?
I know I should somehow use QPixmap and/or QImage, but I cannot find anything helpful in the documentation.
QImage is designed for access to the various pixels (among other things), so you could do something like this:
QImage DataToQImage( int width, int height, int length, char *data )
{
QImage image( width, height, QImage::Format_ARGB32 );
assert( length % 4 == 0 );
for ( int i = 0; i < length / 4; ++i )
{
int index = i * 4;
QRgb argb = qRgba( data[index + 1], //red
data[index + 2], //green
data[index + 3], //blue
data[index] ); //alpha
image.setPixel( i, argb );
}
return image;
}
Based on coming across another constructor, you might also be able to do this:
QImage DataToQImage( int width, int height, int length, const uchar *data )
{
int bytes_per_line = width * 4;
QImage image( data, width, height, bytes_per_line,
QImage::Format_ARGB32 );
// data is required to be valid throughout the lifetime of the image so
// constructed, and QImages use shared data to make copying quick. I
// don't know how those two features interact, so here I chose to force a
// copy of the image. It could be that the shared data would make a copy
// also, but due to the shared data, we don't really lose anything by
// forcing it.
return image.copy();
}

Resources