GPU Driver does not respond after NDRangekernel increase

GPU Driver does not respond after NDRangekernel increase - opencl

i am new to opencl and i want to actually parallelise this Sieve Prime, the C++ code is here: https://www.geeksforgeeks.org/sieve-of-atkin/
I somehow don't get the good results out of it, actually the CPU version is much faster after comparing. I tried to use NDRangekernel to avoid writing the nested loops and probably increase the performance but when i give higher limit number in function, the GPU driver stops responding and the program crashes. Maybe my NDRangekernel config is not ok, anyone could help with it? I probably don't get the NDRange properly, here are the info about my GPU.
CL_DEVICE_NAME: GeForce GT 740M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 397.31
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1032 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:
-CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16
here is my NDRange code
queue.enqueueNDRangeKernel(add, cl::NDRange(1,1), cl::NDRange((limit * limit) -1, (limit * limit) -1 ), cl::NullRange,NULL, &event);
and my kernel code:
__kernel void sieveofAktin(const int limit, __global bool* sieve)
{
int x = get_global_id(0);
int y = get_global_id(1);
//printf("%d \n", x);
int n = (4 * x * x) + (y * y);
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
sieve[n] ^= true;
n = (3 * x * x) + (y * y);
if (n <= limit && n % 12 == 7)
sieve[n] ^= true;
n = (3 * x * x) - (y * y);
if (x > y && n <= limit && n % 12 == 11)
sieve[n] ^= true;
for (int r = 5; r * r < limit; r++) {
if (sieve[r]) {
for (int i = r * r; i < limit; i += r * r)
sieve[i] = false;
}
}
}

You have lots of branching in that code, and I suspect that's what may be killing your performance on GPUs. Look at chapter 6 of the NVIDIA OpenCL Best Practices Guide for details on why this hurts performance.
I'm not sure how possible it is without looking closely at the algorithm, but ideally you want to rewrite the code to use as little branching as possible. Alternatively, you could look at other algorithms entirely.
As for the locking, I'd need to see more of your host code to know what is happening, but it's possible you're exceeding various limits of your platform/device. Are you checking for errors on every OpenCL function you call?

Regardless of how good or bad your algorithm or implementation is - the driver should always respond. Non-response is quite possibly a bug. File a bug report at http://developer.nvidia.com/ .

Related

What is the best practice for memory access in this N-body problem solved on AMD Radeon RX580?

I compute trajectories of N particles which move in their gravitation force field. I wrote the following OpenCL kernel:
#define G 100.0f
#define EPS 1.0f
float2 f (float2 r_me, __constant float *m, __global float2 *r, size_t s, size_t n)
{
size_t i;
float2 res = (0.0f, 0.0f);
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
return res;
}
__kernel void take_step_rk2 (__constant float *m,
__global float2 *r,
__global float2 *v,
float delta)
{
size_t n = get_global_size(0);
size_t s = get_global_id(0);
float2 mv = f(r[s], m, r, s, n);
float2 mr = v[s];
float2 vpred1 = v[s] + mv * delta;
float2 rpred1 = r[s] + mr * delta;
float2 nv = f(rpred1, m, r, s, n);
float2 nr = vpred1;
barrier (CLK_GLOBAL_MEM_FENCE);
r[s] += (mr + nr) * delta / 2;
v[s] += (mv + nv) * delta / 2;
}
Then I run this kernel multiple times as one-dimensional problem with global work size = [number of bodies]:
void take_step (struct cl_state *state)
{
size_t n = state->nbodies;
clEnqueueNDRangeKernel (state->queue, state->step, 1, NULL, &n, NULL, 0, NULL, NULL);
clFinish (state->queue);
}
This is a quote from AMD OpenCL Optimization Guide (year 2015):
Under certain conditions, one unexpected case of a channel conflict is that reading from the same address is a conflict, even on the FastPath.
This does not happen on the read-only memories, such as constant buffers,
textures, or shader resource view (SRV); but it is possible on the read/write UAV
memory or OpenCL global memory.
Work items in my queue all try to get access to the same memory in this loop, so there must be a channel conflict:
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
I replaced
size_t idx = i;
// size_t idx = (i + s) % n;
with
// size_t idx = i;
size_t idx = (i + s) % n;
so the first work item (with global id 0) firstly access the first element in array r, the second work item access the second element and so on.
I expected that this change must result in performance improvement, but to the contrary, it resulted in significant performance degradation (roughly by the factor of 2). What am I missing? Why all-to-the-same memory access it better in this situation?
If you have other tips how to improve the performance, please share with me. OpenCL optimization guide is very confusing.

The f function's loop does not have a barrier for reconvergence for coalesced access. Once some items get their r data, they start computing but those couldn't will wait their data hence, lose the coalescence integrity. To re-group them, add 1 barrier at least per 10 iterations or 2 iterations or maybe even every iteration. But accessing to global has high latency. Barrier + latency is bad for performance. You need local memory here since it has low latency and broadcasting ability which lets it lose coalescedness only on grains bigger than local thread number (64?) which is not bad for global memory access either(you need to fill local memory from global in every Kth iteration where N is divided into K sized groups).
A source from 2013 (
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf):
Thus, the key to effectively using the LDS is to control the access
pattern, so that accesses generated on the same cycle map to different
banks in the LDS. One notable exception is that accesses to the same
address (even though they have the same bits 6:2) can be broadcast to
all requestors and do not generate a bank conflict.
Using LDS(__local) for this will give good performance. Since LDS is small, you should do it in small patches like 256 particles at a time.
Also, using i as idx is very cache friendly but modulus version is very cache enemy. Once data can exist in cache, it doesn't matter if N requests are done. They come from cache now. But with modulus, you destroy cache ingredients before they are re-used, depending on N. For small N it should be faster as you foresee. For big N, and with small GPU cache, it would be much worse. Like only 1 global request per cycle versus N-cache_size global requests per cycle.
I guess with such strong GPU, you had a high N value such as 64k bodies which needed 2 variables per body and 4 bytes per variable totaling 512kB which can not fit L1. Maybe only L2 which is slower than idx=i through L1.
Answer:
all to same L1 cache adr is faster than all to global and L2 cache adr
use local memory in "blocking/patching" algorithm to achieve high speed

Strange behavior with OpenCL - CPU output doesnt match even with single thread GPU output

I am trying to run the following piece of LUDecomposition code in an OpenCL kernel.
'A' below is a single precision floating point array.
for( k = 0; k < N; k++ )
{
for(j=k+1; j < N; j++) {
A[k * N + j] = A[k * N + j] / A[k * N + k];
}
for(i=k+1; i < N; i++) {
for (j=k+1; j < N; j++) {
A[i * N + j] = A[i * N + j] - (A[i * N + k] * A[k * N + j]);
}
}
}
I am running this code on the GPU on just a single GPU thread (completely sequential). So I have the global thread and local thread mapping for the kernel as follows.
globalthread[0] = 1;
globalthread[1] = 1;
localthread[0] = 1;
localthread[1] = 1;
But when I compare the GPU output to the output of the same function run on the CPU
(directly and not as an opencl device) I am seeing that the outputs dont match. (the mismatch is beyond the limits of floating points accuracy).
I found this unexplainable inspite of best efforts. While trying to narrow down the problem,
I found that the problem arises from the second statement. Specifically due to the subtraction operation and when the value of A[i][j] goes negative.
I have made sure that both CPU and GPU are working on the same inputs. But such a strange behavior for such a simple computation seems weird. Can anyone help explain why the outputs might be differing?
I also ran it on both NVIDIA and AMD Devices and I see the same behavior. (to rule out any platform specific issue)
Here is the sample output of the mismatch:
platform name is NVIDIA CUDA
platform version is OpenCL 1.1 CUDA 4.2.1
number of devices is 2
device name is Tesla C2050 / C2070
GPU Runtime: 0.023669s
CPU Runtime: 0.000123s
Values differ at index (45, 40): arr1=0.946256, arr2=0.963078
Values differ at index (60, 52): arr1=-9.348129, arr2=-9.483719
Values differ at index (61, 52): arr1=11.343384, arr2=11.093756
Non-Matching CPU-GPU Outputs Beyond Error Threshold of 1.05 Percent: 3

OpenCL AMD vs NVIDIA performance

I implemented a simple kernel which is some sort of a convolution. I measured it on NVIDIA GT 240. It took 70 ms when written on CUDA and 100 ms when written on OpenCL. Ok, I thought, NVIDIA compiler is better optimized for CUDA (or I'm doing something wrong).
I need to run it on AMD GPUs, so I migrated to AMD APP SDK. Exactly the same kernel code.
I made two tests and their results were discouraging for me: 200 ms at HD 6670 and 70 ms at HD 5850 (the same time as for GT 240 + CUDA). And I am very interested of the reasons of such strange behaviour.
All projects were built on VS2010 using settings from the sample projects of NVIDIA and AMD respectively.
Please, do not consider my post as NVIDIA advertisement. I fairly understand that HD 5850 is more powerful than GT 240. The only thing I wish to know is why such difference is and how to fix the problem.
Update. Below is the kernel code which looks for 6 equally sized template images in the base one. Every pixel of the base image is considered as a possible origin of one of the templates and is processed by a separate thread. The kernel compares R, G, B values of each pixel of the base image and of the template one, and if at least one difference exceeds diff parameter, the corresponding pixel is counted nonmatched. If the number of nonmatched pixels is less than maxNonmatchQt the corresponding template is hit.
__constant int tOffset = 8196; // one template size in memory (in bytes)
__kernel void matchImage6( __global unsigned char* image, // pointer to the base image
int imgWidth, // base image width
int imgHeight, // base image height
int imgPitch, // base image pitch (in bytes)
int imgBpp, // base image bytes (!) per pixel
__constant unsigned char* templates, // pointer to the array of templates
int tWidth, // templates width (the same for all)
int tHeight, // templates height (the same for all)
int tPitch, // templates pitch (in bytes, the same for all)
int tBpp, // templates bytes (!) per pixel (the same for all)
int diff, // max allowed difference of intensity
int maxNonmatchQt, // max number of nonmatched pixels
__global int* result, // results
) {
int x0 = (int)get_global_id(0);
int y0 = (int)get_global_id(1);
if( x0 + tWidth > imgWidth || y0 + tHeight > imgHeight)
return;
int nonmatchQt[] = {0, 0, 0, 0, 0, 0};
for( int y = 0; y < tHeight; y++) {
int ind = y * tPitch;
int baseImgInd = (y0 + y) * imgPitch + x0 * imgBpp;
for( int x = 0; x < tWidth; x++) {
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
for( int i = 0; i < 6; i++)
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
ind += tBpp;
baseImgInd += imgBpp;
}
if( nonmatchQt[0] > maxNonmatchQt && nonmatchQt[1] > maxNonmatchQt && nonmatchQt[2] > maxNonmatchQt && nonmatchQt[3] > maxNonmatchQt && nonmatchQt[4] > maxNonmatchQt && nonmatchQt[5] > maxNonmatchQt)
return;
}
for( int i = 0; i < 6; i++)
if( nonmatchQt[i] < maxNonmatchQt) {
unsigned int pos = atom_inc( &result[0]) * 3;
result[pos + 1] = i;
result[pos + 2] = x0;
result[pos + 3] = y0;
}
}
Kernel run configuration:
Global work size = (1900, 1200)
Local work size = (32, 8) for AMD and (32, 16) for NVIDIA.
Execution time:
HD 5850 - 69 ms,
HD 6670 - 200 ms,
GT 240 - 100 ms.
Any remarks about my code are also highly appreciated.

The difference in execution times is caused by compilers.
Your code can be easily vectorized. Consider image and templates as arrays of vector type char4 (forth coordinate of each char4 vector is always 0). Instead of 3 memory reads:
unsigned char c0 = image[baseImgInd];
unsigned char c1 = image[baseImgInd + 1];
unsigned char c2 = image[baseImgInd + 2];
use only one:
unsigned char4 c = image[baseImgInd];
Instead of bulky if:
if( abs( c0 - templates[i * tOffset + ind]) > diff ||
abs( c1 - templates[i * tOffset + ind + 1]) > diff ||
abs( c2 - templates[i * tOffset + ind + 2]) > diff)
nonmatchQt[i]++;
use fast:
unsigned char4 t = templates[i * tOffset + ind];
nonmatchQt[i] += any(abs_diff(c,t)>diff);
Thus you increase performance of your code up to 3 times (if compiler doesn't vectorize the code by itself). I suppose that AMD OpenCL compiler does not such vectorization and other optimizations. From my experience OpenCL on NVIDIA GPU usually can be made faster than CUDA, because it is more low-level.

There can be no exact perfect answer for this. OpenCL performance depends on many parameters. The number of access to global memory, efficiency of the code etc. Moreover its very difficult compare between two device since they might be having different local, global, constant memories. Number of cores, frequency, memory bandwidth, more importantly the hardware architecture etc.
Each hardware provides their own performance boost, for example native_ from NVIDIA. So you need to explore more regarding the hardware on which you are working, that might actually work. But what I would recommend personally is not to use such hardware specific optimizations it might effect the flexibility of your code.
You can also find some papers published that shows, CUDA performance is much better than the OpenCL performance on same NVIDIA hardware.
So its always better to write code which provides good flexibility, rather than device specific optimizations.

Divide by 10 using bit shifts?

Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide.

Editor's note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182. It is exact for smaller inputs, though, which may be sufficient for some uses.
Compilers (including MSVC) do use fixed-point multiplicative inverses for constant divisors, but they use a different magic constant and shift on the high-half result to get an exact result for all possible inputs, matching what the C abstract machine requires. See Granlund & Montgomery's paper on the algorithm.
See Why does GCC use multiplication by a strange number in implementing integer division? for examples of the actual x86 asm gcc, clang, MSVC, ICC, and other modern compilers make.
This is a fast approximation that's inexact for large inputs
It's even faster than the exact division via multiply + right-shift that compilers use.
You can use the high half of a multiply result for divisions by small integral constants. Assume a 32-bit machine (code can be adjusted accordingly):
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
What's going here is that we're multiplying by a close approximation of 1/10 * 2^32 and then removing the 2^32. This approach can be adapted to different divisors and different bit widths.
This works great for the ia32 architecture, since its IMUL instruction will put the 64-bit product into edx:eax, and the edx value will be the wanted value. Viz (assuming dividend is passed in eax and quotient returned in eax)
div10 proc
mov edx,1999999Ah ; load 1/10 * 2^32
imul eax ; edx:eax = dividend / 10 * 2 ^32
mov eax,edx ; eax = dividend / 10
ret
endp
Even on a machine with a slow multiply instruction, this will be faster than a software or even hardware divide.

Though the answers given so far match the actual question, they do not match the title. So here's a solution heavily inspired by Hacker's Delight that really uses only bit shifts.
unsigned divu10(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
I think that this is the best solution for architectures that lack a multiply instruction.

Of course you can if you can live with some loss in precision. If you know the value range of your input values you can come up with a bit shift and a multiplication which is exact.
Some examples how you can divide by 10, 60, ... like it is described in this blog to format time the fastest way possible.
temp = (ms * 205) >> 11; // 205/2048 is nearly the same as /10

to expand Alois's answer a bit, we can expand the suggested y = (x * 205) >> 11 for a few more multiples/shifts:
y = (ms * 1) >> 3 // first error 8
y = (ms * 2) >> 4 // 8
y = (ms * 4) >> 5 // 8
y = (ms * 7) >> 6 // 19
y = (ms * 13) >> 7 // 69
y = (ms * 26) >> 8 // 69
y = (ms * 52) >> 9 // 69
y = (ms * 103) >> 10 // 179
y = (ms * 205) >> 11 // 1029
y = (ms * 410) >> 12 // 1029
y = (ms * 820) >> 13 // 1029
y = (ms * 1639) >> 14 // 2739
y = (ms * 3277) >> 15 // 16389
y = (ms * 6554) >> 16 // 16389
y = (ms * 13108) >> 17 // 16389
y = (ms * 26215) >> 18 // 43699
y = (ms * 52429) >> 19 // 262149
y = (ms * 104858) >> 20 // 262149
y = (ms * 209716) >> 21 // 262149
y = (ms * 419431) >> 22 // 699059
y = (ms * 838861) >> 23 // 4194309
y = (ms * 1677722) >> 24 // 4194309
y = (ms * 3355444) >> 25 // 4194309
y = (ms * 6710887) >> 26 // 11184819
y = (ms * 13421773) >> 27 // 67108869
each line is a single, independent, calculation, and you'll see your first "error"/incorrect result at the value shown in the comment. you're generally better off taking the smallest shift for a given error value as this will minimise the extra bits needed to store the intermediate value in the calculation, e.g. (x * 13) >> 7 is "better" than (x * 52) >> 9 as it needs two less bits of overhead, while both start to give wrong answers above 68.
if you want to calculate more of these, the following (Python) code can be used:
def mul_from_shift(shift):
mid = 2**shift + 5.
return int(round(mid / 10.))
and I did the obvious thing for calculating when this approximation starts to go wrong with:
def first_err(mul, shift):
i = 1
while True:
y = (i * mul) >> shift
if y != i // 10:
return i
i += 1
(note that // is used for "integer" division, i.e. it truncates/rounds towards zero)
the reason for the "3/1" pattern in errors (i.e. 8 repeats 3 times followed by 9) seems to be due to the change in bases, i.e. log2(10) is ~3.32. if we plot the errors we get the following:
where the relative error is given by: mul_from_shift(shift) / (1<<shift) - 0.1

Considering Kuba Ober’s response, there is another one in the same vein.
It uses iterative approximation of the result, but I wouldn’t expect any surprising performances.
Let say we have to find x where x = v / 10.
We’ll use the inverse operation v = x * 10 because it has the nice property that when x = a + b, then x * 10 = a * 10 + b * 10.
Let use x as variable holding the best approximation of result so far. When the search ends, x Will hold the result. We’ll set each bit b of x from the most significant to the less significant, one by one, end compare (x + b) * 10 with v. If its smaller or equal to v, then the bit b is set in x. To test the next bit, we simply shift b one position to the right (divide by two).
We can avoid the multiplication by 10 by holding x * 10 and b * 10 in other variables.
This yields the following algorithm to divide v by 10.
uin16_t x = 0, x10 = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
uint16_t t = x10 + b10;
if (t <= v) {
x10 = t;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
Edit: to get the algorithm of Kuba Ober which avoids the need of variable x10 , we can subtract b10 from v and v10 instead. In this case x10 isn’t needed anymore. The algorithm becomes
uin16_t x = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
if (b10 <= v) {
v -= b10;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
The loop may be unwinded and the different values of b and b10 may be precomputed as constants.

On architectures that can only shift one place at a time, a series of explicit comparisons against decreasing powers of two multiplied by 10 might work better than the solution form hacker's delight. Assuming a 16 bit dividend:
uint16_t div10(uint16_t dividend) {
uint16_t quotient = 0;
#define div10_step(n) \
do { if (dividend >= (n*10)) { quotient += n; dividend -= n*10; } } while (0)
div10_step(0x1000);
div10_step(0x0800);
div10_step(0x0400);
div10_step(0x0200);
div10_step(0x0100);
div10_step(0x0080);
div10_step(0x0040);
div10_step(0x0020);
div10_step(0x0010);
div10_step(0x0008);
div10_step(0x0004);
div10_step(0x0002);
div10_step(0x0001);
#undef div10_step
if (dividend >= 5) ++quotient; // round the result (optional)
return quotient;
}

Well division is subtraction, so yes. Shift right by 1 (divide by 2). Now subtract 5 from the result, counting the number of times you do the subtraction until the value is less than 5. The result is number of subtractions you did. Oh, and dividing is probably going to be faster.
A hybrid strategy of shift right then divide by 5 using the normal division might get you a performance improvement if the logic in the divider doesn't already do this for you.

I've designed a new method in AVR assembly, with lsr/ror and sub/sbc only. It divides by 8, then sutracts the number divided by 64 and 128, then subtracts the 1,024th and the 2,048th, and so on and so on. Works very reliable (includes exact rounding) and quick (370 microseconds at 1 MHz).
The source code is here for 16-bit-numbers:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/div10_16rd.asm
The page that comments this source code is here:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/DIV10.html
I hope that it helps, even though the question is ten years old.
brgs, gsc

elemakil's comments' code can be found here: https://doc.lagout.org/security/Hackers%20Delight.pdf
page 233. "Unsigned divide by 10 [and 11.]"

How to make ARGB transparency using bitwise operators

I need to make transparency, having 2 pixels:
pixel1: {A, R, G, B} - foreground pixel
pixel2: {A, R, G, B} - background pixel
A,R,G,B are Byte values
each color is represented by byte value
now I'm calculating transparency as:
newR = pixel2_R * alpha / 255 + pixel1_R * (255 - alpha) / 255
newG = pixel2_G * alpha / 255 + pixel1_G * (255 - alpha) / 255
newB = pixel2_B * alpha / 255 + pixel1_B * (255 - alpha) / 255
but it is too slow
I need to do it with bitwise operators (AND,OR,XOR, NEGATION, BIT MOVE)
I want to do it on Windows Phone 7 XNA
---attached C# code---
public static uint GetPixelForOpacity(uint reduceOpacityLevel, uint pixelBackground, uint pixelForeground, uint pixelCanvasAlpha)
{
byte surfaceR = (byte)((pixelForeground & 0x00FF0000) >> 16);
byte surfaceG = (byte)((pixelForeground & 0x0000FF00) >> 8);
byte surfaceB = (byte)((pixelForeground & 0x000000FF));
byte sourceR = (byte)((pixelBackground & 0x00FF0000) >> 16);
byte sourceG = (byte)((pixelBackground & 0x0000FF00) >> 8);
byte sourceB = (byte)((pixelBackground & 0x000000FF));
uint newR = sourceR * pixelCanvasAlpha / 256 + surfaceR * (255 - pixelCanvasAlpha) / 256;
uint newG = sourceG * pixelCanvasAlpha / 256 + surfaceG * (255 - pixelCanvasAlpha) / 256;
uint newB = sourceB * pixelCanvasAlpha / 256 + surfaceB * (255 - pixelCanvasAlpha) / 256;
return (uint)255 << 24 | newR << 16 | newG << 8 | newB;
}

You can't do an 8 bit alpha blend using only bitwise operations, unless you basically re-invent multiplication with basic ops (8 shift-adds).
You can do two methods as mentioned in other answers: use 256 instead of 255, or use a lookup table. Both have issues, but you can mitigate them. It really depends on what architecture you're doing this on: the relative speed of multiply, divide, shift, add and memory loads. In any case:
Lookup table: a trivial 256x256 lookup table is 64KB. This will thrash your data cache and end up being very slow. I wouldn't recommend it unless your CPU has an abysmally slow multiplier, but does have low latency RAM. You can improve performance by throwing away some alpha bits, e.g A>>3, resulting in 32x256=8KB of lookup, which has a better chance of fitting in cache.
Use 256 instead of 255: the idea being divide by 256 is just a shift right by 8. This will be slightly off and tend to round down, darkening the image slightly, e.g if R=255, A=255 then (R*A)/256 = 254. You can cheat a little and do this: (R*A+R+A)/256 or just (R*A+R)/256 or (R*A+A)/256 = 255. Or, scale A to 0..256 first, e.g: A = (256*A)/255. That's just one expensive divide-by-255 instead of 6. Then, (R*A)/256 = 255.

I don't think it can be done with the same precision using only those operators. Your best bet is, I reckon, using a LUT (as long as the LUT can fit in the CPU cache, otherwise it might even be slower)
// allocate the LUT (64KB)
unsigned char lut[256*256] __cacheline_aligned; // __cacheline_aligned is a GCC-ism
// macro to access the LUT
#define LUT(pixel, alpha) (lut[(alpha)*256+(pixel)])
// precompute the LUT
for (int alpha_value=0; alpha_value<256; alpha_value++) {
for (int pixel_value=0; pixel_value<256; pixel_value++) {
LUT(pixel_value, alpha_value) = (unsigned char)((double)(pixel_value) * (double)(alpha_value) / 255.0));
}
}
// in the loop
unsigned char ialpha = 255-alpha;
newR = LUT(pixel2_R, alpha) + LUT(pixel1_R, ialpha);
newG = LUT(pixel2_G, alpha) + LUT(pixel1_G, ialpha);
newB = LUT(pixel2_B, alpha) + LUT(pixel1_B, ialpha);
otherwise you should try vectorizing your code. But to do that you should at least provide us with more info on your CPU architecture and compiler. Keep in mind that your compiler might be able to vectorize automatically, if provided with the right options.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex