OpenCL Intel Iris Integrated Graphics exits with Abort Trap 6: Timeout Issue - opencl

I am attempting to write a program that executes Monte Carlo simulations using OpenCL. I have run into an issue involving exponentials. When the value of the variable steps becomes large, approximately 20000, the calculation of the exponent fails unexpectedly, and the program quits with "Abort Trap: 6". This seems to be a bizarre error given that steps should not affect memory allocation. I have tried setting normal, alpha, and beta to 0 but this does not resolve the problem however commenting out the exponent and replacing it with the constant 1 seems to fix the problem. I have run my code on an AWS GPU instance and it does not run into any issues. Does anybody have any ideas as to why this might be a problem on an integrated graphics card?
SOLUTION
Execute the kernel multiple times over a smaller ranges to keep kernel execution time under 5 seconds
Code Snippet
#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif
static uint MWC64X(uint2 *state) {
enum { A = 4294883355U };
uint x = (*state).x, c = (*state).y;
uint res = x ^ c;
uint hi = mul_hi(x, A);
x = x * A + c;
c = hi + (x < c);
*state = (uint2)(x, c);
return res;
}
__kernel void discreteMonteCarloKernel(...) {
float cumulativeWalk = stockPrice;
float currentValue = stockPrice;
...
uint n = get_global_id(0);
uint2 seed2 = (uint2)(n, seed);
uint random1 = MWC64X(&seed2);
uint2 seed3 = (uint2)(random1, seed);
uint random2 = MWC64X(&seed3);
float alpha = (interestRate - 0.5 * sigma * sigma) * dt;
float beta = sigma * sqrt(dt);
float u1;
float u2;
float a;
float b;
float normal;
for (int j = 0; j < steps; j++) {
random1 = MWC64X(&seed2);
if (random1 == 0) {
random1 = MWC64X(&seed2);
}
random2 = MWC64X(&seed3);
u1 = (float)random1 / (float)0xffffffff;
u2 = (float)random2 / (float)0xffffffff;
a = sqrt(-2 * log(u1));
b = 2 * M_PI * u2;
normal = a * sin(b);
exponent = exp(alpha + beta * normal);
currentValue = currentValue * exponent;
cumulativeWalk += currentValue;
...
}
Problem Report
Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Application Specific Information:
abort() called
Application Specific Signatures:
Graphics hardware encountered an error and was reset: 0x00000813
Thread 0 Crashed:: Dispatch queue: opencl_runtime
0 libsystem_kernel.dylib 0x00007fffb14bad42 __pthread_kill + 10
1 libsystem_pthread.dylib 0x00007fffb15a85bf pthread_kill + 90
2 libsystem_c.dylib 0x00007fffb1420420 abort + 129
3 libGPUSupportMercury.dylib 0x00007fffa98e6fbf gpusGenerateCrashLog + 158
4 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x000000010915f13b gpusKillClientExt + 9
5 libGPUSupportMercury.dylib 0x00007fffa98e7983 gpusQueueSubmitDataBuffers + 168
6 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091aa031 IntelCLCommandBuffer::getNew(GLDQueueRec*) + 31
7 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091a9f99 intelSubmitCLCommands(GLDQueueRec*, unsigned int) + 65
8 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091b00a1 CHAL_INTEL::ChalContext::ChalFlush() + 83
9 com.apple.driver.AppleIntelHD5000GraphicsGLDriver 0x00000001091aa2c3 gldFinishQueue + 43
10 com.apple.opencl 0x00007fff9ffeeb37 0x7fff9ffed000 + 6967
11 com.apple.opencl 0x00007fff9ffef000 0x7fff9ffed000 + 8192
12 com.apple.opencl 0x00007fffa000ccca 0x7fff9ffed000 + 130250
13 com.apple.opencl 0x00007fffa001029d 0x7fff9ffed000 + 144029
14 libdispatch.dylib 0x00007fffb13568fc _dispatch_client_callout + 8
15 libdispatch.dylib 0x00007fffb1357536 _dispatch_barrier_sync_f_invoke + 83
16 com.apple.opencl 0x00007fffa001011d 0x7fff9ffed000 + 143645
17 com.apple.opencl 0x00007fffa000bda6 0x7fff9ffed000 + 126374
18 com.apple.opencl 0x00007fffa00011df clEnqueueReadBuffer + 813
19 simplisticComparison 0x0000000107b953cf BinomialMultiplication::execute(int) + 1791
20 simplisticComparison 0x0000000107b9ec7f main + 767
21 libdyld.dylib 0x00007fffb138c235 start + 1
Thread 1:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
1 ??? 0x000070000eed6b30 0 + 123145552751408
Thread 2:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
Thread 3:
0 libsystem_pthread.dylib 0x00007fffb15a50e4 start_wqthread + 0
1 ??? 0x007865646e496d65 0 + 33888479226719589
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x0000000000000000 rbx: 0x0000000000000006 rcx: 0x00007fff58074078 rdx: 0x0000000000000000
rdi: 0x0000000000000307 rsi: 0x0000000000000006 rbp: 0x00007fff580740a0 rsp: 0x00007fff58074078
r8: 0x0000000000000000 r9: 0x00007fffb140ba50 r10: 0x0000000008000000 r11: 0x0000000000000206
r12: 0x00007f92de80a7e0 r13: 0x00007f92e0008c00 r14: 0x00007fffba29e3c0 r15: 0x00007f92de801a00
rip: 0x00007fffb14bad42 rfl: 0x0000000000000206 cr2: 0x00007fffba280128
Logical CPU: 0
Error Code: 0x02000148
Trap Number: 133

I have a guess. The driver can crash in two ways:
We reference a bad buffer address. This is probably not your case.
We time out (exceed the TDR). A kernel has a few seconds to complete.
My money is on #2. If the larger value (steps) makes the GPU run too long, the system will kill things.
I am not familiar with the guts of Apple's Intel driver, but typically there is a way to disable the TDR in extreme cases. E.g. see the Windows Documenation on TDRs to get the gist. (Linux drivers have a way to disable this too.)
Normally we want to avoid running things that take super long and it might be a good idea to decompose the workload in some way so that you naturally don't hit this kill switch. E.g. perhaps chunk the "steps" into smaller chunks (pass in and save your state for parts you can't recompute).

Related

GPU Driver does not respond after NDRangekernel increase

i am new to opencl and i want to actually parallelise this Sieve Prime, the C++ code is here: https://www.geeksforgeeks.org/sieve-of-atkin/
I somehow don't get the good results out of it, actually the CPU version is much faster after comparing. I tried to use NDRangekernel to avoid writing the nested loops and probably increase the performance but when i give higher limit number in function, the GPU driver stops responding and the program crashes. Maybe my NDRangekernel config is not ok, anyone could help with it? I probably don't get the NDRange properly, here are the info about my GPU.
CL_DEVICE_NAME: GeForce GT 740M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 397.31
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1032 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 512 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 2048 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES:
-CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 256
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16
here is my NDRange code
queue.enqueueNDRangeKernel(add, cl::NDRange(1,1), cl::NDRange((limit * limit) -1, (limit * limit) -1 ), cl::NullRange,NULL, &event);
and my kernel code:
__kernel void sieveofAktin(const int limit, __global bool* sieve)
{
int x = get_global_id(0);
int y = get_global_id(1);
//printf("%d \n", x);
int n = (4 * x * x) + (y * y);
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
sieve[n] ^= true;
n = (3 * x * x) + (y * y);
if (n <= limit && n % 12 == 7)
sieve[n] ^= true;
n = (3 * x * x) - (y * y);
if (x > y && n <= limit && n % 12 == 11)
sieve[n] ^= true;
for (int r = 5; r * r < limit; r++) {
if (sieve[r]) {
for (int i = r * r; i < limit; i += r * r)
sieve[i] = false;
}
}
}
You have lots of branching in that code, and I suspect that's what may be killing your performance on GPUs. Look at chapter 6 of the NVIDIA OpenCL Best Practices Guide for details on why this hurts performance.
I'm not sure how possible it is without looking closely at the algorithm, but ideally you want to rewrite the code to use as little branching as possible. Alternatively, you could look at other algorithms entirely.
As for the locking, I'd need to see more of your host code to know what is happening, but it's possible you're exceeding various limits of your platform/device. Are you checking for errors on every OpenCL function you call?
Regardless of how good or bad your algorithm or implementation is - the driver should always respond. Non-response is quite possibly a bug. File a bug report at http://developer.nvidia.com/ .

OpenCL oil painting

I want to implement oil painting filter in OpenCL,but the output image is always black and I cannot figure out why.
Here's the kernel code:
__kernel void oil_painting(__global const char* R,__global const char* G,__global const char* B,
__global char* r,__global char* g,__global char* b)
{
int i=get_global_id(0);
int j=get_global_id(1);
int i1,j1,k;
int avgR[256],avgG[256],avgB[256],intensity_count[256];
int max_pixels=0,max_intensity=0,current_intensity;
for (i1=0;i1<4;i1++) {
for (j1=0;j1<4;j1++) {
current_intensity=(((R[(i+i1)*512+j+j1]+
G[(i+i1)*512+j+j1]+
B[(i+i1)*512+j+j1])/3)*70)/255;
intensity_count[current_intensity]++;
if (intensity_count[current_intensity]>max_pixels) {
max_pixels=intensity_count[current_intensity];
max_intensity=current_intensity;
}
avgR[current_intensity]+=R[(i+i1)*512+j+j1];
avgG[current_intensity]+=G[(i+i1)*512+j+j1];
avgB[current_intensity]+=B[(i+i1)*512+j+j1];
}
}
r[i*512+j]=min(255,max(0,avgR[max_intensity]/max_pixels));
g[i*512+j]=min(255,max(0,avgG[max_intensity]/max_pixels));
b[i*512+j]=min(255,max(0,avgB[max_intensity]/max_pixels));
}
Code snippets like the following are going to get you into a lot of trouble:
current_intensity=(((R[(i+i1)*512+j+j1]+
G[(i+i1)*512+j+j1]+
B[(i+i1)*512+j+j1])/3)*70)/255;
Consider what happens for a pixel of <127,127,127>:
127 + 127 + 127 = 125 (truncated because `char` is only 8 bytes...)
125 / 3 = 41
41 * 70 = 54 (truncated because `char` is only 8 bytes...)
54 / 255 = 0 (this will always equal 0!)
So intensity_count will only ever have its 0-th index incremented, and nothing else.
Casting everything to int might fix this problem.
current_intensity=((((int)R[(i+i1)*512+j+j1]+
(int)G[(i+i1)*512+j+j1]+
(int)B[(i+i1)*512+j+j1])/3)*70)/255;
New output:
127 + 127 + 127 = 381
381 / 3 = 127
127 * 70 = 8890
8890 / 255 = 34
But you've now got a new problem: what if the values are any higher than 127? Suppose we change this to use <200, 200, 200> instead?
-56 + -56 + -56 = -168 (`char` only has a range in [-128, 127]! You're overflowing!)
-168 / 3 = -56
-56 * 70 = -3920
-3920 / 255 = -15
And now you've crashed your program because either you're going to attempt to access index -15, which is illegal, or you're going to attempt to access index 2^64 - 15 - 1, which is going to still be illegal. Either way, you're going to get bad results.
The simplest solution is to change your kernel arguments to global uchar * instead of global char *, and then make sure that any and all arithmetic is casted upwards to int or long to ensure that overflow doesn't take place.

Profiling Rcpp code on OS X

I am interested in profiling some Rcpp code under OS X (Mountain Lion 10.8.2), but the profiler crashes when being run.
Toy example, using inline, just designed to take enough time for a profiler to notice.
library(Rcpp)
library(inline)
src.cpp <- "
RNGScope scope;
int n = as<int>(n_);
double x = 0.0;
for ( int i = 0; i < n; i++ )
x += (unif_rand()-.5);
return wrap(x);"
src.c <- "
int i, n = INTEGER(n_)[0];
double x = 0.0;
GetRNGstate();
for ( i = 0; i < n; i++ )
x += (unif_rand()-.5);
PutRNGstate();
return ScalarReal(x);"
f.cpp <- cxxfunction(signature(n_="integer"), src.cpp, plugin="Rcpp")
f.c <- cfunction(signature(n_="integer"), src.c)
If I use either the GUI Instruments (in Xcode, version 4.5 (4523)) or the command line sample, both crash: Instruments crashes straight away, while sample completes processing samples before crashing:
# (in R)
set.seed(1)
f.cpp(200000000L)
# (in a separate terminal window)
~ » sample R # this invokes the profiler
Sampling process 81337 for 10 seconds with 1 millisecond of run time between samples
Sampling completed, processing symbols...
[1] 81654 segmentation fault sample 81337
If I do the same process but with the C version (i.e., f.c(200000000L)) both Instruments and sample work fine, and produce output like
Call graph:
1832 Thread_6890779 DispatchQueue_1: com.apple.main-thread (serial)
1832 start (in R) + 52 [0x100000e74]
1832 main (in R) + 27 [0x100000eeb]
1832 run_Rmainloop (in libR.dylib) + 80 [0x1000e4020]
1832 R_ReplConsole (in libR.dylib) + 161 [0x1000e3b11]
1832 Rf_ReplIteration (in libR.dylib) + 514 [0x1000e3822]
1832 Rf_eval (in libR.dylib) + 1010 [0x1000aa402]
1832 Rf_applyClosure (in libR.dylib) + 849 [0x1000af5d1]
1832 Rf_eval (in libR.dylib) + 1672 [0x1000aa698]
1832 do_dotcall (in libR.dylib) + 16315 [0x10007af3b]
1382 file1412f6e212474 (in file1412f6e212474.so) + 53 [0x1007fded5] file1412f6e212474.cpp:16
+ 862 unif_rand (in libR.dylib) + 1127,1099,... [0x10000b057,0x10000b03b,...]
+ 520 fixup (in libR.dylib) + 39,67,... [0x10000aab7,0x10000aad3,...]
356 file1412f6e212474 (in file1412f6e212474.so) + 70,61,... [0x1007fdee6,0x1007fdedd,...] file1412f6e212474.cpp:16
56 unif_rand (in libR.dylib) + 1133 [0x10000b05d]
38 DYLD-STUB$$unif_rand (in file1412f6e212474.so) + 0 [0x1007fdf1c]
I would really appreciate some advice into if there is anything I'm doing wrong, if there is some other preferred way, or if this is just not possible. Given that one of the main uses of Rcpp seems to be in speeding up R code, I'm surprised not to find more information on this, but perhaps I'm looking in the wrong place.
This is on OS X 10.8.2 with R 2.15.1 (x86_64-apple-darwin9.8.0), Rcpp 0.9.15, and g++ --version reports "i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)".
A solution
Thanks to Dirk's answer below, and his talk here http://dirk.eddelbuettel.com/papers/ismNov2009introHPCwithR.pdf, I have at least a partial solution using Google perftools. First, install from here http://code.google.com/p/gperftools/, and add -lprofiler to PKG_LIBS when compiling the C++ code. Then either
(a) Run R as CPUPROFILE=samples.log R, run all code and quit (or use Rscript)
(b) Use two small utility functions to turn on/off profiling:
RcppExport SEXP start_profiler(SEXP str) {
ProfilerStart(as<const char*>(str));
return R_NilValue;
}
RcppExport SEXP stop_profiler() {
ProfilerStop();
return R_NilValue;
}
Then, within R you can do
.Call("start_profiler", "samples.log")
# code that calls C++ code to be profiled
.Call("stop_profiler")
either way, the file samples.log will contain profiling information. This can be looked at with
pprof --text /Library/Frameworks/R.framework/Resources/bin/exec/x86_64/R samples.log
which produces output like
Using local file /Library/Frameworks/R.framework/Resources/bin/exec/x86_64/R.
Using local file samples.log.
Removing __sigtramp from all stack traces.
Total: 112 samples
64 57.1% 57.1% 64 57.1% _unif_rand
30 26.8% 83.9% 30 26.8% _process_system_Renviron
14 12.5% 96.4% 101 90.2% _for_profile
3 2.7% 99.1% 3 2.7% Rcpp::internal::expr_eval_methods
1 0.9% 100.0% 1 0.9% _Rf_PrintValueRec
0 0.0% 100.0% 1 0.9% 0x0000000102bbc1ff
0 0.0% 100.0% 15 13.4% 0x00007fff5fbfe06f
0 0.0% 100.0% 1 0.9% _Rf_InitFunctionHashing
0 0.0% 100.0% 1 0.9% _Rf_PrintValueEnv
0 0.0% 100.0% 112 100.0% _Rf_ReplIteration
which would probably be more informative on a real example.
I'm confused, your example is incomplete:
you don't show the (trivial) invocation of cfunction() and cxxfunction()
you don't show how you invoke the profiler
you aren't profiling the C or C++ code (!!)
Can you maybe edit the question and make it clearer?
Also, when I run this, the two example do give identical speed results as they are essentially identical. [ Rcpp would let you do this in call with sugars random number functions. ]
R> library(Rcpp)
R> library(inline)
R>
R> src.cpp <- "
+ RNGScope scope;
+ int n = as<int>(n_);
+ double x = 0.0;
+ for ( int i = 0; i < n; i++ )
+ x += (unif_rand()-.5);
+ return wrap(x);"
R>
R> src.c <- "
+ int i, n = INTEGER(n_)[0];
+ double x = 0.0;
+ GetRNGstate();
+ for ( i = 0; i < n; i++ )
+ x += (unif_rand()-.5);
+ PutRNGstate();
+ return Rf_ScalarReal(x);"
R>
R> fc <- cfunction(signature(n_="int"), body=src.c)
R> fcpp <- cxxfunction(signature(n_="int"), body=src.c, plugin="Rcpp")
R>
R> library(rbenchmark)
R>
R> print(benchmark(fc(10000L), fcpp(10000L)))
test replications elapsed relative user.self sys.self user.child sys.child
1 fc(10000) 100 0.013 1 0.012 0 0 0
2 fcpp(10000) 100 0.013 1 0.012 0 0 0
R>

How to find x mod 15 without using any Arithmetic Operations?

We are given a unsigned integer, suppose. And without using any arithmetic operators ie + - / * or %, we are to find x mod 15. We may use binary bit manipulations.
As far as I could go, I got this based on 2 points.
a = a mod 15 = a mod 16 for a<15
Let a = x mod 15
then a = x - 15k (for some non-negative k).
ie a = x - 16k + k...
ie a mod 16 = ( x mod 16 + k mod 16 ) mod 16
ie a mod 15 = ( x mod 16 + k mod 16 ) mod 16
ie a = ( x mod 16 + k mod 16 ) mod 16
OK. Now to implement this. A mod16 operations is basically & OxF. and k is basically x>>4
So a = ( x & OxF + (x>>4) & OxF ) & OxF.
It boils down to adding 2 4-bit numbers. Which can be done by bit expressions.
sum[0] = a[0] ^ b[0]
sum[1] = a[1] ^ b[1] ^ (a[0] & b[0])
...
and so on
This seems like cheating to me. I'm hoping for a more elegant solution
This reminds me of an old trick from base 10 called "casting out the 9s". This was used for checking the result of large sums performed by hand.
In this case 123 mod 9 = 1 + 2 + 3 mod 9 = 6.
This happens because 9 is one less than the base of the digits (10). (Proof omitted ;) )
So considering the number in base 16 (Hex). you should be able to do:
0xABCE123 mod 0xF = (0xA + 0xB + 0xC + 0xD + 0xE + 0x1 + 0x2 + 0x3 ) mod 0xF
= 0x42 mod 0xF
= 0x6
Now you'll still need to do some magic to make the additions disappear. But it gives the right answer.
UPDATE:
Heres a complete implementation in C++. The f lookup table takes pairs of digits to their sum mod 15. (which is the same as the byte mod 15). We then repack these results and reapply on half as much data each round.
#include <iostream>
uint8_t f[256]={
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,0,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,0,1,
2,3,4,5,6,7,8,9,10,11,12,13,14,0,1,2,
3,4,5,6,7,8,9,10,11,12,13,14,0,1,2,3,
4,5,6,7,8,9,10,11,12,13,14,0,1,2,3,4,
5,6,7,8,9,10,11,12,13,14,0,1,2,3,4,5,
6,7,8,9,10,11,12,13,14,0,1,2,3,4,5,6,
7,8,9,10,11,12,13,14,0,1,2,3,4,5,6,7,
8,9,10,11,12,13,14,0,1,2,3,4,5,6,7,8,
9,10,11,12,13,14,0,1,2,3,4,5,6,7,8,9,
10,11,12,13,14,0,1,2,3,4,5,6,7,8,9,10,
11,12,13,14,0,1,2,3,4,5,6,7,8,9,10,11,
12,13,14,0,1,2,3,4,5,6,7,8,9,10,11,12,
13,14,0,1,2,3,4,5,6,7,8,9,10,11,12,13,
14,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,0};
uint64_t mod15( uint64_t in_v )
{
uint8_t * in = (uint8_t*)&in_v;
// 12 34 56 78 12 34 56 78 => aa bb cc dd
in[0] = f[in[0]] | (f[in[1]]<<4);
in[1] = f[in[2]] | (f[in[3]]<<4);
in[2] = f[in[4]] | (f[in[5]]<<4);
in[3] = f[in[6]] | (f[in[7]]<<4);
// aa bb cc dd => AA BB
in[0] = f[in[0]] | (f[in[1]]<<4);
in[1] = f[in[2]] | (f[in[3]]<<4);
// AA BB => DD
in[0] = f[in[0]] | (f[in[1]]<<4);
// DD => D
return f[in[0]];
}
int main()
{
uint64_t x = 12313231;
std::cout<< mod15(x)<<" "<< (x%15)<<std::endl;
}
Your logic is somewhere flawed but I can't put a finger on it. Think about it yourself, your final formula operates on first 8 bits and ignores the rest. That could only be valid if the part you throw away (9+ bits) are always the multiplication of 15. However, in reality (in binary numbers) 9+ bits are always multiplications of 16 but not 15. For example try putting 1 0000 0000 and 11 0000 0000 in your formula. Your formula will give 0 as a result for both cases, while in reality the answer is 1 and 3.
In essense I'm almost sure that your task can not be solved without loops. And if you are allowed to use loops - then it's nothing easier than to implement bitwiseAdd function and do whatever you like with it.
Added:
Found your problem. Here it is:
... a = x - 15k (for some non-negative k).
... and k is basically x>>4
It equals x>>4 only by pure coincidence for some numbers. Take any big example, for instance x=11110000. By your calculation k = 15, while in reality it is k=16: 16*15 = 11110000.

Divide by 10 using bit shifts?

Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide.
Editor's note: this is not actually what compilers do, and gives the wrong answer for large positive integers ending with 9, starting with div10(1073741829) = 107374183 not 107374182. It is exact for smaller inputs, though, which may be sufficient for some uses.
Compilers (including MSVC) do use fixed-point multiplicative inverses for constant divisors, but they use a different magic constant and shift on the high-half result to get an exact result for all possible inputs, matching what the C abstract machine requires. See Granlund & Montgomery's paper on the algorithm.
See Why does GCC use multiplication by a strange number in implementing integer division? for examples of the actual x86 asm gcc, clang, MSVC, ICC, and other modern compilers make.
This is a fast approximation that's inexact for large inputs
It's even faster than the exact division via multiply + right-shift that compilers use.
You can use the high half of a multiply result for divisions by small integral constants. Assume a 32-bit machine (code can be adjusted accordingly):
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
What's going here is that we're multiplying by a close approximation of 1/10 * 2^32 and then removing the 2^32. This approach can be adapted to different divisors and different bit widths.
This works great for the ia32 architecture, since its IMUL instruction will put the 64-bit product into edx:eax, and the edx value will be the wanted value. Viz (assuming dividend is passed in eax and quotient returned in eax)
div10 proc
mov edx,1999999Ah ; load 1/10 * 2^32
imul eax ; edx:eax = dividend / 10 * 2 ^32
mov eax,edx ; eax = dividend / 10
ret
endp
Even on a machine with a slow multiply instruction, this will be faster than a software or even hardware divide.
Though the answers given so far match the actual question, they do not match the title. So here's a solution heavily inspired by Hacker's Delight that really uses only bit shifts.
unsigned divu10(unsigned n) {
unsigned q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
I think that this is the best solution for architectures that lack a multiply instruction.
Of course you can if you can live with some loss in precision. If you know the value range of your input values you can come up with a bit shift and a multiplication which is exact.
Some examples how you can divide by 10, 60, ... like it is described in this blog to format time the fastest way possible.
temp = (ms * 205) >> 11; // 205/2048 is nearly the same as /10
to expand Alois's answer a bit, we can expand the suggested y = (x * 205) >> 11 for a few more multiples/shifts:
y = (ms * 1) >> 3 // first error 8
y = (ms * 2) >> 4 // 8
y = (ms * 4) >> 5 // 8
y = (ms * 7) >> 6 // 19
y = (ms * 13) >> 7 // 69
y = (ms * 26) >> 8 // 69
y = (ms * 52) >> 9 // 69
y = (ms * 103) >> 10 // 179
y = (ms * 205) >> 11 // 1029
y = (ms * 410) >> 12 // 1029
y = (ms * 820) >> 13 // 1029
y = (ms * 1639) >> 14 // 2739
y = (ms * 3277) >> 15 // 16389
y = (ms * 6554) >> 16 // 16389
y = (ms * 13108) >> 17 // 16389
y = (ms * 26215) >> 18 // 43699
y = (ms * 52429) >> 19 // 262149
y = (ms * 104858) >> 20 // 262149
y = (ms * 209716) >> 21 // 262149
y = (ms * 419431) >> 22 // 699059
y = (ms * 838861) >> 23 // 4194309
y = (ms * 1677722) >> 24 // 4194309
y = (ms * 3355444) >> 25 // 4194309
y = (ms * 6710887) >> 26 // 11184819
y = (ms * 13421773) >> 27 // 67108869
each line is a single, independent, calculation, and you'll see your first "error"/incorrect result at the value shown in the comment. you're generally better off taking the smallest shift for a given error value as this will minimise the extra bits needed to store the intermediate value in the calculation, e.g. (x * 13) >> 7 is "better" than (x * 52) >> 9 as it needs two less bits of overhead, while both start to give wrong answers above 68.
if you want to calculate more of these, the following (Python) code can be used:
def mul_from_shift(shift):
mid = 2**shift + 5.
return int(round(mid / 10.))
and I did the obvious thing for calculating when this approximation starts to go wrong with:
def first_err(mul, shift):
i = 1
while True:
y = (i * mul) >> shift
if y != i // 10:
return i
i += 1
(note that // is used for "integer" division, i.e. it truncates/rounds towards zero)
the reason for the "3/1" pattern in errors (i.e. 8 repeats 3 times followed by 9) seems to be due to the change in bases, i.e. log2(10) is ~3.32. if we plot the errors we get the following:
where the relative error is given by: mul_from_shift(shift) / (1<<shift) - 0.1
Considering Kuba Ober’s response, there is another one in the same vein.
It uses iterative approximation of the result, but I wouldn’t expect any surprising performances.
Let say we have to find x where x = v / 10.
We’ll use the inverse operation v = x * 10 because it has the nice property that when x = a + b, then x * 10 = a * 10 + b * 10.
Let use x as variable holding the best approximation of result so far. When the search ends, x Will hold the result. We’ll set each bit b of x from the most significant to the less significant, one by one, end compare (x + b) * 10 with v. If its smaller or equal to v, then the bit b is set in x. To test the next bit, we simply shift b one position to the right (divide by two).
We can avoid the multiplication by 10 by holding x * 10 and b * 10 in other variables.
This yields the following algorithm to divide v by 10.
uin16_t x = 0, x10 = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
uint16_t t = x10 + b10;
if (t <= v) {
x10 = t;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
Edit: to get the algorithm of Kuba Ober which avoids the need of variable x10 , we can subtract b10 from v and v10 instead. In this case x10 isn’t needed anymore. The algorithm becomes
uin16_t x = 0, b = 0x1000, b10 = 0xA000;
while (b != 0) {
if (b10 <= v) {
v -= b10;
x |= b;
}
b10 >>= 1;
b >>= 1;
}
// x = v / 10
The loop may be unwinded and the different values of b and b10 may be precomputed as constants.
On architectures that can only shift one place at a time, a series of explicit comparisons against decreasing powers of two multiplied by 10 might work better than the solution form hacker's delight. Assuming a 16 bit dividend:
uint16_t div10(uint16_t dividend) {
uint16_t quotient = 0;
#define div10_step(n) \
do { if (dividend >= (n*10)) { quotient += n; dividend -= n*10; } } while (0)
div10_step(0x1000);
div10_step(0x0800);
div10_step(0x0400);
div10_step(0x0200);
div10_step(0x0100);
div10_step(0x0080);
div10_step(0x0040);
div10_step(0x0020);
div10_step(0x0010);
div10_step(0x0008);
div10_step(0x0004);
div10_step(0x0002);
div10_step(0x0001);
#undef div10_step
if (dividend >= 5) ++quotient; // round the result (optional)
return quotient;
}
Well division is subtraction, so yes. Shift right by 1 (divide by 2). Now subtract 5 from the result, counting the number of times you do the subtraction until the value is less than 5. The result is number of subtractions you did. Oh, and dividing is probably going to be faster.
A hybrid strategy of shift right then divide by 5 using the normal division might get you a performance improvement if the logic in the divider doesn't already do this for you.
I've designed a new method in AVR assembly, with lsr/ror and sub/sbc only. It divides by 8, then sutracts the number divided by 64 and 128, then subtracts the 1,024th and the 2,048th, and so on and so on. Works very reliable (includes exact rounding) and quick (370 microseconds at 1 MHz).
The source code is here for 16-bit-numbers:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/div10_16rd.asm
The page that comments this source code is here:
http://www.avr-asm-tutorial.net/avr_en/beginner/DIV10/DIV10.html
I hope that it helps, even though the question is ten years old.
brgs, gsc
elemakil's comments' code can be found here: https://doc.lagout.org/security/Hackers%20Delight.pdf
page 233. "Unsigned divide by 10 [and 11.]"

Resources