Can "swap nybble" and "byte mask" tricks do multi byte logical shift by 4 much faster than the naive method of using bit shift chains

Can "swap nybble" and "byte mask" tricks do multi byte logical shift by 4 much faster than the naive method of using bit shift chains - microcontroller

I'm writing a fixed-point (16Q16) algo that does division using the Newton–Raphson method outlined on Wikipedia. (Related SE question HERE.)
The first step requires logical right-shift by anywhere from 1-16 bits. My CPU is an 8-bit microcontroller, so does not have a barrel shifter; the hardware can only support shifting by 1 bit. (Related SE question HERE.) To shift by n bits, one can use the single shift instruction n times, however, this has significant worst case timing. This bad worst case becomes abysmal when compounded by the multi-byte nature of the shift. If it takes 1 cycle to shift 1 byte 1 time, we can quickly imagine the problem when needing to shift 16 bytes 16 times. Remember, this is just step 1.
The obvious optimization is to divide and conquer; manually compute for all powers of 2. The first, trivial case is shifts by 16 and 8, since this is just changing the memory index to the fixed-point number. Doing this it only takes ~4 cycles/instructions to shift 16 bytes by 16 bits, a 64x speed improvement over "single shift chaining."
The problem I'm having is pesky shifts by 4.
My intuition, as well as a few snippets and posts here and there, tells me that their exists an efficient way to combine nybble swap instructions with bit-masks and logical ops to create a sort of "nybble swap zipper effect" that could optimally shift an arbitrary length array by 4 bits. (Related SE question HERE. Links to answer 3)
I KNOW it can at least fundamentally be done, because I have a prof of concept using only SWAP, XOR, and AND instructions. I also know it can be done FASTER because what I have is one cycle faster (lol, yes, one) than a simple shift chain method. (see code box below) What I do not know is...
Can this be done with a time complexity much closer to one cycle per byte than to one cycle per bit using any kind of nybble swap byte mask trick?
Note: This is PIC18 ASM, but it's pretty obvious what's going on. Any suggestions on how this can be majorly improved would be an answer to this question. YES! I realize that it may very well be close to optimal already, but realize shifting by 4 is a recurring hot spot in a few pieces of code. I'm expecting to cut at least one instruction from each block. Cutting out two would be amazing.
; Shift the denominator -> 4 (19 cyc)
;------------------------------------------
SWAPF Denom+3, W, BANKED
MOVWF Denom+3, BANKED
ANDLW 0xF0
XORWF Denom+3, F, BANKED
SWAPF Denom+2, F, BANKED
XORWF Denom+2, F, BANKED
XORWF Denom+2, W, BANKED
ANDLW 0xF0
XORWF Denom+2, F, BANKED
SWAPF Denom+1, F, BANKED
XORWF Denom+1, F, BANKED
XORWF Denom+1, W, BANKED
ANDLW 0xF0
XORWF Denom+1, F, BANKED
SWAPF Denom+0, F, BANKED
XORWF Denom+0, F, BANKED
XORWF Denom+0, W, BANKED
ANDLW 0xF0
XORWF Denom+0, F, BANKED
;------------------------------------------

Your solution involves 19 instructions (which each take one cycle) but the simple solution just takes 18:
RRCF Denom+3, F, BANKED
RRCF Denom+2, F, BANKED
RRCF Denom+1, F, BANKED
RRCF Denom+0, F, BANKED
RRCF Denom+3, F, BANKED
RRCF Denom+2, F, BANKED
RRCF Denom+1, F, BANKED
RRCF Denom+0, F, BANKED
RRCF Denom+3, F, BANKED
RRCF Denom+2, F, BANKED
RRCF Denom+1, F, BANKED
RRCF Denom+0, F, BANKED
RRCF Denom+3, F, BANKED
RRCF Denom+2, F, BANKED
RRCF Denom+1, F, BANKED
RRCF Denom+0, F, BANKED
MOVLW 0x0F
ANDWF Denom+3, F, BANKED
I can't think of anything faster than that which actually performs this shift.

Related

Calculate the probabilities by simulation in R programming

You have all the black cards from a normal deck, I have all the red cards. We each choose one card from our half decks - the highest card wins. But I have managed to sneak out and throw away your 9 Of Clubs! Simulate 1000 games - how many did I win? (Ace is low.)
reds = rep(1:13, 2)
n = 1000
sum(sample(reds[-9], n, rep = T) < sample(reds, n, rep = T))
This is my code, I want to show the how many times I did win when simulate 1000 games if the 9 of clubs from black card removed.

If I understood correctly, you just need to divide the total of winnings by the total number of simulations
sum(sample(reds[-9], n, rep = T) < sample(reds, n, rep = T))/n

Solving power equation in Maple

I have a function
f := x -> -5.582656463587253/L^1.877207104415696;
If I try to solve for x with
solve(abs(f(x)) = 3, x);
it takes an awful lot of time to compute, and if I do it multiple times, my computer breaks down.
Shouldn't it be a simple
abs(-5.582656463587253/L^1.877207104415696) = 3
5.582656463587253/L^1.877207104415696 = 3
L^1.877207104415696 = 5.582656463587253/3
L = (5.582656463587253/3)^(1/1.877207104415696)
= 1.392134989

Computers are, on a basic level, stupid. And literal. Recent advantages in so called artificial intelligence non-withstanding.
Thus you find yourself giving the task to solve for x in an equation that does not contain the symbol x, only the undeclared symbol L. And obviously in your example computation, you solve for L, not for x.

Using local memory to speed calculation

Should be an easy one but my OpenCL skills are completely rusty. :)
I have a simple kernel that does the sum of two arrays:
__kernel void sum(__global float* a, __global float* b, __global float* c)
{
__private size_t gid = get_global_id(0);
c[gid] = log(sqrt(exp(cos(sin(a[gid]))))) + log(sqrt(exp(cos(sin(b[gid])))));
}
It's working fine.
Now I'm trying to use local memory hoping it could speed things up:
__kernel void sum_with_local_copy(__global float* a, __global float* b, __global float* c, __local float* tmpa, __local float* tmpb, __local float* tmpc)
{
__private size_t gid = get_global_id(0);
__private size_t lid = get_local_id(0);
__private size_t grid = get_group_id(0);
__private size_t lsz = get_local_size(0);
event_t evta = async_work_group_copy(tmpa, a + grid * lsz, lsz, 0);
wait_group_events(1, &evta);
event_t evtb = async_work_group_copy(tmpb, b + grid * lsz, lsz, 0);
wait_group_events(1, &evtb);
tmpc[lid] = log(sqrt(exp(cos(sin(tmpa[lid]))))) + log(sqrt(exp(cos(sin(tmpb[lid])))));
event_t evt = async_work_group_copy(c + grid * lsz, tmpc, lsz, 0);
wait_group_events(1, &evt);
}
But there is two issues with this kernel:
it's something like 3 times slower than the naive implementation
the results are wrong starting at index 64
My local-size is the max workgroup size.
So my questions are:
1) Am I missing something obvious or is there really a subtlety?
2) How to use local memory to speed up the computation?
3) Should I loop inside the kernel so that each work-item does more than one operation?
Thanks in advance.

Your simple kernel is already optimal w.r.t work-group performance.
Local memory will only improve performance in cases where multiple work-items in a work-group read from the same address in local memory. As there is no shared data in your kernel there is no gain to be had by transferring data from global to local memory, thus the slow-down.
As for point 3, you may see a gain by processing multiple values per thread (depending on how expensive your computation is and what hardware you have).

As you probably know you can explicitly set the local work group size (LWS) when executing your kernel using:
clEnqueueNDRangeKernel( ... bunch of args include Local Work Size ...);
as discussed here. But as already mentioned by Kyle, you don't really have to do this because OpenCL tries to pick the best value for the LWS when you pass in NULL for LWS argument.
Indeed the specification says: "local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances."
I was curious to see how this played out in your case so I setup your calculation to verify the performance against the default value chosen by OpenCL on my device.
In case your interested I setup some arbitrary data:
int n = powl(2, 20);
float* a = (float*)malloc(sizeof(float)*n);
float* b = (float*)malloc(sizeof(float)*n);
float* results = (float*)malloc(sizeof(float)*n);
for (int i = 0; i<n; i++) {
a[i] = (float)i;
b[i] = (float)(n-i);
results[i] = 0.f;
}
and then after defining all of the other OpenCL structures I varied, lws = VALUE, from 2 to 256 (max allowed on my device for this kernel) in powers of 2, and measured the wall-clock time (note: can also use OpenCL events):
struct timeval timer;
int trials = 100;
gettimeofday(&timer, NULL);
double t0 = timer.tv_sec+(timer.tv_usec/1000000.0);
// ---------- Execution ---------
size_t global_work_size = n;
size_t lws[] = {VALUE}; // VALUE was varied from 2 to 256 in powers of 2.
for (int trial = 0; trial<trials; trial++) {
clEnqueueNDRangeKernel(cmd_queue, kernel[0], 1, NULL, &global_work_size, lws, 0, NULL, NULL);
}
clFinish(cmd_queue);
gettimeofday(&timer, NULL);
double t1 = timer.tv_sec+(timer.tv_usec/1000000.0);
double avgTime = (double)(t1-t0)/trials/1.0f;
I then plotted the total execution time as a function of the LWS and as expected the performance varies by quite a bit, until the best value of LWS = 256, is reached. For LWS > 256, the memory on my device is exceeded with this kernel.
FYI for these tests I am running a laptop GPU: AMD ATI Radeon HD 6750M, Max compute units = 6 and the CL_DEVICE_LOCAL_MEM_SIZE = 32768 (so no big screamer compared other GPUs)
Here are the raw numbers:
LWS time(sec)
2 14.004
4 6.850
8 3.431
16 1.722
32 0.866
64 0.438
128 0.436
256 0.436
Next, I checked the default value chosen by OpenCL (passing NULL for the LWS) and this corresponds to the best value that I found by profiling, i.e., LWS = 256.
So in the code you setup you found one of the suboptimal cases, and as mentioned before, its best to let OpenCL pick the best values for the local work groups, especially when there is no shared data in your kernel between multiple work-items in a work-group.
As to the error you got, you probably violated a constraint (from the spec):
The total number of work-items in the work-group must be less than or equal to the CL_DEVICE_MAX_WORK_GROUP_SIZE
Did you check that in detail, by querying the CL_DEVICE_MAX_WORK_GROUP_SIZE for your device?

Adding to what Kyle has written: It has to be multiple work items reading from the same address; if it's just each work item itself reading multiple times from the same address - then again local memory won't help you any; just use the work item's private memory, i.e. variables you define within your kernel.
Also, some points not related to the use of local memory:
log(sqrt(exp(x)) = log(exp(x)) / 2 = x / 2 ... assuming it's the natural logarithm.
log(sqrt(exp(x)) = log(exp(x)) / 2 = x / (2 ln(2)) ... assuming it's the base-2 logarithm. Compute ln(2) in advance of course.
If you really did have some complex function-of-a-function-of-a-function, you might be better off using a Taylor series expansion. For example, your function expands to 1/2-x^2/4+(5 x^4)/48+O(x^6) (order 5).
The last term is an error term, which you can bound from above to choose the appropriate order for the expansion; the error term should not be that high for 'well-behaving' functions. The Taylor expansion calculation might even benefit from further parallelization (but then again, it might not).

ACM 4744 Brute-Force Algorithm

The full Context of the Problem can be seen here
Details.
Also you can try my Sourcecode to plot the recursion for small numbers:
Pastebin
I'm looking at this problem the math way, its a nested recursion and looks like follows:
Function Find(integer n, function func)
If n=1
For i = 1 to a do func()
Elseif n=2
For i = 1 to b do func()
Else Find(n-1,Find(n-2,func))
Function Main
Find(n,funny)
My implementation in Mathematica without the Modulo-Operation is:
$IterationLimit = Infinity
Clear[A]
A [a_, b_, f_, 1] := A [a, b, f, 1, p] = (f a);
A [a_, b_, f_, 2] := A [a, b, f, 2, p] = (f b);
A [a_, b_, f_, n_] :=
A [a, b, f, n, p] = (A[a, b, A[a, b, f, n - 2], n - 1]);
This reveals some nice Output for general a and b
A[a, b, funny, 1]
a funny
A[a, b, funny, 2]
b funny
A[a, b, funny, 3]
a b funny
A[a, b, funny, 4]
a b^2 funny
A[a, b, funny, 5]
a^2 b^3 funny
A[a, b, funny, 6]
a^3 b^5 funny
So when we are looking at how often the Func is called, it seems like a^(F(n)) * b^(F(n+1))
with F(n) as the n-th Fibonacci Number. So my Problem is: How do i get very huge Fibonacci-Numbers modulo p, i did a lot of research on this, read through Cycle-Lenghts of Fibonacci, tried some Recursion with:
F(a+b) = F(a+1) * F(b) + F(a)*F(b-1)
but it seems like the Recursion-Depth (log_2(1.000.000.000) ~=30 ) when splitting p into two numbers is way to much, even with a deep first recursion.
a= floor(n/2)
b= ceiling(n/2)
When i have the Fib-Numbers, the multiplication and exponentiation
should not be a problem in my point of view.
Unfortunately not :/
I'm still stuck with the Problem. Computing the Fibonacci-Numbers in the Exponent first did not solve the Problem correct, it was a wrong Mathformula I applied there :/
So i thought of other ways Computing the Formula:
(a^(Fibonacci(n-2))*b^(Fibonacci(n-1))) mod p
But as the Fibonacci Numbers get really large, I am assuming that there must be an easier way than computing the whole Fibonacci-Number and then applying the discrete exponential function with BigInteger/BigFloat. Does someone have a hint for me, i see no further progress. Thanks
So this is where i am so far, might be just a little thing I'm missing, so looking forward to your replies
Thanks

If it's about calculating fibonacci numbers, there is a non-recursive, non-iterative formula for it. It's featured prominently on the Dutch wikipedia page about fibonacci numbers, but not so much on the English page.
F(n) = ( ( 1 + sqrt(5) ) ^ n - ( 1- sqrt(5) ) ^ n ) / (2 ^ n * sqrt(5))
http://upload.wikimedia.org/wikipedia/nl/math/1/7/4/1747ee745fbe1fbf10fb3d9de36b8927.png
Source: http://nl.wikipedia.org/wiki/Rij_van_Fibonacci
Maybe there's something you can do with this formula.

You might find helpful my ruminations on various ways to compute the Fibonacci and Lucas numbers. In there I show how to do the computation using a recursive scheme that is basically O(log2(n)). It works very nicely for large fibonacci numbers. And if you do it all modulo some small number, you need not even use a big integer tool for the computations. This would be blindingly fast for even huge Fibonacci numbers. This one below is only moderately large.
fibonacci(10000)
ans =
33644764876431783266621612005107543310302148460680063906564769974680
081442166662368155595513633734025582065332680836159373734790483865268263
040892463056431887354544369559827491606602099884183933864652731300088830
269235673613135117579297437854413752130520504347701602264758318906527890
855154366159582987279682987510631200575428783453215515103870818298969791
613127856265033195487140214287532698187962046936097879900350962302291026
368131493195275630227837628441540360584402572114334961180023091208287046
088923962328835461505776583271252546093591128203925285393434620904245248
929403901706233888991085841065183173360437470737908552631764325733993712
871937587746897479926305837065742830161637408969178426378624212835258112
820516370298089332099905707920064367426202389783111470054074998459250360
633560933883831923386783056136435351892133279732908133732642652633989763
922723407882928177953580570993691049175470808931841056146322338217465637
321248226383092103297701648054726243842374862411453093812206564914032751
086643394517512161526545361333111314042436854805106765843493523836959653
428071768775328348234345557366719731392746273629108210679280784718035329
131176778924659089938635459327894523777674406192240337638674004021330343
297496902028328145933418826817683893072003634795623117103101291953169794
607632737589253530772552375943788434504067715555779056450443016640119462
580972216729758615026968443146952034614932291105970676243268515992834709
891284706740862008587135016260312071903172086094081298321581077282076353
186624611278245537208532365305775956430072517744315051539600905168603220
349163222640885248852433158051534849622434848299380905070483482449327453
732624567755879089187190803662058009594743150052402532709746995318770724
376825907419939632265984147498193609285223945039707165443156421328157688
908058783183404917434556270520223564846495196112460268313970975069382648
706613264507665074611512677522748621598642530711298441182622661057163515
069260029861704945425047491378115154139941550671256271197133252763631939
606902895650288268608362241082050562430701794976171121233066073310059947
366875
The trick is simple. Simply relate the 2n'th Fibonacci and Lucas numbers to the n'th such numbers. It allows us to work backwards. So to compute F(n) and L(n), we need to know F(n/2) and L(n/2). Clearly this works as long as n is even. For odd n, there are similar schemes that will allow us to move recursively downwards.
For kicks, I just modified the above tool, to accept a modulus. So to compute the last 6 digits of the Fibonacci number with index 1e15, it took about 1/6 of a second.
tic,[Fn,Ln] = fibonacci(1e15,1000000),toc
Elapsed time is 0.161468 seconds.
Fn =
546875
Ln =
328127
Note: In my discussion of recursion to compute the Fibonacci numbers, I do make a few comments on the number of recursive calls required. See that that number is indeed related quite nicely to the Fibonacci sequence itself. This is easily derived.

number of possible combinations in a partitioning

Given is a set S of size n, which is partitioned into classes (s1,..,sk) of sizes n1,..,nk. Naturally, it holds that n = n1+...+nk.
I am interested in finding out the number of ways in which I can combine elements of this partitioning so that each combination contains exactly one element of each class.
Since I can choose n1 elements from s1, n2 elements from s2 and so on, I am looking for the solution to max(n1*..*nk) for arbitrary n1,..nk for which it holds that n1+..+nk=n.
I have the feeling that this is a linear-optimization problem, but it's been too long since I learned this stuff as an undergrad. I hope that somebody remembers how to compute this.

You're looking for the number of combinations with one element from each partition?
That's simply n1*n2*...*nk.
Edit:
You seem to also be asking a separate question:
Given N, how do I assign n1, n2, ..., nk such that their product is maximized. This is not actually a linear optimization problem, since your variables are multiplied together.
It can be solved by some calculus, i.e. by taking partial dervatives in each of the variables, with the constraint, using Lagrange multipliers.
The result will be that the n1 .. nk should be as close to the same size as possible.
if n is a multiple of k, then n_1 = n_2 = ... = n_k = n/k
otherwise, n_1 = n_2 = ... = n_j = Ceiling[n/k]
and n_j+1 = ... = n_k = floor[n/k]
Basically, we try to distribute the elements as evenly as possible into partitions. If they divide evenly, great. If not, we divide as evenly as possible, and with whatever is left over, we give an extra element each to the first partitions. (Doesn't have to be the first partitions, that choice is fairly arbitrary.) In this way, the difference in the number of elements owned by any two partitions will be at most one.
Gory Details:
This is the product function which we wish to maximize:
P = n1*n2*...nK
We define a new function using Lagrange multipliers:
Lambda = P + l(N - n1 - n2 ... -nk)
And take Partial derivatives in each of the k n_i variables:
dLambda/dn_i = P/n_i - l
and in l:
dLambda/dl = N - n1 -n2 ... -nk
setting all of the partial derivatives = 0, we get a system of k + 1 equations, and when we solve them, we'll get that n1 = n2 = ... = nk
Some useful links:
Lagrange Multipliers
Optimization

floor(n/k)^(k - n mod k)*ceil(n/k)^(n mod k)
-- MarkusQ
P.S. For the example you gave of S = {1,2,3,4}, n = 4, k = 2 this gives:
floor(4/2)^(2 - 4 mod 2)*ceil(4/2)^(4 mod 2)
floor(2)^(2 - 0)*ceil(2)^(0)
2^2 * 2^0
4 * 1
4
...as you wanted.
To clarify, this formula gives the number of permutations generated by the partitioning with the maximum possible number of permutations. There will of course be other, less optimal partitionings.
For a given perimeter the rectangle with the largest area is the one that is closest to a square (and the same is true in higher dimensions) which means you want the sides to be as close to equal in length as possible (e.g. all either the average length rounded up or down). The formula can then be seen to be:
(length of short sides)^(number of short sides)
times
(length of long sides)^(number of long sides)
which is just the volume of the hyper-rectangle meeting this constraint.
Note that, when viewed this way, it also tells you how to construct a maximal partitioning.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex