How to multiply each digit in a number efficiently - math

I want to multiply every digit in a number to each other.
For example
515 would become 25(i.e 5*1*5)
10 would become 0(i.e 1*0)
111111 would become 1(i.e 1*1*1*1*1*1)
I used this code to do it
public static int evalulate(int no)
{
if(no==0)return 0;
int temp=1;
do
{
temp=(no%10)*temp;
no=no/10;
}while(no>0);
return temp;
}
problem is I want to evaluate for about a billion numbers like this
for(int i=0;i<1000000000;i++)evaluate(i);
This takes about 146 seconds on my processor.I want to evaluate it within some seconds.
So,is it possible to optimize this code using some shift,and,or operators so that I can reduce the time to evaluate without using multiple threads or parallelizing it
Thanks

First, figure out how many numbers you can store in memory. For this example, let's say you can store 999 numbers.
Your first step will be to pre-calculate the products of digits for all numbers from 0-999, and store that in memory. So, you'd have an array along the lines of:
multLookup = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
0, 2, 4, 6, 8, 10, 12, 14, 16, 18,
0, 3, 6, 9, 12, 15, 18, 21, 24, 27,
0, 4, 8, 12, 16, 20, 24, 28, 32, 36,
...]
Now, you'd break your number up into a bunch of 3 digit numbers. For example, if your number is 1739203423, you'd break it up into 1, 739, 203, and 423. You'd look each of these up in your multLookup array, and multiply the results together, like so:
solution = multLookup[1] * multLookup[739] * multLookup[203] * multLookup[423];
With this approach, you will have sped up your calculations by a factor of 3 (since we picked 999 items to store in memory). To speed it up by 5, store 99999 numbers in memory and follow the same steps. In your case, speeding it up by 5 means you'll arrive at your solution in 29.2 seconds.
Note: the gain isn't exactly linear with respect to how many numbers you store in memory. See jogojapan's reasoning in the comments under this answer for why that is.
If you know more about the order in which your numbers show up, or the range of your numbers (say your input is only in the range of [0, 10000]), you can make this algorithm smarter.
In your example, you're using a for loop to iterate from 0 to 1000000000. In this case, this approach will be super efficient because the memory won't page-fault very frequently and there will be fewer cache-misses.
But wait! You can make this even faster (for your specific for-loop iteration example)!! How, you ask? Caching! Lets say you're going through 10 digit numbers.
Let's say you start off at 8934236000. Based on the 999 digits in memory solution, you'd break this down into 8, 934, 236, and 000. Then you'd multiply:
solution = multLookup[8] * multLookup[934] * multLookup[236] * multLookup[0];
Next, you'd take 8934236001, break it down to 8, 934, 236, and 001, and multiply:
solution = multLookup[8] * multLookup[934] * multLookup[236] * multLookup[1];
And so on... But we notice that the first three lookups are the same for the next 997 iterations! So, we cache that.
cache = multLookup[8] * multLookup[934] * multLookup[236];
And then we use the cache as such:
for (int i = 0; i < 1000; i++) {
solution = cache * i;
}
And just like that, we've almost reduced the time by a factor of 4. So you take the ~29.2 second solution you had, and divide that by 4 to go through all billion numbers in ~7.3 seconds

If you can store the result of each operation for all your numbers.. Then you can use Memoization. That way you need to only calculate 1 digit.
int prodOf(int num){
// can be optimized to store 1/10 of the numbers, since the last digit will always be processed
static std::vector<int> memo(<max number of iterations>, -1);
if(num == 0) return 0;
if(memo[num] != -1 )return memo[num];
int prod = (num%10) * prodOf(num/10);
memo[num] = prod;
return prod;
}

Some test i made,
With simple C/C++ code on my PC (Xeon 3.2GHz),
last no = i = 999999999 ==> 387420489 nb sec 23
#include "stdafx.h"
#include <chrono>
#include <iostream>
#undef _TRACE_
inline int evaluate(int no)
{
#ifdef _TRACE_
std::cout << no;
#endif
if(no==0)return 0;
int temp=1;
do
{
temp=(no%10)*temp;
no=no/10;
}while(no>0);
#ifdef _TRACE_
std::cout << " => " << temp << std::endl;
#endif // _TRACE_
return temp;
}
int _tmain(int argc, _TCHAR* argv[])
{
std::chrono::time_point<std::chrono::system_clock> start(std::chrono::system_clock::now());
int last = 0;
int i = 0;
for(/*int i = 0*/;i<1000000000;++i) {
last = evaluate(i);
}
std::cout << "last no = i = " << (i-1) << " ==> " << last << std::endl;
std::chrono::time_point<std::chrono::system_clock> end(std::chrono::system_clock::now());
std::cout << "nb sec " << std::chrono::duration_cast<std::chrono::seconds>(end - start).count() << std::endl;
return 0;
}
I also tested the loop split over multiple thread with openMP and result is 0 second,
So I would say that it would be useful if you consider performance problem of using a real efficient language.
pragma omp parallel for
for(int i = 0;i<1000000000;++i) {
/*last[threadID][i] = */evaluate(i);
}

Related

How to get the first x leading binary digits of 5**x without big integer multiplication

I want to efficiently and elegantly compute with perfect precision the first x leading binary digits of 5**x?
For example 5**20 is 10101101011110001110101111000101101011000110001. The first 8 leading binary digits is 10101101.
In my use case, x is only up to 1-60. I don't want to create a table. A solution using 64-bit integers would be fine. I just don't want to use big integers.
first x leading binary digits of 5**x without big integer multiplication
efficiently and elegantly compute with perfect precision the first x leading binary digits of 5x?
"compute with perfect precision" leaves out pow(). Too many implementations will return an imperfect result and FP math might not use 64 bit precision, even with long double.
Form an integer with a 64-bit whole number part .ms and a 64-bit fraction part .ls. Then loop 60 times, multiply by 5 and diving by 2 as needed, to keep the leading bits from growing too big.
Note there is some precision lost in the fraction, with N > 42, yet that is not significant enough to affect the whole number part OP is seeking.
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
typedef struct {
uint64_t ms, ls;
} uint128;
// Simplifications possible here, leave for OP
uint128 times5(uint128 x) {
uint128 y = x;
for (int i=1; i<5; i++) {
// y += x
y.ms += x.ms;
y.ls += x.ls;
if (y.ls < x.ls) y.ms++;
}
return y;
}
uint128 div2(uint128 x) {
x.ls = (x.ls >> 1) | (x.ms << 63);
x.ms >>= 1;
return x;
}
int main(void) {
uint128 y = {.ms = 1};
uint64_t pow2 = 2;
for (unsigned x = 1; x <= 60; x++) {
y = times5(y);
while (y.ms >= pow2) {
y = div2(y);
}
printf("%2u %16" PRIX64 ".%016" PRIX64 "\n", x, y.ms, y.ls);
pow2 <<= 1;
}
}
Output
whole part.fraction
1 1.4000000000000000
2 3.2000000000000000
3 7.D000000000000000
4 9.C400000000000000
...
57 14643E5AE44D12B.8F5FEE5AA432560D
58 32FA9BE33AC0AEC.E66FD3E29A7DD720
59 7F7285B812E1B50.401791B6823A99D0
60 9F4F2726179A224.501D762422C94044
^-------------^ This is the part OP is seeking.
The key to solving this task is: divide and conquer. Form an algorithm, (which is simply *5 and /2 as needed), and code a type and functions to do each small step.
Is a loop of 60 efficient? Perhaps not. Another approach would use Exponentiation by squaring. Certainly would be worth it for large N, yet for N == 60, a loop was simple enough for a quick turn.
5n = 2(-n) • 10n
Using this identity, we can easily compute the leading N base-2 digits of (the nearest integer to) any given power of 5.
This code example is in C, but it's the same idea in any other language.
Example output: https://wandbox.org/permlink/Fs205DDzQR0gaLSo
#include <assert.h>
#include <float.h>
#include <math.h>
#include <stdint.h>
#define STATIC_ASSERT(CONDITION) ((void)sizeof(int[(CONDITION) ? 1 : -1]))
uint64_t pow5_leading_digits(double power, uint8_t ndigits)
{
STATIC_ASSERT(DBL_MANT_DIG <= 64);
double pow5 = exp2(-power) * pow(10, power);
const double binary_digits = ceil(log2(pow5));
assert(ndigits <= DBL_MANT_DIG);
if (!ndigits || binary_digits < 0)
return 0;
// If pow5 can fit in the number of digits requested, return it
if (binary_digits <= ndigits)
return pow5;
// If pow5 is too big to return, divide by 2 until it fits
if (binary_digits > DBL_MANT_DIG)
pow5 /= exp2(binary_digits - DBL_MANT_DIG + 1);
return (uint64_t)pow5 >> (DBL_MANT_DIG - ndigits);
}
Edit: Now limits the returned value to those exactly representable with double's.

Why does MPI_Send is blocking when i try to send 2D int array?

I'm trying to perform a fractal picture parallel calcul with mpi.
I've divide my program in 4 part :
Balance the number of row treat by each rank
Perform the calcul on each row attribute to the rank
Sending the number of row and the rows to the rank 0
Treat the data in rank 0 (for the test just print the int)
The step 1 and 2 are working but when i'm trying to send the rows to rank 0 the program is stoping and block. I know that the MPI_Send could Block bu there is no reason for that here.
Here is the 2 first step:
Step 1 :
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* Include the MPI library for function calls */
#include <mpi.h>
/* Define tags for each MPI_Send()/MPI_Recv() pair so distinct messages can be
* sent */
#define OTHER_N_ROWS_TAG 0
#define OTHER_PIXELS_TAG 1
int main(int argc, char **argv) {
const int nRows = 513;
const int nCols = 513;
const int middleRow = 0.5 * (nRows - 1);
const int middleCol = 0.5 * (nCols - 1);
const double step = 0.00625;
const int depth = 100;
int pixels[nRows][nCols];
int row;
int col;
double xCoord;
double yCoord;
int i;
double x;
double y;
double tmp;
int myRank;
int nRanks;
int evenSplit;
int nRanksWith1Extra;
int myRow0;
int myNRows;
int rank;
int otherNRows;
int otherPixels[nRows][nCols];
/* Each rank sets up MPI */
MPI_Init(&argc, &argv);
/* Each rank determines its ID and the total number of ranks */
MPI_Comm_rank(MPI_COMM_WORLD, &myRank);
MPI_Comm_size(MPI_COMM_WORLD, &nRanks);
printf("My rank is %d \n",myRank);
evenSplit = nRows / nRanks;
nRanksWith1Extra = nRows % nRanks;
/*Each rank determine the number of rows that he will have to perform (well balanced)*/
if (myRank < nRanksWith1Extra) {
myNRows = evenSplit + 1;
myRow0 = myRank * (evenSplit + 1);
}
else {
myNRows = evenSplit;
myRow0 = (nRanksWith1Extra * (evenSplit + 1)) +
((myRank - nRanksWith1Extra) * evenSplit);
}
/*__________________________________________________________________________________*/
Step 2 :
/*_____________________PERFORM CALCUL ON EACH PIXEL________________________________ */
for (row = myRow0; row < myRow0 + myNRows; row++) {
/* Each rank loops over the columns in the given row */
for (col = 0; col < nCols; col++) {
/* Each rank sets the (x,y) coordinate for the pixel in the given row and
* column */
xCoord = (col - middleCol) * step;
yCoord = (row - middleRow) * step;
/* Each rank calculates the number of iterations for the pixel in the
* given row and column */
i = 0;
x = 0;
y = 0;
while ((x*x + y*y < 4) && (i < depth)) {
tmp = x*x - y*y + xCoord;
y = 2*x*y + yCoord;
x = tmp;
i++;
}
/* Each rank stores the number of iterations for the pixel in the given
* row and column. The initial row is subtracted from the current row
* so the array starts at 0 */
pixels[row - myRow0][col] = i;
}
//printf("one row performed by %d \n",myRank);
}
printf("work done by %d \n",myRank);
/*_________________________________________________________________________________*/
Step 3:
/*__________________________SEND DATA TO RANK 0____________________________________*/
/* Each rank (including Rank 0) sends its number of rows to Rank 0 so Rank 0
* can tell how many pixels to receive */
MPI_Send(&myNRows, 1, MPI_INT, 0, OTHER_N_ROWS_TAG, MPI_COMM_WORLD);
printf("test \n");
/* Each rank (including Rank 0) sends its pixels array to Rank 0 so Rank 0
* can print it */
MPI_Send(&pixels, sizeof(int)*myNRows * nCols, MPI_BYTE, 0, OTHER_PIXELS_TAG,
MPI_COMM_WORLD);
printf("enter ranking 0 \n");
/*_________________________________________________________________________________*/
Step 4:
/*________________________TREAT EACH ROW IN RANK 0_________________________________*/
/* Only Rank 0 prints so the output is in order */
if (myRank == 0) {
/* Rank 0 loops over each rank so it can receive that rank's messages */
for (rank = 0; rank < nRanks; rank++){
/* Rank 0 receives the number of rows from the given rank so it knows how
* many pixels to receive in the next message */
MPI_Recv(&otherNRows, 1, MPI_INT, rank, OTHER_N_ROWS_TAG,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Rank 0 receives the pixels array from each of the other ranks
* (including itself) so it can print the number of iterations for each
* pixel */
MPI_Recv(&otherPixels, otherNRows * nCols, MPI_INT, rank,
OTHER_PIXELS_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Rank 0 loops over the rows for the given rank */
for (row = 0; row < otherNRows; row++) {
/* Rank 0 loops over the columns within the given row */
for (col = 0; col < nCols; col++) {
/* Rank 0 prints the value of the pixel at the given row and column
* followed by a comma */
printf("%d,", otherPixels[row][col]);
}
/* In between rows, Rank 0 prints a newline character */
printf("\n");
}
}
}
/* All processes clean up the MPI environment */
MPI_Finalize();
return 0;
}
I would like to understand why does it blocking , could you explain me ?
I'm a new user of MPI and i would like to learn it not just to have a program that is working.
Thank you in advance.
MPI_Send is by definition of the standard a blocking operation.
Note that blocking means:
it does not return until the message data and envelope have been safely stored away so that the sender is free to modify the send buffer. The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer.
Trying to have a rank send messages to itself with MPI_Send and MPI_Recv is a deadlock.
The idiomatic pattern for your situation is to use the appropriate collective communication operations MPI_Gather and MPI_Gatherv.
When you use blocking send/recv constructs when sending to the rank 0 itself, it might cause a deadlock.
From the MPI 3.0 standard, Section 3.2.4:
Source = destination is allowed, that is, a process can send a message to itself. (However, it is unsafe to do so with the blocking send and receive operations described above,
since this may lead to deadlock. See Section 3.5.)
Possible solutions:
Use non-blocking send/recv constructs when sending/receiving to/from rank 0 itself. For more information, take a look at the MPI_Isend, MPI_Irecv and MPI_Wait routines.
Eliminate communication with rank 0 itself. Since you are in rank 0, you already have a way to know how many pixels you have to compute.
As explained in a previous answer, MPI_Send() might block.
From a theoretical MPI point of view, your application is incorrect because of a potential deadlock (rank 0 MPI_Send() to itself when no receive is posted).
From a very pragmatic point of view, MPI_Send() generally returns immediately when a small message is sent (such as myNRows), but blocks until a matching receive is posted when a large message is sent (such as pixels). Please keep in mind
small and large depend at least on both the MPI library and the interconnect being used
it is incorrect from a MPI point of view to assume that MPI_Send() will return immediately for small messages
If you really want to make sure your application is deadlock free, you can simply replace MPI_Send() with MPI_Ssend().
Back to your question, there are several options here
revamp your app so rank 0 does not communicate with itself (all the info is available, so no communication is needed
post a MPI_Irecv() before MPI_Send(), and replace MPI_Recv(source=0) with MPI_Wait()
revamp you app so rank 0 does not MPI_Send() nor MPI_Recv(source=0), but MPI_Sendrecv instead. This is my recommended option since you only have to make a small change to the communication pattern (the computation pattern is kept untouched) which is more elegant imho.

How to group all possible integers into three buckets

I want to be able to evenly, reproducibly, and predictably switch an inputted integer value into one of three cases. If it was two cases, it would be obvious.
Pseudo code:
switch (integer) {
if even:
something;
break;
if odd:
something else;
break;
}
I want to do the same thing but for three cases, and I'm kind of stumped as to how I can do that. Probably because I'm not really very good at math.
Any ideas?
How about dividing by 3?
switch (x % 3) { // compute the remainder
case 0: // 0, 3, 6, 9, ...
something;
break;
case 1: // 1, 4, 7, 10, ...
something;
break;
case 2: // 2, 5, 8, 11, ...
something;
break;
}
You need to watch out for the sign - some languages will compute (-5) % 3 as -2 instead of 1, so you might need to use abs(x) % 3 instead of x % 3 or add case statements:
switch (x % 3) { // compute the remainder
case 0: // -6, -3, 0, 3, 6, 9, ...
something;
break;
case 1: // 1, 4, 7, 10, ...
case -2: // ... -5, -2
something;
break;
case 2: // 2, 5, 8, 11, ...
case -1: // ... -4, -1
something;
break;
}
See remainder and modulus operation.
PS in Common Lisp you would use mod:
(ecase (mod x 3)
(0 ...)
(1 ...)
(2 ...))

unknown recursive method, must find how it runs

This was a past exam question and I have no idea what it does! Please can someone run through it.
public static int befuddle(int n){
if(n <= 1){
return n;
}else{
return befuddle(n - 1) * befuddle(n - 2) + 1;
}
}
this is computing the sequence: 0, 1, 1, 2, 3, 7, 22, 155, ...
Which can be expressed using this formula:
when dealing with numerical sequences, a great resources is The Online Encyclopedia of Integer Sequences!, a quick search there shows a similar sequence to yours but with:
giving the following sequence: 0, 0, 1, 1, 2, 3, 7, 22, 155, ...
you can find more about it here
public static is the type of member function it is. I'm assuming this is part of a class? The static keyword allows you to use it without creating an instance of the class.
Plug in a value of 'n' and step through it. For instance, if n = 1, then the function returns 1. If n = 0 -> 0; n = -100 -> -100.
If n = 2, the else branch is triggered and befuddled is called with 1 and 0. So n = 2 returns 0*1 + 1 = 1.
Do the same thing for n = 3, etc. (calls n = 2 -> 1, and n = 1 -> 1, so n=3 -> 1*1+1 = 2.)

Interpreting GDB registers (SSE registers)

I've been using GDB for 1 day and I've accumulated a decent understanding of it.
However when I set a breakpoint at the final semicolon using GDB and print registers I can't fully interpret the meaning of the data stored into the XMM register.
I don't know if the data is in (MSB > LSB) format or vice versa.
__m128i S = _mm_load_si128((__m128i*)Array16Bytes);
}
So this is the result that I'm getting.
(gdb) print $xmm0
$1 = {
v4_float = {1.2593182e-07, -4.1251766e-18, -5.43431603e-31, -2.73406277e-14},
v2_double = {4.6236050467459811e-58, -3.7422963639201271e-245},
v16_int8 = {52, 7, 55, -32, -94, -104, 49, 49, -115, 48, 90, -120, -88, -10, 67, 50},
v8_int16 = {13319, 14304, -23912, 12593, -29392, 23176, -22282, 17202},
v4_int32 = {872888288, -1567084239, -1926210936, -1460255950},
v2_int64 = {3749026652749312305, -8273012972482837710},
uint128 = 0x340737e0a29831318d305a88a8f64332
}
So would someone kindly guide me how to interpret the data.
SSE (XMM) registers can be interpreted in various different ways. The register itself has no knowledge of the implicit data representation, it just holds 128 bits of data. An XMM register can represent:
4 x 32 bit floats __m128
2 x 64 bit doubles __m128d
16 x 8 bit ints __m128i
8 x 16 bit ints __m128i
4 x 32 bit ints __m128i
2 x 64 bit ints __m128i
128 individual bits __m128i
So when gdb displays an XMM register it gives you all possible interpretations, as seen in your example above.
If you want to display a register using a specific interpretation (e.g. 16 x 8 bit ints) then you can do it like this:
(gdb) p $xmm0.v16_int8
$1 = {0, 0, 0, 0, 0, 0, 0, 0, -113, -32, 32, -50, 0, 0, 0, 2}
As for endianness, gdb displays the register contents in natural order, i.e. left-to-right, from MS to LS.
So if you have the following code:
#include <stdio.h>
#include <stdint.h>
#include <xmmintrin.h>
int main(int argc, char *argv[])
{
int8_t buff[16] __attribute__ ((aligned(16))) = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };
__m128i v = _mm_load_si128((__m128i *)buff);
printf("v = %vd\n", v);
return 0;
}
If you compile and run this you will see:
v = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
However if you step through the code in gdb and examine v you will see:
v16_int8 = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}

Resources