I was trying binomial coeffecient problem using dp. i know its not an efficient approach bugt i wamt to know whats this SIGFPE error - decoding

Program I wrote. ran it on practice environment of gfg:
class Solution{
public:
int nCr(int n, int r){
// code here
enter code here const unsigned int M = 1000000007;
long long dp[n+1]={0},i,ans;
if(n<r)
return 0;
dp[0]=1;
dp[1]=1;
for(i=2;i<=n;i++){
dp[i]=i*dp[i-1];
}
ans=(dp[n])/(dp[n-r]*dp[r]);
ans=ans%M;
return ans;
}
};
don't really understand what is going on. The division seems to be well defined.

The division seems to be well defined.
You are right suspecting the division as the SIGFPE error origin. As you know, division is well defined as long as the divisor is not zero. At first glance, one wouldn't expect that dp[n-r]*dp[r] could become zero. But the elements of dp have a limited range of values they can hold. With a 64-bit long long, the maximum representable value typically is 263−1 = 9223372036854775807. This means that dp[i] already has overflown for i > 20, though on common processors this overflow is silently ignored. Now, as computing the factorial by multiplication with even higher values of i proceeds, more and more zeros are "shifted in" from the right until eventually all 64 bits are zero; this is on common processors for i = 66 where the exception occurs when n-r or r are equal to or greater than 66.

Related

OpenCL convert_long from NaN Producing Incorrect Result on RTX 2060

OpenCL's convert_T (OpenCL 1.2, others are similar) function for long is producing an odd result.
In particular the function definition states:
Conversions to integer type may opt to convert using the optional
saturated mode by appending the _sat modifier to the conversion
function name. When in saturated mode, values that are outside the
representable range shall clamp to the nearest representable value in
the destination format. (NaN should be converted to 0).
However, on an NVidia RTX 2060 I am getting the most negative integral value for NaN inputs. For instance consider, the following kernel given NaN inputs such as 0x7FC00001 and 0xFFC00001.
kernel void test(
const global uint *srcF)
{
uint id = get_global_id(0);
float x = ((global float *)srcF)[id];
long x_rte = convert_long_sat_rte(x);
if (isnan(x) && x_rte == 0x8000000000000000) {
printf("0x%016llx: oops! 0x%08X fails to generate zero\n", x_rte, srcF[id]);
}
}
On an NVidia RTX 2060 I see.
0x8000000000000000: oops! 0x7FC00001 fails to generate zero
0x8000000000000000: oops! 0xFFC00001 fails to generate zero
It seems to generates 0x8000000000000000 (most negative long) instead of the expected value of 0. On Intel HD 630 I get 0's as expected. Similarly, I noticed some double to other integral types with convert_T_sat also fail similarly (returning the most negative integral value).
My question, am I missing something here? Am I misunderstanding the above spec? I know typical conversion has ill defined behavior outside bounds, but this explicit conversion seems to clearly say NaN's must be converted to 0. Still, this seems like an obvious conformance test that the driver must have gone through and I suspect myself of screwing up here.

Inaccurate results with OpenCL Reduction example

I am working with the OpenCL reduction example provided by Apple here
After a few days of dissecting it, I understand the basics; I've converted it to a version that runs more or less reliably on c++ (Openframeworks) and finds the largest number in the input set.
However, in doing so, a few questions have arisen as follows:
why are multiple passes used? the most I have been able to cause the reduction to require is two; the latter pass only taking a very low number of elements and so being very unsuitable for an openCL process (i.e. wouldn't it be better to stick to a single pass and then process the results of that on the cpu?)
when I set the 'count' number of elements to a very high number (24M and up) and the type to a float4, I get inaccurate (or totally wrong) results. Why is this?
in the openCL kernels, can anyone explain what is being done here:
while (i < n){
int a = LOAD_GLOBAL_I1(input, i);
int b = LOAD_GLOBAL_I1(input, i + group_size);
int s = LOAD_LOCAL_I1(shared, local_id);
STORE_LOCAL_I1(shared, local_id, (a + b + s));
i += local_stride;
}
as opposed to what is being done here?
#define ACCUM_LOCAL_I1(s, i, j) \
{ \
int x = ((__local int*)(s))[(size_t)(i)]; \
int y = ((__local int*)(s))[(size_t)(j)]; \
((__local int*)(s))[(size_t)(i)] = (x + y); \
}
Thanks!
S
To answer the first 2 questions:
why are multiple passes used?
Reducing millions of elements to a few thousands can be done in parallel with a device utilization of almost 100%. But the final step is quite tricky. So, instead of keeping everything in one shot and have multiple threads idle, Apple implementation decided to do a first pass reduction; then adapt the work items to the new reduction problem, and finally completing it.
Ii is a very specific optimization for OpenCL, but it may not be for C++.
when I set the 'count' number of elements to a very high number (24M
and up) and the type to a float4, I get inaccurate (or totally wrong)
results. Why is this?
A float32 precision is 2^23 the remainder. Values higher than 24M = 1.43 x 2^24 (in float representation), have an error in the range +/-(2^24/2^23)/2 ~= 1.
That means, if you do:
float A=24000000;
float B= A + 1; //~1 error here
The operator error is in the range of the data, therefore... big errors if you repeat that in a loop!
This will not happen in 64bits CPUs, because the 32bits float math uses internally 48bits precision, therefore avoiding these errors. However if you get the float close to 2^48 they will happen as well. But that is not the typical case for normal "counting" integers.
The problem is with the precision of 32 bit floats. You're not the first person to ask about this either. OpenCL reduction result wrong with large floats

ATMega peformance for different operations

Has anyone experiences replacing floating point operations on ATMega (2560) based systems? There are a couple of very common situations which happen every day.
For example:
Are comparisons faster than divisions/multiplications?
Are float to int type cast with followed multiplication/division faster than pure floating point operations without type cast?
I hope I don't have to make a benchmark just for me.
Example one:
int iPartialRes = (int)fArg1 * (int)fArg2;
iPartialRes *= iFoo;
faster as?:
float fPartialRes = fArg1 * fArg2;
fPartialRes *= iFoo;
And example two:
iSign = fVal < 0 ? -1 : 1;
faster as?:
iSign = fVal / fabs(fVal);
the questions could be solved just by thinking a moment about it.
AVRs does not have a FPU so all floating point related stuff is done in software --> fp multiplication involves much more than a simple int multiplication
since AVRs also does not have a integer division unit a simple branch is also much faster than a software division. if dividing floating points this is the worst worst case :)
but please note, that your first 2 examples produce very different results.
This is an old answer but I will submit this elaborated answer for the curious.
Just typecasting a float will truncate it ie; 3.7 will become 3, there is no rounding.
Fastest math on a 2560 will be (+,-,*) with divide being the slowest due to no hardware divide. Typecasting to an unsigned long int after multiplying all operands by a pseudo decimal point that suits your fractal number(1) range that your floats are expected to see and tracking the sign as a bool will give the best range/accuracy compromise.
If your loop needs to be as fast as possible, avoid even integer division, instead multiplying by a pseudo fraction instead and then doing your typecast back into a float with myFloat(defined elsewhere) = float(myPseudoFloat) / myPseudoDecimalConstant;
Not sure if you came across the Show info page in the playground. It's basically a sketch that runs a benchmark on your (insert Arduino model here) Shows the actual compute times for various things and systems. The Mega 2560 will be very close to an At Mega 328 as far as FLOPs goes, up to 12.5K/s (80uS per divide float). Typecasting would likely handicap the CPU more as it introduces more overhead and might even give erroneous results due to rounding errors and lack of precision.
(1)ie: 543.509,291 * 100000 = 543,509,291 will move the decimal 6 places to the maximum precision of a float on an 8-bit AVR. If you first multiply all values by the same constant like 1000, or 100000, etc, then the decimal point is preserved and then you cast it back to a float number by dividing by your decimal constant when you are ready to print or store it.
float f = 3.1428;
int x;
x = f * 10000;
x now contains 31428

slow execution of string comparision

my problem why my program takes much large time to execute, this program is supposed to check the user password, the approach used is
take password form console in to array and
compare it with previously saved password
comparision is done by function str_cmp()-returns zero if strings are equal,non zero if not equal
#include<stdio.h>
char str_cmp(char *,char *);
int main(void)
{
int i=0;
char c,cmp[10],org[10]="0123456789";
printf("\nEnter your account password\ntype 0123456789\n");
for(i=0;(c=getchar())!=EOF;i++)
cmp[i]=c;
if(!str_cmp(org,cmp))
{
printf("\nLogin Sucessful");
}
else
printf("\nIncorrect Password");
return 0;
}
char str_cmp(char *porg,char *pcmp)
{
int i=0,l=0;
for(i=0;*porg+i;i++)
{
if(!(*porg+i==*pcmp+i))
{
l++;
}
}
return l;
}
There are libraries available to do this much more simply but I will assume that this is an assignment and either way it is a good learning experience. I think the problem is in your for loop in the str_cmp function. The condition you are using is "*porg+i". This is not really doing a comparison. What the compiler is going to do is go until the expression is equal to 0. That will happen once i is so large that *porg+i is larger than what an "int" can store and it gets reset to 0 (this is called overflowing the variable).
Instead, you should pass a size into the str_cmp function corresponding to the length of the strings. In the for loop condition you should make sure that i < str_size.
However, there is a build in strncmp function (http://www.elook.org/programming/c/strncmp.html) that does this exact thing.
You also have a different problem. You are doing pointer addition like so:
*porg+i
This is going to take the value of the first element of the array and add i to it. Instead you want to do:
*(porg+i)
That will add to the pointer and then dereference it to get the value.
To clarify more fully with the comparison because this is a very important concept for pointers. porg is defined as a char*. This means that you have a variable that has the memory address of a 'char'. When you use the dereference operator (*, for example *porg) on the variable, it returns the value at stored in that piece of memory. However, you can add a number to the memory location to move to a different memory location. porg + 1 is going to return the memory location after porg. Therefore, when you do *porg + 1 you are getting the value at the memory address and adding 1 to it. On the other hand, when you do *(porg + 1) you are getting the value at the memory address one after where porg is pointing to. This is useful for arrays because arrays are store their values one after another. However, a more understandable notation for doing this is: porg[1]. This says "get the value 1 after the beginning of the array" or in other words "get the second element of the array".
All conditions in C are checking if the value is zero or non-zero. Zero means false, and every other value means true. When you use this expression (*porg + 1) for a condition it is going to do the calculation (value at porg + 1) and check if it is zero or not.
This leads me to the other very important concept for programming in C. An int can only hold values up to a certain size. If the variable is added to enough where it is larger than that maximum value, it will cycle around to 0. So lets say the maximum value of an int is 256 (it is in fact much larger). If you have an int that has the value of 256 and add 1 to it, it will become zero instead of 257. In reality the maximum number is 65,536 for most compilers so this is why it is taking so long. It is waiting until *porg + i is greater than 65,536 so that it becomes zero again.
Try including string.h:
#include <string.h>
Then use the built-in strcmp() function. The existing string functions have already been written to be as fast as possible in most situations.
Also, I think your for statement is messed up:
for(i=0;*porg+i;i++)
That's going to dereference the pointer, then add i to it. I'm surprised the for loop ever exits.
If you change it to this, it should work:
for(i=0;porg[i];i++)
Your original string is also one longer than you think it is. You allocate 10 bytes, but it's actually 11 bytes long. A string (in quotes) is always ended with a null character. You need to declare 11 bytes for your char array.
Another issue:
if(!(*porg+i==*pcmp+i))
should be changed to
if(!(porg[i]==pcmp[i]))
For the same reasons listed above.

Use of qsrand, random method that is not random

I'm having a strange problem here, and I can't manage to find a good explanation to it, so I thought of asking you guys :
Consider the following method :
int MathUtility::randomize(int Min, int Max)
{
qsrand(QTime::currentTime().msec());
if (Min > Max)
{
int Temp = Min;
Min = Max;
Max = Temp;
}
return ((rand()%(Max-Min+1))+Min);
}
I won't explain you gurus what this method actually does, I'll instead explain my problem :
I realised that when I call this method in a loop, sometimes, I get the same random number over and over again... For example, this snippet...
for(int i=0; i<10; ++i)
{
int Index = MathUtility::randomize(0, 1000);
qDebug() << Index;
}
...will produce something like :
567
567
567
567...etc...
I realised too, that if I don't call qsrand everytime, but only once during my application's lifetime, it's working perfectly...
My question : Why ?
Because if you call randomize more than once in a millisecond (which is rather likely at current CPU clock speeds), you are seeding the RNG with the same value. This is guaranteed to produce the same output from the RNG.
Random-number generators are only meant to be seeded once. Seeding them multiple times does not make the output extra random, and in fact (as you found) may make it much less random.
If you make the call fast enough the value of QTime::currentTime().msec() will not change, and you're basically re-seeding qsrand with the same seed, causing the next random number generated to be the same as the prior one.
If you call the qsrand Qt function to initialize the seed, you must call the qrand Qt function to generate a random number, not the rand function from the standard library. the seed initialization for the rand function is srand.
Sorry for the dig up.
What you see is the effect of pseudo-randomness. You seed it with the time once, and it generates a sequence of numbers. Since you are pulling a series of random numbers very quickly after each other, you are re-seeding the randomizer with the same number until the next millisecond. And while a millisecond seems like a short time, consider the amount of calculations you're doing in that time.
modern Qt c++ 11
#include <random>
#include "QDateTime"
int getRand(int min, int max){
unsigned int ms = static_cast<unsigned>(QDateTime::currentMSecsSinceEpoch());
std::mt19937 gen(ms);
std::uniform_int_distribution<> uid(min, max);
return uid(gen);
}
Two problems:
1 As others have pointed out, the generator is being seed multiple times.
2 This is not a very good method to generate random numbers within a given range. (In fact it's very very bad for most generators )
You are assuming that the low-order bits from the generator are uniformly distributed . This is not the case with most generators. In most generators the randomness occurs in the high order bits.
By using the remainder after divisions you are in effect throwing out the randomness.
You should scale using multiplication and division. Not using the modulo operator.
eg
my_number= start_required + ( generator_output * range_required)/generator_maximum;
if generator_output is in [0, generator_maximum]
my_number will be in [start_required , start_required + range_required]
I've found the same action and solved it by using rand() instead the srand().
But I use it for checking my application. It just working in the cicle, so I don't need to look for it updates.
But if you going to do some king of game, it isn't a good way, because your randomizing will be the same.

Resources