`#include <bits/stdc++.h>
using namespace std;
#define ll long long
ll solve(ll a, ll b, ll i){
//base case
if (a == 0) return i;
if (b > a) return i+1;
//recursive case
if (b == 1) {
return solve(a,b+1,i+1);
}
ll n = solve(a, b+1, i+1);
ll m = solve(a/b, b, i+1);
return min(n,m);
}
int main(){
int t;
cin >> t;
while(t--){
ll a, b;
cin >> a >> b;
cout << solve(a, b, 0)<< endl;
}
}`
Basically question is from codeforces (1485A). The problem is that when I give some big input like 50000000 a and 5 for b, this gives me segmentation fault error while the code works fine for smaller inputs. Please help me solve it.
Using recursion is a terrible choice. And you need to make all obvious algorithmic optimizations.
The key insight is that for any path that divides before increasing b, there is a path that is as good or better that does not divide before increasing b. Why divide by a smaller number when you can divide by a bigger one if you're going to use the steps to increase the number anyway?
With that insight, and removing recursion, the problem is trivial to solve:
#include <iostream>
unsigned long long divisions(unsigned long long a, unsigned long long b)
{
// figure out how many divide operations we need
int ops = 0;
while (a > 0)
{
a/=b;
ops++;
}
return ops;
}
unsigned long long ops(unsigned long long a, unsigned long long b)
{
// figure out how many divides we need with the smallest possible b
unsigned long long min_ops = (b == 1) ? (1 + divisions(a, b+1)) : divisions(a, b);
// try every sensible larger b to see if it takes fewer operations
for (unsigned long long num_inc = 1; num_inc <= min_ops; ++num_inc)
{
unsigned long long ops = num_inc + divisions (a, b + num_inc);
if (ops < min_ops)
min_ops = ops;
}
return min_ops;
}
int main(void)
{
int t;
std::cin >> t;
while (t--)
{
unsigned long long a, b;
std::cin >> a >> b;
std::cout << ops(a, b) << std::endl;
}
}
Again, the lesson is that you must make algorithmic optimizations before you start coding. No amount of great coding will make a terrible algorithm work well.
By the way, there was a huge hint on the problem page. Something in the problem tags gives the key optimization away.
Related
I was looking at implementing the following computation, where divisor is nonzero and not a power of two
unsigned multiplier(unsigned divisor)
{
unsigned shift = 31 - clz(divisor);
uint64_t t = 1ull << (32 + shift);
return t / div;
}
in a manner that is efficient for processors that lack 64-bit integer and floating-point instructions, but may have 32-bit fused multiply-add (such as GPUs, which also will lack division).
This calculation is useful for finding "magic multipliers" involved in optimizing division, when the divisor is known ahead of time, to a multiply-high instruction followed by a bitwise shift. Unlike code used in compilers and reference code in libdivide, it finds largest such multiplier.
One additional twist is that in the application I was looking at, I anticipated that divisor will almost always be representable in float type. Therefore, it would make sense to have an efficient "fast path" that will handle those divisors, and a size-optimized "slow path" that would handle the rest.
The solution I came up with performs long division with remainder that is specialized for this particular scenario (dividend is a power of two) in 6 or 8 FMA operations on the "fast path", and then performs a binary search with 8 iterations on the "slow path".
The following program performs exhaustive testing of the proposed solution (needs about 1-2 minutes on an FMA-capable CPU).
#include <math.h>
#include <stdint.h>
#include <stdio.h>
struct quomod {
unsigned long quo;
unsigned long mod;
};
// Divide 1 << (32 + SHIFT) by DIV, return quotient and modulus
struct quomod
quomod_ref(unsigned div, unsigned shift)
{
uint64_t t = 1ull << (32 + shift);
return (struct quomod){t / div, t % div};
}
// Reinterpret given bits as float
static inline float int_as_float(uint32_t bits)
{
return (union{ unsigned b; float f; }){bits}.f;
}
// F contains integral value in range [-2**32 .. 2**32]. Convert it to integer,
// with wrap-around on overflow. If the GPU implements saturating conversion,
// it also may be used
static inline uint32_t cvt_f32_u32_wrap(float f)
{
return (uint32_t)(long long)f;
}
struct quomod
quomod_alt(unsigned div, unsigned shift)
{
// t = float(1ull << (32 + shift))
float t = int_as_float(0x4f800000 + (shift << 23));
// mask with max(0, shift - 23) low bits zero
uint32_t mask = (int)(~0u << shift) >> 23;
// No roundoff in conversion
float div_f = div & mask;
// Caution: on the CPU this is correctly rounded, but on the GPU
// native reciprocal may be off by a few ULP, in which case a
// refinement step may be necessary:
// recip = fmaf(fmaf(recip, -div_f, 1), recip, recip)
float recip = 1.f / div_f;
// Higher part of the quotient, integer in range 2^31 .. 2^32
float quo_hi = t * recip;
// No roundoff
float res = fmaf(quo_hi, -div_f, t);
float quo_lo_approx = res * recip;
float res2 = fmaf(quo_lo_approx, -div_f, res);
// Lower part of the quotient, may be negative
float quo_lo = floorf(fmaf(res2, recip, quo_lo_approx));
// Remaining part of the dividend
float mod_f = fmaf(quo_lo, -div_f, res);
// Quotient as sum of parts
unsigned quo = cvt_f32_u32_wrap(quo_hi) + (int)quo_lo;
// Adjust quotient down if remainder is negative
if (mod_f < 0) {
quo--;
}
if (div & ~mask) {
// The quotient was computed for a truncated divisor, so
// it matches or exceeds the true result
// High part of the dividend
uint32_t ref_hi = 1u << shift;
// Unless quotient is zero after wraparound, increment it so
// it's higher than true quotient (its high bit must be 1)
quo -= (int)quo >> 31;
// Binary search for the true quotient; search invariant:
// quo is higher than true quotient, quo-2*bit is lower
for (unsigned bit = 256; bit; bit >>= 1) {
unsigned try = quo - bit;
// One multiply-high instruction
uint32_t prod_hi = 1ull * try * div >> 32;
if (prod_hi >= ref_hi)
quo = try;
}
// quo is zero or exceeds the true quotient, so quo-1 must be it
quo--;
}
// Use the "left-pointing short magic wand" operator
// to recover the remainder
return (struct quomod){quo, quo *- div};
}
int main()
{
fprintf(stderr, "%66c\r[", ']');
unsigned step = 1;
for (unsigned div = 3; div; div += step) {
// Progress bar
if (!(div & 0x03ffffff)) fprintf(stderr, "=");
// Skip powers of two
if (!(div & (div-1))) continue;
unsigned shift = 31 - __builtin_clz(div);
struct quomod ref = quomod_ref(div, shift);
struct quomod alt = quomod_alt(div, shift);
if (ref.quo != alt.quo || ref.mod != alt.mod) {
printf("\nerror at %u\n", div);
return 1;
}
}
fprintf(stderr, "=\nAll ok\n");
return 0;
}
I am trying to run C code in R using Rcpp, but am unsure how to convert a buffer used to hold data from a file. In the third line of code below, I allocate an unsigned char buffer and my problem is that I don't know what Rcpp data type to use. Once the data are read into the buffer, I figured out how to use Rcpp::NumericMatrix to hold the final result, but not the character buffer. I have seen several responses by Dirk Eddelbuettel to similar questions where he suggests replacing all 'malloc' calls with Rcpp initialization commands. I tried using an Rcpp::CharacterVector, but then there is a type mismatch in the loop at the end: the Rcpp::CharacterVector cannot be read as an unsigned long long int. The code runs for some C-compilers, but throws a 'memory corruption' error for others, so I would prefer to do things the way Dirk suggests (use Rcpp data types) so that the code will run regardless of the specific compiler.
FILE *fp = fopen( filename, "r" );
fseek( fp, index_data_offset, SEEK_SET );
unsigned char* buf = (unsigned char *)malloc( 3 * number_of_index_entries * sizeof(unsigned long long int) );
fread( buf, sizeof("unsigned long long int"), (long)(3 * number_of_index_entries), fp );
fclose( fp );
// Convert "buf" into a 3-column matrix.
unsigned long long int l;
Rcpp::NumericMatrix ToC(3, number_of_index_entries);
for (int col=0; col<number_of_index_entries; col++ ) {
l = 0;
int offset = (col*3 + 0)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(0,col) = l;
l = 0;
offset = (col*3 + 1)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(1,col) = l;
l = 0;
offset = (col*3 + 2)*sizeof(unsigned long long int);
for (int i = 0; i < 8; ++i) {
l = l | ((unsigned long long int)buf[i+offset] << (8 * i));
}
ToC(2,col) = l;
}
return( ToC );
C and C++ can be lovely. If you know what you're doing, you have both a very direct line to the underlying hardware and higher-level abstraction for efficient reasoning.
I would suggest to simplify and reduce the problem. Start with a simple and known case, for example an STL vector of double. Let's call is x. Fill it with 10 or hundred elements, then open a FILE and write a blob from
x.data(), x.size() * sizeof(double)
Close the file. The read it into Rcpp by first allocation a NumericVector v of the same size, then reading the bytes back and then calling memcpy to &(v[0]).
It should be the same vector.
Then you can generalize to different types. Because vectors are guaranteed to be contiguous memory you can this serialization trick directly.
You can do variations on this with character buffers, or void*, or ... None of that matters for as long as you are careful not to mismatch. I.e. don't assing an int payload to a double and so on.
Now, is any this recommended? Hell no, unless you are chasing performance and know well enough what you are doing in which case it is reasonable. Otherwise rely on fantastic existing packages like fst or qs
to do it for you.
I hope this helps with your question. I wasn't entirely what it was you were asking. Maybe you clarify (and possibly shorten / focus) it if not.
A typecast did the trick:
Rcpp::NumericVector NumVecBuf( 3 * number_of_index_entries * sizeof(unsigned long long int) );
unsigned char* buf = (unsigned char*) &(NumVecBuf[0]);
Dirk's statement about "contiguous memory" suggested that this would work, so I went ahead and marked his comment as the answer. Thanks, Dirk! And, thanks for developing and maintaining Rcpp!
I'm trying to allocate multi dimensional arrays by using CUDA UMA on Power 8 system. However, I'm having issue while size is getting bigger. The code I'm using is below. When size is 24 x 24 x 24 x 5 works fine. When I increase it to 64 x 64 x 64 x 8 I am having " out of memory" even though I have memory in my device. Afaik, I suppose to be able to allocate memory via UMA as much as GPU device physical memory. So I would not expect any error. Currently my main configuration is Power 8 and Tesla k40 where I am having seg fault during runtime. However, I tried the code piece I provided on x86 + k40 machine. It surprisingly worked.
BTW, if you tell me another way to do that apart from transforming all my code from 4d array to 1d array, I'll so appreciate.
Thanks in advance
Driver: Nvidia 361
#include <iostream>
#include <cuda_runtime.h>
void* operator new[] (size_t len) throw(std::bad_alloc) {
void *ptr;
cudaMallocManaged(&ptr, len);
return ptr;
}
template<typename T>
T**** create_4d(int a, int b, int c, int d){
T**** ary = new T***[a];
for(int i = 0; i < a; ++i)
{
ary[i] = new T**[b];
for(int j = 0; j < b; ++j){
ary[i][j] = new T*[c];
for(int k = 0; k < c; ++k){
ary[i][j][k] = new T[d];
}
}
}
return ary;
}
int main() {
double ****data;
std::cout << "allocating..." << std::endl;
data = create_4d<double>(32,65,65,5);
std::cout << "Hooreey !!!" << std::endl;
//segfault here
std::cout << "allocating..." << std::endl;
data = create_4d<double>(64,65,65,5);
std::cout << "Hooreey !!!" << std::endl;
return 0;
}
There's been a considerable amount of dialog on your cross-posting here including an answer to your main question. I'll use this answer to summarize what is there as well as to answer this question specifically:
BTW, if you tell me another way to do that apart from transforming all my code from 4d array to 1d array, I'll so appreciate.
One of your claims is that you are doing proper error checking (" I caught error propoerly."). You are not. CUDA runtime API calls (including cudaMallocManaged) by themselves do not generate C++ style exceptions, so your throw specification on the new operator definition is meaningless. CUDA runtime API calls return an error code. If you want to do proper error checking, you must collect this error code and process it. If you collect the error code, you can use it to generate an exception if you wish, and an example of how you might do that is contained in the canonical proper CUDA error checking question, as one of the answers by Jared Hoberock. As a result of this oversight, when your allocations eventually fail, you are ignoring this, and then when you attempt to use those (non-) allocated areas for subsequent pointer storage, you generate a seg fault.
The proximal reason for the allocation failure is that you are in fact running out of memory, as discussed in your cross-posting. You can confirm this easily enough with proper error checking. Managed allocations have a granularity, and so when you request allocations of relatively small amounts, you are in fact using more memory than you think - the small allocations you are requesting are each being rounded up to the allocation granularity. The size of the allocation granularity varies by system type, and so the OpenPower system you are operating on has a much larger allocation granularity than the x86 system you compared it to, and as a result you were not running out of memory on the x86 system, but you were on the Power system. As discussed in your cross-posting, this is easy to verify with strategic calls to cudaMemGetInfo.
From a performance perspective, this is a pretty bad approach to multidimensional allocations for several reasons:
The allocations you are creating are disjoint, connected by pointers. Therefore, to access an element by pointer dereferencing, it requires 3 or 4 such dereferences to go through a 4-subscripted pointer array. Each of these dereferences will involve a device memory access. Compared to using simulated 4-D access into a 1-D (flat) allocation, this will be noticeably slower. The arithmetic associated with converting the 4-D simulated access into a single linear index will be much faster than traversing through memory via pointer-chasing.
Since the allocations you are creating are disjoint, the managed memory subsystem cannot coalesce them into a single transfer, and therefore, under the hood, a number of transfers equal to the product of your first 3 dimensions will take place, at kernel launch time (and presumably at termination, ie. at the next cudaDeviceSynchronize() call). This data must all be transferred of course, but you will be doing a large number of very small transfers, compared to a single transfer for a "flat" allocation. The associated overhead of the large number of small transfers can be significant.
As we've seen, the allocation granularity can seriously impact the memory usage efficiency of such an allocation scheme. What should be only using a small percentage of system memory ends up using all of system memory.
Operations that work on contiguous data from "row" to "row" of such an allocation will fail, because the allocations are disjoint. For example, such a matrix or a subsection of such a matrix could not be reliably passed to a CUBLAS linear algebra routine, as the expectation for that matrix would have contiguity of row storage in memory associated with it.
The ideal solution would be to create a single flat allocation, and then use simulated 4-D indexing to create a single linear index. Such an approach would address all 4 concerns above. However it requires perhaps substantial code refactoring.
We can however come up with an alternate approach, which preserves the 4-subscripted indexing, but otherwise addresses the concerns in items 2, 3, and 4 above by creating a single underlying flat allocation.
What follows is a worked example. We will actually create 2 managed allocations: one underlying flat allocation for data storage, and one underlying flat allocation (regardless of dimensionality) for pointer storage. It would be possible to combine these two into a single allocation with some careful alignment work, but that is not required to achieve any of the proposed benefits.
The basic methodology is covered in various other CUDA questions here on the SO tag, but most of those have host-side usage (only) in view, since they did not have UM in view. However, UM allows us to extend the methodology to host- and device-side usage. We will start by creating a single "base" allocation of the necessary size to store the data. Then we will create an allocation for the pointer array, and we will then work through the pointer array, fixing up each pointer to point to the correct location in the pointer array, or else to the correct location in the "base" data array.
Here's a worked example, demonstrating host and device usage, and including proper error checking:
$ cat t1271.cu
#include <iostream>
#include <assert.h>
template<typename T>
T**** create_4d_flat(int a, int b, int c, int d){
T *base;
cudaError_t err = cudaMallocManaged(&base, a*b*c*d*sizeof(T));
assert(err == cudaSuccess);
T ****ary;
err = cudaMallocManaged(&ary, (a+a*b+a*b*c)*sizeof(T*));
assert(err == cudaSuccess);
for (int i = 0; i < a; i++){
ary[i] = (T ***)((ary + a) + i*b);
for (int j = 0; j < b; j++){
ary[i][j] = (T **)((ary + a + a*b) + i*b*c + j*c);
for (int k = 0; k < c; k++)
ary[i][j][k] = base + ((i*b+j)*c + k)*d;}}
return ary;
}
template<typename T>
void free_4d_flat(T**** ary){
if (ary[0][0][0]) cudaFree(ary[0][0][0]);
if (ary) cudaFree(ary);
}
template<typename T>
__global__ void fill(T**** data, int a, int b, int c, int d){
unsigned long long int val = 0;
for (int i = 0; i < a; i++)
for (int j = 0; j < b; j++)
for (int k = 0; k < c; k++)
for (int l = 0; l < d; l++)
data[i][j][k][l] = val++;
}
void report_gpu_mem()
{
size_t free, total;
cudaMemGetInfo(&free, &total);
std::cout << "Free = " << free << " Total = " << total <<std::endl;
}
int main() {
report_gpu_mem();
unsigned long long int ****data2;
std::cout << "allocating..." << std::endl;
data2 = create_4d_flat<unsigned long long int>(64, 63, 62, 5);
report_gpu_mem();
fill<<<1,1>>>(data2, 64, 63, 62, 5);
cudaError_t err = cudaDeviceSynchronize();
assert(err == cudaSuccess);
std::cout << "validating..." << std::endl;
for (int i = 0; i < 64*63*62*5; i++)
if (*(data2[0][0][0] + i) != i) {std::cout << "mismatch at " << i << " was " << *(data2[0][0][0] + i) << std::endl; return -1;}
free_4d_flat(data2);
return 0;
}
$ nvcc -arch=sm_35 -o t1271 t1271.cu
$ cuda-memcheck ./t1271
========= CUDA-MEMCHECK
Free = 5904859136 Total = 5975900160
allocating...
Free = 5892276224 Total = 5975900160
validating...
========= ERROR SUMMARY: 0 errors
$
Notes:
This still involves pointer chasing inefficiency. I don't know of a method to avoid that without removing the multiple subscript arrangement.
I've elected to use 2 different indexing schemes in host and device code. In device code, I am using a normal 4-subscripted index, to demonstrate the utility of that. In host code, I am using a "flat" index, to demonstrate that the underlying storage is contiguous and contiguously addressable.
I've been thinking about it all day and still cannot figure out why this happens. My objective is simple: STEP1, generate a function S(h,p); STEP2, numerically integrate S(h,p) with respect to p by trapezoidal rule and obtain a new function SS(h). I wrote the code and source it by sourceCpp, and it successfully generated two functions S(h,p) and SS(h) in R. But when I tried to test it by calculating SS(1), R just kept running and never gave the result, which is weird because the calculation amount is not that big. Any idea why this would happen?
My code is here:
#include <Rcpp.h>
using namespace Rcpp;
//generate the first function that gives S(h,p)
// [[Rcpp::export]]
double S(double h, double p){
double out=2*(h+p+h*p);
return out;
}
//generate the second function that gives the numerically integreation of S(h,p) w.r.t p
//[[Rcpp::export]]
double SS(double h){
double out1=0;
double sum=0;
for (int i=0;i<1;i=i+0.01){
sum=sum+S(h,i);
}
out1=0.01/2*(2*sum-S(h,0)-S(h,1));
return out1;
}
The problem is that you are treating i as if it were not an int in this statement:
for (int i=0;i<1;i=i+0.01){
sum=sum+S(h,i);
}
After each iteration you are attempting to add 0.01 to an integer, which is of course immediately truncated towards 0, meaning that i is always equal to zero, and you have an infinite loop. A minimal example highlighting the problem, with a couple of possible solutions:
#include <Rcpp.h>
// [[Rcpp::export]]
void bad_loop() {
for (int i = 0; i < 1; i += 0.01) {
std::printf("i = %d\n", i);
Rcpp::checkUserInterrupt();
}
}
// [[Rcpp::export]]
void good_loop() {
for (int i = 0; i < 100; i++) {
std::printf("i = %d\n", i);
Rcpp::checkUserInterrupt();
}
}
// [[Rcpp::export]]
void good_loop2() {
for (double j = 0.0; j < 1.0; j += 0.01) {
std::printf("j = %.2f\n", j);
Rcpp::checkUserInterrupt();
}
}
The first alternative (good_loop) is to scale your step size appropriately -- looping from 0 through 99 by 1 takes the same number of iterations as looping from 0 to 0.99 by 0.01. Additionally, you could just use a double instead of an int, as in good_loop2. At any rate, the main takeaway here is that you need to be more careful about choosing your variable types in C++. Unlike R, when you declare i to be an int it will be treated like an int, not a floating point number.
As #nrussell pointed out very expertly, there is an issue with treating i as an int when the type held is a double. The goal of posting this answer is to stress the need to avoid using a double or float as a loop incrementer. I've opted to post it as an answer instead of a comment for readability.
Please note, the loop increment should not ever be given as a double or a float due to precision issues. e.g. it is hard to get i = .99 since i = 0.981111111 et cetera...
Instead, I would opt to have the loop be processed as an int and convert it to a double / float as soon as possible, e.g.
for (int i=0; i < 100; i++){
// Make sure to use double division
// (e.g. either numerator or denominator is a floating / double)
sum += S(h, i/100.0);
}
Further notes:
RcppArmadillo and C++ division issue
Using float / double as a loop variable
OK, say I have a boolean array called bits, and an int called cursor
I know I can access individual bits by using bits[cursor], and that I can use bit logic to get larger datatypes from bits, for example:
short result = (bits[cursor] << 3) |
(bits[cursor+1] << 2) |
(bits[cursor+2] << 1) |
bits[cursor+3];
This is going to result in lines and lines of code when reading larger types like int32 and int64 though.
Is it possible to do a cast of some kind and achieve the same result? I'm not concerned about safety at all in this context (these functions will be wrapped into a class that handles that)
Say I wanted to get an uint64_t out of bits, starting at an arbitrary address specified by cursor, when cursor isn't necessarily a multiple of 64; is this possible by a cast? I thought this
uint64_t result = (uint64_t *)(bits + cursor)[0];
Would work, but it doesn't want to compile.
Sorry I know this is a dumb question, I'm quite inexperienced with pointer math. I'm not looking just for a short solution, I'm also looking for a breakdown of the syntax if anyone would be kind enough.
Thanks!
You could try something like this and cast the result to your target data size.
uint64_t bitsToUint64(bool *bits, unsigned int bitCount)
{
uint64_t result = 0;
uint64_t tempBits = 0;
if(bitCount > 0 && bitCount <= 64)
{
for(unsigned int i = 0, j = bitCount - 1; i < bitCount; i++, j--)
{
tempBits = (bits[i])?1:0;
result |= (tempBits << j);
}
}
return result;
}