Overhead of recursive lambdas - recursion

Do recursive lambda functions induce any overhead comparing to regular recursive functions (since we have to capture them into a std::function) ?
What is the difference between this function and a similar one using only regular functions ?
int main(int argc, const char *argv[])
{
std::function<void (int)> helloworld = [&helloworld](int count) {
std::cout << "Hello world" << std::endl;
if (count > 1) helloworld(--count);
};
helloworld(2);
return 0;
}

There is overhead using lambdas recursively by storing it as a std::function, although they are itself basically functors. It seems that gcc is not able to optimize well which can be seen in a direct comparison.
Implementing the behaviour of a lambda, i.e. creating a functor, enables gcc of optimizing again. Your specific example of a lambda could be implemented as
struct HelloWorldFunctor
{
void operator()(int count) const
{
std::cout << "Hello world" << std::endl;
if ( count > 1 )
{
this->operator()(count - 1);
}
}
};
int main()
{
HelloWorldFunctor functor;
functor(2);
}
For the example I've created the functor would look like in this second demo.
Even if one introduces calls to impure functions such as std::rand, the performance without a recursive lambda or with a custom functor is still better. Here's a third demo.
Conclusion: With the usage of a std::function, there's overhead, although it might be negligible depending on the use case. Since this usage prevents some compiler optimizations, this shouldn't be used extensively.

So std::function is implemented polymorphically. Meaning your code is roughly equivalent to:
struct B
{
virtual void do(int count) = 0;
};
struct D
{
B* b;
virtual void do(int count)
{
if (count > 1)
b->do(count--);
}
};
int main()
{
D d;
d.b = &d;
d.do(10);
}
It is rare to have a tight enough recursion such that a virtual method lookup is a significant overhead, but depending on your application area it is certainly possible.

Lambdas in C++ are equivalent to functors so it would be the same that calling the operator() of some class created automatically by the compiler. When you capture the environment what's happening behind the scenes is that those captured variables are passed to the constructor of the class and stored as member variables.
So in short, the performance difference should be very close to zero.
Here there is some further explanation:
Jump to the section "How are Lambda Closures Implemented?"
http://www.cprogramming.com/c++11/c++11-lambda-closures.html
EDIT:
After some research and thanks to Stefan answer and code, it turned out that there is an overhead on recursive lambdas because of the std::function idiom. Since the lambda has to be wrapped on a std::function in order to call itself, it involves to call a virtual function which does add an overhead.
The subject is treated on the comments of this answer:
https://stackoverflow.com/a/14591727/1112142

Related

Frama-c fails to prove fact about pointer comparison

Consider the following C code:
#include <assert.h>
//# requires p < q;
void f(char *p, char *q)
{
assert(p <= q-1);
}
//# requires a < b;
void g(int a, int b)
{
assert(a <= b-1);
}
Using alt-ergo, frama-c successfully proves that the assertion in g() holds but fail to prove the same with f(). Why?
Formally, pointers and integers are two very different things. In particular, C semantics states that pointer comparison is well defined only for pointers that points in the same allocated block (or one offset past the end of said allocated block) . This is reflected in the model used by the WP plugin of Frama-C in the definition of addr_le and friends (see $(frama-c -print-share-path)/wp/why3/Memory.why), where the pointers are checked to have the same address before the comparison is done on their offset.

c - Array of pointer to functions, having different number of arguments

I am making a simple scheduler that executes functions contained in a FIFO queue.
Those functions have a same return type int, but have different number of int arguments.
I tried to implement it this way, but it does not seem to work. The compiler forbids conversion between int(*)() , int(*)(int), int(*)(int, int), or to any of those sort. (Arduino Sketch compiler)
Is there a way to solve this problem, or could you recommend a better way around? Thanks!
My code:
typedef int (*fnptr)(); // Tried this!
int foo(int var) {
return 0;
}
int main() {
fnptr fp = &foo; // error: invalid conversion from
// 'int (*)(int)' to 'int (*)()'
// [-fpermissive]
return 0;
}
You can cast:
fnptr fp = reinterpret_cast<fnptr>(foo);
The ()s are the "function call operator", adding them makes no sense at all in this situation, it changes the expression from "take the address of this function" to "take the address of this function's return value".
Note that aboev I don't even include the &, this is because the name of a function acts pretty much like a function pointer so it's already an address.

When I invoke an asynchronous CUDA kernel, how are its arguments copied?

Say I want to invoke a CUDA kernel, like this:
struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);
Assume that stream was previously created using a call to cudaStreamCreate(), so the above will execute asynchronously. I'm concerned about the required lifetime of arg.
Are the arguments to the kernel copied synchronously when I invoke it (so it would be safe for arg to go out of scope immediately), or are they copied asynchronously (so I need to ensure that it stays alive until the kernel runs)?
Arguments are copied synchronously at launch. The API exposes a call stack onto which execution parameters and function arguments are pushed in order, then a call finalises those arguments into a CUDA kernel launch on the drivers internal streams/command queues.
This process isn't documented, but as of CUDA 7.5, a runtime API kernel launch like this:
dot_product<<<1,n>>>(n, d_a, d_b);
becomes this:
(cudaConfigureCall(1, n)) ? (void)0 : (dot_product)(n, d_a, d_b);
where the host stub function dot_product is expanded into this:
void __device_stub__Z11dot_productiPfS_(int __par0, float *__par1, float *__par2)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par1, sizeof(__par1), (size_t)8UL) != cudaSuccess) return;
if (cudaSetupArgument((void *)(char *)&__par2, sizeof(__par2), (size_t)16UL) != cudaSuccess) return;
{
volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(int, float *, float *))dot_product));
(void)cudaLaunch(((char *)((void ( *)(int, float *, float *))dot_product)));
};
}
void dot_product( int __cuda_0,float *__cuda_1,float *__cuda_2)
{
__device_stub__Z11dot_productiPfS_( __cuda_0,__cuda_1,__cuda_2);
}
cudaSetupArgument is the API call which is pushing arguments onto the call stack. Interestingly, this is actually deprecated in the API documentation for CUDA 7.5, even though the compiler is using it. I would, therefore, expect this to change in the future, but the idea will be the same.
The parameters of the kernel call are copied prior to execution, so the scope schould be of no concern. But please note that the size of all kernel parameters cannot exceed a maximum size in bytes. If you want larger structs or blobs of data you need to allocate the used memory on the device using cudaMalloc, then copy the content of the host struct to the device struct using cudaMemcpy and call the kernel with a pointer to the new device struct.
Your code would look something like this:
struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here
cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);

Pass double pointer in a struct to CUDA

I've got the following struct:
struct Param
{
double** K_RP;
};
And I wanna perform the following operations on "K_RP" in CUDA
__global__ void Test( struct Param prop)
{
int ix = threadIdx.x;
int iy = threadIdx.y;
prop.K_RP[ix][iy]=2.0;
}
If "prop" has the following form, how should I do my "cudaMalloc" and "cudaMemcpy" operations?
int main( )
{
Param prop;
Param cuda_prop;
prop.K_RP=alloc2D(Imax,Jmax);
//cudaMalloc cuda_prop ?
//cudaMemcpyH2D prop to cuda_prop ?
Test<<< (1,1), (Imax,Jmax)>>> ( cuda_prop);
//cudaMemcpyD2H cuda_prop to prop ?
return (0);
}
Questions like this get asked from time to time. If you search on the cuda tag, you'll find a variety of examples with answers. Here's one example.
In general, dynamically allocated data contained within structures or other objects requires special handling. This question/answer explains why and how to do it for the single pointer (*) case.
Handling double pointers (**) is difficult enough that most people would recommend "flattening" the storage so that it can be handled by reference with a single pointer (*). If you really want to see how the double pointer (**) method works, review this question/answer. It's not trivial.

CUDA streams, texture binding and async memcpy

Writing some signal processing in CUDA I recently made huge progress in optimizing it. By using 1D textures and adjusting my access patterns I managed to get a 10× performance boost. (I previously tried transaction aligned prefetching from global into shared memory, but the nonuniform access patterns happening later messed up the warp→shared cache bank association (I think)).
So now I'm facing the problem, how CUDA textures and bindings interact with asynchronous memcpy.
Consider the following kernel
texture<...> mytexture;
__global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = tex1Dfetch(texture, threadIdx.x);
}
The kernel is launched in multiple streams
extern void *sourcedata;
#define N_CUDA_STREAMS ...
cudaStream stream[N_CUDA_STREAMS];
void *d_pOut[N_CUDA_STREAMS];
void *d_texData[N_CUDA_STREAMS];
for(int k_stream = 0; k_stream < N_CUDA_STREAMS; k_stream++) {
cudaStreamCreate(stream[k_stream]);
cudaMalloc(&d_pOut[k_stream], ...);
cudaMalloc(&d_texData[k_stream], ...);
}
/* ... */
for(int i_datablock; i_datablock < n_datablocks; i_datablock++) {
int const k_stream = i_datablock % N_CUDA_STREAMS;
cudaMemcpyAsync(d_texData[k_stream], (char*)sourcedata + i_datablock * blocksize, ..., stream[k_stream]);
cudaBindTexture(0, &mytexture, d_texData[k_stream], ...);
mykernel<<<..., stream[k_stream]>>>(d_pOut);
}
Now what I wonder about is, since there is only one texture reference, what happens when I bind a buffer to a texture while other streams' kernels access that texture? cudaBindStream doesn't take a stream parameter, so I'm worried that by binding the texture to another device pointer while running kernels are asynchronously accessing said texture I'll divert their accesses to the other data.
The CUDA documentation doesn't tell anything about this. If have to to disentangle this to allow concurrent access, it seems I'd have to create a number of texture references and use a switch statementto chose between them, based on the stream number passed as a kernel launch parameter.
Unfortunately CUDA doesn't allow to put arrays of textures on the device side, i.e. the following does not work:
texture<...> texarray[N_CUDA_STREAMS];
Layered textures are not an option, because the amount of data I have only fits within a plain 1D texture not bound to a CUDA array (see table F-2 in the CUDA 4.2 C Programming Guide).
Indeed you cannot unbind the texture while still using it in a different stream.
Since the number of streams doesn't need to be large to hide the asynchronous memcpys (2 would already do), you could use C++ templates to give each stream its own texture:
texture<float, 1, cudaReadModeElementType> mytexture1;
texture<float, 1, cudaReadModeElementType> mytexture2;
template<int TexSel> __device__ float myTex1Dfetch(int x);
template<> __device__ float myTex1Dfetch<1>(int x) { return tex1Dfetch(mytexture1, x); }
template<> __device__ float myTex1Dfetch<2>(int x) { return tex1Dfetch(mytexture2, x); }
template<int TexSel> __global__ void mykernel(float *pOut)
{
pOut[threadIdx.x] = myTex1Dfetch<TexSel>(threadIdx.x);
}
int main(void)
{
float *out_d[2];
// ...
mykernel<1><<<blocks, threads, stream[0]>>>(out_d[0]);
mykernel<2><<<blocks, threads, stream[1]>>>(out_d[1]);
// ...
}

Resources