I have to design and implement a Fortran routine to determine the size of clusters on a square lattice, and it seemed extremely convenient to code the subroutine recursively. However, whenever my lattice size grows beyond a certain value (around 200/side), the subroutine consistently segfaults. Here's my cluster-detection routine:
RECURSIVE SUBROUTINE growCluster(lattice, adj, idx, area)
INTEGER, INTENT(INOUT) :: lattice(:), area
INTEGER, INTENT(IN) :: adj(:,:), idx
lattice(idx) = -1
area = area + 1
IF (lattice(adj(1,idx)).GT.0) &
CALL growCluster(lattice,adj,adj(1,idx),area)
IF (lattice(adj(2,idx)).GT.0) &
CALL growCluster(lattice,adj,adj(2,idx),area)
IF (lattice(adj(3,idx)).GT.0) &
CALL growCluster(lattice,adj,adj(3,idx),area)
IF (lattice(adj(4,idx)).GT.0) &
CALL growCluster(lattice,adj,adj(4,idx),area)
where adj(1,n) represents the north neighbor of site n, adj(2,n) represents the west and so on. What would cause the erratic segfault behavior? Is the cluster just "too huge" for large lattice sizes?
I think you're running into a stack overflow. If your lattice is over 200 units per side, that's 40,000 units, which means you're recursing 40,000 times. Depending on your stack size and your stack frame size, you could easily be running out of stack space.
You'll have to convert your algorithm into one that uses less stack space in order to handle larger lattices. Wikipedia provides a few implementations (in pseudocode) on how to do a flood fill without blowing your stack.
1) Have you tried compiling and then running with subscript-range checking turned on? Just to make sure that the large size isn't revealing a bug in which the code steps past the array bounds into illegal memory. It's an easy check.
2) Re the possibility that your program is running out of memory: try increasing the per-process stack size limit. How to do this depends on the operating system -- search here, or google, or tell us your operating system.
I have to solve a code problem on the gpu using CUDA but I always get a warning of Stack size for "name of the function" cannot be statically determined.
This is for a student project that I'm working on, the project is written in C using CUDA 9.0 libraries and it's running on an NVIDIA Quadro K5000 gpu.
Every single thread must execute one function and, in this function, there are two recursive calls of the same function, the reason why I want to use those two recursive calls it's because it makes the code clean and simple for me, but if there is only one recursive call there isn't anymore the Stack size problem.
Here is the error I get every time I compile the code:
CUDA supports recursive function calls but I don't understand why it makes a problem when there are two recursive calls.
__device__ void bitonicMergeGPU(float *arr, int l, int indexT, int order)
int k,p;
if(l > 1)
p = l/2;
//Compare the values.
I simply want to know if it is possible to solve the problem of the recursive calls.
CUDA supports recursion. When you use recursion in CUDA, this warning is expected, and there is no NVIDIA-documented way you can make the warning go away (except by not using recursion).
If you use a function recursively, in most languages it will use more stack space as the recursion depth increases. This is true in CUDA as well. You need to account for this and provide enough stack space for the maximum recursion depth you anticipate. It is common practice to limit recursion depth, so as to prevent stack problems.
The compiler is unable to discover the maximum runtime recursion depth at compile time, and the warning is there to remind you of that.
Regardless of how much you increase the stack size, the warning will not go away. The warning is there to let you know that it is your responsibility to make sure your recursion design along with the stack space allocated will work correctly. The compiler does not verify in any way that the amount of increase in stack size is sufficient.
Using recursion in CUDA must be very careful. The recursion uses stack memory which has a limit of 512 KB. The default is usually 1KB which is easy to overflow and crashes the program. You can get the stack size per thread using cudaThreadGetLimit().
Redesign the algorithm/function using a non-recursive approach. The efficiency is usually very similar.
Increase stack size per thread using cudaThreadSetLimit(), not exceed the limit, e.g. 512KB.
I use MPI_TYPE_CREATE_SUBARRAY to create a type used to communicate portions of 3D arrays between neighboring processes in a Cartesian topology. Specifically, each process communicates with the two processes on the two sides along each of the three directions.
Referring for simplicity to a one-dimensional grid, there are two parameters nL and nR that define how many values each process has to receive from the left and send to the right, and how many each has to receive from the right and send to the left.
Unaware (or maybe just forgetful) of the fact that all elements of the array_of_subsizes array parameter of MPI_TYPE_CREATE_SUBARRAY must be positive, I wrote my code that can't deal with the case nR = 0 (or nL = 0, either can be).
(By the way, I see that MPI_TYPE_VECTOR does accept zero count and blocklength arguments and it's sad MPI_TYPE_CREATE_SUBARRAY can't.)
How would you suggest to face this problem? Do I really have to convert each call to MPI_TYPE_CREATE_SUBARRAY into multiple MPI_TYPE_VECTORs called in a chain?
The following code is minimal but not working (but it works in the larger program and I haven't had time to extract the minimum number of declarations and prints), still it should give a better look into what I'm talking about.
INTEGER :: ndims = 3, DBS, ierr, temp, sub3D
INTEGER, DIMENSION(ndims) :: aos, aoss
! doesn't work if ANY(aoss == 0)
! does work if ANY(aoss == 0)
CALL MPI_TYPE_HVECTOR(aoss(2), aoss(1), DBS*aos(1), MPI_DOUBLE_PRECISION, temp, ierr)
CALL MPI_TYPE_HVECTOR(aoss(3), 1, DBS*PRODUCT(aos(1:2)), temp, sub3D, ierr)
At the end it wasn't hard to replace MPI_TYPE_CREATE_SUBARRAY with two MPI_TYPE_HVECTORs. Maybe this is the best solution, after all.
In this sense one side question comes naturally for me: why is MPI_TYPE_CREATE_SUBARRAY so limited? There are a lot of examples in the MPI standard of stuff which correctly falls back on "do nothing" (when a sender or receiver is MPI_PROC_NULL) or "there's nothing in this" (when aoss has a zero dimension in my example). Should I post a feature request somewhere?
The MPI 3.1 standard (chapter 4.1 page 95) makes it crystal clear
For any dimension i, it is erroneous to specify array_of_subsizes[i] < 1 [...].
You are free to send your comment to the appropriate Mailing List.
Is it possible to get a stackoverflow with a function that is not tail call optimized in Erlang? For example, suppose I have a function like this
sum_list([],Acc) ->
sum_list([Head|Tail],Acc) ->
Head + sum_list(Tail, Acc).
It would seem like if a large enough list was passed in it would eventually run out of stack space and crash. I tried testing this like so:
> L = lists:seq(1, 10000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22, 23,24,25,26,27,28,29|...]
> sum_test:sum_list(L, 0).
But it never crashes! I tried it with a list of 100,000,000 integers and it took a while to finish but it still never crashed! Questions:
Am I testing this correctly?
If so, why am I unable to generate a stackoverflow?
Is Erlang doing something that prevents stackoverflows from occurring?
You are testing this correctly: your function is indeed not tail-recursive. To find out, you can compile your code using erlc -S <erlang source file>.
{function, sum_list, 2, 2}.
As a comparison the following tail-recursive version of the function:
tail_sum_list([],Acc) ->
tail_sum_list([Head|Tail],Acc) ->
tail_sum_list(Tail, Head + Acc).
compiles as:
{function, tail_sum_list, 2, 5}.
Notice the lack of allocate and the call_only opcode in the tail-recursive version, as opposed to the allocate/call/deallocate/return sequence in the non-recursive function.
You are not getting a stack overflow because the Erlang "stack" is very large. Indeed, stack overflow usually means the processor stack overflowed, as the processor's stack pointer went too far away. Processes traditionally have a limited stack size which can be tuned by interacting with the operating system. See for example POSIX's setrlimit.
However, Erlang execution stack is not the processor stack, as the code is interpreted. Each process has its own stack which can grow as needed by invoking operating system memory allocation functions (typically malloc on Unix).
As a result, your function will not crash as long as malloc calls succeed.
For the record, the actual list L is using the same amount of memory as the stack to process it. Indeed, each element in the list takes two words (the integer value itself, which is boxed as a word as they are small) and the pointer to the next element to the list. Conversely, the stack is grown by two words at each iteration by allocate opcode: one word for CP which is saved by allocate itself and one word as requested (the first parameter of allocate) for the current value.
For 100,000,000 words on a 64-bit VM, the list takes a minimum of 1.5 GB (more as the actual stack is not grown every two words, fortunately). Monitoring and garbaging this is difficult in the shell, as many values remain live. If you spawn a function, you can see the memory usage:
spawn(fun() ->
io:format("~p\n", [erlang:memory()]),
L = lists:seq(1, 100000000),
io:format("~p\n", [erlang:memory()]),
sum_test:sum_list(L, 0),
io:format("~p\n", [erlang:memory()])
As you can see, the memory for the recursive call is not released immediately.
I'm trying to write a histogram kernel in OpenCL to compute 256 bin R, G, and B histograms of an RGBA32F input image. My kernel looks like this:
const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
__kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
__global int* gOutput, __global int* bOutput)
int2 coords = {get_global_id(0), get_global_id(1)};
float4 sample = read_imagef(input, mSampler, coords);
uchar rbin = floor(sample.x * 255.0f);
uchar gbin = floor(sample.y * 255.0f);
uchar bbin = floor(sample.z * 255.0f);
When I run it on an 2100 x 894 image (1,877,400 pixels) i tend to only see in or around 1,870,000 total values being recorded when I sum up the histogram values for each channel. It's also a different number each time. I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).
The 1,870,000 output is for a {1,1} workgroup size (which is what seems to get set by default if I don't specify otherwise). If I force a larger workgroup size like {10,6}, I get a drastically smaller sum in my histogram (proportional to the change in workgroup size). This seemed strange to me, but I'm guessing what happens is that all of the work items in the group increment the output array value at the same time, and so it just counts as a single increment?
Anyways, I've read in the spec that OpenCL has no global memory syncronization, only syncronization within local workgroups using their __local memory. The histogram example by nVidia breaks up the histogram workload into a bunch of subproblems of a specific size, computes their partial histograms, then merges the results into a single histogram after. This doesn't seem like it'll work all that well for images of arbitrary size. I suppose I could pad the image data out with dummy values...
Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).
As stated before, you write into a shared memory unsynchronized and non atomic. This leads to errors. If the picture is big enough, I have a suggestion:
Split your work group into a one dimensional one for cols or rows. Use each kernel to sum up the histogram for the col or row and afterwards sum it globally with atomic atom_inc. This brings the most sum ups in private memory which is much faster and reduces atomic ops.
If you work in two dimensions you can do it on parts of the picture.
I think, I have a better answer: ;-)
Have a look to: http://developer.download.nvidia.com/compute/opencl/sdk/website/samples.html#oclHistogram
They have an interesting implementation there...
Yes, you're writing to a shared memory from many work-items at the same time, so you will lose elements if you don't do the updates in a safe way (or worse ? Just don't do it). The increase in group size actually increases the utilization of your compute device, which in turn increases the likelihood of conflicts. So you end up losing more updates.
However, you seem to be confusing synchronization (ordering thread execution order) and shared memory updates (which typically require either atomic operations, or code synchronization and memory barriers, to make sure the memory updates are visible to other threads that are synchronized).
the synchronization+barrier is not particularly useful for your case (and as you noted is not available for global synchronization anyways. Reason is, 2 thread-groups may never run concurrently so trying to synchronize them is nonsensical). It's typically used when all threads start working on generating a common data-set, and then all start to consume that data-set with a different access pattern.
In your case, you can use atomic operations (e.g. atom_inc, see http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=113&Itemid=168). However, note that updating a highly contended memory address (say, because you have thousands of threads trying all to write to only 256 ints) is likely to yield poor performance. All the hoops typical histogram code goes through are there to reduce the contention on the histogram data.
You can check
The histogram example from AMD Accelerated Parallel Processing (APP) SDK.
Chapter 14 - Image Histogram of OpenCL Programming Guide book (ISBN-13: 978-0-321-74964-2).
GPU Histogram - Sample code from Apple
we have a particle detector hard-wired to use 16-bit and 8-bit buffers. Every now and then, there are certain [predicted] peaks of particle fluxes passing through it; that's okay. What is not okay is that these fluxes usually reach magnitudes above the capacity of the buffers to store them; thus, overflows occur. On a chart, they look like the flux suddenly drops and begins growing again. Can you propose a [mostly] accurate method of detecting points of data suffering from an overflow?
P.S. The detector is physically inaccessible, so fixing it the 'right way' by replacing the buffers doesn't seem to be an option.
Update: Some clarifications as requested. We use python at the data processing facility; the technology used in the detector itself is pretty obscure (treat it as if it was developed by a completely unrelated third party), but it is definitely unsophisticated, i.e. not running a 'real' OS, just some low-level stuff to record the detector readings and to respond to remote commands like power cycle. Memory corruption and other problems are not an issue right now. The overflows occur simply because the designer of the detector used 16-bit buffers for counting the particle flux, and sometimes the flux exceeds 65535 particles per second.
Update 2: As several readers have pointed out, the intended solution would have something to do with analyzing the flux profile to detect sharp declines (e.g. by an order of magnitude) in an attempt to separate them from normal fluctuations. Another problem arises: can restorations (points where the original flux drops below the overflowing level) be detected by simply running the correction program against the reverted (by the x axis) flux profile?
int32[] unwrap(int16[] x)
// this is pseudocode
int32[] y = new int32[x.length];
y[0] = x[0];
for (i = 1:x.length-1)
y[i] = y[i-1] + sign_extend(x[i]-x[i-1]);
// works fine as long as the "real" value of x[i] and x[i-1]
// differ by less than 1/2 of the span of allowable values
// of x's storage type (=32768 in the case of int16)
// Otherwise there is ambiguity.
return y;
int32 sign_extend(int16 x)
return (int32)x; // works properly in Java and in most C compilers
// exercise for the reader to write similar code to unwrap 8-bit arrays
// to a 16-bit or 32-bit array
Of course, ideally you'd fix the detector software to max out at 65535 to prevent wraparound of the sort that is causing your grief. I understand that this isn't always possible, or at least isn't always possible to do quickly.
When the particle flux exceeds 65535, does it do so quickly, or does the flux gradually increase and then gradually decrease? This makes a difference in what algorithm you might use to detect this. For example, if the flux goes up slowly enough:
true flux measurement
5000 5000
10000 10000
30000 30000
50000 50000
70000 4465
90000 24465
60000 60000
30000 30000
10000 10000
then you'll tend to have a large negative drop at times when you have overflowed. A much larger negative drop than you'll have at any other time. This can serve as a signal that you've overflowed. To find the end of the overflow time period, you could look for a large jump to a value not too far from 65535.
All of this depends on the maximum true flux that is possible and on how rapidly the flux rises and falls. For example, is it possible to get more than 128k counts in one measurement period? Is it possible for one measurement to be 5000 and the next measurement to be 50000? If the data is not well-behaved enough, you may be able to make only statistical judgment about when you have overflowed.
Your question needs to provide more information about your implementation - what language/framework are you using?
Data overflows in software (which is what I think you're talking about) are bad practice and should be avoided. While you are seeing (strange data output) is only one side effect that is possible when experiencing data overflows, but it is merely the tip of the iceberg of the sorts of issues you can see.
You could quite easily experience more serious issues like memory corruption, which can cause programs to crash loudly, or worse, obscurely.
Is there any validation you can do to prevent the overflows from occurring in the first place?
I really don't think you can fix it without fixing the underlying buffers. How are you supposed to tell the difference between the sequences of values (0, 1, 2, 1, 0) and (0, 1, 65538, 1, 0)? You can't.
How about using an HMM where the hidden state is whether you are in an overflow and the emissions are observed particle flux?
The tricky part would be coming up with the probability models for the transitions (which will basically encode the time-scale of peaks) and for the emissions (which you can build if you know how the flux behaves and how overflow affects measurement). These are domain-specific questions, so there probably aren't ready-made solutions out there.
But one you have the model, everything else---fitting your data, quantifying uncertainty, simulation, etc.---is routine.
You can only do this if the actual jumps between successive values are much smaller than 65536. Otherwise, an overflow-induced valley artifact is indistinguishable from a real valley, you can only guess. You can try to match overflows to corresponding restorations, by simultaneously analysing a signal from the right and the left (assuming that there is a recognizable base line).
Other than that, all you can do is to adjust your experiment by repeating it with different original particle flows, so that real valleys will not move, but artifact ones move to the point of overflow.