I need to work with generalized polygons where edges are not only line segments, but circle segments too and holes can occur. So I decided to use
CGAL Boolean set operations
with the setup:
typedef CGAL::Exact_predicates_exact_constructions_kernel Kernel;
typedef Kernel::Point_2 Point_2;
typedef Kernel::Circle_2 Circle_2;
typedef Kernel::Line_2 Line_2;
typedef CGAL::Gps_circle_segment_traits_2<Kernel> Traits_2;
typedef CGAL::General_polygon_set_2<Traits_2> Polygon_set_2;
typedef Traits_2::General_polygon_2 Polygon_2;
typedef Traits_2::General_polygon_with_holes_2 Polygon_with_holes_2;
typedef Traits_2::Curve_2 Curve_2;
typedef Traits_2::X_monotone_curve_2 X_monotone_curve_2;
typedef Traits_2::Point_2 Point_2t;
typedef Traits_2::CoordNT coordnt;
typedef CGAL::Arrangement_2<Traits_2> Arrangement_2;
typedef Arrangement_2::Face_handle Face_handle;
I have to compute unions of large numbers of polygons (up to about 200,000) and subtract from such unions again up-to 100,000-200,000 polygons. So speed is an important concern. In the setup above an exact kernel is used, where the number type of Point_2t is of the form a0 + a1 sqrt(root) where a0, a1, root are rational numbers. To speed things up, I tried to compile everything with CGAL::Kernel<double> instead, but this lead to assertion errors in the CGAL code.
Is it true, that the Boolean ops on polygons routines make an exact number type kernel necessary, and can not work with an inexact CGAL::Kernel<double>?
If there is no possibility for further speedup in this CGAL library:
Is there another (C++)-library for working with generalized polygons (having also circle segments as edges) and computing boolean ops on them which is faster? (For example because it uses only double-values as point coordinate values?)
Related
I have declared a signed multidimensional array as follows:
typedef logic signed [3:0][31:0] hires_frame_t;
typedef hires_frame_t [3:0] hires_capture_t;
hires_capture_t rndpacket;
I want to randomize this array such that each element has a value between -32768 to 32767, in 32-bit two's-complement.
I've tried the following:
assert(std::randomize(rndpacket) with
{foreach (rndpacket[channel])
foreach (rndpacket[channel][subsample])
{rndpacket[channel][subsample] < signed'(32768);
rndpacket[channel][subsample] >= signed'(-32768);}});
This compiles well, but (mentor graphics) modelsim fails in simulation claiming
randomize() failed due to conflicts between the following constraints:
# clscummulativedata.sv(56): (rndpacket[3][3] < 32768);
# cummulativedata.sv(57): (rndpacket[3][3] >= 32'hffff8000);
This is clearly something linked to usage of signed vectors. I had a feeling that everything should be fine as array is declared as signed as well as thresholds in the randomize call, but apparently not. If I replace the range by 0-65535, everything works as expected.
What is the correct way to randomize such a signed array?
Your problem is hires_frame_t is a signed 128-bit 2-dimensional packed array, and selecting a part of a packed array is unsigned. A way of keeping the part-select of a packed dimension signed is using a separate typedef for the dimension you want signed:
typedef bit signed [31:0] int32_t;
typedef int32_t [3:0] hires_frame_t;
typedef hires_frame_t [3:0] hires_capture_t;
Another option is putting the signed cast on the LHS of the comparisons. Your signed cast on the RHS is not doing anything because bare decimal numbers are already treated as signed. A comparison is unsigned if one or both sides are unsigned.
assert(std::randomize(rndpacket) with {
foreach (rndpacket[channel,subsample])
{signed'(rndpacket[channel][subsample]) < 32768;
signed'(rndpacket[channel][subsample]) >= -32768;}});
BTW, I'm showing the LRM compliant way of using a 2-d foreach loop.
I have the following OpenCL kernel, which copies values from one buffer to another, optionally inverting the value (the 'invert' arg can be 1 or -1):-
__kernel void extraction(__global const short* src_buff, __global short* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
The source buffer contains one or more "records", each containing N (record_len) short values. All records in the buffer are of equal length, and record_len is always a multiple of 32.
The global size is 2D (number of records in the buffer, record length), and I chose this as it seemed to make best use of the GPU parallel processing, with each thread being responsible for copying just one value in one record in the buffer.
(The local work size is set to NULL by the way, allowing OpenCL to determine the value itself).
After reading about vectors recently, I was wondering if I could use these to improve on the performance? I understand the concept of vectors but I'm not sure how to use them in practice, partly due to lack of good examples.
I'm sure the kernel's performance is pretty reasonable already, so this is mainly out of curiosity to see what difference it would make using vectors (or other more suitable approaches).
At the risk of being a bit naive here, could I simply change the two buffer arg types to short16, and change the second value in the 2-D global size from "record length" to "record length / 16"? Would this result in each kernel thread copying a block of 16 short values between the buffers?
Your naive assumption is basically correct, though you may want to add a hint to the compiler that this kernel is optimized for the vector type (Section 6.7.2 of spec), in your case, you would add
attribute((vec_type_hint(short16)))
above your kernel function. So in your example, you would have
__attribute__((vec_type_hint(short16)))
__kernel void extraction(__global const short16* src_buff, __global short16* dest_buff, const int record_len, const int invert)
{
int i = get_global_id(0); // Index of record in buffer
int j = get_global_id(1); // Index of value in record
dest_buff[(i* record_len) + j] = src_buff[(i * record_len) + j] * invert;
}
You are correct in that your 2nd global dimension should be divided by 16, and your record_len should also be divided by 16. Also, if you were to specify the local size instead of giving it NULL, you would also want to divide that by 16.
There are some other things to consider though.
You might think choosing the largest vector size should provide the best performance, especially with such a simple kernel. But in my experience, that rarely is the most optimal size. You may try asking clGetDeviceInfo for CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, but for me this rarely is accurate (also, it may give you 1, meaning the compiler will try auto-vectorization or the device doesn't have vector hardware). It is best to try different vector sizes and see which is fastest.
If your device supports auto-vectorization, and you want to give it a go, it may help to remove your record_len parameter and replace it with get_global_size(1) so the compiler/driver can take care of dividing record_len by whatever vector size it picks. I would recommend doing this anyway, assuming record_len is equal to the global size you gave that dimension.
Also, you gave NULL to the local size argument so that the implementation picks a size automatically. It is guaranteed to pick a size that works, but it will not necessarily pick the most optimal size.
Lastly, for general OpenCL optimizations, you may want to take a look at the NVIDIA OpenCL Best Practices Guide for NVidia hardware, or the AMD APP SDK OpenCL User Guide for AMD GPU hardware. The NVidia one is from 2009, and I'm not sure how much their hardware has changed since. Notice though that it actually says:
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience.
Older AMD hardware (pre-GCN) benefited from using vector types, but AMD suggests not using them on GCN devices (see mogu's comment). Also if you are targeting a CPU, it will use AVX hardware if available.
Using MPI and C, I'm looking to distribute (scatter and gather) a 2D array of complex double values (ie. every element in the 2D array is of type complex double, so has a creal and cimag component). If I use regular declaration of a 2D array of size n-by-n:
double complex grid[n][n];
Everything works just fine, BUT my program will fail depending on the size of n, giving a "segmentation fault" error. Anything above, say, 256 will immediately spit out a "segmentation fault" error. This is the problem that I'm having and am failing miserably to figure out.
After browsing through numerous similar issues, I'm guessing my problem is that I'm overloading the stack memory (something I'm honestly not 100% in understanding), meaning that I need to dynamically allocate my 2D arrays using malloc or calloc.
However, in my understanding, allocating a 2D array that you can call like grid[n][n] won't work since the allocated memory is not necessarily aligned, meaning that MPI_Scatter fails.
double complex **alloc_2d_complex(int rows, int cols){
double complex *data = (complex double*) malloc(rows*cols*sizeof(complex double));
double complex **array = (complex double**) malloc(rows*sizeof(complex double*));
int i;
for (i = 0; i < rows; i++)
array[i] = &(data[cols*i]);
return array;
}
int main(int argc, char*argv[]){
double complex **grid;
grid = alloc_2d_complex(n,n);
/* Continue to initialize MPI and attempt Scatter... */
}
I've tried initializing a 2D by this method and scatter does fail for me, giving errors "memcpy argument memory ranges overlap" since something in memory apparently doesn't line up right.
This means I must allocate everything in 1D arrays in row-major order, like:
grid[y][x] ==> grid[y*n + x]
I'm really, really trying to avoid this because I'm dealing with numerous transposed and untransposed matrices (which is hard enough to keep track of in [y][x] logic) and it's going to make things difficult to keep track of for my purpose, but fine, if it's what I have to do then let's get it over with. But this ALSO doesn't work with MPI_Scatter, giving me once again "memcpy" errors, which I am utterly dumbfounded by. Below is an example of how I'm trying to do everything using 1D arrays. Since I'm getting the same error for this and the 2D allocated array, maybe the 2D allocation will work and I'm just missing something here. I'm only using a number of processors, numProcs, that can evenly divide n.
int n = 128;
double complex *grid = malloc(n*n*sizeof(complex double));
/* ... Initialize MPI ... */
stepSize = (int) n/numProcs;
double complex *gridChunk = malloc(stepSize*n*sizeof(complex double));
/* ... Initialize grid[y*n+x] Values... */
MPI_Scatter(&grid, n*stepSize, MPI_C_DOUBLE_COMPLEX,
&gridChunk, n*stepSize, MPI_C_DOUBLE_COMPLEX,
0, MPI_COMM_WORLD);
Recently I started using Armadillo C++ library. Given my C++ coding skills are not that great, I found this very friendly for linear algebra. I am also using that along with my matlab to speed things up for many of reconstruction algorithm.
I do need to create a vector of boolean and I would prefer using this library rather than . However, I could not figure out how to do it. I tried using uvec; but, documentation seems to indicate that it can not be used with boolean.
Any help would be appreciated.
Regards,
Dushyant
Consider using a matrix uchar_mat which is a typdef for Mat<unsigned char>, it should consume the same amount of memory as a matrix of boolean values.
The Armadillo documentation of version 7.8 states that a matrix Mat<type>, can be of the following types:
float, double, std::complex<float>, std::complex<double>, short, int, long, and unsigned versions of short, int, and long. The code on GitHub however contains typedef Mat <unsigned char> uchar_mat; in the file include/armadillo_bits/typedef_mat.hpp so you should also be able to use uchar_mat.
You will not save any memory by creating a matrix of bool values compared to a matrix of unsigned char values (a bool type consumes 8 bits). This is because in C++ every data type must be addressable; it must be at least 1 byte long so that it is possible to create a pointer pointing to it.
This is part of the pseudo code I am implementing in CUDA as part of an image reconstruction algorithm:
for each xbin(0->detectorXDim/2-1):
for each ybin(0->detectorYDim-1):
rayInit=(xbin*xBinSize+0.5,ybin*xBinSize+0.5,-detectordistance)
rayEnd=beamFocusCoord
slopeVector=rayEnd-rayInit
//knowing that r=rayInit+t*slopeVector;
//x=rayInit[0]+t*slopeVector[0]
//y=rayInit[1]+t*slopeVector[1]
//z=rayInit[2]+t*slopeVector[2]
//to find ray xx intersections:
for each xinteger(xbin+1->detectorXDim/2):
solve t for x=xinteger*xBinSize;
find corresponding y and z
add to intersections array
//find ray yy intersections(analogous to xx intersections)
//find ray zz intersections(analogous to xx intersections)
So far, this is what I have come up with:
__global__ void sysmat(int xfocus,int yfocus, int zfocus, int xbin,int xbinsize,int ybin,int ybinsize, int zbin, int projecoes){
int tx=threadIdx.x, ty=threadIdx.y,tz=threadIdx.z, bx=blockIdx.x, by=blockIdx.y,i,x,y,z;
int idx=ty+by*blocksize;
int idy=tx+bx*blocksize;
int slopeVectorx=xfocus-idx*xbinsize+0.5;
int slopeVectory=yfocus-idy*ybinsize+0.5;
int slopeVectorz=zfocus-zdetector;
__syncthreads();
//points where the ray intersects x axis
int xint=idx+1;
int yint=idy+1;
int*intersectionsx[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
for(xint=xint; xint<detectorXDim/2;xint++){
x=xint*xbinsize;
t=(x-idx)/slopeVectorx;
y=idy+t*slopeVectory;
z=z+t*slopeVectorz;
intersectionsx[xint-1]=x;
intersectionsy[xint-1]=y;
intersectionsz[xint-1]=z;
__syncthreads();
}
...
}
This is just a piece of the code. I know that there might be some errors(you can point them if they are blatantly wrong) but what I am more concerned is this:
Each thread(which corresponds to a detector bin) needs three arrays so it can save the points where the ray(which passes through this thread/bin) intersects multiples of the x,y and z axis. Each array's length depend on the place of the thread/bin(it's index) in the detector and on the beamFocusCoord(which are fixed). In order to do this I wrote this piece of code, which I am certain can not be done(confirmed it with a small test kernel and it returns the error: "expression must have constant value"):
int*intersectionsx[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
So in the end, I want to know if there is an alternative to this piece of code, where a vector's length depends on the index of the thread allocating that vector.
Thank you in advance ;)
EDIT: Given that each thread will have to save an array with the coordinates of the intersections between the ray(that goes from the beam source to the detector) and the xx,yy and zz axis, and that the spacial dimensions are around(I dont have the exact numbers at the moment, but they are very close to the real value) 1400x3600x60, is this problem feasible with CUDA?
For example, the thread (0,0) will have 1400 intersections in the x axis, 3600 in the y axis and 60 in the z axis, meaning that I will have to create an array of size (1400+3600+60)*sizeof(float) which is around 20kb per thread.
So given that each thread surpasses the 16kb local memory, that is out of the question. The other alternative was to allocate those arrays but, with some more math, we get (1400+3600+60)*4*numberofthreads(i.e. 1400*3600), which also surpasses the ammount of global memory available :(
So I am running out of ideas to deal with this problem and any help is appreciated.
No.
Every piece of memory in CUDA must be known at kernel-launch time. You can't allocate/deallocate/change anything while the kernel is running. This is true for global memory, shared memory and registers.
The common workaround is the allocate the maximum size of memory needed beforehand. This can be as simple as allocating the maximum size needed for one thread thread-multiple times or as complex as adding up all those thread-needed sizes for a total maximum and calculating appropriate thread-offsets into that array. That's a tradeoff between memory allocation and offset-computation time.
Go for the simple solution if you can and for the complex if you have to, due to memory limitations.
Why are you not using textures? Using a 2D or 3D texture would make this problem much easier. The GPU is designed to do very fast floating point interpolation, and CUDA includes excellent support for it. The literature has examples of projection reconstruction on the GPU, e.g. Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU, and textures are an integral part of their algorithms. Your own manual coordinates calculations can only be slower and more error prone than what the GPU provides, unless you need something weird like sinc interpolation.
1400x3600x60 is a little big for a single 3D texture, but you could break your problem up into 2D slices, 3D sub-volumes, or hierarchical multi-resolution reconstruction. These have all been used by other researchers. Just search PubMed.