Dead kernel when using ray - jupyter-notebook

I tried to use ray for crawling some data.
My original code before using ray is as below, and works well.
def download(n):
#download nth data
return downloaded_data
I use ray in reference to ray tutorial, and this makes my kernel dead:
#ray.remote
def download(n):
#download nth data
return downloaded_data
ray.init(num_cpus = 4)
total_data = ray.get([download.remote(x) for x in range(4000)])
I tried the code through jupyter and spyder, but in both cases kernel died.
I do not think the reason is memory lack. It does not use so much memory.
What are the possible causes?

Related

Optimize a sorting while loop

I have been working on a variation of a traveling salesman problem. The solution I am trying to implement is to load my vehicle as close to the max as possible as return trips are largely expensive.
I have a large data set in the following format:
pkgid Latitude Longitude Weight
42127 8.205561907 34.54574863 37.0660242
42069 7.640153828 34.03634169 31.91148072
96632 7.700233671 33.85385033 24.27309403
93160 7.756960678 35.36007723 22.3526782
39075 6.881522479 34.19903152 19.56993506
62579 7.622385316 33.78590124 16.7793145
93784 7.523606197 35.32735063 16.18484202
81204 7.597161645 33.81316073 11.54433538
My solution is to take the farthest point south and grab nearby neighbours until my vehicle is full. I have a code snippet that works, but very slow (seconds per loop). I could use a kmeans or similar method, but there is no good way to guarantee a full load or cut off clustering with a metric (that I know of). So I wrote my own.
##NN Algorithm
pkg <- data.frame(fread("muh_data"))
pkg$TripId=0
NN<-data.frame(setorder(pkg,Latitude))
loc<-1
weight<-0
current_point<-c(NN[1,3],NN[1,2])
TripID=1
while (dim(NN)[1]>0)
{
while ((weight<1000)&(dim(NN)[1]>0)){
NN<-NN[-c(loc),]
if(dim(NN)[1]==0)
{break}
NN$NN<-distHaversine(current_point,cbind(NN$Longitude,NN$Latitude))
loc<-which.min(NN$NN)
current_point=c(NN[loc,3],NN[loc,2])
whichpkg<-NN[loc,1]
if ((weight+pkg[loc,4]>1000)|(dim(NN)[1])==0){
break}
weight=weight+pkg[loc,4]
pkg[pkg$pkgid==whichpkg,5]<-TripID
}
print(TripID) ##just a visual check where I am at--should end at ~3500
TripID=TripID+1
weight=0
loc<-1
}
Any hints for speeding this up?
first use the profiler (Rprof) to find where time is being spent. next try to replace dataframes with matrices -- dataframes are very slow when accessing. then you might know where to focus.

What is the right way to duplicate an OpenCL kernel?

It seems that I can duplicate a kernel by get the program object and kernel name from the kernel. And then I can create a new one.
Is this the right way? It doesn't looks so good, though.
EDIT: To answer properly the question: Yes it is the correct way, there is no other way in CL 2.0 or earlier versions.
The compilation (and therefore, slow step) of the CL code creation is in the "program" creation (clProgramBuild + clProgramLink).
When you create a kernel. You are just creating a object that packs:
An entry point to a function in the program code
Parameters for input + output to that function
Some memory to remember all the above data between calls
It is an simple task that should be almost for free.
That is why it is preferred to have multiple kernel with different input parameters. Rather than one single kernel, and changing the parameters every loop.

Line and intersection based pathfinding

I am trying to develop a pathfinding algorithm that finds the closest path to a point by tracing a line from the initial position to the final position. Then I test for intersections with obstacles ( all of them in my game are rectangular ) and then draw two new lines, one to each of the visible corners of the obstacle. Then, I draw lines from those two corners to the endpoint and repeat the intersection test for each branch of the path.
My question is this: how do I propagate my branching paths in Lua without recursion? It seems like it would be pretty easy with recursion, but I don't want to risk a stack overflow and a crash if my path gets too long.
If there is a technical term for the technique I am using, please tell me. I like to do my own research when I can.
There is recursion and iteration. Lua supports "infinite recursion" via tail calls, so if you can express your recursion in terms of tail calls there is no risk of stack overflow. Example:
function f(x)
if x==0 then return 0 end -- end recursion
.... Do stuff ...
return f(x-1)
end
Section 6.3 of Programming in Lua (available online) discusses this.
If you can't determine a way to do this with tail calls then you have to use iteration. For example start with one path, using a while loop test how many intersections (exit loop when 0); the loop calls a function that computes two new paths; each path is added to a list; the while loop is in a paths loop (since the # of paths increases); the path length can be computed simultaneously; some paths will be dead end and will be dropped; some will be cyclical and should be dropped; and the rest will arrive at destination. You then keep the one with shortest path or travel time (not necessarily the same).

kernel OpenCL not saving results?

Im making my first steps in OpenCL, and I got a problem with I think results aren't saved correctly. I first made a simple program, that is working, giving the correct results etc. So main is functioning correctly, arrays are filled the way they are supposed to etc.
Brief explanation of the program:
I have a grid of 4x4 points (OP array), in a field that is 40x40. There is a plane flying above this field, and it's route is divided into 100 segments (Segment array). Coord is a typedef struct, with currently only a double x and double y value. I know there are simpler solutions, but I have to use this later on, so I do it this way. The kernel is supposed to calculate the distance from a point (an OP) to a segment. The result has to be saved in the array Distances.
Description of the kernel: it gets is global id, from wich it calculates which OP and which segment it has to calculate with. With that info it gets the x and y from those 2, and calculates the x and y distance between each other. With this information it does an square root from the distances and the flightaltitude.
The problem is, that the Distances table only shows zero's :P
kernel: http://pastebin.com/U9hTWvv2 There is an parameter that says __global const Coord* Afstanden, this has to be __global double* Afstanden
OpenCL stuff in main: http://pastebin.com/H3mPwuUH (because Im not 100% sure Im doing that is right)
I can give you the whole program, but it's mostly dutch :P
I tried to explain everything as good as possible, if something is not clear, I'll try to clear it up ;)
As I made numerous little faults in the first kernel, here is his successor: http://pastebin.com/rs6T1nKs Also the OpenCL part is a bit updated: http://pastebin.com/4NbUQ40L because it also had one or two faults in it.
It should work now, in my opinion... But it isn't, the problem is still standing
Well, it seems that the "Afstanden" variable is declared as a "const Coord*". Remove the constness - i.e, "Coord* only - and be sure that when you use clCreateBuffer "CL_MEM_WRITE_ONLY" or "CL_MEM_READ_WRITE" as an argument.
"The problem is, that the Distances table only shows zero's :P"
You are probably initializing Afstanden on the CPU side with zeros. Am I correct? This would explain this behavior.

CUDA kernel's vectors' length based on threadIdx

This is part of the pseudo code I am implementing in CUDA as part of an image reconstruction algorithm:
for each xbin(0->detectorXDim/2-1):
for each ybin(0->detectorYDim-1):
rayInit=(xbin*xBinSize+0.5,ybin*xBinSize+0.5,-detectordistance)
rayEnd=beamFocusCoord
slopeVector=rayEnd-rayInit
//knowing that r=rayInit+t*slopeVector;
//x=rayInit[0]+t*slopeVector[0]
//y=rayInit[1]+t*slopeVector[1]
//z=rayInit[2]+t*slopeVector[2]
//to find ray xx intersections:
for each xinteger(xbin+1->detectorXDim/2):
solve t for x=xinteger*xBinSize;
find corresponding y and z
add to intersections array
//find ray yy intersections(analogous to xx intersections)
//find ray zz intersections(analogous to xx intersections)
So far, this is what I have come up with:
__global__ void sysmat(int xfocus,int yfocus, int zfocus, int xbin,int xbinsize,int ybin,int ybinsize, int zbin, int projecoes){
int tx=threadIdx.x, ty=threadIdx.y,tz=threadIdx.z, bx=blockIdx.x, by=blockIdx.y,i,x,y,z;
int idx=ty+by*blocksize;
int idy=tx+bx*blocksize;
int slopeVectorx=xfocus-idx*xbinsize+0.5;
int slopeVectory=yfocus-idy*ybinsize+0.5;
int slopeVectorz=zfocus-zdetector;
__syncthreads();
//points where the ray intersects x axis
int xint=idx+1;
int yint=idy+1;
int*intersectionsx[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorYDim-yint)+(zfocus)];
for(xint=xint; xint<detectorXDim/2;xint++){
x=xint*xbinsize;
t=(x-idx)/slopeVectorx;
y=idy+t*slopeVectory;
z=z+t*slopeVectorz;
intersectionsx[xint-1]=x;
intersectionsy[xint-1]=y;
intersectionsz[xint-1]=z;
__syncthreads();
}
...
}
This is just a piece of the code. I know that there might be some errors(you can point them if they are blatantly wrong) but what I am more concerned is this:
Each thread(which corresponds to a detector bin) needs three arrays so it can save the points where the ray(which passes through this thread/bin) intersects multiples of the x,y and z axis. Each array's length depend on the place of the thread/bin(it's index) in the detector and on the beamFocusCoord(which are fixed). In order to do this I wrote this piece of code, which I am certain can not be done(confirmed it with a small test kernel and it returns the error: "expression must have constant value"):
int*intersectionsx[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsy[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
int*intersectionsz[(detectorXDim/2-xint)+(detectorXDim-yint)+(zfocus)];
So in the end, I want to know if there is an alternative to this piece of code, where a vector's length depends on the index of the thread allocating that vector.
Thank you in advance ;)
EDIT: Given that each thread will have to save an array with the coordinates of the intersections between the ray(that goes from the beam source to the detector) and the xx,yy and zz axis, and that the spacial dimensions are around(I dont have the exact numbers at the moment, but they are very close to the real value) 1400x3600x60, is this problem feasible with CUDA?
For example, the thread (0,0) will have 1400 intersections in the x axis, 3600 in the y axis and 60 in the z axis, meaning that I will have to create an array of size (1400+3600+60)*sizeof(float) which is around 20kb per thread.
So given that each thread surpasses the 16kb local memory, that is out of the question. The other alternative was to allocate those arrays but, with some more math, we get (1400+3600+60)*4*numberofthreads(i.e. 1400*3600), which also surpasses the ammount of global memory available :(
So I am running out of ideas to deal with this problem and any help is appreciated.
No.
Every piece of memory in CUDA must be known at kernel-launch time. You can't allocate/deallocate/change anything while the kernel is running. This is true for global memory, shared memory and registers.
The common workaround is the allocate the maximum size of memory needed beforehand. This can be as simple as allocating the maximum size needed for one thread thread-multiple times or as complex as adding up all those thread-needed sizes for a total maximum and calculating appropriate thread-offsets into that array. That's a tradeoff between memory allocation and offset-computation time.
Go for the simple solution if you can and for the complex if you have to, due to memory limitations.
Why are you not using textures? Using a 2D or 3D texture would make this problem much easier. The GPU is designed to do very fast floating point interpolation, and CUDA includes excellent support for it. The literature has examples of projection reconstruction on the GPU, e.g. Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU, and textures are an integral part of their algorithms. Your own manual coordinates calculations can only be slower and more error prone than what the GPU provides, unless you need something weird like sinc interpolation.
1400x3600x60 is a little big for a single 3D texture, but you could break your problem up into 2D slices, 3D sub-volumes, or hierarchical multi-resolution reconstruction. These have all been used by other researchers. Just search PubMed.

Resources