Vulkan: Why does vkCreateRayTracingPipelinesKHR takes so long? - pipeline

I am working a vulkan ray tracing program, and find "vkCreateRayTracingPipelinesKHR " takes 5-6 seconds, why does it takes so long?
I have tried using PipelineCache, as this example shows(https://github.com/LunarG/VulkanSamples/blob/master/API-Samples/pipeline_cache/pipeline_cache.cpp). But it doesn’t seems to save the time.
Anybody give me a hint?
Thanks!
ps: just find that the pipeline creation time is proportional to the length of shader code…

Related

Strange behavior when implementing Back propagation in DBN

Currently I'm trying to implement the Deep Belief Network. But I've met a very strange problem. My source code can be found here: https://github.com/mistree/GoDeep/blob/master/GoDeep/
I first implemented the RBM using CD and it works perfectly (by using the concurrency feature of Golang it's quite fast). Then I start to implement a normal feed forward network with back propagation and then the strange thing happens. It seems very unstable. When I run it with xor gate test it sometimes fails, only when I set the hidden layer nodes to 10 or more then it never fails. Below is how I calculate it
Step 1 : calculate all the activation with bias
Step 2 : calculate the output error
Step 3 : back propagate the error to each node
Step 4 : calculate the delta weight and bias for each node with momentum
Step 1 to Step 4 I do a full batch and sum up these delta weight and bias
Step 5 : apply the averaged delta weight and bias
I followed the tutorial here http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm
And normally it works if I give it more hidden layer nodes. My test code is here https://github.com/mistree/GoDeep/blob/master/Test.go
So I think it should work and start to implement the DBN by combining the RBM and normal NN. However then the result becomes really bad. It even can't learn a xor gate in 1000 iteration. And sometimes goes totally wrong. I tried to debug with that so after the PreTrain of DBN I do a reconstruction. Most times the reconstruction looks good but the back propagation even fails when the preTrain result is perfect.
I really don't know what's wrong with the back propagation. I must misunderstood the algorithm or made some big mistakes in the implementation.
If possible please run the test code and you'll see how weird it is. The code it self is quite readable. Any hint will be great help.Thanks in advance
I remember Hinton saying you cant train RBM's on an XOR, something about the vector space that doesnt allow a two layer network to work. Deeper networks have less linear properties that allow it to work.

returning one result from a pyopencl kernel

My pyopencl kernel program is started with global size of (512,512), I assume it will run 512x512=262,144 times. I want to find the minimum value of a function in my 512x512 image but I don't want to return 262,144 floats to my CPU to calculate the min. I want to run another kernel (possibly waiting in the queue ) to find the min value of all 262,144 pixels then just send that one float to the CPU. I think this would be faster. Should my waiting kernel's global size be (1,1), ? I hope the large 262,144 Buffer of floats that I created using mf.COPY_HOST_PTR will not cross the GPU/CPU bus before I call the next kernel.
Thanks
Tim
Andreas is right: reduction is the solution. Here is a nice article from AMD explaining how to implement simple reduction. It discusses different approaches and the gain in terms of performance they bring. The example in the article is about summing all the elements and not to find the minimum, but it's fairly trivial to modify the given codes.
BTW, maybe I don't understand well you first sentence, but a kernel launched with a global size of (512, 512) will not run 262,144 times but only one time with 262,144 threads scheduled.
Use a reduction kernel to find the minimum.

how to develop demo application with FPS rate using kick.js?

I'm very interested in Kick.js. To convince my professor to use this framework, I want to develop an application which I can load/code custom 3D model using kick.js and should be able to add more objects. I should also able to print FPS to check the variations in FPS as I add more 3D objects on canvas. I'm new to graphic programming, I neither have knowledge on shader programming nor openGL. Being a newbie, how can I start diving into this framework?
The following steps I wanted to implement (Suggest me if I go wrong):
Develop simple demo using kick.js loading single cube or sphere or teapot on canvas.
Able to see the fps as I change the camera angles.
Later I should be able to add more triangles(Models) on the canvas of same type (ex: Teapot) and able to compare the fps with single teapot one.
Am i approaching the right way or please suggestions needed. The provided tutorials neither of them having FPS demo. Please someone HELP ME. I really liked the features stated on homepage but I don't know how can I implement them in my demo.
Assuming that Kick.js has a "render" callback or something similar that's invoked for each frame you want to render (and you know the time between frames, or the absolute time since program start), it's fairly simple to calculate your frame rate.
The method I've used before is: pick a sample rate (I like 250ms so it updates 4 times a second), and count how many frames have executed every 250ms. When you hit 250ms, update the on-screen frame rate counter variable and start counting again.
timeSinceLastFPSUpdate += millisecondsSinceLastFrame;
framesSinceLastFPSUpdate++;
if timeSinceLastFPSUpdate > 250:
timeSinceLastFPSUpdate = 0
fps = framesSinceLastFPSUpdate * (1000 / 250); // convert "frames per 250ms" to "frames per 1s"
framesSinceLastFPSUpdate = 0;
print fps to screen;
You can play around with different sample rates or use a different frame rate calculation method to get the timer to be more "accurate" (to better find frame rate dips) but it sounds like you're looking for something that's less accurate and is just giving you a reasonable idea of the overall complexity of rendering rather than frame to frame dips.

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

Strange CUDA behavior in vector multiplication program

I'm having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compares them. This works without a problem. What's wrong is that I'm trying to test different number of threads and blocks for learning purposes. I have the following kernel:
__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
int idx = threadIdx.x;
if (idx<N)
c[idx] = a[idx]*b[idx];
}
which I call like:
multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);
For the moment I've fixed nBLocks to 1 so I only vary the vector size N and the number of threads nThreads. From what I understand, there will be a thread for each multiplication so N and nThreads should be equal.
The problem is the following
I first call the kernel with N=16 and nThreads<16 which doesn't work. (This is ok)
Then I call it with N=16 and nThreads=16 which works fine. (Again
works as expected)
But when I call it with N=16 and nThreads<16 it still works!
I don't understand why the last step doesn't fail like the first one. It only fails again if I restart my PC.
Has anyone run into something like this before or can explain this behavior?
Wait, so are you calling all three in a row? I don't know the rest of your code, but are you sure you're clearing out the graphics memory you alloced between each run? If not, that could explain why it doesn't work the first time but does the third time when you're passing the same values, and why it only works again after rebooting (rebooting clears all the memory alloced).
Don't know if its ok to answer my own question but I realized I had a bug in my code when comparing the host and device vectors (that part of the code wasn't posted). Sorry for the inconvenience. Could someone please close this post since it won't let me delete it?

Resources