Are vertex buffers created on another thread than the DXDevice was created on, slower? - directx-10

If I create a bunch of vertex buffers (that is, DX10 Buffers set to hold vertices) using multi-threading, will they be slower (to draw as primitives) than ones created on the same thread as the DXDevice?
I ask because the DirectX10 DXDevice is slower when presenting from a different thread than it was created on, and I couldn't find an answer on Google.

Once created they sit on your GPU, so speed will be exactly the same.

Related

QSGGeometry: Is it fast to upload tons of vertices every frame?

I'm developing a realtime log plotter using Qt Quick 2. It receives log data like every millisecond, and I'd like to plot it incrementally (to a parametric curve) using a custom QQuickItem.
Currently I'm planning to use QSGGeometry and send vertex data to GPU. However, since QSGGeometry does not support incremental vertex upload, I will have to send all vertices every frame. Since the log can be about a hundred second long, I will be sending hundreds of thousands of vertices every frame. I feel myself silly to do that every sixtieth second.
Of course I could prune unnecessary vertices(those that are too close to others) and make the vertex buffer size maybe 1/30, but I noticed that was just bringing GPU task into CPU. (or I could just take every one of 30 data and send, but the user can zoom up the graph and it will be ugly.)
Instead, I could use QQuickPaintedItem and draw on the FrameBufferObject incrementally, but when the user drags the graph and the graph redraws, it will send one hundred thousand gl-calls in one frame(or do it in CPU, but it will be slow anyways).
Which one is the faster way? or is there a better way to do this?
EDIT: I think I found a much better solution. I could split the data recursively and adaptively add the points until it gets smooth enough. This way I will be able to reduce the data to about 500 points, which is cheap enough to send to GPU every frame, while only accessing to the points needed in CPU. The only concern is whether g++ can optimize the recursive call for low overhead.

Difference between reading speed from a memory create with CL_MEM_READ_WRITE and CL_MEM_READ flags

In my project in the first stage, I generate some vertices then in second stage I read these vertices and then create connectivity array. For my vertices I have used CL_MEM_READ_WRITE. I wanted to know will I have a performance increase if I use a CL_WRITE memory in the first stage then copy it in another CL_READ memory for the second stage? Because probably each of them has its own optimization to get the maximum performance.
The flag passed in the 2nd argument Of CL_CREATEBUFER only specifies how the kernel side can access the memory space.
Probably not. I expect the buffer copy to be far more costly than any optimization.
Also, I looked at the AMD APP OpenCL Programming Guide and I didn't find any indication about optimizations when using a READ_ONLY or WRITE_ONLY buffer.
My understanding is that the access flag is only used by the OpenCL runtime to decide when it needs to copy buffer data between the different memory spaces/areas.

How to Implement embarrassingly parallel task (FOR loop) WITHOUT MPI-IO?

Preamble:
I have a very large array (one dim) and need to solve evolution equation (wave-like eq). I I need to calculate integral at each value of this array, to store the resulting array of integral and apply integration again to this array, and so on (in simple words, I apply integral on grid of values, store this new grid, apply integration again and so on).
I used MPI-IO to spread over all nodes: there is a shared .dat file on my disc, each MPI copy reads this file (as a source for integration), performs integration and writes again to this shared file. This procedure repeats again and again. It works fine. The most time consuming part was the integration and file reading-writing was negligible.
Current problem:
Now I moved to 1024 (16x64 CPU) HPC cluster and now I'm facing an opposite problem: a calculation time is NEGLIGIBLE to read-write process!!!
I tried to reduce a number of MPI processes: I use only 16 MPI process (to spread over the nodes) + 64 threads with OpenMP to parallelize my computation inside of each node.
Again, reading and writing processes is the most time consuming part now.
Question
How should I modify my program, in order to utilize the full power of 1024 CPUs with minimal loss?
The important point, is that I cannot move to the next step without completing the entire 1D array.
My thoughts:
Instead of reading-writing, I can ask my rank=0 (master rank) to send-receive the entire array to all nodes (MPI_Bcast). So, instead of each node will I/O, only one node will do it.
Thanks in advance!!!
I would look here and here. FORTRAN code for the second site is here and C code is here.
The idea is that you don't give the entire array to each processor. You give each processor only the piece it works on, with some overlap between processors so they can handle their mutual boundaries.
Also, you are right to save your computation to disk every so often. And I like MPI-IO for that. I think it is the way to go. But the codes in the links will allow you to run without reading every time. And, for my money, writing out the data every single time is overkill.

Hadoop suitability for recursive data processing

I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!

OpenCL - Multiple GPU Buffer Synchronization

I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:
Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.
So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.
I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).
EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?
It sounds like you are having implementation trouble.
There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.
I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.
Appendix A of the OpenCL 1.2 specification says that:
OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.
Further:
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.
So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.
To put things more in terms of your question:
What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?
... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.
As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.
I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.
Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.
If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.
All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.
I hope I was helpful,
Ryan

Resources