I have a single OpenCL kernel that needs access to different memory objects at the same time, and those memory objects can't all fit in the device memory, so I need to only have at any given iteration the ones that are needed. How can I maintain only the objects needed without having to destroy and transfer again the objects that are already there and need to stay there?
In an ideal world I could copy each object individually as needed, but given that there's no such thing as arrays of pointers on the device, how could I do this if I need 50 of them in one same instance of a kernel?
For context: I use one pretty general-purpose kernel that takes care of all my graphics needs by following a list of things to do, but now I need it to access images, tiles from huge images or smaller mipmap versions, many at once. It might need the exact same objects with no changes for thousands of iterations, or it might need to remove some objects but not some others on a regular basis.
Related
There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.
First link: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/
Problem: Allocating a 2d array of pointers
User solution: use mallocPitch
"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)
"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu
Second link: https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-allocate-space-in-host-and-pass-to-device-/
Problem: Allocating space on host and passing it to device
Sub link: https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/
Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.
Third link: Allocate 2D Array on Device Memory in CUDA
Problem: Allocating and transferring 2d arrays
User solution: use mallocPitch
Other solution: flatten it
Fourth link: How to use 2D Arrays in CUDA?
Problem: Allocate and traverse 2d arrays
Submitted solution: Does not show allocation
Other solution: squash it
There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.
Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?
The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.
Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.
Also according to this link: Copy an object to device?
and the sub link answer: cudaMemcpy segmentation fault
This gets a little iffy.
The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?
I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?
Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.
cudaMallocPitch/cudaMemcpy2D:
First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.
general, dynamically allocated 2D case:
If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:
there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.
(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)
Also, here is a thrust method for building a general dynamically allocated 2D array.
flattening:
If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".
general, dynamically allocated 3D case:
As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).
special case: array width known at compile time:
Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.
doubly-subscripted host code, singly-subscripted device code:
Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).
From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.
I work with large R objects that are sometimes accessed for read-only purposes by multiple people on our local network. For example, a reference class or R6 object might be used to store validation results related to a particular model, and it may have many read-only validation-related methods. I would like to keep using R to maintain workflow homogeneity and avoid moving to a language (like Java or Python) that would be more appropriate to solving the question I am about to ask.
Rather than instantiating these objects anew or reading them from serialized output (e.g., RDS or redis) every time we need them in a new R session, it would be much more efficient to keep an active R process running on some server that is accessible on the network, and then "memcpy"ing objects from that server onto local machines: some kind of quasi-object pooling. Note these are sometimes legitimately non-tabular R objects that would be difficult to e.g. translate into a database-backed object (which might still be slower).
I understand that R maintains all information about what is in scope on the heap, so this may be difficult to do without control of the gc, but is it possible to "siphon" objects away from other R sessions on a byte-by-byte level using some sort of underlying C magic? I don't understand enough about how R manages objects in memory to know how to do this, but perhaps there is a package or snippets of existing code that can provide inspiration.
I am also willing to put on the straightjacket and make restrictions on the aforementioned objects that would make this task easier (e.g., can only reference certain packages, or the method definitions cannot be weird closures that would make this task impossible, or even can only be S3 objects).
EDIT: I just realized I haven't looked into RProtoBuf. Could that be appropriate?
The standard way to do this would be to serialize your objects into a stream of bytes that can be safely loaded on another machine at a later time. This is exactly what the base::serialize method is for if everything is in R, or what RProtoBuf is for if the data needs to be shared between applications written in other languages. In either case, you can write the serialized bytes to RDS or redis or any other data store.
Direct memcpy between machines would be problematic for many reason. Most fundamentally, architectural differences between machines make this error prone if not all of your computers are the same endianness. Also, you would have to find a way to represent complex data structures as a stream of bytes that could be interpreted on another machine. Maybe things were loaded into memory at different addresses and so you can't just expect to do a raw memcpy without fixing up pointers to different memory locations, but if you are doing that, you are doing serialization, so again why not use base::serialize or RProtoBuf.
I am in the process of coding a level design tool in Qt with OpenGL (for a relevant example see Valve's Hammer, as Source games are what I'm primarily designing this for) and have currently written a few classes to represent 3D objects (vertices, edges, faces). I plan to implement an "object" class which ties the three together, keeps track of its own vertices, etc.
After having read up on rendering polygons on http://open.gl, I have a couple of questions regarding the most efficient way to render the content. Bear in mind that this is a level editor, so I am anticipating needing to render a large number of objects with arbitrary shapes and numbers of vertices/faces.
Edit: Updated to be less broad.
At what point would be the best point to create the VBO? The Qt OpenGL example creates a VBO when a viewport is initialized, but I'd expect it to be inefficient to create a close for each viewport.
Regarding the submitted answer, would it be a sensible idea to create one VBO for geometry, another for mesh models, etc? What happens if/when a VBO overflows?
VBOs should be re-/initialized whenever there's a need for it. Consider VBOs as memory pools. So you'd not allocate one VBO per object, but group similar objects into a single VBO. When you run out of space in one VBO you allocate a another one.
Today's GPUs are optimized for rendering indexed triangles. So GL_TRIANGLES will suffice in 90% of all cases.
Frankly modern OpenGL implementations completely ignore the buffer object access mode. So many programs did made ill use of that parameter, that it became more efficient to profile the usage pattern and adjust driver behavior toward that. However it's still a good idea to use the right mode. And in your case it's GL_STATIC_DRAW.
I'm working on an algorithm which needs very fast random access to video frames in a possibly long video (minimum 30 minutes). I am currently using OpenCV's VideoCapture to read my video, but the seeking functionality is either broken or very slow. The best I found until now is using the MJPEG codec inside a MKV container, but it's not fast enough.
I can chose any video format or even create a new one. The storage space is not a problem (to some extents of course). The only requirement is to get the fastest possible seeking time to any location in the video. Ideally, I would like to be able to access to multiple frames simultaneously, taking advantages of my quad-core CPU.
I know that relational databases are very good to store large volumes of data, they allows simultaneous read accesses and they're very fast when using indexes.
Is SQLite a good fit for my specific needs ? I plan to store each video frame compressed in JPEG, and use an index on the frame number to access them quickly.
EDIT : for me a frame is just an image, not the entire video. A 30mn video # 25 fps contains 30*60*25=45000 frames, and I want to be able to quickly get one of them using its number.
EDIT : For those who could be interested, I finally implemented a custom video container saving each frame in fixed-sized blocks (consequently, the position of any frame can be directly computed !). The images are compressed with the turbojpeg library and file accesses are multi-threaded (to be NCQ-friendly). The bottleneck is not the HDD anymore and I finally obtained much better perfs :)
I don't think using SQLite (or any other dabatase engine) is a good solution for your problem. A database is not a filesystem.
If what you need is very fast random access, then stick to the filesystem, it was designed for this kind of usage, and optimized with this in mind. As per your comment, you say a 5h video would require 450k files, well, that's not a problem in my opinion. Certainly, directory listing will be a bit long, but you will get the absolute fastest possible random access. And it will certainly be faster than SQLite because you're one level of abstraction under.
And if you're really worried about directory listing times, you just have to organize your folder structure like a tree. That will get you longer paths, but fast listing.
Keep a high level perspective. The problem is that OpenCV isn't fast enough at seeking in the source video. This could be because
Codecs are not OpenCV's strength
The source video is not encoded for efficient seeking
You machine has a lot of dedicated graphics hardware to leverage, but it does not have specialized capabilities for randomly seeking within a 17 GB dataset, be it a file, a database, or a set of files. The disk will take a few milliseconds per seek. It will be better for an SSD but still not so great. Then you wait for it to load into main memory And you have to generate all that data in the first place.
Use ffmpeg, which should handle decoding very efficiently, perhaps even using the GPU. Here is a tutorial. (Disclaimer, I haven't used it myself.)
You might preprocess the video to add key frames. In principle this shouldn't require completely re-encoding, at least for MPEG, but I don't know much about specifics. MJPEG essentially turns all frames into keyframes, but you can find a middle ground and maybe seek 1.5x faster at a 2x size cost. But avoid hitting the disk.
As for SQLite, that is a fine solution to the problem of seeking within 17 GB of data. The notion that databases aren't optimized for random access is poppycock. Of course they are. A filesystem is a kind of database. Random access in 17 GB is slow because of hardware, not software.
I would recommend against using the filesystem for this task, because it's a shared resource synchronized with the rest of the machine. Also, creating half a million files (and deleting them when finished) will take a long time. That is not what a filesystem is specialized for. You can get around that, though, by storing several images to each file. But then you need some format to find the desired image, and then why not put them all in one file?
Indeed, (if going the 17 GB route) why not ignore the entire problem and put everything in virtual memory? VM is just as good at making the disk seek as SQLite or the filesystem. As long as the OS knows it's OK for the process to use that much memory, and you're using 64-bit pointers, it should be a fine solution, and the first thing to try.
I was wondering if it is possible to create a programming language without explicit memory allocation/deallocation (like C, C++ ...) AND without garbage collection (like Java, C#...) by doing a full analysis at the end of each scope?
The obvious problem is that this would take some time at the end of each scope, but I was wondering if it has become feasible with all the processing power and multiple cores in current CPU's. Do such languages exist already?
I also was wondering if a variant of C++ where smart pointers are the only pointers that can be used, would be exactly such a language (or am I missing some problems with that?).
Edit:
Well after some more research apparently it's this: http://en.wikipedia.org/wiki/Reference_counting
I was wondering why this isn't more popular. The disadvantages listed there don't seem quite serious, the overhead should be that large according to me. A (non-interpreted, properly written from the ground up) language with C family syntax with reference counting seems like a good idea to me.
The biggest problem with reference counting is that it is not a complete solution and is not capable of collecting a cyclic structure. The overhead is incurred every time you set a reference; for many kinds of problems this adds up quickly and can be worse than just waiting for a GC later. (Modern GC is quite advanced and awesome - don't count it down like that!!!)
What you are talking about is nothing special, and it shows up all the time. The C or C++ variant you are looking for is just plain regular C or C++.
For example write your program normally, but constrain yourself not to use any dynamic memory allocation (no new, delete, malloc, or free, or any of their friends, and make sure your libraries do the same), then you have that kind of system. You figure out in advance how much memory you need for everything you could do, and declare that memory statically (either function level static variables, or global variables). The compiler takes care of all the accounting the normal way, nothing special happens at the end of each scope, and no extra computation is necessary.
You can even configure your runtime environment to have a statically allocated stack space (this one isn't really under the compiler's control, more linker and operating system environment). Just figure out how deep your function call chain goes, and how much memory it uses (with a profiler or similar tool), an set it in your link options.
Without dynamic memory allocation (and thus no deallocation through either garbage collection or explicit management), you are limited to the memory you declared when you wrote the program. But that's ok, many programs don't need dynamic memory, and are already written that way. The real need for this shows up in embedded and real-time systems when you absolutely, positively need to know exactly how long an operation will take, how much memory (and other resources) it will use, and that the running time and the use of those resources can't ever change.
The great thing about C and C++ is that the language requires so little from the environment, and gives you the tools to do so much, that smart pointers or statically allocated memory, or even some special scheme that you dream up can be implemented. Requiring the use them, and the constraints you put on yourself just becomes a policy decision. You can enforce that policy with code auditing (use scripts to scan the source or object files and don't permit linking to the dynamic memory libraries)