In some dynamic programming problems, I notice that my cache table is very sparse. In other words, if I define a table as DP[i][j], i<=10^6, j<=10^2, only a fraction of the table is used and the rest is -1.
So my question is, is it common practice to use a hashmap instead to store (i, j) pairs with their DP value and access them in average O(1) time rather than storing them in the sparse table to save memory?
First of all, Yes you can use hashmap instead of the array for dynamic programming problems. But there are some limitations as well as well as benefits for using a hashmap.
When you use a hashmap for this particular case(dynamic programming), it reduces memory complexity but simultaneously it will increase the constant factor of your code. That means if you can perform 10^{8} operations/second with the help of array, then you will be able to perform around 10^{7} operations/second when used hashmap due to its constant factor although with the same complexity of the algorithm.
So if possible to declare that much size of the array, use array otherwise use the hashmap.
Yes, is the definitely a common practice to use hashmaps. Particularly in the case of sparsity.
It is even possible to go beyond that... For even larger problems, approximate dynamic programming draws from tools such as function approximation.
Related
There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.
First link: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/
Problem: Allocating a 2d array of pointers
User solution: use mallocPitch
"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)
"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu
Second link: https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-allocate-space-in-host-and-pass-to-device-/
Problem: Allocating space on host and passing it to device
Sub link: https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/
Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.
Third link: Allocate 2D Array on Device Memory in CUDA
Problem: Allocating and transferring 2d arrays
User solution: use mallocPitch
Other solution: flatten it
Fourth link: How to use 2D Arrays in CUDA?
Problem: Allocate and traverse 2d arrays
Submitted solution: Does not show allocation
Other solution: squash it
There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.
Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?
The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.
Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.
Also according to this link: Copy an object to device?
and the sub link answer: cudaMemcpy segmentation fault
This gets a little iffy.
The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?
I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?
Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.
cudaMallocPitch/cudaMemcpy2D:
First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.
general, dynamically allocated 2D case:
If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:
there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.
(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)
Also, here is a thrust method for building a general dynamically allocated 2D array.
flattening:
If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".
general, dynamically allocated 3D case:
As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).
special case: array width known at compile time:
Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.
doubly-subscripted host code, singly-subscripted device code:
Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).
From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.
Can someone explain the difference between dictionaries and hashtables? In Java, I've read that dictionaries are a superset of hashtables, but I always thought it was the other way around. Other languages seem to treat the two as the same. When should one be used over the other, and what's the difference?
The Oxford Dictionary of Computing defines a dictionary as...
Any data structure representing a set of elements that can support the insertion and deletion of elements as well as a test for membership.
As such, dictionaries are an abstract idea that can be reasonably efficiently implemented as e.g. binary trees or hash tables, tries, or even direct array indexing if the keys are numeric and not too sparse. That said, python uses a closed-hashing hash table for its dict implementation, and C# seems to use some kind of hash table too (hence the need for a separate SortedDictionary type).
A hash table is a much more specific and concrete data structures: there are several implementations options (closed vs. open hashing being perhaps the most fundamental), but they're all characterised by O(1) amortised insertion, lookup and deletion, and there's no excuse for begin->end iteration worse than O(n + #buckets), while implementations may achieve better (e.g. GCC's C++ library has O(n) container iteration. The implementations necessarily depend on a hash function leading to an indexed probe in an array.
The way i see it, a hashtable is one way of implementing a dictionary. specifying that the key is hashfunction(x) and the value is any Object. The Java Dictionary can use any key as long as .equals(y) has been implemented for that object.
The 'answer' will also change depending on the language (C#? Java? JS?) you're using. in JS the 'dictionary' is implemented as a hashtable and there is no difference. ---- in another language (i believe it's C#), the Dictionary MUST be strongly typed fixed type key and fixed type value, while the Hashtable's value can be any type, and the two are not extended from one another.
I have a code in which a one-dimensional array R is used which has 3N elements. You can think of it as the position vector of N particles, such that R=[r1x,r1y,r1z,r2x,r2y,...]. Note that the pattern should be defined as this for concise usage of the array.
In sections of the code, I need to perform some operations only on the x-coordinates. I am currently using something like this:
Rx => R(1:3N-2:3)
and Rx is subsequently used in the operations. This makes the access non-contiguous but I was wondering if I can hope for a way to vectorize the operations. Alternatively, one may use OMP with a loop over the particles. I want to get the expert's idea on this matter and particularly the best possible practice performance-wise.
You can't have your cake and eat it too. If you want to make strided access to non-contiguous array elements you're going to pay a price in performance. For small arrays, in which all the elements fit into cache, you'll probably never notice the price. For larger arrays you'll do a lot more data movement through cache than if you step through array elements one-by-one in memory-layout order. Using pointers to non-contiguous array sections doesn't magically alter these facts (as you seem to be aware).
So what you do is what Fortran programmers have always done, optimise the memory layout of your arrays for the most common access pattern. In your case many of us would have either a 3,x rank-2 array or a x,3 one depending on whether accessing all the x (or y or z) elements together was more frequent than accessing particle-by-particle.
Sometimes it's worth transposing an array prior to operations on elements in non-memory-layout order. Sometimes it's even worth holding the same data twice, once in one order, once in the other. But you're going to have to figure out which is the best solution for your program, we don't have all the facts necessary to provide a high-quality recommendation. If it matters to you, then it should matter enough for you to conduct some tests and develop a quantified view of the situation.
You pays your money and you makes your choice.
What are the downsides of using Bigarray when interfacing with C is not an issue? Are they slower, in particular for small 2D matrices?
Just based on looking through the implementations, I'd say that bigarrays might be slower if you create large numbers of short-lived arrays. It looks like the memory for them is managed outside the usual OCaml GC, which handles short-lived objects extremely well.
You also might find that accesses to bigarrays aren't inlined, whereas accesses to the built-in arrays would be.
On the other hand, built-in arrays are going to have an extra indirection for two-dimensions.
If the performance really matters, you'll probably have to benchmark your particular application.
The main downside is right there in the type - bigarrays can hold only small subset of primitive types
hi i'm doing a program that needs high performance of handling vector elements
vector<Class_A> object ;
1- which one is fastest to access the elements
2- which is more code simpler and not complex to deal with
index ? iterator ? pointer ?
An iterator or pointer will have the same performance on most implementations -- usually a vector iterator is a pointer. An index needs to calculate a pointer each time, but the optimizer can sometimes take care of that. Generally though, as another commenter said, there's no sense in optimizing this for performance.
All of that said, I would probably go with an iterator since it's easier to change to another type of container if need be.
Assuming you have inlining enabled, and arent doing index range checking, they will probably all be about the same. Besides micro optimizing this probably isn't going to gain you anything, you need to optimize globally. profile your process and target the slowest parts first