random memory access and bank conflict - opencl

in these days, i'm trying program on mobile gpu(adreno)
the algorithm what i use for image processing has 'randomness' for memory access.
it refers some pixels in 'fixed' range for filtering.
BUT, i cant know exactly which pixel will be referred(depends on image)
as far as i understood. if multiple thread access local memory bank
it causes bank conflict. so in my case it should make bank conflict.
MY question: Can i eliminate bank conflict at random memory access?
or can i reduce them?

Assuming that the distances of your randomly accessed pixels is somehow normal distributed, you could think of tiling your image into subimages.
What I mean: instead of working with a (lets say) 1024x1024 image, you might have 4x4 images of size 256x256. Each of them is kept together in memory, so "near" pixel access stays within the same image object. Only the far distance operations need to access different subimages.
A second option: instead of using CLImage objects, try to save your data into an array. The data in the array can be stored in a Z-order curve sorting. This also leads to a reduced spatially distribution (compared to row-order-sorting)
But of course, this depends strongly on your image size.

There are a variety of ways to deal with bank conflicts - the size of the elements you are working with, the strides between lines and shifting the coordinates around to different memory addresses. It's never going to be as good as non-random / conflict free though and so what you will notice is depending on the image - you will see significantly different compute times.
See http://cuda-programming.blogspot.com/2013/02/bank-conflicts-in-shared-memory-in-cuda.html

Related

OpenCL: know local work group size in advance?

I'm working on optimizing a separable image downscaler. My next step is reduction of multiple samplings (nearest) of the same texel by reading all necessary texels into local memory. Here begins the fun...
The downscaler is versatile, so it can downscale anything larger into anything smaller and even take sections of an image and downscale it into a destination image. Thus the final resolution divider never is a whole number. Most of the time it will be something around 3.97 or such. This means: I do not know the required size for that local array at compile time.
To me that means: before enqueuing a task, I'll have to create a local mem object of the required size.
How do I know what workgroup sizes OpenCL will select?
If there is no way, is there a "best practice" to overcome this problem?
P.S.: I'm writing for OpenCL 1.1 compatibility.
Since you are using images, the texture cache can be relied upon instead of using shared local memory.

How do I generate a waypoint map in a 2D platformer without expensive jump simulations?

I'm working on a game (using Game Maker: Studio Professional v1.99.355) that needs to have both user-modifiable level geometry and AI pathfinding based on platformer physics. Because of this, I need a way to dynamically figure out which platforms can be reached from which other platforms in order to build a node graph I can feed to A*.
My current approach is, more or less, this:
For each platform consider each other platform in the level.
For each of those platforms, if it is obviously unreachable (due to being higher than the maximum jump height, for example) do not form a link and move on to next platform.
If a link seems possible, place an ai_character instance on the starting platform and (within the current step event) simulate a jump attempt.
3.a Repeat this jump attempt for each possible starting position on the starting platform.
If this attempt is successful, record the data necessary to replicate it in real time and move on to the next platform.
If not, do not form a link.
Repeat for all platforms.
This approach works, more or less, and produces a link structure that when visualised looks like this:
linked platforms (Hyperlink because no rep.)
In this example the mostly-concealed pink ghost in the lower right corner is trying to reach the black and white box. The light blue rectangles are just there to highlight where recognised platforms are, the actual platforms are the rows of grey boxes. Link lines are green at the origin and red at the destination.
The huge, glaring problem with this approach is that for a level of only 17 platforms (as shown above) it takes over a second to generate the node graph. The reason for this is obvious, the yellow text in the screen centre shows us how long it took to build the graph: over 24,000(!) simulated frames, each with attendant collision checks against every block - I literally just run the character's step event in a while loop so everything it would normally do to handle platformer movement in a frame it now does 24,000 times.
This is, clearly, unacceptable. If it scales this badly at a mere 17 platforms then it'll be a joke at the hundreds I need to support. Heck, at this geometric time cost it might take years.
In an effort to speed things up, I've focused on the other important debugging number, the tests counter: 239. If I simply tried every possible combination of starting and destination platforms, I would need to run 17 * 16 = 272 tests. By figuring out various ways to predict whether a jump is impossible I have managed to lower the number of expensive tests run by a whopping 33 (12%!). However the more exceptions and special cases I add to the code the more convinced I am that the actual problem is in the jump simulation code, which brings me at long last to my question:
How would you determine, with complete reliability, whether it is possible for a character to jump from one platform to another, preferably without needing to simulate the whole jump?
My specific platform physics:
Jumps are fixed height, unless you hit a ceiling.
Horizontal movement has no acceleration or inertia.
Horizontal air control is allowed.
Further info:
I found this video, which describes a similar problem but which doesn't provide a good solution. This is literally the only resource I've found.
You could limit the amount of comparisons by only comparing nearby platforms. I would probably only check the horizontal distance between platforms, and if it is wider than the longest jump possible, then don't bother checking for a link between those two. But you might have done this since you checked for the max height of a jump.
I glanced at the video and it gave me an idea. Instead of looking at all platforms to find which jumps are impossible, what if you did the opposite? Try placing an AI character on all platforms and see which other platforms they can reach. That's certainly easier to implement if your enemies can't change direction in midair though. Oh well, brainstorming is the key to finding something.
Several ideas you could try out:
Limit the amount of comparisons you need to make by using a spatial data structure, like a quad tree. This would allow you to severely limit how many platforms you're even trying to check. This is mostly the same as what you're currently doing, but a bit more generic.
Try to pre-compute some jump trajectories ahead of time. This will not catch all use cases that you have - as you allow for full horizontal control - but might allow you to catch some common cases more quickly
Consider some kind of walkability grid instead of a link generation scheme. When geometry is modified, compute which parts of the level are walkable and which are not, with some resolution (something similar to the dimensions of your agent might be good starting point). You could also filter them with a height, so that grid tiles that are higher than your jump height, and you can't drop from a higher place on to them, are marked as unwalkable. Then, when you compute your pathfinding, as part of your pathfinding step you can compute when you start a jump, if a path is actually executable ('start a jump, I can go vertically no more than 5 tiles, and after the peak of the jump, i always fall down vertically with some speed).

What is the advantage of using a 1d image over a 1d buffer?

I understand that in 2d, images are cached in x and y directions.
But in 1d, why would you want to use an image? Is the memory used
for images faster than memory used for buffers?
1D Image stays the image, so it has all advantages that Image has against Buffer. That is:
Image IO operations are usually well-cached.
Samplers can be used, which gives benefits like computationally cheap interpolation, hardware-resolved out-ouf-bound access, etc.
Though, you should remember that Image has some constraints in comparison to regular Buffer:
Single Image can be used either for reading or for writing within one kernel.
You can't use vloadN / vstoreN operations, which can handle up to 16 values per call. Your best option is read_imageX & write_imageX functions, which can load / store up to 4 values per one call. That can be serious issue on GPU, with vector architecture.
If you are not using 4-component format, usually, you are loosing part of performance as many functions process samples from color planes simultaneously. So, payload is decreasing.
If we talk about GPU, different parts of hardware are involved into processing of Images & Buffers, so it's difficult to draw up, how one is better than another. Carefull benchmarking & algorithm optimizations are needed.

Storing pixel based world data

I am making a 2d game with destructable terrain. It will be on iOS but I am looking for ideas or pseudocode, not actual code. I'm wondering how to store a large amount of data. (It will be a large world, approximately 64000 pixels wide and 9600 tall. Each pixel needs a way to store what type of object it is.) I was hoping to use a 2D array but a quick load test showed that this is not feasable (even using a 640x480 grid I dropped below 1 fps)
I also tried the method detailed here: http://gmc.yoyogames.com/index.php?showtopic=315851 (I used to use Game Maker and remembered this method) however is seems a bit cumbersome and recombining the objects again is nearly impossible.
So what other methods are there? Does anyone know how Worms worked? What about image editors, how do they store the colour of each pixel?
Thankyou,
YM
Run-length encoding can help with your memory issues
I am most likely going to use Polygon based storage.

Problem with huge objects in a quad tree

Let's say I have circular objects. Each object has a diameter of 64 pixels.
The cells of my quad tree are let's say 96x96 pixels.
Everything will be fine and working well when I check collision from the cell a circle is residing in + all it's neighbor cells.
BUT what if I have one circle that has a diameter of 512 pixels? It would cover many cells and thus this would be a problem when checking only the neighbor cells. But I can't re-size my quad-tree-grid every time a much larger object is inserted into the tree...
Instead och putting objects into a single cell put them in all cells they collide with. That way you can just test each cell individually. Use pointers to the object so you dont create copies. Also you only need to do this with leavenodes, so no need to combine data contained in higher nodes with lower ones.
This an interesting problem. Maybe you can extend the node or the cell with a tree height information? If you have an object bigger then the smallest cell nest it with the tree height. That's what map's application like google or bing maps does.
Here a link to a similar solution: http://www.gamedev.net/topic/588426-2d-quadtree-collision---variety-in-size. I was confusing the screen with the quadtree. You can check collision with a simple recusion.
Oversearching
During the search, and starting with the largest objects first...
Test Object.Position.X against QuadTreeNode.Centre.X, and also
test Object.Position.Y against QuadTreeNode.Centre.Y;
... Then, by taking the Absolute value of the difference, treat the object as lying within a specific child node whenever the absolute value is NOT more than the radius of the object...
... that is, when some portion of the object intrudes into that quad : )
The same can be done with AABB (Axis Aligned Bounding Boxes)
The only real caveat here is that VERY large objects that cover most of the screen, will force a search of the entire tree. In these cases, a different approach may be called for.
Of course, this only takes care of the object that everything else is being tested against. To ensure that all the other large objects in the world are properly identified, you will need to alter your quadtree slightly...
Use Multiple Appearances
In this variation on the QuadTree we ONLY place objects in the leaf nodes of the QuadTree, as pointers. Larger objects may appear in multiple leaf nodes.
Since some objects have multiple appearances in the tree, we need a way to avoid them once they've already been tested against.
So...
A simple Boolean WasHit flag can avoid testing the same object multiple times in a hit-test pass... and a 'cleanup' can be run on all 'hit' objects so that they are ready for the next test.
Whilst this makes sense, it is wasteful if performing all-vs-all hit-tests
So... Getting a little cleverer, we can avoid having any cleanup at all by using a Pointer 'ptrLastObjectTestedAgainst' inside of each object in the scene. This avoids re-testing the same objects on this run (the pointer is set after the first encounter)
It does not require resetting when testing a new object against the scene (the new object has a different pointer value than the last one). This avoids the need to reset the pointer as you would with a simple Bool flag.
I've used the latter approach in scenes with vastly different object sizes and it worked well.
Elastic QuadTrees
I've also used an 'elastic' QuadTree. Basically, you set a limit on how many items can IDEALLY fit in each QuadTreeNode - But, unlike a standard QuadTree, you allow the code to override this limit in specific cases.
The overriding rule here is that an object may NOT be placed into a Node that cannot hold it ENTIRELY... with the top node catching any objects that are larger than the screen.
Thus, small objects will continue to 'fall through' to form a regular QuadTree but large objects will not always fall all the way through to the leaf node - but will instead expand the node that last fitted them.
Think of the non-leaf nodes as 'sieving' the objects as they fall down the tree
This turns out to be a very efficient choice for many scenarios : )
Conclusion
Remember that these standard algorithms are useful general tools, but they are not a substitute for thinking about your specific problem. Do not fall into the trap of using a specific algorithm or library 'just because it is well known' ... your application is unique, and it may benefit from a slightly different approach.
Therefore, don't just learn to apply algorithms ... learn from those algorithms, and apply the principles themselves in novel and fitting ways. These are NOT the only tools, nor are they necessarily the best fit for your application.
Hope some of those ideas helped.

Resources