"Tower of Hanoi" recursion not as simple as it seems - recursion

I was asked to implement the "Tower of Hanoi" recursion, and my mind worked this logic:
Move 1 disk from start onto buffer
Move the rest from start onto goal
Move 1 disk from buffer onto goal
Every test passed except for "Towers_of_hanoi(3, start, goal)"
Turns out the correct implementation is:
Move n - 1 disks from start onto buffer
Move 1 from start onto goal
Move the n - 1 disk from buffer onto goal
I have learned to think about recursion as a way solving a problem in terms of simpler ones, and in my mind, both implementations do that. What is the fundamental flaw in my thinking vs the correct way of breaking it down?

The thing about Towers of Hanoi is that you can only have 3 stacks, and you can't put larger disks on top of smaller disks, so you need to make sure that the stack you're using as 'buffer' does not contain any smaller disks before you make the recursive call. In your case, you move the smallest disk to the buffer and then attempt to recurse to move the rest of the stack, but that will fail as the buffer is not usable.

Related

How do I generate a waypoint map in a 2D platformer without expensive jump simulations?

I'm working on a game (using Game Maker: Studio Professional v1.99.355) that needs to have both user-modifiable level geometry and AI pathfinding based on platformer physics. Because of this, I need a way to dynamically figure out which platforms can be reached from which other platforms in order to build a node graph I can feed to A*.
My current approach is, more or less, this:
For each platform consider each other platform in the level.
For each of those platforms, if it is obviously unreachable (due to being higher than the maximum jump height, for example) do not form a link and move on to next platform.
If a link seems possible, place an ai_character instance on the starting platform and (within the current step event) simulate a jump attempt.
3.a Repeat this jump attempt for each possible starting position on the starting platform.
If this attempt is successful, record the data necessary to replicate it in real time and move on to the next platform.
If not, do not form a link.
Repeat for all platforms.
This approach works, more or less, and produces a link structure that when visualised looks like this:
linked platforms (Hyperlink because no rep.)
In this example the mostly-concealed pink ghost in the lower right corner is trying to reach the black and white box. The light blue rectangles are just there to highlight where recognised platforms are, the actual platforms are the rows of grey boxes. Link lines are green at the origin and red at the destination.
The huge, glaring problem with this approach is that for a level of only 17 platforms (as shown above) it takes over a second to generate the node graph. The reason for this is obvious, the yellow text in the screen centre shows us how long it took to build the graph: over 24,000(!) simulated frames, each with attendant collision checks against every block - I literally just run the character's step event in a while loop so everything it would normally do to handle platformer movement in a frame it now does 24,000 times.
This is, clearly, unacceptable. If it scales this badly at a mere 17 platforms then it'll be a joke at the hundreds I need to support. Heck, at this geometric time cost it might take years.
In an effort to speed things up, I've focused on the other important debugging number, the tests counter: 239. If I simply tried every possible combination of starting and destination platforms, I would need to run 17 * 16 = 272 tests. By figuring out various ways to predict whether a jump is impossible I have managed to lower the number of expensive tests run by a whopping 33 (12%!). However the more exceptions and special cases I add to the code the more convinced I am that the actual problem is in the jump simulation code, which brings me at long last to my question:
How would you determine, with complete reliability, whether it is possible for a character to jump from one platform to another, preferably without needing to simulate the whole jump?
My specific platform physics:
Jumps are fixed height, unless you hit a ceiling.
Horizontal movement has no acceleration or inertia.
Horizontal air control is allowed.
Further info:
I found this video, which describes a similar problem but which doesn't provide a good solution. This is literally the only resource I've found.
You could limit the amount of comparisons by only comparing nearby platforms. I would probably only check the horizontal distance between platforms, and if it is wider than the longest jump possible, then don't bother checking for a link between those two. But you might have done this since you checked for the max height of a jump.
I glanced at the video and it gave me an idea. Instead of looking at all platforms to find which jumps are impossible, what if you did the opposite? Try placing an AI character on all platforms and see which other platforms they can reach. That's certainly easier to implement if your enemies can't change direction in midair though. Oh well, brainstorming is the key to finding something.
Several ideas you could try out:
Limit the amount of comparisons you need to make by using a spatial data structure, like a quad tree. This would allow you to severely limit how many platforms you're even trying to check. This is mostly the same as what you're currently doing, but a bit more generic.
Try to pre-compute some jump trajectories ahead of time. This will not catch all use cases that you have - as you allow for full horizontal control - but might allow you to catch some common cases more quickly
Consider some kind of walkability grid instead of a link generation scheme. When geometry is modified, compute which parts of the level are walkable and which are not, with some resolution (something similar to the dimensions of your agent might be good starting point). You could also filter them with a height, so that grid tiles that are higher than your jump height, and you can't drop from a higher place on to them, are marked as unwalkable. Then, when you compute your pathfinding, as part of your pathfinding step you can compute when you start a jump, if a path is actually executable ('start a jump, I can go vertically no more than 5 tiles, and after the peak of the jump, i always fall down vertically with some speed).

Memory transfer between work items and global memory in OpenCL?

I have some queries regarding how data transfer happens between work items and global memory. Let us consider the following highly inefficient memory bound kernel.
__kernel void reduceURatios(__global myreal *coef, __global myreal *row, myreal ratio)
{
size_t gid = get_global_id(0);//line no 1
myreal pCoef = coef[gid];//line no 2
myreal pRow = row[gid];//line no 3
pCoef = pCoef - (pRow * ratio);//line no 4
coef[gid] = pCoef;//line no 5
}
Do all work items in a work group begin executing line no 1 at the
same time?
Do all work items in a work group begin executing line no 2 at the
same time?
Suppose different work items in a work group finish executing line
no 4 at different times. Do the early finished ones wait so that,
all work items transfer the data to global memory at the same time
in line no 5?
Do all work items exit the compute unit simultaneously such that
early finished work items have to wait until all work items have
finished executing?
Suppose each kernel has to perform 2 reads from global memory. Is it
better to execute these statements one after the other or is it
better to execute some computation statements between the 2 read
executions?
The above shown kernel is memory bound for GPU. Is there any way by
which performance can be improved?
Are there any general guidelines to avoid memory bounds?
Find my answers below: (thanks sharpneli for the good comment of AMD GPUs and warps)
Normally YES. But depends on the hardware. You can't directly expect that behavior and design your algorithm on this "ordered execution". That's why barriers and mem_fences exists. For example, some GPU execute in order only a sub-set of the WG's WI. In CPU it is even possible that they run completely free of order.
Same as answer 1.
As in the answer 1, they will really unlikely finish at different times, so YES. However you have to bear in mind that this is a good feature, since 1 big write to memory is more efficient than a lot of small writes.
Typically YES (see answer 1 as well)
It is better to intercalate the reads with operations, but the compiler will already account for this and reorder the operation order to hide the latency of reading/writting effects. Of course the compiler will never move around code that can change the result value. Unless you disable manually the compiler optimizations this is a typical behavior of OpenCL compilers.
NO, it can't be improved in any way from the kernel point of view.
The general rule is, each memory cell of the input is used by more than one WI?
NO (1 global->1 private) (this is the case of your kernel in the question)
Then that memory is global->private, and there is no way to improve it, don't use local memory since it will be a waste of time.
YES (1 global-> X private)
Try to move the global memory lo local memory first, then read directly from local to private for each WI. Depending on the reuse amount (maybe only 2 WIs use the same global data) it may not even be worth if the computation amount is already high. You have to consider the tradeoff between extra memory usage and global access gain. For image procesing it is typically a good idea, for other types of processes not so much.
NOTE: The same process applies if you try to write to global memory. It is always better to operate in local memory by many WI before writing to global. But if each WI writes to an unique address in global, then write directly.

How to Implement embarrassingly parallel task (FOR loop) WITHOUT MPI-IO?

Preamble:
I have a very large array (one dim) and need to solve evolution equation (wave-like eq). I I need to calculate integral at each value of this array, to store the resulting array of integral and apply integration again to this array, and so on (in simple words, I apply integral on grid of values, store this new grid, apply integration again and so on).
I used MPI-IO to spread over all nodes: there is a shared .dat file on my disc, each MPI copy reads this file (as a source for integration), performs integration and writes again to this shared file. This procedure repeats again and again. It works fine. The most time consuming part was the integration and file reading-writing was negligible.
Current problem:
Now I moved to 1024 (16x64 CPU) HPC cluster and now I'm facing an opposite problem: a calculation time is NEGLIGIBLE to read-write process!!!
I tried to reduce a number of MPI processes: I use only 16 MPI process (to spread over the nodes) + 64 threads with OpenMP to parallelize my computation inside of each node.
Again, reading and writing processes is the most time consuming part now.
Question
How should I modify my program, in order to utilize the full power of 1024 CPUs with minimal loss?
The important point, is that I cannot move to the next step without completing the entire 1D array.
My thoughts:
Instead of reading-writing, I can ask my rank=0 (master rank) to send-receive the entire array to all nodes (MPI_Bcast). So, instead of each node will I/O, only one node will do it.
Thanks in advance!!!
I would look here and here. FORTRAN code for the second site is here and C code is here.
The idea is that you don't give the entire array to each processor. You give each processor only the piece it works on, with some overlap between processors so they can handle their mutual boundaries.
Also, you are right to save your computation to disk every so often. And I like MPI-IO for that. I think it is the way to go. But the codes in the links will allow you to run without reading every time. And, for my money, writing out the data every single time is overkill.

OpenCL - Multiple GPU Buffer Synchronization

I have an OpenCL kernel that calculates total force on a particle exerted by other particles in the system, and then another one that integrates the particle position/velocity. I would like to parallelize these kernels across multiple GPUs, basically assigning some amount of particles to each GPU. However, I have to run this kernel multiple times, and the result from each GPU is used on every other. Let me explain that a little further:
Say you have particle 0 on GPU 0, and particle 1 on GPU 1. The force on particle 0 is changed, as is the force on particle 1, and then their positions and velocities are changed accordingly by the integrator. Then, these new positions need to be placed on each GPU (both GPUs need to know where both particle 0 and particle 1 are) and these new positions are used to calculate the forces on each particle in the next step, which is used by the integrator, whose results are used to calculate forces, etc, etc. Essentially, all the buffers need to contain the same information by the time the force calculations roll around.
So, the question is: What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer? They cannot have a single shared buffer if I want to keep parallelism, as per my last question (though, if there is a way to create a shared buffer and still keep multiple GPUs, I'm all for that). I suspect that copying the results each step will cause more slowdown than it's worth to parallelize the algorithm across GPUs.
I did find this thread, but the answer was not very definitive and applied only to a single buffer across all GPUs. I would like to know, specifically, for Nvidia GPUs (more specifically, the Tesla M2090).
EDIT: Actually, as per this thread on the Khronos forums, a representative from the OpenCL working group says that a single buffer on a shared context does indeed get spread across multiple GPUs, with each one making sure that it has the latest info in memory. However, I'm not seeing that behavior on Nvidia GPUs; when I use watch -n .5 nvidia-smi while my program is running in the background, I see one GPU's memory usage go up for a while, and then go down while another GPU's memory usage goes up. Is there anyone out there that can point me in the right direction with this? Maybe it's just their implementation?
It sounds like you are having implementation trouble.
There's a great presentation from SIGGRAPH that shows a few different ways to utilize multiple GPUs with shared memory. The slides are here.
I imagine that, in your current setup, you have a single context containing multiple devices with multiple command queues. This is probably the right way to go, for what you're doing.
Appendix A of the OpenCL 1.2 specification says that:
OpenCL memory objects, [...] are created using a context and can be shared across multiple command-queues created using the same context.
Further:
The application needs to implement appropriate synchronization across threads on the host processor to ensure that the changes to the state of a shared object [...] happen in the correct order [...] when multiple command-queues in multiple threads are making changes to the state of a shared object.
So it would seem to me that your kernel that calculates particle position and velocity needs to depend on your kernel that calculates the inter-particle forces. It sounds like you already know that.
To put things more in terms of your question:
What is the best way to synchronize buffers across GPUs, given that each GPU has a different buffer?
... I think the answer is "don't have the buffers be separate." Use the same cl_mem object between two devices by having that cl_mem object come from the same context.
As for where the data actually lives... as you pointed out, that's implementation-defined (at least as far as I can tell from the spec). You probably shouldn't worry about where the data is living, and just access the data from both command queues.
I realize this could create some serious performance concerns. Implementations will likely evolve and get better, so if you write your code according to the spec now, it'll probably run better in the future.
Another thing you could try in order to get a better (or a least different) buffer-sharing behavior would be to make the particle data a map.
If it's any help, our setup (a bunch of nodes with dual C2070s) seem to share buffers fairly optimally. Sometimes, the data is kept on only one device, other times it might have the data exist in both places.
All in all, I think the answer here is to do it in the best way the spec provides and hope for the best in terms of implementation.
I hope I was helpful,
Ryan

Backtracking maze algorithm doesn't seem to be going all the way back

Basically, I have this: First, a bunch of code generates a maze that is non-traversable. It randomly sets walls in certain spaces of a 2D array based on a few parameters. Then I have a backtracking algorithm go through it to knock out walls until the whole thing is traversable. The thing is, the program doesn't seem to be going all the way back in the stack.
It's pretty standard backtracking code. The algorithm starts at a random location, then proceeds thus in pseudocode:
move(x, y){
if you can go up and haven't been there already:
move (x, y - 1)
if you can go right and haven't been there already:
move (x + 1, y)
...
}
And so on for the other directions. Every time you move, two separate 2D arrays of booleans (one temporary, one permanent) are set at the coordinates to show that you've been in a certain element. Once it can't go any further, it checks the permanent 2D array to see if it has been everywhere. If not, it randomly picks a wall that borders between a visited and non visited space (according to the temporary array) and removes it. This whole thing is invoked in a while loop, so once it's traversed a chunk of the maze, the temporary 2D array is reset while the other is kept and it traverses again at another random location until the permanent 2D array shows that the whole maze has been traversed. The check in the move method is compared against the temporary 2D array, not the permanent one.
This almost works, but I kept finding a few unreachable areas in the final generated maze. Otherwise it's doing a wonderful job of generating a maze just the way I want it to. The thing is, I'm finding that the reason for this is that it's not going all the way back in the stack.
If I change it to check the temporary 2D array for completion instead of the permanent one (thus making it do one full traversal in a single run to mark it complete instead of doing a full run across multiple iterations), it will go on and on and on. I have to set a counter to break it. The result is a "maze" with far, far too many walls removed. Checking the route the algorithm takes, I find that it has not been properly backtracking and has not gone back in the stack nearly far enough in the stack and often just gets stuck on a single element for dozens of recursions before declaring itself finished for no reason at all and removing a wall that had zero need to be removed.
I've tried running the earlier one twice, but it keeps knocking out walls that don't need to be knocked out and making the maze too sparse. I have no idea why the heck this is happening.
I've had a similar problem when trying to make a method for creating a labyrinth.
The important thing when making mazes is to try to NOT create isolated "islands" of connected rooms in the maze. Here's my solution in pseudocode
Room r=randomRoom();
while(r!=null){
recursivelyDigNewDoors(r);
r=null;
for(i=0;i<rooms.count;i++){
if(rooms[i].doors.length == 0 && rooms[i].hasNeighborWithDoor() ){
//if there is a room with no doors and that has a neighbor with doors
Create a door between the doorless room and the one
connected to the rest of your maze
r=rooms[i];
}
}
}
where recursivelyDigNewDoors is a lot like your move() function
In reality, you might like to
describe a "door" as a lack of a wall
and a room without doors as a room with four walls
But the general principle is:
Start your recursive algorithm somewhere
When the algorithm stops: find a place where there's 1 unvisited square and 1 visited.
Link those two together and continue from the previously unvisited square
When no two squares fulfill (2) you're done and all squares are connected

Resources