Detecting if a Program is in an Infinite Loop (Read: Solving the Halting Problem) - infinite-loop

Is detecting whether a deterministic program (i.e. state machine) is in an infinite loop equivalent to solving the halting problem?
I came up with a solution, and I'm not sure why it shouldn't work:
Let the program run
When you think it's in an infinite loop, take a snapshot of its memory regular intervals
If you ever detect the same snapshot, the program is in an infinite loop
As long as you don't get the same snapshot twice, it's either (1) not in an infinite loop, or (2) you need to take snapshots more quickly (perhaps once on every memory access?)
I'm assuming this doesn't work... but why?
It seems like a perfectly reasonable way to detect if a program is in an infinite loop (e.g. especially if you store hashes rather than the memory itself, although that will not be 100% accurate)... what's wrong with it, if anything?

In theory, it is not equivalent to the halting problem because real computers have finite number of possible states (even though it's huge). Turing machines, which the halting problem applies to, have infinite storage.
But, let's explore your idea further. You also have to take a snapshot of the "hidden" state: the CPU's program counter and other registers, and that you must have to take a snapshot before each single instruction. (The program would be in an infinite loop if the memory snapshot is the same AND the same instruction is about to be executed. It doesn't help if the memory contents is the same, but something else is going to be executed than the last time you saw the same snapshot.)
In practice, even a very small computer has such a huge number of potential states that you'd never be able to store (not even hashes!) all your snapshots. For example, even a minicomputer like the ancient commodore 64 with 64kB of RAM has 256^65536 potential states (not including the 5 CPU registers). Tracking cycles that are potentially so long is absolutely infeasbile, both in time and space.

The solution wouldn't work even in principle. A Turing machine doesn't have to ever be in precisely the same state (with the tape in the same configuration) to get into an infinite loop.
Your algorithm might work for context-sensitive languages and linear-bounded automata, but if you can't know how much tape a TM is going to need, you'd never know if you had an infinite loop or were about to hit the top. Note that your method would clearly work for real computers for a variety of reasons... chief among them being that your computer is less powerful than a (big) finite automaton.

Related

Why the memory in my pc decreases when it almost reaches the limit?

I am running a piece of code in R. Its parallelized, running in 8 cores. Interestingly enough, when my memory usage reaches 15 and something GB, it drops to 10GB (my max memory is 16GB). I am curious of what is actually happening in the background? In the end, I get the complete data from all 8 cores, so I assume that data doesn't get lost. Does the pc stores it somewhere in SSD to free memory?
For more information, I loop over a time series data and perform a lot calculations, which I store in multiple vectors. When code finishes looping, it stores all the previous vectors in a list.
While running code, if I start opening many chrome tabs, which require a lot of memory, my code running time may take longer but still retrieves all data (sometimes crashes).
Very curious of what is happening?
It's impossible to say without the specific code, but most likely, it's due to R's garbage collection running only when necessary and only when more memory needs to be allocated - unlike other languages like Python, R does not immediately garbage-collect objects when they reach out of scope, and in particular if the R objects have an underlying pointer to a C/C++ object, garbage collection can he held out until very late after the object is unreachable.
If this variable memory usage is a problem, you can try adding explicit calls to gc() at key points in your code.
Yes, you are right pc sometimes usage the hard disk as memory. it is known as Swap memory. When your ram gets overloaded it sends some of the data to the hard and stores them there temporarily.

Optimization of parallel programming

I want to use MPI to make my program parallel, and I want to send something to other computers. I want to know which one is better: sending a huge buffer one time or sending two smaller messages 3 times atrent times during the execution instead of all at once?
It's almost always going to be faster to send the one big message than the smaller one. Each time you do a Send/Receive pair, the two processes have to go through the entire process of sending a message to each other, including at least 6 roundtrip messages. If you are just sending one larger message, there is a minimum of 2 roundtrip messages. Each of those messages can be very expensive (compared to doing things locally like packing all of your data into one buffer).
I'd encourage you to try it out both ways though to be sure that this applies to your application. It could be different if you're doing something unexpected.
Depending on your problem, sending all data may be more efficient, because the nodes have to be synced, every time. That may cause a delay.
I always try to send as much data as I can in a single MPI call. In my experience, sending many small bits of data greatly increases the overhead and network traffic, and I have even run into problems where I overwhelmed the computers' ability to keep up with the number of requests, because I was sending a large member of a complicated class, one integer at a time, to many workers. Therefore, when possible, send the entire data at once, unless you have some reason to believe it is too large.
Further, I strive to use 100% of all the CPU's my program claims. When you are working on shared resources, if you use a CPU, you need to actually use it. Otherwise, someone else who wants to use that core, or node, is blocked out while your program sits and does nothing. For example, on a Cray which I have used, even if you call for only two 'cores', the manager will reserve a full bank of 24 cores, essentially wasting 22. Or, perhaps one worker has nothing to do, while another chugs away -- again, wasting time. Hopefully, there is a way to balance the load, so to speak, to avoid unintentional waste of resources.
Back to the topic at hand. Demonstrate timing and efficiency of vector sending to yourself -- write a program which breaks up the vector into varying sizes of packets, and do the sends/receives. Test it with varying numbers of workers, and on several different configurations of computers, if you can. Before writing production code, do proof of concept, and what optimization you can. Test and time it!

What happens when I divide by zero?

Now I'm not asking about the mathematical answer of dividing by zero, I want to what my computer does when it tries to divide by zero?
I can see different answers depending on what level we look at?
Looking at a high-level I expect the language specification may just say "hey you can't do that throw an error"
Looking at an assembly level, will the CPU try to call the divide instruction when we try to divide by 0?
If it does that'll take us to the machine code level. What happens now?
Now if that doesn't happen and we force it to happen, what would the result be?
I think what you're asking is what's going to happen if we perform binary division algorithm with 0 in the denominator.
The algorithm will go into an infinite loop, and the quotient will grow larger and larger until it exhausts all available memory.
Looking at an assembly level, will the CPU try to call the divide instruction when we try to divide by 0?
Yep instruction gets fetched and decoded. What happens then? The divisor is found to be zero and the process stops doing what it's doing. It throws some kind of an exception, pipeline (if there is one) is flushed and most likely control jumps to some predefined error handling code - often the OS controls a machine level jump table called an interrupt vector (or this could be a separate to the interrupt vector table).
There are many architectures however, and things like error handling vary. Intel x86 follows above procedure at least.
If it does that'll take us to the machine code level. What happens now?
I have no idea what that means. From the CPU's perspective, it is all machine code level.

Barriers in OpenCL

In OpenCL, my understanding is that you can use the barrier() function to synchronize threads in a work group. I do (generally) understand what they are for and when to use them. I'm also aware that all threads in a work group must hit the barrier, otherwise there are problems. However, every time I've tried to use barriers so far, it seems to result in either my video driver crashing, or an error message about accessing invalid memory of some sort. I've seen this on 2 different video cards so far (1 ATI, 1 NVIDIA).
So, my questions are:
Any idea why this would happen?
What is the difference between barrier(CLK_LOCAL_MEM_FENCE) and barrier(CLK_GLOBAL_MEM_FENCE)? I read the documentation, but it wasn't clear to me.
Is there general rule about when to use barrier(CLK_LOCAL_MEM_FENCE) vs. barrier(CLK_GLOBAL_MEM_FENCE)?
Is there ever a time that calling barrier() with the wrong parameter type could cause an error?
As you have stated, barriers may only synchronize threads in the same workgroup. There is no way to synchronize different workgroups in a kernel.
Now to answer your question, the specification was not clear to me either, but it seems to me that section 6.11.9 contains the answer:
CLK_LOCAL_MEM_FENCE – The barrier function will either flush any
variables stored in local memory or queue a memory fence to ensure
correct ordering of memory operations to local memory.
CLK_GLOBAL_MEM_FENCE – The barrier function will queue a memory fence
to ensure correct ordering of memory operations to global memory.
This can be useful when work-items, for example, write to buffer or
image memory objects and then want to read the updated data.
So, to my understanding, you should use CLK_LOCAL_MEM_FENCE when writing and reading to the __local memory space, and CLK_GLOBAL_MEM_FENCE when writing and readin to the __global memory space.
I have not tested whether this is any slower, but most of the time, when I need a barrier and I have a doubt about which memory space is impacted, I simply use a combination of the two, ie:
barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
This way you should not have any memory reading\writing ordering problem (as long as you are sure that every thread in the group goes through the barrier, but you are aware of that).
Hope it helps.
Reviving an old-ish thread here. I have had a little bit of trouble with barrier() myself.
Regarding your crash problem, one potential cause could be if your barrier is inside a condition. I read that when you use barrier, ALL work items in the group must be able to reach that instruction, or it will hang your kernel - usually resulting in a crash.
if(someCondition){
//do stuff
barrier(CLK_LOCAL_MEM_FENCE);
//more stuff
}else{
//other stuff
}
My understanding is that if one or more work items satisfies someCondition, ALL work items must satisfy that condition, or there will be some that will skip the barrier. Barriers wait until ALL work items reach that point. To fix the above code, I need to restructure it a bit:
if(someCondition){
//do stuff
}
barrier(CLK_LOCAL_MEM_FENCE);
if(someCondition){
//more stuff
}else{
//other stuff
}
Now all work items will reach the barrier.
I don't know to what extent this applies to loops; if a work item breaks from a for loop, does it hit barriers? I am unsure.
UPDATE: I have successfully crashed a few ocl programs with a barrier in a for-loop. Make sure all work items exit the for loop at the same time - or better yet, put the barrier outside the loop.
(source: Heterogeneous Computing with OpenCL Chapter 5, p90-91)

"random" kernel crash after running for minutes.... HEP!? -- [same question posted on Khronos]

I have a thoroughly complex kernel processing audio input data. It will run for a couple of minutes, 60 times a second, and then hang. That's on the GPU; on the CPU it will run for hours. The input data are constantly changing, but each variable is always within proscribed ranges. I have inserted test code before uploading the inputs to the kernel each frame; in this test code, I can force these inputs to be well below their valid input range, but it still will eventually crash. (Say the valid range for a particular input is 0->400; I can force it to 0->1 and it will STILL eventually crash. I can force it to be below 0.1 and it will still ultimately bite the dust.) However, if I force the input variables to zero, the GPU will happily dance for hours. Of course, that input-free dance is not so particularly interesting.
I'm at a loss so far, though I have clues. I can make it crash much faster than 2 minutes if an input variable is high in its approved range. I can make it crash in less then 10 seconds under the right circumstances. BUT, I can't seem to _back_off_of_ those certain circumstances such that they go away. As said above, I can force the input vars into ridiculously small portions of their valid range, and the kernel (let's call him Harlan Sanders) will eventually go belly-up. BUT, if they're forced to actual zero, no problems puppy, we can run all day long.
To repeat, I'm a bit at a loss - although I have things that look like clues, I have not yet figured out what they are hinting at, though I've been trying for a few days. Frankly, I do not expect to find a real solution by asking here; whenever I stumble over a problem in opencl it seems that my fate is to be the first to articulate that particular problem. I guess this is part of the fun of being in on a technology during its infancy!!!!!!!!!! BUT, I want to do some serious, sustainable work with this "baby" (or, maybe, "toddler").
Op details: MacBook Pro 2010, OS 10.6.8, nv 330M GPU, xcode 3.2.5, shorts, teeshirt.
bonus P.S. for those who've read this far, including a related question:
My laptop, soldier that it has proved to be, is not powerful enough for the next stage. I must sell some stocks/bonds and purchase a Mac Pro. I'm looking at the ATI 5870. So, PERHAPS my problem will simply go away when I compile the .cl for the ATI??? Maybe I have run into a bug in the nV implementation. Maybe my kernel is so complex that I'm running into undetected resource limits (it's 1300 lines of code). So, SINCE I run fine on the CPU, perhaps I'll have no bugs, or different bugs, on the ATI card???
Any thoughts?
Thanks, guys & dolls --
Dave
Use "cl_" data types on the CPU side, because maybe you are not coping data the right way, or it is not being understood by the GPU. This could lead to GPU hangs on invalid pointers while handing the data.
You should also try -Werror, and read the error output. You can be doing smt wrong.
Without any code, we can only guess. But I haven't found any bug in the actual OpenCL NV or ATI implementations.
Make sure you release all resources. Events returned by Enqueue functions must be released. This error sometimes occurs after accessing buffers out of range.

Resources