Asynchronous gradient descent in tensorflow under concurrent access - asynchronous

I have been having trouble implementing an asynchronous gradient descent in a multithreaded environment.
To describe the skeleton of my code, for each thread,
loop
synchronize with global param
< do some work / accumulate gradients on mini-batch >
apply gradient descent to the global network, specifically,
= self.optimizer.apply_gradients(grads_and_vars)
end
where each thread has its own optimizer.
Now the problem is that, in defining the optimizer with 'use_locking=False', it does Not work, evidenced by the rewards generated by my reinforcement learning agent.
However, when I set the 'use_locking=True', it works so the algorithm is correct; it's just that the local gradients are not applied properly to the global param.
So some possible reasons I thought of were the following:
1. While one thread is updating the global param, when another thread accesses the global param, the former thread cancels all remaining updates. And that too many threads access this global param concurrently, threads do all hard work for nothing.
2. Referring to, How does asynchronous training work in distributed Tensorflow?, reading asynchronously is certain fine in the top of the loop. However, it may be that as soon as the thread is done applying the gradient, it goes to synchronizing from the global param too quickly that it does not fetch the updates from other threads.
Can you, hopefully tensorflow developer, help me what is really happening with 'use_locking' for this specific loop instance?
I have been spending days on this simple example. Although setting use_locking = True does solve the issue, it is not asynchronous in nature and it is also very slow.
I appreciate your help.

Related

Asynchronous long running computation in elm

I am new to ELM and I am trying to wrap my head around asynchronous computations.
I am trying to write an ELM program to automate radiology report generation.
This program uses a graph data structure to store the state of a hierarchical tree of toggles, radio, and edit boxes that the user can edit (add, delete, move around) the toggles.
Individual toggles can also have their behavior modified so that toggling one ON, will show or hide other toggles.
As an example, when you toggled Chest ON, the lung would show, and the liver would be hidden. I do this by using the edges of the graph in the update function, which then is interpreted and rendered in the view function.
I then use a look-up table of sorts to compute the English phrase corresponding to a set of specific ON or OFF toggles, like liver, cyst, 3, mm, segment_4.
This works fine, but computing the English phrases is computationally expensive and it blocks my UI thread.
I have been playing with Tasks and Process.swap but I am misunderstanding them somehow.
I thought that I could change my ToggleOnOff message implementation so that I could spawn the long running implementation, return immediately my new state (i.e., (model, Cmd msg) ), and set the spawned processed to call the update again with the English text string when it finished with the computation. I have failed completely in implementing this.
I have a generateReport function which I turned into a Task using Task.succeed.
Than I tried to spawn this generateReportTask and chain it using Task.andThen to the msg UpdateReportFullText. And then I would need to return a new model imediately with the state of the toggles, before the long running generateReportTask runs and calls update on its own to add the english text to the model.
I believe I am still thinking about this imperatively and that I am misunderstanding FP and ELM. My experience with ELM has always shown me that the ELM way is always shorter and clearer than my clumsy imperative-style noob attempts.
Can someone please educate me and help a poor fellow see the light?
Thank you
(Sorry for the dry no code question but my program is hard to reproduce in a few lines...)
Long running computations are a bit tricky on the web platform, especially if you want them not to block the user interface.
In practice you have two options:
Use Web Workers. This is a technology that actually spins up a new process that can run entirely in parallel with your UI process. Unfortunately Web Workers are basically an entirely separate program that communicates with your main program via (mostly) serialised messages. This means that the computation you perform has to be accelerated more than the serialization/deserialization ends up costing.
In Elm, you would use a Platform.worker program, then add ports and wire things together in JavaScript. This article seems to describe this in some detail.
You can chunk up your computation into small bits and perform each small bit per frame. This can work even better in some cases where there are intermediate results that can be rendered, although this isn't necessary. You can see an example I wrote here (notice how it actually takes a few seconds to compute the layout of the graph, but it just looks like a neat little animation).

Is a preemptive multitasking OS possible on the interruptless DCPU-16?

I am looking into various OS designs in the hopes of writing a simple multitasking OS for the DCPU-16. However, everything I read about implementation of preemptive multitasking is centered around interrupts. It sounds like in the era of 16-bit hardware and software, cooperative multitasking was more common, but that requires every program to be written with multitasking in mind.
Is there any way to implement preemptive multitasking on an interruptless architecture? All I can think of is an interpreter which would dynamically switch tasks, but that would have a huge performance hit (possibly on the order of 10-20x+ if it had to parse every operation and didn't let anything run natively, I'm imagining).
Preemptive multitasking is normally implemented by having interrupt routines post status changes/interesting events to a scheduler, which decides which tasks to suspend, and which new tasks to start/continue based on priority. However, other interesting events can occur when a running task makes a call to an OS routine, which may have the same effect.
But all that matters is that some event is noted somewhere, and the scheduler decides who to run. So you can make all such event signalling/scheduling occur only only on OS calls.
You can add egregious calls to the scheduler at "convenient" points in various task application code to make your system switch more often. Whether it just switches, or uses some background information such as elapsed time since the last call is a scheduler detail.
Your system won't be as responsive as one driven by interrupts, but you've already given that up by choosing the CPU you did.
Actually, yes. The most effective method is to simply patch run-times in the loader. Kernel/daemon stuff can have custom patches for better responsiveness. Even better, if you have access to all the source, you can patch in the compiler.
The patch can consist of a distributed scheduler of sorts. Each program can be patched to have a very low-latency timer; on load, it will set the timer, and on each return from the scheduler, it will reset it. A simplistic method would allow code to simply do an
if (timer - start_timer) yield to scheduler;
which doesn't yield too big a performance hit. The main trouble is finding good points to pop them in. In between every function call is a start, and detecting loops and inserting them is primitive but effective if you really need to preempt responsively.
It's not perfect, but it'll work.
The main issue is making sure that the timer return is low latency; that way it is just a comparison and branch. Also, handling exceptions - errors in the code that cause, say, infinite loops - in some way. You can technically use a fairly simple hardware watchdog timer and assert a reset on the CPU without clearing any of the RAM; an in-RAM routine would be where RESET vector points, which would inspect and unwind the stack back to the program call (thus crashing the program but preserving everything else). It's sort of like a brute-force if-all-else-fails crash-the-program. Or you could POTENTIALLY change it to multi-task this way, RESET as an interrupt, but that is much more difficult.
So...yes. It's possible but complicated; using techniques from JIT compilers and dynamic translators (emulators use them).
This is a bit of a muddled explanation, I know, but I am very tired. If it's not clear enough I can come back and clear it up tomorrow.
By the way, asserting reset on a CPU mid-program sounds crazy, but it is a time-honored and proven technique. Early versions of Windows even did it to run compatibility mode on, I think 386's, properly, because there was no way to switch back to 32-bit from 16-bit mode. Other processors and OSes have done it too.
EDIT: So I did some research on what the DCPU is, haha. It's not a real CPU. I have no idea if you can assert reset in Notch's emulator, I would ask him. Handy technique, that is.
I think your assessment is correct. Preemptive multitasking occurs if the scheduler can interrupt (in the non-inflected, dictionary sense) a running task and switch to another autonomously. So there has to be some sort of actor that prompts the scheduler to action. If there are no interrupting devices (in the inflected, technical sense) then there's little you can do in general.
However, rather than switching to a full interpreter, one idea that occurs is just dynamically reprogramming supplied program code. So before entry into a process, the scheduler knows full process state, including what program counter value it's going to enter at. It can then scan forward from there, substituting, say, either the twentieth instruction code or the next jump instruction code that isn't immediately at the program counter with a jump back into the scheduler. When the process returns, the scheduler puts the original instruction back in. If it's a jump (conditional or otherwise) then it also effects the jump appropriately.
Of course, this scheme works only if the program code doesn't dynamically modify itself. And in that case you can preprocess it so that you know in advance where jumps are without a linear search. You could technically allow well-written self-modifying code if it were willing to nominate all addresses that may be modified, allowing you definitely to avoid those in your scheduler's dynamic modifications.
You'd end up sort of running an interpreter, but only for jumps.
another way is to keep to small tasks based on an event queue (like current GUI apps)
this is also cooperative but has the effect of not needing OS calls you just return from the task and then it will go on to the next task
if you then need to continue a task you need to pass the next "function" and a pointer to the data you need to the task queue

How to implement a callback/producer/consumers scheme?

I'm warming up with Clojure and started to write a few simple functions.
I'm realizing how the language is clearly well-suited for parallel computation and this got me thinking. I've got an app (written in Java but whatever) that works the following way:
one thread waits for input to come in (filesystem in my case but it could be network or whatever) and puts that input once it arrives on a queue
several consumers fetch data from that queue and process the data in parallel
The code that puts the input to be parallely processed may look like this (it's just an example):
asynchFetchInput( new MyCallBack() {
public void handle( Input input ) {
queue.put(input)
}
})
Where asynchFetchInput would spawn a Thread and then call the callback.
It's really just an example but if someone could explain how to do something similar using Clojure it would greatly help me understand the "bigger picture".
If you have to transform data, you can make it into a seq, then you can feed it to either map or pmap The latter will process it in parallel. filter and reduce are also both really useful; so you might want to see if you can express your logic in those terms.
You might also want to look into the concurrency utilities in basic Java rather than spawning your own threads.

asp.net - How does parallel.for works

How does parallel.for works. Does it invoke threads for each loop/ divide the loops in parts and execute them in parallel? If it does then can we ensure the same result as normal for loop? I tested for performance and it really uses multi core processors. But i want to know the internal working as to how this works
Parallel.For partitions the work for a number of concurrent iterations. Per default it uses the default task scheduler to schedule the iterations, which essentially uses the current thread as well as a number of thread pool threads. There are overloads that will allow you to change this behavior.
A parallel loop may look very similar to a regular loop, but there are actually a number of important differences. First of all the order of the iterations is not guaranteed. I.e. the code cannot assume any specific order. Doing so will lead to unpredictable results.
Also, since the code may be run on multiple threads exception handling is completely different from a regular for loop. Parallel.For will catch exceptions for the threads and marshal these back to the calling thread as inner exceptions in an instance of AggregateException.
For additional details please check Parallel Programming with Microsoft .NET by Microsoft patterns and practices.
parallel.for loops run iterations of the loop in different processes in parallel. You can only use this if iterations are independent of one another. Only if the iterations are independent, you can assume the same results will be produced by a parallel and non-parallel for loop.

Flex equivalent of ProcessMessages and unresponsive UI during long loops

I find that my Flex application's UI becomes unresponsive during very long processing loops (tens of seconds). For example, while processing very large XML files and doing something per-element...
Is there an equivalent of "ProcessMessages"? That is, a call that would tell Flex to continue responding to UI events, even in the middle of some long loop, so that the UI doesn't become unresponsive?
I'm aware Flex is single threaded by design. That's exactly why I'm looking for something like ProcessMessages() - a function that allows single-threaded reentrant applications (like in VB, or single-threaded message loop based C++ applications) to remain responsive during long operations.
Summary of Answers
There's no built-in function like HandleEvents() or ProcessMessages() in Flex.
Using some sort of callback mechanism to iteratively process chunks of a long computation process, while yielding to the runtime between chunks, thus enabling it to be responsive, is the only way to maintain a responsive UI during long computations.
Ways of accomplishing the above are:
Using the enterFrame event, which is called whenever the Flash "movie" layer below the Flex application refreshes its frame (which is something like 20fps).
Using a timer.
Using UIComponent.callLater() which schedules work to be done "later". (as the docs say: Queues a function to be called later. Before each update of the screen, Flash Player or AIR calls the set of functions that are scheduled for the update.
Using intentionally triggered mouse/keyboard events to create pseudo "worker threads", as in this example.
If there are further suggestions, or if I left out anything, please feel free to edit this (now) wiki piece.
The problem is that Flash is single threaded, i.e. until a part of the code is running, no other processing can be made. You'll somehow need to break up the processing into smaller chunks and execute these chunks, say, on the enterFrame event.
Edit: I'm afraid that downvoting this (or Simon's) answer does not change the fact that this is not doable in AS3. Read this article for more insight on the problem. The article also includes a simple "library" called PseudoThread, which helps in executing long background computations. You still have to break up the problem into smaller pieces yourself, though.
I can tell you definitively that as of Flex 3, there is no built-in construct similar to the ProcessMessages functionality you are describing.
The most common way to work around this is to split whatever work you are processing into a queue of "worker" tasks and a "worker manager" to control the queue. As each "worker" completes its processing, the worker queue manager pops the next worker off the queue and executes it in a callLater() invocation. This will have an effect that is similar to "yielding to the main thread" and allow your UI to receive events and be responsive, while still allowing the processing to continue when control is returned to the worker.
Once you've got this working, you can do some testing and profiling to figure out if your application can get away with executing a run of multiple workers before invoking callLater(), and encapsulate this logic within the worker queue manager. For example, in our implementation of this pattern (with a few hundred workers in the queue), we were able to get the processing done more quickly with a comparable perceived performance by executing 4 workers before "yielding to the main thread" with callLater(), but this is totally dependent on the scope and nature of the work that your workers are doing.
The process model for ActionScript is single threaded, so the answer is no. Best bet is to either defer to an asynchronous task on the server if you can, or pop up a wait cursor while your long loop runs, or break your process into some smaller pieces which are not quite as intrusive to the UI, or perform the long running tasks at a moment when the user is less likely to feel the effect (app startup for instance).
Actionscript is single threaded by design, no amount of downvoting answers will change that.
For compatibility your best bet is to try to split up your processing into smaller chunks, and do your processing iteratively.
If you absolutely need threading it can sort of be done in Flash Player 10 using Pixel Bender filters. These will run on a separate thread and can give you a callback once they are done.
I believe they are well suited for "hardcore" processing tasks, so they might fit your purpose nicely.
However, they will put a whole other set of demands on your code, so you might be better of doing small "buckets" of computing anyways.
There is no equivalent functionality in Flash Player. By design, Flash Player alternates between rendering to the screen and then running all of the code for each frame. Using Event.ENTER_FRAME events on display objects, or Timer objects elsewhere, are the best bet for breaking up very long calculations.
It's worth noting that some events in ActionScript have an updateAfterEvent() function with the following description:
Instructs Flash Player or the AIR runtime to render after processing of this event completes, if the display list has been modified.
In particular, TimerEvent, MouseEvent, and KeyboardEvent support updateAfterEvent(). There may be others, but those are the ones I found with a quick search.

Resources