How does parallel.for works. Does it invoke threads for each loop/ divide the loops in parts and execute them in parallel? If it does then can we ensure the same result as normal for loop? I tested for performance and it really uses multi core processors. But i want to know the internal working as to how this works
Parallel.For partitions the work for a number of concurrent iterations. Per default it uses the default task scheduler to schedule the iterations, which essentially uses the current thread as well as a number of thread pool threads. There are overloads that will allow you to change this behavior.
A parallel loop may look very similar to a regular loop, but there are actually a number of important differences. First of all the order of the iterations is not guaranteed. I.e. the code cannot assume any specific order. Doing so will lead to unpredictable results.
Also, since the code may be run on multiple threads exception handling is completely different from a regular for loop. Parallel.For will catch exceptions for the threads and marshal these back to the calling thread as inner exceptions in an instance of AggregateException.
For additional details please check Parallel Programming with Microsoft .NET by Microsoft patterns and practices.
parallel.for loops run iterations of the loop in different processes in parallel. You can only use this if iterations are independent of one another. Only if the iterations are independent, you can assume the same results will be produced by a parallel and non-parallel for loop.
Related
I am using foreach + %dopar% to achieve parallelism over multiple cores. I know some of the tasks will face exceptions. When an exception occurs:
Will the remaining tasks that were already parallelly started still complete?
Will the tasks that were not scheduled (I don't know if that's the correct term) be scheduled and eventually complete? If so, will it still be able to utilize all the cores?
I tried finding resources on this, but couldn't find any. Looks like I'm using the wrong keywords. If you have any resources, please direct me to them.
There is a parameter in foreach called .errorhandling, it could have the values of stop (default), remove or pass. their behaviour is like this:
stop: The function will be stopped.
remove: The result of this specific task will not be returned.
pass: The error object will be included with the results.
So addressing your specific question, if you have many task running in parallel and one of the task in one worker raised an exception, then it stop the process and will pass to the next task "scheduled" (that is because of the default value stop). The other task will continue as normal in parallel.
Please see this answer that explains better how are handled the errors in foreach and %dopar%.
I hope this will clarify a little your problem.
I have been having trouble implementing an asynchronous gradient descent in a multithreaded environment.
To describe the skeleton of my code, for each thread,
loop
synchronize with global param
< do some work / accumulate gradients on mini-batch >
apply gradient descent to the global network, specifically,
= self.optimizer.apply_gradients(grads_and_vars)
end
where each thread has its own optimizer.
Now the problem is that, in defining the optimizer with 'use_locking=False', it does Not work, evidenced by the rewards generated by my reinforcement learning agent.
However, when I set the 'use_locking=True', it works so the algorithm is correct; it's just that the local gradients are not applied properly to the global param.
So some possible reasons I thought of were the following:
1. While one thread is updating the global param, when another thread accesses the global param, the former thread cancels all remaining updates. And that too many threads access this global param concurrently, threads do all hard work for nothing.
2. Referring to, How does asynchronous training work in distributed Tensorflow?, reading asynchronously is certain fine in the top of the loop. However, it may be that as soon as the thread is done applying the gradient, it goes to synchronizing from the global param too quickly that it does not fetch the updates from other threads.
Can you, hopefully tensorflow developer, help me what is really happening with 'use_locking' for this specific loop instance?
I have been spending days on this simple example. Although setting use_locking = True does solve the issue, it is not asynchronous in nature and it is also very slow.
I appreciate your help.
While I have tried to dive into both techniques it is still a bit blurry to me for which problems and situations these are used.
If I simplify this, are CPU-bound problems handled with parallel and IO-bound ones async programming?
Perhaps a better title for this question would be 'to block or not to block?' as going parallel or asynchronous are not mutually exclusive.
I recommend using multiple threads on a problem either 1) when it is both CPU bound, and can be split up into multiple parts that do not require coordination/sharing to complete or 2) the job may stall for a long period of time on IO and we do not want to prevent other work from occurring.
Asynchronous basically means, don't block a thread waiting for something to complete. Instead rely on a callback that will notify of its completion. As such one can go asynchronous when there is only one worker thread.
Asynchronous techniques have been resurfacing recently because they scale better than blocking techniques. This is because we are limited in how many threads we can have on a single system before the overheads of managing those threads dominate.
I have a C++ code using mpi and is executed in a sequential-parallel-sequential pattern. The above pattern is repeated in a time loop.
While validating the code with the serial code, I could get a reduction in time for the parallel part and in fact the reduction is almost linear with the no of processors.
The problem that I am facing is that the time required for the sequential part also increases considerably when using higher no of processors.
The parallel part takes less time to be executed in comparison with total sequential time of the entire program.
Therefore although there is a reduction in time for the parallel part when using higher no of processors, the saving in time is lost considerably due to increase in time while executing the sequential part. Also the sequential part includes a large no of computations at each time step and writing the data to an output file at some specified time.
All the processors are made to run during the execution of sequential part and the data is gathered to the root processor after the parallel computation and only the root processor is allowed to write the file.
Therefore can anyone suggest what is the efficient way to calculate the serial part (large no of operations + write the file) of the parallel code ? I would also like to clarify on any of the point if required.
Thanks in advance.
First of all, do file writing from separate thread (or process in MPI terms), so other threads can use your cores for computations.
Then, check why your parallel version is much slower than sequential. Often this means you creates too small tasks so communication between threads (synchronization) eats your performance. Think if tasks can be combined into chunks and complete chunks processed in parallel.
And, of course, use any profiler that is good for multithreading environment.
[EDIT]
sequential part = part of your logic that cannot be (and is not) paralleled, do you mean the same? sequential part on multicore can work a bit slower, probably because of OS dispatcher or something like this. It's weird that you see noticable difference.
Disk is sequential by its nature, so writing to disk from many threads don't give any benefits, but can lead to the situation when many threads try to do this simultaneously and waits for each other instead of doing something useful.
BTW, what MPI implementation do you use?
Your problem description is too high-level, provide some pseudo-code or something like this, this can help us to help you.
I have some high performance file transfer code which I wrote in C# using the Async Programming Model (APM) idiom (eg, BeginRead/EndRead). This code reads a file from a local disk and writes it to a socket.
For best performance on modern hardware, it's important to keep more than one outstanding I/O operation in flight whenever possible. Thus, I post several BeginRead operations on the file, then when one completes, I call a BeginSend on the socket, and when that completes I do another BeginRead on the file. The details are a bit more complicated than that but at the high level that's the idea.
I've got the APM-based code working, but it's very hard to follow and probably has subtle concurrency bugs. I'd love to use TPL for this instead. I figured Task.Factory.FromAsync would just about do it, but there's a catch.
All of the I/O samples I've seen (most particularly the StreamExtensions class in the Parallel Extensions Extras) assume one read followed by one write. This won't perform the way I need.
I can't use something simple like Parallel.ForEach or the Extras extension Task.Factory.Iterate because the async I/O tasks don't spend much time on a worker thread, so Parallel just starts up another task, resulting in potentially dozens or hundreds of pending I/O operations; way too much! You can work around that by Waiting on your tasks, but that causes creation of an event handle (a kernel object), and a blocking wait on a task wait handle, which ties up a worker thread. My APM-based implementation avoids both of those things.
I've been playing around with different ways to keep multiple read/write operations in flight, and I've managed to do so using continuations that call a method that creates another task, but it feels awkward, and definitely doesn't feel like idiomatic TPL.
Has anyone else grappled with an issue like this with the TPL? Any suggestions?
If you're worried about too many threads, you can just set ParallelOptions.MaxDegreeOfParallelism to an acceptable number in your call to Parallel.ForEach.