I'm trying to implement my own version of the Asynchronous Advantage Actor-Critic method, but it fails to learn the Pong game. My code was mostly inspired by Arthur Juliani's and OpenAI Gym's A3C versions. The method works well for a simple Doom environment (the one used in Arthur Juliani's code), but when I try the Pong game, the method diverges to a policy where it always executes the same action (always move down, or always move up, or always executes the no-op action). My code is located in my GitHub repository.
I have already adapted my network to resemble the architecture used by OpenAI Gym's A3C version, which is:
4 convolutional layers with the same specs, those being: 32 filters, 3x3 kernels, 2x2 strides, with padding (padding='same'). The output of the last convolutional layer is then flattened and fed to a LSTM layer with an output of size 256. The initial states C and H of the LSTM layer are given as an input. The output of the LSTM layer is then separated into two streams: a fully connected layer with an output size equals to the number of actions (policy) and another fully connected layer with only one output (value function) (more details in Network.py of my code);
The loss function used is just as is informed in the original A3C paper. Basically, the policy loss is the log_softmax of the linear policy times the advantage function. The value loss is the square of the difference between the value function and the discounted rewards. The total loss accounts for the value loss, policy loss, and the entropy. The gradients are clipped to 40 (more details in Network.py of my code);
There is only one global network and several worker networks (one network for each worker). Only the global network is updated. This update is done with respect to the local gradients of each worker network. Therefore, each worker simulate the environment for BATCH_SIZE iterations, saving the state, value function, chosen action, reward received, and the LSTM state. After BATCH_SIZE (I used BATCH_SIZE = 20) iterations, each worker pass those data into the network, calculate the discounted rewards, the advantage function, the total loss, and the local gradients. It then updates the global network with those gradients. Finally, the worker's local network is synchronized with the global network (local_net = global_net). All workers does that asynchronously (for more details in this step, check the work and train methods of the Worker class inside the Worker.py);
The LSTM states C and H are reset between episodes. It is also important to note that the current states C and H are kept locally by each worker;
To apply the gradients to the global network, I used the Adamoptimizer with learning rate = 1e-4.
I have already tried different configurations for the network (by trying several different convolutional layers configurations, including different activation functions), other optimizers (RMSPropOptimizer and AdadeltaOptimizer) with different parameters configurations, and different values to BATCH_SIZE. But it almost ends up diverging to a policy where it always executes only one action. I mean always because there are certain configurations where the agent maintains a policy similar to a random policy for several episodes, with no apparent improvements (I waited until 62k episodes before giving up in those cases).
Therefore, I would like to know if anyone have obtained success in training an agent in the Pong game using the A3C with a LSTM layer. If so, what are the parameters used? Any help would be appreciated!
[EDIT] As I said in the comments, I managed to partially solve the problem by feeding the correct LSTM state before calculating the gradients (instead of feeding an initialized LSTM state). This made the method learn reasonably well for the PongDeterministic environment. But the problem persists when I try the Breakout-v0: the agent reaches a mean score of 40 in about 65k episodes, but it seems to stop learning after this (it maintained this score for some time). I have checked the OpenAI starter agent several times and I can't find any significant differences between mine implementation with their's. Any help would be extremely appreciated!
Related
How can defined topology in Castalia-3.2 for WBAN ?
How can import topology in omnet++ to casalia ?
where the topology defined in default WBAN scenario in Castalia?
with regard
thanks
Topology of a network is an abstraction that shows the structure of the communication links in the network. It's an abstraction because the notion of a link is an abstraction itself. There are no "real" links in a wireless network. The communication is happening in a broadcast medium and there are many parameters that dictate if a packet is received or not, such as the power of transmission, the path loss between transmitter and receiver, noise and interference, and also just luck. Still, the notion of a link could be useful in some circumstances, and some simulators are using it to define simulation scenarios. You might be used to simulators that you can draw nodes and then simply draw lines between them to define their links. This is not how Castalia models a network.
Castalia does not model links between the nodes, it models the channel and radios to get a more realistic communication behaviour.
Topology is often confused with deployment (I confuse them myself sometimes). Deployment is just the placement of nodes on the field. There are multiple ways to define deployment in Castalia, if you wish, but it is not needed in all scenarios (more on this later). People can confuse deployment with topology, because under very simplistic assumptions certain deployments lead to certain topologies. Castalia does not make these assumptions. Study the manual (especially chapter 4) to get a better understanding of Castalia's modeling.
After you have understood the modeling in Castalia, and you still want a specific/custom topology for some reason then you could play with some parameters to achieve your topology at least in a statistical sense. Assuming all nodes use the same radios and the same transmission power, then the path loss between nodes becomes a defining factor of the "quality" of the link between the nodes. In Castalia, you can define the path losses for each and every pair of nodes, using a pathloss map file.
SN.wirelessChannel.pathLossMapFile = "../Parameters/WirelessChannel/BANmodels/pathLossMap.txt"
This tells Castalia to use the specific path losses found in the file instead of computing path losses based on a wireless channel model. The deployment does not matter in this case. At least it does not matter for communication purposes (it might matter for other aspects of the simulation, for example if we are sampling a physical process that depends on location).
In our own simulations with BAN, we have defined a pathloss map based on experimental data, because other available models are not very accurate for BAN. For example the, lognormal shadowing model, which is Castalia's default, is not a good fit for BAN simulations. We did not want to enforce a specific topology, we just wanted a realistic channel model, and defining a pathloss map based on experimental data was the best way.
I have the impression though that when you say topology, you are not only referring to which nodes could communicate with which nodes, but which nodes do communicate with which nodes. This is also a matter of the layers above the radio (MAC and routing). For example it's the MAC and Routing that allow for relay nodes or not.
Note that in Castalia's current implementations of 802.15.6MAC and 802.15.4MAC, relay nodes are not allowed. So you can not create a mesh topology with these default implementations. Only a star topology is supported. If you want something more you'll have to implemented yourself.
I'm new to MPI and I'm trying to understand how MPI (and specifically OpenMPI) work in order to reason about the performance of my system.
I've tried to find resources online to help me understand things a little better, but haven't had much luck. I thought I'd come here.
Right now my question is simple: if I have 3 nodes (1 master, 2 clients) and I issue an MPI_Gather, does the root process handle incoming data sequentially or concurrently? In other words, if processes 1 is the first to make a connection with processes 0, will process 2 have to wait until processes 1 is done sending its data before it can start to send its data?
Thanks!
There are multiple components in Open MPI that implement collective operations and some of them provide multiple algorithms for the implementation of each operation.
What you are most likely interested in is the tuned component of the coll framework as that is what Open MPI uses by default. tuned implements all collectives using point-to-point operations and provides several algorithms for gather:
linear with synchronisation - used when messages are large to mid-size
binomial - used when the number of processes is large or the message size is small
basic linear - used in all other cases
The performance of each algorithm depends strongly on the particular combination of message size and number of ranks, therefore the library comes with a set of heuristics that tries to determine the best algorithm based on the data size and the size of the communicator (as indicated above). There are several mechanisms to override the heuristics and either force a certain algorithm or provide a list of custom algorithm selection rules.
The basic linear algorithm simply has the root loop over all other ranks receiving their messages in sequence. In that case, rank 2 won't be able to send its chunk before rank 1 since the root will first receive the message from rank 1 and only then move on to rank 2.
The linear with synchronisation algorithm splits the chunks into two pieces each. The first pieces are collected in sequence just like in the basic linear algorithm. The second pieces are collected asynchronously using non-blocking receives.
The binomial algorithm arranges the ranks as a binomial tree. The processes at the nodes of the tree receive the chunks from the lower levels and aggregate them into larger chunks that then get passed to the upper levels until they reach the root rank.
You can find the source code of the tuned module in the ompi/mca/coll/tuned folder of the Open MPI source tree. In the development branch, part of the tuned component got promoted to the base implementation of the collective framework and the code for the gather is to be found in ompi/mca/coll/base instead.
Hristo's answer is of course excellent, but I would like to offer a different point of view.
Contrary to your expectation, the question is not simple. It isn't even possible to specifically answer it without knowing more system specifics, as Hristo pointed out. That doesn't mean the question is invalid, but you should start to reason about performance on a different level.
First, consider the complexity of a the gather operation: The total network transfer to the root as well as the memory requirements are linearly growing with the number of processes in the communicator. This naturally limits scalability.
Second, you may assume that your MPI implementation does implement MPI_Gather in the most efficient way possible - better than you could do it by hand. This assumption may very well be wrong, but it is the best starting point to write your program.
Now when you have your program, you should measure and see where time is spent - or wasted. For that you should an MPI performance analysis tools. Now if you have identified that your Gather has a significant impact on performance, you can go ahead and try to optimize that: But to do so, first consider if you can structure your communication conceptually better, e.g. by somehow removing the computation all together or using a clever reduction instead. If you still need to stick to the gather: go ahead and tune your MPI implementation. Afterwards verify that your optimization did indeed improve performance on your specific system.
we have a single threaded application that simulates the interaction of a hundred of thousands of objects over time with the shared memory model.
obviously, it suffers from its inability to scale over multi CPU hardware.
after reading a little about agent based modeling and functional programming/actor model I was considering a rewrite with the message-passing paradigm.
the idea is very simple - each object will be an actor and their interactions will be messages so that the simulation could happen in parallel. given a configuration of objects at a certain time - its future consequences can be easily computed.
the question is how to model the time:
for example let's assume the the behavior of object X depends on A and B, as the actors and the messages calculations order is not guaranteed it could be that when X is to be computed A has already sent its message to X but B didn't.
how to make sure the computation happens correctly?
I hope the question is clear
thanks in advance.
Your approach of using message passing to parallelize a (discrete-event?) simulation is well-known and does not require a functional style per se (although, of course, this does not prevent you to implement it like that).
The basic problem you describe w.r.t. to the timing of events is also known as the local causality constraint (see, for example, this textbook). Basically, you need to use a synchronization protocol to ensure that each object (or agent) processes its messages in the right order. In the domain of parallel discrete-event simulation, such objects are called logical processes, and they communicate via events (i.e. time-stamped messages).
Correctly implementing a synchronization protocol for these events is challenging and the right choice of protocol is highly application-specific. For example, one important factor is the average amount of computation required per event: if there is little computation required, the communication costs dominate the overall execution time and it will be hard to scale the simulation.
I would therefore recommend to look for existing solutions/libraries on top of the actors framework you intend to use before starting from scratch.
I'm in the process of building a discrete-event simulator, and need to be able to calculate the theoretical bandwidth available between two systems in a given network topology, so that I can "time" how long a transfer will take to occur and create an event at its expected completion time.
At the moment, for simplicity, I do not consider the switch's backplanes or likelyhood for collisions / congestion to occur within the network. I am simply interested in the maximum transfer rate between all systems communicating.
For instance, consider the following sample network topology:
We assume the following connections:
Source 1, Source 2 -> (sending to) Dest 1
Source 3, Source 4 -> (sending to) Dest 2
Given these connections, what is the maximum effective transfer rate of all sources?
If we visualize this as a graph, I can calculate this manually by starting from the sources and evaluating at each switch level the maximum amount of incoming network traffic vs the switch's uplink.
For instance, Source #1 in this scenario has 50 Mbps of effective bandwidth to Dest 1
1 Gbps * S1(1/2) * S2(1) * S3(1/10) = 50 Mbps
However, I'm curious as to what other methods can be utilized to calculate this, or if there is a more effective approach which I can utilize to "predict" network traffic.
Any feedback is appreciated -- thanks.
This is essentially a max-min fairness problem.
https://en.wikipedia.org/wiki/Max-min_fairness
The progressive filling algorithm (described in the Wiki article) is a simple solution to this problem:
If resources are allocated in advance in the network nodes, max-min
fairness can be obtained by using an algorithm of progressive filling.
You start with all rates equal to 0 and grow all rates together at the
same pace, until one or several link capacity limits are hit. The
rates for the sources that use these links are not increased any more,
and you continue increasing the rates for other sources. All the
sources that are stopped have a bottleneck link. This is because they
use a saturated link, and all other sources using the saturated link
are stopped at the same time, or were stopped before, thus have a
smaller or equal rate. The algorithm continues until it is not
possible to increase. Lastly, when the algorithm terminates, all
sources have been stopped at some time and thus have a bottleneck
link. This allocation is max-min fair.
In R. Kent Dybvig's paper "Three Implementation Models for Scheme" he speaks of "FFP languages" and "FFP machines". Apparently there is some correlation between FFP machines, and string-reduction on multiple processors.
Googling doesn't really uncover much in terms of explanations or examples.
Can anyone shed some light on this topic?
Thanks.
Kent Dybvig's advisor, Gyula A. Mago, published a detailed description in "The FFP Machine: Technical Report 87-014" in 1987 by Mago and Stanat.
As of this writing, the PDF is freely available at:
http://www.cs.unc.edu/techreports/87-014.pdf
The FFP Machine is a very fine-grained parallel computer architecture:
each processor holds a single symbol / atom / value.
It uses a string reduction model of computation in which
innermost function applications are found and replaced by their
equivalent result (eager evaluation).
Where a result is used in several places, it tends to be re-evaluated
instead of incurring the costs of accessing some global store
(but see Mago's paper on "Copying Operands vs Copying Results", or better yet Mago's "Data Sharing in an FFP Machine" in the 1982 Functional Programming Languages and Computer Architecture conference).
The L cells holding the FFP expression being reduced
communicate through a tree structured arrangement of T cells.
Note that IC's are basically two dimensional and with wiring,
circuits can move towards being three dimensional in physical space.
Interconnection networks that occupy higher dimensions
(such as the Hypercube, Omega, Banyan, Star, etc. networks)
will eventually be unable to perform near their theoretical limit.
This communication network is circuit-switched rather than being packet-switched.
Data packets contain no addresses and do not need routing.
Packets from distinct reductions cannot meet, cannot conflict
and cannot experience congestion with each other.
The configuring activity (called "Partitioning") is performed
in a single sweep upwards in the tree, using a handful
of logic operations on 3-bit messages, leaving "area machines" in its wake,
each created to advance at most a single reducible application.
While it is technically logarithmic in time,
the resulting area machines can begin communicating
in a pipelined fashion behind the partitioning wave,
practically costing a constant time penalty.
(The dismantling of area machines remains a logarithmic cost in time).
Packets within a single reduction should, and must, meet
and thus provide a often-useful synchronization.
Sequences of packets are sorted and combined as they rise
within an area, to be broadcast from the root of the area machine.
Parallel Prefix and Parallel Suffix operations are provided
to reduce area traffic, since there remains a potential bottleneck
within an individual reducible application.
This is accomplished without the need exhibited in
the Ultracomputer (Jack (Jacob?) Schwartz at NYU)
for a separate logarithmic-sized cache memory in each
communication node.
Each T cell (internal tree node) only needs a FIFO buffer
(for efficiency) of size greater than the pipeline path to
the top of the tree and back down.
(This latter is a conjecture of mine, but it seems reasonable).
Since the tree maintains the left-to-right order of data
(unlike some other combining networks), the system enables cells
to rotate their data in logarithmic rather than linear time,
avoiding the plausible congestion at the root of the area machine.
It's worth noting again, that the parallelism within an area
machine is independent of the simultaneous parallelism in other
area machines, and has available to it a number of processors
proportional to the quantity of data in the operand.
Have you come across this yet?: Compiling APL for parallel execution on an FFP machine
Formal FP. Similar to FP, but with regular sugarless syntax, for machine execution is all I can offer you.
See Wikis Fp page.