Reinforcement Learning Algorithm stuck in infinite loop when choosing an infeasible action - infinite-loop

😊
I wrote a gym environment and i am training this gym environment using the DQN agent from stable-baselines.
What my environment dose:
action : int == 1, ..., n --> Add an order n (containing (1 <= x <= 10) items) to a list (lets call it "item_list").
action == 0 --> Close the item_list (no further items can be added).
So whats the problem?
There is a max number of items allowed in the item_list. This means if the item_list is almost full and the agent chooses action >= 1 --> The capacity of the "item_list is exceeded. Thus it is an infeasable action.
What happens in this case?
In that case the order will not be added to "item_list" and the Agent receaves a negative reward. The problem is that the Observation of my agent wont chainge at all. This will lead to my Agent choosing the same action over and over again.
When does it become a problem?
During training it does not matter that much. The Agent will learn to avoid choosing an action like this and exploration will always get the agent out of this loop after a while. Though, when i want to use the trained agent there will be no exploration. One "bad" action would be enough to send this agent into a infinite loop of choosing an order that does not fit into the item_list.
What kind of answer am I looking for?
Is there a way of dealing with those infinite loops exept form hardcoding a lot of exceptions? If the capacity of item_list ist exeeded it would be easy to hardcode that the list should be closed the way it ist. Though the same infinite loop can appear if the item_list is empty and the Agent chooses action == 0 (closing an empty list). There are many other cases where this problem can occure. Is there a smart solution for this that I am not aware of?

Two general approaches to consider:
First, do not allow illegal actions. When your agent outputs a distribution over actions, multiply it by indicator vector that zeroes out illegal actions. Then sample an action.
Second, after an illegal action terminate the episode with a negative reward. You already assign a negative reward for illegal action, additionally terminate the episode, so that there is no infinite loop.
I would go for the first one.
Edit: see this question as well: https://ai.stackexchange.com/questions/2980/how-should-i-handle-invalid-actions-when-using-reinforce

Related

Handling Race Conditions / Concurrency in Network Protocol Design

I am looking for possible techniques to gracefully handle race conditions in network protocol design. I find that in some cases, it is particularly hard to synchronize two nodes to enter a specific protocol state. Here is an example protocol with such a problem.
Let's say A and B are in an ESTABLISHED state and exchange data. All messages sent by A or B use a monotonically increasing sequence number, such that A can know the order of the messages sent by B, and A can know the order of the messages sent by B. At any time in this state, either A or B can send a ACTION_1 message to the other, in order to enter a different state where a strictly sequential exchange of message needs to happen:
send ACTION_1
recv ACTION_2
send ACTION_3
However, it is possible that both A and B send the ACTION_1 message at the same time, causing both of them to receive an ACTION_1 message, while they would expect to receive an ACTION_2 message as a result of sending ACTION_1.
Here are a few possible ways this could be handled:
1) change state after sending ACTION_1 to ACTION_1_SENT. If we receive ACTION_1 in this state, we detect the race condition, and proceed to arbitrate who gets to start the sequence. However, I have no idea how to fairly arbitrate this. Since both ends are likely going to detect the race condition at about the same time, any action that follows will be prone to other similar race conditions, such as sending ACTION_1 again.
2) Duplicate the entire sequence of messages. If we receive ACTION_1 in the ACTION_1_SENT state, we include the data of the other ACTION_1 message in the ACTION_2 message, etc. This can only work if there is no need to decide who is the "owner" of the action, since both ends will end up doing the same action to each other.
3) Use absolute time stamps, but then, accurate time synchronization is not an easy thing at all.
4) Use lamport clocks, but from what I understood these are only useful for events that are causally related. Since in this case the ACTION_1 messages are not causally related, I don't see how it could help solve the problem of figuring out which one happened first to discard the second one.
5) Use some predefined way of discarding one of the two messages on receipt by both ends. However, I cannot find a way to do this that is unflawed. A naive idea would be to include a random number on both sides, and select the message with the highest number as the "winner", discarding the one with the lowest number. However, we have a tie if both numbers are equal, and then we need another way to recover from this. A possible improvement would be to deal with arbitration once at connection time and repeat similar sequence until one of the two "wins", marking it as favourite. Every time a tie happens, the favourite wins.
Does anybody have further ideas on how to handle this?
EDIT:
Here is the current solution I came up with. Since I couldn't find 100% safe way to prevent ties, I decided to have my protocol elect a "favorite" during the connection sequence. Electing this favorite requires breaking possible ties, but in this case the protocol will allow for trying multiple times to elect the favorite until a consensus is reached. After the favorite is elected, all further ties are resolved by favoring the elected favorite. This isolates the problem of possible ties to a single part of the protocol.
As for fairness in the election process, I wrote something rather simple based on two values sent in each of the client/server packets. In this case, this number is a sequence number starting at a random value, but they could be anything as long as those numbers are fairly random to be fair.
When the client and server have to resolve a conflict, they both call this function with the send (their value) and the recv (the other value) values. The favorite calls this function with the favorite parameter set to TRUE. This function is guaranteed to give the opposite result on both ends, such that it is possible to break the tie without retransmitting a new message.
BOOL ResolveConflict(BOOL favorite, UINT32 sendVal, UINT32 recvVal)
{
BOOL winner;
int sendDiff;
int recvDiff;
UINT32 xorVal;
xorVal = sendVal ^ recvVal;
sendDiff = (xorVal < sendVal) ? sendVal - xorVal : xorVal - sendVal;
recvDiff = (xorVal < recvVal) ? recvVal - xorVal : xorVal - recvVal;
if (sendDiff != recvDiff)
winner = (sendDiff < recvDiff) ? TRUE : FALSE; /* closest value to xorVal wins */
else
winner = favorite; /* break tie, make favorite win */
return winner;
}
Let's say that both ends enter the ACTION_1_SENT state after sending the ACTION_1 message. Both will receive the ACTION_1 message in the ACTION_1_SENT state, but only one will win. The loser accepts the ACTION_1 message and enters the ACTION_1_RCVD state, while the winner discards the incoming ACTION_1 message. The rest of the sequence continues as if the loser had never sent ACTION_1 in a race condition with the winner.
Let me know what you think, and how this could be further improved.
To me, this whole idea that this ACTION_1 - ACTION_2 - ACTION_3 handshake must occur in sequence with no other message intervening is very onerous, and not at all in line with the reality of networks (or distributed systems in general). The complexity of some of your proposed solutions give reason to step back and rethink.
There are all kinds of complicating factors when dealing with systems distributed over a network: packets which don't arrive, arrive late, arrive out of order, arrive duplicated, clocks which are out of sync, clocks which go backwards sometimes, nodes which crash/reboot, etc. etc. You would like your protocol to be robust under any of these adverse conditions, and you would like to know with certainty that it is robust. That means making it simple enough that you can think through all the possible cases that may occur.
It also means abandoning the idea that there will always be "one true state" shared by all nodes, and the idea that you can make things happen in a very controlled, precise, "clockwork" sequence. You want to design for the case where the nodes do not agree on their shared state, and make the system self-healing under that condition. You also must assume that any possible message may occur in any order at all.
In this case, the problem is claiming "ownership" of a shared clipboard. Here's a basic question you need to think through first:
If all the nodes involved cannot communicate at some point in time, should a node which is trying to claim ownership just go ahead and behave as if it is the owner? (This means the system doesn't freeze when the network is down, but it means you will have multiple "owners" at times, and there will be divergent changes to the clipboard which have to be merged or otherwise "fixed up" later.)
Or, should no node ever assume it is the owner unless it receives confirmation from all other nodes? (This means the system will freeze sometimes, or just respond very slowly, but you will never have weird situations with divergent changes.)
If your answer is #1: don't focus so much on the protocol for claiming ownership. Come up with something simple which reduces the chances that two nodes will both become "owner" at the same time, but be very explicit that there can be more than one owner. Put more effort into the procedure for resolving divergence when it does happen. Think that part through extra carefully and make sure that the multiple owners will always converge. There should be no case where they can get stuck in an infinite loop trying to converge but failing.
If your answer is #2: here be dragons! You are trying to do something which buts up against some fundamental limitations.
Be very explicit that there is a state where a node is "seeking ownership", but has not obtained it yet.
When a node is seeking ownership, I would say that it should send a request to all other nodes, at intervals (in case another one misses the first request). Put a unique identifier on each such request, which is repeated in the reply (so delayed replies are not misinterpreted as applying to a request sent later).
To become owner, a node should receive a positive reply from all other nodes within a certain period of time. During that wait period, it should refuse to grant ownership to any other node. On the other hand, if a node has agreed to grant ownership to another node, it should not request ownership for another period of time (which must be somewhat longer).
If a node thinks it is owner, it should notify the others, and repeat the notification periodically.
You need to deal with the situation where two nodes both try to seek ownership at the same time, and both NAK (refuse ownership to) each other. You have to avoid a situation where they keep timing out, retrying, and then NAKing each other again (meaning that nobody would ever get ownership).
You could use exponential backoff, or you could make a simple tie-breaking rule (it doesn't have to be fair, since this should be a rare occurrence). Give each node a priority (you will have to figure out how to derive the priorities), and say that if a node which is seeking ownership receives a request for ownership from a higher-priority node, it will immediately stop seeking ownership and grant it to the high-priority node instead.
This will not result in more than one node becoming owner, because if the high-priority node had previously ACKed the request sent by the low-priority node, it would not send a request of its own until enough time had passed that it was sure its previous ACK was no longer valid.
You also have to consider what happens if a node becomes owner, and then "goes dark" -- stops responding. At what point are other nodes allowed to assume that ownership is "up for grabs" again? This is a very sticky issue, and I suspect you will not find any solution which eliminates the possibility of having multiple owners at the same time.
Probably, all the nodes will need to "ping" each other from time to time. (Not referring to an ICMP echo, but something built in to your own protocol.) If the clipboard owner can't reach the others for some period of time, it must assume that it is no longer owner. And if the others can't reach the owner for a longer period of time, they can assume that ownership is available and can be requested.
Here is a simplified answer for the protocol of interest here.
In this case, there is only a client and a server, communicating over TCP. The goal of the protocol is to two system clipboards. The regular state when outside of a particular sequence is simply "CLIPBOARD_ESTABLISHED".
Whenever one of the two systems pastes something onto its clipboard, it sends a ClipboardFormatListReq message, and transitions to the CLIPBOARD_FORMAT_LIST_REQ_SENT state. This message contains a sequence number that is incremented when sending the ClipboardFormatListReq message. Under normal circumstances, no race condition occurs and a ClipboardFormatListRsp message is sent back to acknowledge the new sequence number and owner. The list contained in the request is used to expose clipboard data formats offered by the owner, and any of these formats can be requested by an application on the remote system.
When an application requests one of the data formats from the clipboard owner, a ClipboardFormatDataReq message is sent with the sequence number, and format id from the list, the state is changed to CLIPBOARD_FORMAT_DATA_REQ_SENT. Under normal circumstances, there is no change of clipboard ownership during that time, and the data is returned in the ClipboardFormatDataRsp message. A timer should be used to timeout if no response is sent fast enough from the other system, and abort the sequence if it takes too long.
Now, for the special cases:
If we receive ClipboardFormatListReq in the CLIPBOARD_FORMAT_LIST_REQ_SENT state, it means both systems are trying to gain ownership at the same time. Only one owner should be selected, in this case, we can keep it simple an elect the client as the default winner. With the client as the default owner, the server should respond to the client with ClipboardFormatListRsp consider the client as the new owner.
If we receive ClipboardFormatDataReq in the CLIPBOARD_FORMAT_LIST_REQ_SENT state, it means we have just received a request for data from the previous list of data formats, since we have just sent a request to become the new owner with a new list of data formats. We can respond with a failure right away, and sequence numbers will not match.
Etc, etc. The main issue I was trying to solve here is fast recovery from such states, with going into a loop of retrying until it works. The main issue with immediate retrial is that it is going to happen with timing likely to cause new race conditions. We can solve the issue by expecting such inconsistent states as long as we can move back to proper protocol states when detecting them. The other part of the problem is with electing a "winner" that will have its request accepted without resending new messages. A default winner can be elected by default, such as the client or the server, or some sort of random voting system can be implemented with a default favorite to break ties.

Arc-Consistency (AC3) and one Challenges?

In Applying Arc-Consistency (AC3) algorithms on one Constraint Satisfaction Problem, if domain of one variable be empty, what is the next step?
1) halt.
2) do backtrack.
3) start from another initial state.
4) it depends on that we are in which step.
Solution (4). I think (1) is true because it mean we cannot find any consistent assignment and halt. anyone can describe why (4) is true?
With the particular algorithm you're using, if the domain of a variable shrinks until it is empty, then it means that the constraint problem has no solutions. Therefore the algorithm should halt in the failure state.

Difficulty satisfying hard and medium constraints simultaneously with Optaplanner

I've implemented a sensor scheduling problem using OptaPlanner 6.2 that has 1 hard constraint, 1 medium constraint, and 1 soft constraint. The trouble I'm having is that either I can satisfy some of the hard constraints after 30 seconds of or so, and then the solver makes very little progress satisfying them constraints with additional minutes of termination. I don't think the problem is over constrained; I also don't know how to help the local search process significantly increase the score.
My problem is a scheduling one, where I precalculate all possible times (intervals) that a sensor can observe objects prior to solving. I've modeled the problem as follows:
Hard constraint - no intervals can overlap
when
$s1: A( interval!=null,$id: id, $doy : interval.doy, $interval: interval, $sensor: interval.getSensor())
exists A( id > $id, interval!=null, $interval2: interval, $interval2.getSensor() == $sensor, $interval2.getDoy() == $doy, $interval.getStartSlot() <= $interval2.getEndSlot(), $interval.getEndSlot() >= $interval2.getStartSlot() )
then
scoreHolder.addHardConstraintMatch(kcontext,-10000);
Medium constraint - every assignment should have an Interval
when
A(interval==null)
then
scoreHolder.addMediumConstraintMatch(kcontext,-100);
Soft constraint - maximize a property/value in the Interval class
when
$s1: A( interval!=null)
then
scoreHolder.addSoftConstraintMatch(kcontext,-1 * $s1.getInterval().getSomeProperty())
A: entity planning class; each instance is an assignment for a particular object (i.e has an member objectid that corresponds with one in the Interval class)
Interval: planning variable class, all possible intervals (start time, stop time) for the sensor and objects
In A, I restrict access to B instances (intervals) to just those for the object associated with that assignment. For my test case, there are 40000 or so Intervals, but only dozens for each object. There are about 1100 instances of A (so dozens of possible Intervals for each one).
#PlanningVariable(valueRangeProviderRefs = {"intervalRange"},strengthComparatorClass = IntervalStrengthComparator.class, nullable=true)
public Interval getInterval() {
return interval;
}
#ValueRangeProvider(id = "intervalRange")
public List<Interval> getPossibleIntervalList() {
return task.getAvailableIntervals();
}
In my solution class:
//have tried commenting this out since the overall interval list does not apply to all A
#ValueRangeProvider (id="intervalRangeAll")
public List getIntervalList() {
return intervals;
}
#PlanningEntityCollectionProperty
public List<A> getAList() {
return AList;
}
I've reviewed the documentation and tried a lot of things. My problem is somewhat of a cross between the nurserostering and course scheduling examples, which I have looked at. I am using the FIRST_FIT_DECREASING construction heuristic.
What I have tried:
Turning on and off nullable in the planning variable annotation for A.getInterval()
Late acceptance, Tabu, both.
Benchmarking. I wasn't seeing any problems and average
Adding an IntervalChangeFactory as a moveListFactory. Restricting the custom ChangeMove to whether the interval can be accepted or not (i.e. enforcing or not the hard constraint in the IntervalChangeMove.isDoable).
Here is an example one, where most of the hard constraints are not satisfied, but the medium ones are:
[main] INFO org.optaplanner.core.impl.solver.DefaultSolver - Solving started: time spent (202), best score (0hard/-112500medium/0soft), environment mode (REPRODUCIBLE), random (WELL44497B with seed 987654321).
[main] INFO org.optaplanner.core.impl.constructionheuristic.DefaultConstructionHeuristicPhase - Construction Heuristic phase (0) ended: step total (1125), time spent (2296), best score (-9100000hard/0medium/-72608soft).
[main] INFO org.optaplanner.core.impl.localsearch.DefaultLocalSearchPhase - Local Search phase (1) ended: step total (92507), time spent (30000), best score (-8850000hard/0medium/-74721soft).
[main] INFO org.optaplanner.core.impl.solver.DefaultSolver - Solving ended: time spent (30000), best score (-8850000hard/0medium/-74721soft), average calculate count per second (5643), environment mode (REPRODUCIBLE).
So I don't understand is why the hard constraints can't be deal with by the search process. Any my calculate count per second has dropped to below 10000 due to all the tinkering I've done.
If it's not due to the Score trap (see docs, this is the first thing to fix), it's probably because it gets stuck in a local optima and there are no moves that go from 1 feasible solution to another feasible solution except those that don't really change much. There are several ways to deal with that:
Add course-grained moves (but still leave the fine-grained moves such as ChangeMove in!). You can add generic course grained moves (such as the pillar moves) or add a custom Move. Don't start making smarter selectors that try to only select feasible moves, that's a bad idea (as it will kill your ACC or will limit your search space). Just mix in course grained moves (= diversification) to complement the fine grained moves (= intensification).
A better domain model might help too. The Project Job Scheduling and Cheap Time scheduling examples have a domain model which naturally leads to a smaller search space while still allowing all feasible solutions.
Increase Tabu list size, LA size or use a hard constraint in the SA starting temperature. But I presume you've tried that with the benchmarker.
Turn on TRACE logging to see optaplanner's decision making. Only look at the part after it reaches the local optimum.
In the future, I 'll also add Ruin&Recreate moves, which will be far less code than custom moves or a better domain model (but it will be less efficient).

Chicken or egg in functions/actors processing WorldState

I have read the series "Purely Functional Retrogames"
http://prog21.dadgum.com/23.html
It discusses some interesting techniques to build a (semi-)pure game world update loop.
However, I have the following side remark, that I cannot seem to get my head around:
Suppose you have a system, where every enemy, and every player are separate actors, or separate pure functions.
Suppose they all get a "WorldState" as input, and output a New WorldState (or, if you're thinking in actor terms, send the new WorldState to the next actor, ending with for instance a "Game Render" actor).
Then, there's two ways to go with this:
Either you start with one actor, (f.i. the player), and feed him the "current world".
Then, you feed the new world, to the next enemy, and so on, until all actors have converted the worlds. Then, the last world is the new world you can feed to the render loop. (or, if you followed the above article, you end up with a list of events that have occurred in the world, which can be processed).
A second way, is to just give all actors the current WorldState at the same time. They generate any changes, which may conflict (for instance, two enemies and the player can take a coin in the same animation frame) -> it is up to the game system to solve these conflicts by processing the events. By processing all events, the Game actor creates the new world, to be used in the next update frame.
I have a feeling I'm just confronted with the exact same "race condition" problem I wished to avoid by using pure functions with immutable data.
Any advice here?
I didn't read the article, but with the coin example you are creating a kind of global variable: you give a copy of the world state to all actors and you suppose that each actor will evaluate the game, take decision and expect that their action will succeed, regardless of the last phase which is the conflict solving. I would not call this a race condition but rather a "blind condition", and yes, this will not work.
I imagine that you came to this solution in order to allow parallelism, not available in solution 1. In my opinion, the problem is about responsibility.
The coin must belong to an actor (a server acting as resource manager) as anything in the application. This actor is the only responsible to decide what will happen to the coin.
All requests (is there something to grab, grab it, drop something...) should be sent to this actor (one actor per cell, or per map, level or any split that make sense for the game).
The way you will manage it is then up to you: serve all requests in the receive order, buffer them until a synchro message comes and make a random decision or priority decision... In any case the server will be able to reply to all actors with success or failure without any risk of race condition since the server process is run (at least in erlang) on a single core and proceed one message at a time.
In addition to Pascal answer, you can solve parallelization by splitting (i assume huge map) to smaller chunks which depend on last state (or part of it, like an edge) of its neighbours. This allows you to distribute this game among many nodes.

Modeling an HTTP transition system in Alloy

I want to model an HTTP interaction, i.e. a sequence of HTTPRequest/HTTPResponse, and I am trying to model this as a transition system.
I defined an ordering on a class State by using:
open util/ordering[State]
where a State is simply a set of Messages:
sig State {
msgSet: set Message
}
Each pair of (HTTPRequest->HTTPResponse) and (HTTPResponse->HTTPRequest) is represented as a rule in my transition system.
The rules are expressed in Alloy as predicates that let one move from one state to another.
E.g., this is a rule generating an HTTPResponse after a particular HTTPRequest is received:
pred rsp1 [s, s': State] {
one msg: Request, msg':Response | (
// Preconditions (previous Request)
msg.method=get &&
msg.address.url=sample_com &&
// Postconditions (next Response)
msg'.status=OK_200 &&
// previous Request has to be in previous state
msg in s.msgSet &&
// Response generated is added to next state
s'.msgSet = s.msgSet + msg'
}
Unfortunately, the model created seems to be too complex: we have a dozen of rules (more complex than the one above but following the same pattern) and the execution is very slow.
EDIT: In particular, the CNF generation is extremely slow, while the solving takes a reasonable amount of time.
Do you have any suggestion on how to model a similar transition system?
Thank you very much!
This is a model with an impressive level of detail; thank you for sharing it!
None of the various forms of honestAction by itself takes more than two or three minutes to find an instance (or in some cases to fail to find any instance), except for rsp8, which takes quite a while by itself (it ran for fifteen minutes or so before I stopped it).
So the long CNF preparation times you are observing are apparently caused by either (a) just predicate rsp8 that's causing your time issues, or (b) the size of the disjunction in the honestAction predicate, or (c) both.
I suspect but have not proved that the time issue is caused by combinatorial explosion in the number of individuals required to populate a model and the number of constraints in the model.
My first instinct (it's not more than that) would be to cut back on the level of detail in the model, in particular the large number of singleton signatures which instantiate your abstract signatures. These seem (I could be wrong) to be present either for bookkeeping purposes (so you can identify which rule licenses the transition from one state to another), or because the modeler doesn't trust Alloy to generate concrete instances of signatures like UserName, Password, Code, etc.
As the model now is, it looks as if you're doing a lot of work to define all the individuals involved in a particular example, instead of defining constraints and letting Alloy do the work of finding examples. (Using Alloy to check the properties a particular concrete example can be useful, but there are other ways to do that.)
Since so many of the concrete signatures in the model are constrained to singleton cardinality, I don't actually know that defining them makes the task of finding models more complex; for all I know, it makes it simpler. But my instinct is to think that it would be more useful to know (as well as possibly easier for Alloy to establish) that state transitions have a particular property in general, no matter what hosts, users, and URIs are involved, than to know that property rsp1 applies in all the cases where the host is named examplecom and the address URI is example_url_https and whatnot.
I conjecture that reducing the number of individuals whose existence and properties are prescribed, and the constraints on which individuals can be involved in which state transitions, will reduce the CNF generation time.
If your long-term goal is to test long sequences of state transitions to test whether from a given starting point it's possible or impossible to arrive at a particular state (or kind of state), you may need to re-think the approach to enable shorter sequences of state transitions to do the job.
A second conjecture would involve less restructuring of the model. For reasons I don't think I understand fully, sometimes quantification with one seems to hurt rather than help performance, as in this example, where explicitly quantifying some variables with some instead of one turned out to make a problem tractable instead of intractable.
That question involves quantification in a predicate, not in the model overall, and the quantification with one wasn't intended in the first place, so it may not be relevant here. But we can test the effect of the one keyword on this model in a simple way: I commented out everything in honestAction except rsp8 and ran the predicate first != last in a scope of 8, once with most of the occurrences of one commented out and once with those keywords intact. With the one keywords commented out, the Analyser ran the problem in 24 seconds or so; with the one keywords in place, it ran for 500 seconds so far before I decided the point was made and terminated it.
So I'd try removing the keyword one from all of the signatures with instance-specific individuals, leaving it only on get, post, OK_200, etc., and appData. I would also try doing without the various subtypes of Key, SessionID, URL, Host, UserName, and Password, or at least constraining their cardinality in the run command.

Resources