OpenvSwitch port missing in large load, long poll interval observed - openstack

ISSUE description
I have a OpenStack system with HA management network (VIP) via ovs (Open vSwitch) port, it's found in this system, with high load (concurrently volume-from-glance-image creation), the VIP port (an ovs port) will be missing.
Analysis
For now, with default log level from log file, the only thing observed is as below the Unreasonably long 62741ms poll interval.
2017-12-29T16:40:38.611Z|00001|timeval(revalidator70)|WARN|Unreasonably long 62741ms poll interval (0ms user, 0ms system)
Idea for now
I will turn debug log on for file and try reproducing the issue:
sudo ovs-appctl vlog/set file:dbg
Question
What else should I do during/after of the issue reproduction please?
Is this issue typical? Caused by what if yes?
I googled OpenvSwitch trouble shoot or other related key words while information was all on data flow/table level instead of this ovs-vswitchd level ( am I right? )
Many thanks!
BR//Wey

This issue was not reproduced and thus I forgot about it, until recently, 2 years afterward, I had a chance to get in touch with this issue in a different environment, and this time I have more ideas on its root cause.
It could be caused by the shift that comes in the bonding, for some reason, the traffic pattern fits the situation of triggering shifts again and again(the condition is quite strong I would say but there is a chance to be hit anyway, right?).
The condition of the shift was quoted as below and please refer to the full doc here: https://docs.openvswitch.org/en/latest/topics/bonding/
Bond Packet Output
When a packet is sent out a bond port, the bond member actually used is selected based on the packet’s source MAC and VLAN tag (see bond_choose_output_member()). In particular, the source MAC and VLAN tag are hashed into one of 256 values, and that value is looked up in a hash table (the “bond hash”) kept in the bond_hash member of struct port. The hash table entry identifies a bond member. If no bond member has yet been chosen for that hash table entry, vswitchd chooses one arbitrarily.
Every 10 seconds, vswitchd rebalances the bond members (see bond_rebalance()). To rebalance, vswitchd examines the statistics for the number of bytes transmitted by each member over approximately the past minute, with data sent more recently weighted more heavily than data sent less recently. It considers each of the members in order from most-loaded to least-loaded. If highly loaded member H is significantly more heavily loaded than the least-loaded member L, and member H carries at least two hashes, then vswitchd shifts one of H’s hashes to L. However, vswitchd will only shift a hash from H to L if it will decrease the ratio of the load between H and L by at least 0.1.
Currently, “significantly more loaded” means that H must carry at least 1 Mbps more traffic, and that traffic must be at least 3% greater than L’s.

Related

Handling Race Conditions / Concurrency in Network Protocol Design

I am looking for possible techniques to gracefully handle race conditions in network protocol design. I find that in some cases, it is particularly hard to synchronize two nodes to enter a specific protocol state. Here is an example protocol with such a problem.
Let's say A and B are in an ESTABLISHED state and exchange data. All messages sent by A or B use a monotonically increasing sequence number, such that A can know the order of the messages sent by B, and A can know the order of the messages sent by B. At any time in this state, either A or B can send a ACTION_1 message to the other, in order to enter a different state where a strictly sequential exchange of message needs to happen:
send ACTION_1
recv ACTION_2
send ACTION_3
However, it is possible that both A and B send the ACTION_1 message at the same time, causing both of them to receive an ACTION_1 message, while they would expect to receive an ACTION_2 message as a result of sending ACTION_1.
Here are a few possible ways this could be handled:
1) change state after sending ACTION_1 to ACTION_1_SENT. If we receive ACTION_1 in this state, we detect the race condition, and proceed to arbitrate who gets to start the sequence. However, I have no idea how to fairly arbitrate this. Since both ends are likely going to detect the race condition at about the same time, any action that follows will be prone to other similar race conditions, such as sending ACTION_1 again.
2) Duplicate the entire sequence of messages. If we receive ACTION_1 in the ACTION_1_SENT state, we include the data of the other ACTION_1 message in the ACTION_2 message, etc. This can only work if there is no need to decide who is the "owner" of the action, since both ends will end up doing the same action to each other.
3) Use absolute time stamps, but then, accurate time synchronization is not an easy thing at all.
4) Use lamport clocks, but from what I understood these are only useful for events that are causally related. Since in this case the ACTION_1 messages are not causally related, I don't see how it could help solve the problem of figuring out which one happened first to discard the second one.
5) Use some predefined way of discarding one of the two messages on receipt by both ends. However, I cannot find a way to do this that is unflawed. A naive idea would be to include a random number on both sides, and select the message with the highest number as the "winner", discarding the one with the lowest number. However, we have a tie if both numbers are equal, and then we need another way to recover from this. A possible improvement would be to deal with arbitration once at connection time and repeat similar sequence until one of the two "wins", marking it as favourite. Every time a tie happens, the favourite wins.
Does anybody have further ideas on how to handle this?
EDIT:
Here is the current solution I came up with. Since I couldn't find 100% safe way to prevent ties, I decided to have my protocol elect a "favorite" during the connection sequence. Electing this favorite requires breaking possible ties, but in this case the protocol will allow for trying multiple times to elect the favorite until a consensus is reached. After the favorite is elected, all further ties are resolved by favoring the elected favorite. This isolates the problem of possible ties to a single part of the protocol.
As for fairness in the election process, I wrote something rather simple based on two values sent in each of the client/server packets. In this case, this number is a sequence number starting at a random value, but they could be anything as long as those numbers are fairly random to be fair.
When the client and server have to resolve a conflict, they both call this function with the send (their value) and the recv (the other value) values. The favorite calls this function with the favorite parameter set to TRUE. This function is guaranteed to give the opposite result on both ends, such that it is possible to break the tie without retransmitting a new message.
BOOL ResolveConflict(BOOL favorite, UINT32 sendVal, UINT32 recvVal)
{
BOOL winner;
int sendDiff;
int recvDiff;
UINT32 xorVal;
xorVal = sendVal ^ recvVal;
sendDiff = (xorVal < sendVal) ? sendVal - xorVal : xorVal - sendVal;
recvDiff = (xorVal < recvVal) ? recvVal - xorVal : xorVal - recvVal;
if (sendDiff != recvDiff)
winner = (sendDiff < recvDiff) ? TRUE : FALSE; /* closest value to xorVal wins */
else
winner = favorite; /* break tie, make favorite win */
return winner;
}
Let's say that both ends enter the ACTION_1_SENT state after sending the ACTION_1 message. Both will receive the ACTION_1 message in the ACTION_1_SENT state, but only one will win. The loser accepts the ACTION_1 message and enters the ACTION_1_RCVD state, while the winner discards the incoming ACTION_1 message. The rest of the sequence continues as if the loser had never sent ACTION_1 in a race condition with the winner.
Let me know what you think, and how this could be further improved.
To me, this whole idea that this ACTION_1 - ACTION_2 - ACTION_3 handshake must occur in sequence with no other message intervening is very onerous, and not at all in line with the reality of networks (or distributed systems in general). The complexity of some of your proposed solutions give reason to step back and rethink.
There are all kinds of complicating factors when dealing with systems distributed over a network: packets which don't arrive, arrive late, arrive out of order, arrive duplicated, clocks which are out of sync, clocks which go backwards sometimes, nodes which crash/reboot, etc. etc. You would like your protocol to be robust under any of these adverse conditions, and you would like to know with certainty that it is robust. That means making it simple enough that you can think through all the possible cases that may occur.
It also means abandoning the idea that there will always be "one true state" shared by all nodes, and the idea that you can make things happen in a very controlled, precise, "clockwork" sequence. You want to design for the case where the nodes do not agree on their shared state, and make the system self-healing under that condition. You also must assume that any possible message may occur in any order at all.
In this case, the problem is claiming "ownership" of a shared clipboard. Here's a basic question you need to think through first:
If all the nodes involved cannot communicate at some point in time, should a node which is trying to claim ownership just go ahead and behave as if it is the owner? (This means the system doesn't freeze when the network is down, but it means you will have multiple "owners" at times, and there will be divergent changes to the clipboard which have to be merged or otherwise "fixed up" later.)
Or, should no node ever assume it is the owner unless it receives confirmation from all other nodes? (This means the system will freeze sometimes, or just respond very slowly, but you will never have weird situations with divergent changes.)
If your answer is #1: don't focus so much on the protocol for claiming ownership. Come up with something simple which reduces the chances that two nodes will both become "owner" at the same time, but be very explicit that there can be more than one owner. Put more effort into the procedure for resolving divergence when it does happen. Think that part through extra carefully and make sure that the multiple owners will always converge. There should be no case where they can get stuck in an infinite loop trying to converge but failing.
If your answer is #2: here be dragons! You are trying to do something which buts up against some fundamental limitations.
Be very explicit that there is a state where a node is "seeking ownership", but has not obtained it yet.
When a node is seeking ownership, I would say that it should send a request to all other nodes, at intervals (in case another one misses the first request). Put a unique identifier on each such request, which is repeated in the reply (so delayed replies are not misinterpreted as applying to a request sent later).
To become owner, a node should receive a positive reply from all other nodes within a certain period of time. During that wait period, it should refuse to grant ownership to any other node. On the other hand, if a node has agreed to grant ownership to another node, it should not request ownership for another period of time (which must be somewhat longer).
If a node thinks it is owner, it should notify the others, and repeat the notification periodically.
You need to deal with the situation where two nodes both try to seek ownership at the same time, and both NAK (refuse ownership to) each other. You have to avoid a situation where they keep timing out, retrying, and then NAKing each other again (meaning that nobody would ever get ownership).
You could use exponential backoff, or you could make a simple tie-breaking rule (it doesn't have to be fair, since this should be a rare occurrence). Give each node a priority (you will have to figure out how to derive the priorities), and say that if a node which is seeking ownership receives a request for ownership from a higher-priority node, it will immediately stop seeking ownership and grant it to the high-priority node instead.
This will not result in more than one node becoming owner, because if the high-priority node had previously ACKed the request sent by the low-priority node, it would not send a request of its own until enough time had passed that it was sure its previous ACK was no longer valid.
You also have to consider what happens if a node becomes owner, and then "goes dark" -- stops responding. At what point are other nodes allowed to assume that ownership is "up for grabs" again? This is a very sticky issue, and I suspect you will not find any solution which eliminates the possibility of having multiple owners at the same time.
Probably, all the nodes will need to "ping" each other from time to time. (Not referring to an ICMP echo, but something built in to your own protocol.) If the clipboard owner can't reach the others for some period of time, it must assume that it is no longer owner. And if the others can't reach the owner for a longer period of time, they can assume that ownership is available and can be requested.
Here is a simplified answer for the protocol of interest here.
In this case, there is only a client and a server, communicating over TCP. The goal of the protocol is to two system clipboards. The regular state when outside of a particular sequence is simply "CLIPBOARD_ESTABLISHED".
Whenever one of the two systems pastes something onto its clipboard, it sends a ClipboardFormatListReq message, and transitions to the CLIPBOARD_FORMAT_LIST_REQ_SENT state. This message contains a sequence number that is incremented when sending the ClipboardFormatListReq message. Under normal circumstances, no race condition occurs and a ClipboardFormatListRsp message is sent back to acknowledge the new sequence number and owner. The list contained in the request is used to expose clipboard data formats offered by the owner, and any of these formats can be requested by an application on the remote system.
When an application requests one of the data formats from the clipboard owner, a ClipboardFormatDataReq message is sent with the sequence number, and format id from the list, the state is changed to CLIPBOARD_FORMAT_DATA_REQ_SENT. Under normal circumstances, there is no change of clipboard ownership during that time, and the data is returned in the ClipboardFormatDataRsp message. A timer should be used to timeout if no response is sent fast enough from the other system, and abort the sequence if it takes too long.
Now, for the special cases:
If we receive ClipboardFormatListReq in the CLIPBOARD_FORMAT_LIST_REQ_SENT state, it means both systems are trying to gain ownership at the same time. Only one owner should be selected, in this case, we can keep it simple an elect the client as the default winner. With the client as the default owner, the server should respond to the client with ClipboardFormatListRsp consider the client as the new owner.
If we receive ClipboardFormatDataReq in the CLIPBOARD_FORMAT_LIST_REQ_SENT state, it means we have just received a request for data from the previous list of data formats, since we have just sent a request to become the new owner with a new list of data formats. We can respond with a failure right away, and sequence numbers will not match.
Etc, etc. The main issue I was trying to solve here is fast recovery from such states, with going into a loop of retrying until it works. The main issue with immediate retrial is that it is going to happen with timing likely to cause new race conditions. We can solve the issue by expecting such inconsistent states as long as we can move back to proper protocol states when detecting them. The other part of the problem is with electing a "winner" that will have its request accepted without resending new messages. A default winner can be elected by default, such as the client or the server, or some sort of random voting system can be implemented with a default favorite to break ties.

Why does the number of links not matter in the transmission time of circuit switching?

I am reading a book about networking, and it says that, in a circuit switching environment, the number of links and switches a signal has to go through before getting to destination does not affect the overall time it takes for all the signal to be received. On the other hand, in a packet switching scenario the number of links and switches does make a difference. I spent quite a bit of time trying to figure it out, but I can't seem to get it. Why is that?
To drastically over-simplify, effectively a circuit-switched environment has a direct line from the transmitter to the transmitted once a connection has been established; imagine an old-fashioned phone-call going through switchboards. Therefore the transmission time is the same regardless of hops (well, ignoring the physical time it takes the signal to move over the wire, which very small since it's moving at the speed of light).
In a packet-switched environment, there is no direct connection. A packet of data is sent from the transmitter to the first hop, which tries to calculate an open route to the destination. It then passes its data onto the next hop, which again has to calculate the next available hop, and so on. This takes time that linearly increases with the number of hops. Think sending a letter through the US postal system. It has to go from your house to a post office, then the post office to a local distribution center, then from the local distribution center to the national one, then from that to the recipient's local distribution center, then to the recipient's post office, then finally to their house.
The difference is that only one connection at a time per circuit can exist on a circuit-switched network; again, think phone line with someone using it to make a call. Whereas in a packet switched network many transmitters and receivers can be sending data at the same time; again, think many people sending/recieving letters.

A peer-to-peer and privacy-aware data mining/aggregation algorithm: is it possible?

Suppose I have a network of N nodes, each with a unique identity (e.g. public key) communicating with a central-server-less protocol (e.g. DHT, Kad). Each node stores a variable V. With reference to e-voting as an easy example, that variable could be the name of a candidate.
Now I want to execute an "aggregation" function on all V variables available in the network. With reference to e-voting example, I want to count votes.
My question is completely theoretical (I have to prove a statement, details at the end of the question), so please don't focus on the e-voting and all of its security aspects. Do I have to say it again? Don't answer me that "a node may have any number identities by generating more keys", "IPs can be traced back" etc. because that's another matter.
Let's see the distributed aggregation only from the privacy point of view.
THE question
Is it possible, in a general case, for a node to compute a function of variables stored at other nodes without getting their value associated to the node's identity? Did researchers design such a privacy-aware distributed algorithm?
I'm only dealing with privacy aspects, not general security!
Current thoughts
My current answer is no, so I say that a central server, obtaining all Vs and processes them without storing, is necessary and there are more legal than technical means to assure that no individual node's data is either stored or retransmitted by the central server. I'm asking to prove that my previous statement is false :)
In the e-voting example, I think it's impossible to count how many people voted for Alice and Bob without asking all the nodes, one by one "Hey, who do you vote for?"
Real case
I'm doing research in the Personal Data Store field. Suppose you store your call log in the PDS and somebody wants to find statistical values about the phone calls (i.e. mean duration, number of calls per day, variance, st-dev) without being revealed neither aggregated nor punctual data about an individual (that is, nobody must know neither whom do I call, nor my own mean call duration).
If a trusted broker exists, and everybody trusts it, that node can expose a double getMeanCallDuration() API that first invokes CallRecord[] getCalls() on every PDS in the network and then operates statistics on all rows. Without the central trusted broker, each PDS exposing double getMyMeanCallDuration() isn't statistically usable (the mean of the means shouldn't be the mean of all...) and most importantly reveals the identity of the single user.
Yes, it is possible. There is work that actually answers your question solving the problem, given some assumptions. Check the following paper: Privacy, efficiency & fault tolerance in aggregate computations on massive star networks.
You can do some computation (for example summing) of a group of nodes at another node without having the participants nodes to reveal any data between themselves and not even the node that is computing. After the computation, everyone learns the result (but no one learns any individual data besides their own which they knew already anyways). The paper describes the protocol and proves its security (and the protocol itself gives you the privacy level I just described).
As for protecting the identity of the nodes to unlink their value from their identity, that would be another problem. You could use anonymous credentials (check this: https://idemix.wordpress.com/2009/08/18/quick-intro-to-credentials/) or something alike to show that you are who you are without revealing your identity (in a distributed scenario).
The catch of this protocol is that you need a semi-trusted node to do the computation. A fully distributed protocol (for example, in a P2P network scenario) is not that easy though. Not because of a lack of a storage (you can have a DHT, for example) but rather you need to replace that trusted or semi-trusted node by the network, and that is when you find your issues, who does it? Why that one and not another one? And what if there is a collusion? Etc...
How about when each node publishes two sets of data x and y, such that
x - y = v
Assuming that I can emit x and y independently, you can correctly compute the overall mean and sum, while every single message is largely worthless.
So for the voting example and candidates X, Y, Z, I might have one identity publishing the vote
+2 -1 +3
and my second identity publishes the vote:
-2 +2 -3
But of course you cannot verify that I didn't vote multiple times anymore.

defining the time it takes to do something (latency, throughput, bandwidth)

I understand latency - the time it takes for a message to go from sender to recipient - and bandwidth - the maximum amount of data that can be transferred over a given time - but I am struggling to find the right term to describe a related thing:
If a protocol is conversation-based - the payload is split up over many to-and-fros between the ends - then latency affects 'throughput'1.
1 What is this called, and is there a nice concise explanation of this?
Surfing the web, trying to optimize the performance of my nas (nas4free) I came across a page that described the answer to this question (imho). Specifically this section caught my eye:
"In data transmission, TCP sends a certain amount of data then pauses. To ensure proper delivery of data, it doesn’t send more until it receives an acknowledgement from the remote host that all data was received. This is called the “TCP Window.” Data travels at the speed of light, and typically, most hosts are fairly close together. This “windowing” happens so fast we don’t even notice it. But as the distance between two hosts increases, the speed of light remains constant. Thus, the further away the two hosts, the longer it takes for the sender to receive the acknowledgement from the remote host, reducing overall throughput. This effect is called “Bandwidth Delay Product,” or BDP."
This sounds like the answer to your question.
BDP as wikipedia describes it
To conclude, it's called Bandwidth Delay Product (BDP) and the shortest explanation I've found is the one above. (Flexo has noted this in his comment too.)
Could goodput be the term you are looking for?
According to wikipedia:
In computer networks, goodput is the application level throughput, i.e. the number of useful bits per unit of time forwarded by the network from a certain source address to a certain destination, excluding protocol overhead, and excluding retransmitted data packets.
Wikipedia Goodput link
The problem you describe arises in communications which are synchronous in nature. If there was no need to acknowledge receipt of information and it was certain to arrive then the sender could send as fast as possible and the throughput would be good regardless of the latency.
When there is a requirement for things to be acknowledged then it is this synchronisation that cause this drop in throughput and the degree to which the communication (i.e. sending of acknowledgments) is allowed to be asynchronous or not controls how much it hurts the throughput.
'Round-trip time' links latency and number of turns.
Or: Network latency is a function of two things:
(i) round-trip time (the time it takes to complete a trip across the network); and
(ii) the number of times the application has to traverse it (aka turns).

Measuring time difference between networked devices

I'm adding networked multiplayer to a game I've made. When the server sends an update packet to the client, I include a timestamp so that the client knows exactly when that information is valid. However, the server computer and the client computer might have their clocks set to different times (maybe even just a few seconds difference), so the timestamp from the server needs to be translated to the client's local time.
So, I'd like to know the best way to calculate the time difference between the server and the client. Currently, the client pings the server for a time stamp during initialization, takes note of when the request was sent and when it was answered, and guesses that the time stamp was generated roughly halfway along the journey. The client also runs 10 of these trials and takes the average.
But, the problem is that I'm getting different results over repeated runs of the program. Within each set of 10, each measurement rarely diverges by more than 400 milliseconds, which might be acceptable. But if I wait a few minutes between each run of the program, the resulting averages might disagree by as much as 2 seconds, which is not acceptable.
Is there a better way to figure out the difference between the clocks of two networked devices? Or is there at least a way to tweak my algorithm to yield more accurate results?
Details that may or may not be relevant: The devices are iPod Touches communicating over Bluetooth. I'm measuring pings to be anywhere from 50-200 milliseconds. I can't ask the users to sync up their clocks. :)
Update: With the help of the below answers, I wrote an objective-c class to handle this. I posted it on my blog: http://scooops.blogspot.com/2010/09/timesync-was-time-sink.html
I recently took a one-hour class on this and it wasn't long enough, but I'll try to boil it down to get you pointed in the right direction. Get ready for a little algebra.
Let s equal the time according to the server. Let c equal the time according to the client. Let d = s - c. d is what is added to the client's time to correct it to the server's time, and is what we need to solve for.
First we send a packet from the server to the client with a timestamp. When that packet is received at the client, it stores the difference between the given timestamp and its own clock as t1.
The client then sends a packet to the server with its own timestamp. The server sends the difference between the timestamp and its own clock back to the client as t2.
Note that t1 and t2 both include the "travel time" t of the packet plus the time difference between the two clocks d. Assuming for the moment that the travel time is the same in both directions, we now have two equations in two unknowns, which can be solved:
t1 = t - d
t2 = t + d
t1 + d = t2 - d
d = (t2 - t1)/2
The trick comes because the travel time is not always constant, as evidenced by your pings between 50 and 200 ms. It turns out to be most accurate to use the timestamps with the minimum ping time. That's because your ping time is the sum of the "bare metal" delay plus any delays spent waiting in router queues. Every once in a while, a lucky packet gets through without any queuing delays, so you use that minimum time as the most repeatable time.
Also keep in mind that clocks run at different rates. For example, I can reset my computer at home to the millisecond and a day later it will be 8 seconds slow. That means you have to continually readjust d. You can use the slope of various values of d computed over time to calculate your drift and compensate for it in between measurements, but that's beyond the scope of an answer here.
Hope that helps point you in the right direction.
Your algorithm will not be much more accurate unless you can use some statistical methods. First of all, 10 is probably not sufficient. The first and simplest change would be to gather 100 transit time samples and toss out the x longest and shortest.
Another thing to add would be that both clients send their own timestamp in each packet. Then you can also calculate how different their clocks are and check the average difference between the clocks.
You can also check up on STNP and NTP implementations specifically, as these protocols do this specifically.

Resources