Cisco ip sla, making it more resilliant - networking

I have recently set up this configuration: http://pastebin.com/9SWpqQnz
It is achieving the goal of having the route fail over to the backup ADSL line when the primary fibre ethernet line goes down.
It is, however, failing over for a short period of time every few hours or so.
I assume it is looking for a single missed ping?
Can anyone suggest how this might be tightened up a bit and made more reliable?

track 1 rtr 123 reachability
delay up 60 down 20
ip sla 123
icmp-echo 37.77.176.177 source-interface FastEthernet0/0
timeout 2000
frequency 10
ip sla schedule 123 life forever start-time now
Also it's probably worth pinging something a little further out on the internet, since your ISP could have issues that leaves your local loop up and running but the rest of their network is unable to forward traffic correctly. On the other hand, if you continuously ping something random on the internet that you don't have authorization for you could get blocked or rate-limited, which will artificially fail over your default route.

Related

Data cost of keeping a tcp connection open

Let's suppose 2 computers:
The first is running a netcat server on a tcp port.
The second is running a netcat client, connected to the previous netcat server.
(netcat is an example, you can imagine a basic c program with socket)
We ca send data between the 2 computers.
Let's imagine nobody send data during multiple days.
Is there a timeout in tcp stack ?
Does netcat (or operating system) sends some packets to keep the connection opened ?
What i want to know is how much data is sent if there is no top level activity.
Thanks
Is there a timeout in tcp stack ?
There are many different timeouts in the TCP stack, depending on what state we are currently in, and how the connection was configured (e.g. with keepalive or not). The idle connection timeout (which is what you refer to) does not seem to be defined. With keepalive the timeout is ~2 hours. That being said pretty much every firewall in the world will setup some timeout. Based on this reddit thread 15 minutes looks like a reasonable assumption, maybe even 1 hour. But multiple days? I doubt it will be alive in any network (except your own).
Does netcat (or operating system) sends some packets to keep the connection opened ?
No. You will have to do it yourself by sending data. With the keepalive option for TCP, the OS will do it for you (note: keepalive is disabled by default), but this works between direct peers, i.e. may fail when proxies are involved. Sending data is definitely a better approach.

Extremely high latency when network gets busy, TCP, libevent

In our C/S based online game project, we use TCP for network transmission. We include Libevent, utilise a bufferevent for each connection to handling with the network I/O automatically.
It works well beforeļ¼Œbut the lagging problem comes to the surface recently. When I do some stress testing to make the network busier, the latency becomes extremely high, several seconds or more. The server sinks into a confusing state:
the average CPU usage decreased (0%-60%-0%-60% repeat, waiting something?)
the net traffic decreased (nethogs)
the clients connected to server still alive (netstat & tcpdump)
It looks like something magically slowed all system down, but new connection to server responded quit in time.
When I changed the protocol to UDP, it works well on the same situation: no obvious latency, the system runs fast. Net traffic is around 3M/S.
The project is running on an Intranet. I also tested the max download speed, nearly 18M/S.
I studied part of Libevent's header files and ducumentations, tried to setup a rate limit to all connections. It did some improvements, but not completely resolved the problem even though I had tried several different configurations. Here is my parameters: read_rate 163840, read_burst 163840, write_rate 163840, write_burst 163840, tick_len 500ms.
Thank you for your help!
TCP = Transmission Control Protocol. It responds to packet loss by retransmitting unacknowledged packets after a delay. In the case of repeated loss, it will exponentially back off. Take a look at this network capture of an attempt to open a connection to host that is not responding:
It sends the initial SYN, and then after not getting an ack for 1s it tries again. After not getting an ack it then sends another after ~2s, then ~4s, then ~8s, and so on. So you can see that you can get some serious latency in the face of repeated packet loss.
Since you said you were deliberately stressing the network, and that the CPU usage is inconsistent, one possible explanation is that TCP is waiting to retransmit lost packets.
The best way to see what is going on is to get a network capture of what is actually transmitted. If your hosts are connected to a single switch, you can "span" a port of interest to the port of another host where you can make the capture.
If your switch isn't capable of this, or if you don't have the administrative control of the switch, then you will have to get the capture from one of hosts involved in your online game. The disadvantage of this is that taking the capture will possibly alter what happens, and it doesn't see what is actually on the wire. For example, you might have TCP segmentation offload enabled for your interface, in which case the capture will see large packets that will be broken up by the network interface.
I would suggest installing wireshark to analyse the network capture (which you can do in real time by using wireshark to do the capture as well). Any time you are working with a networked system I would recommend using wireshark so that you have some visibility into what is actually happening on the network. The first filter I would suggest you use is tcp.analysis.flags which will show you packets suggestive of problems.
I would also suggest turning off the rate limiting first to try to see what is going on (rate limiting is adding another reason to not send packets, which is probably going to make it harder to diagnose what is going on). Also, 500ms might be a longish tick_len depending on how your game operates. If your burst configuration allows the rate to be used up in 100ms, you will end up waiting 400ms before you can transmit again. The IO Graph is a very helpful feature of Wireshark in this regard. It can help you see transmission rates, although the default tick interval and unit are not very helpful in this regard. Here is an example of a bursty flow being rate limited to 200mbit/s:
Note that the tick interval is 1ms and the unit is bits/tick, which makes the top of the chart 1gb/s, the speed of the interface in question.

How to fix the ping between the server and every client?

I have a server for a very competitive game which involves money. In order for the game to be fair, every client must have the same ping. I can't, obviously, make everyone have a short ping. So the only solution is to fix it high: for example, 200ms is acceptable.
The problem is, how do I force a ping to be 200ms? In order for that to work, I'd have to know how much I should delay sending the packets - and, for that, I'd have to know the ping of the client. So, if the ping is 60ms, I could just add a 140ms delay to provided data. The problem is: I can only know the ping by asking it, and a client can lie, telling me his ping is higher than it is and making me send the packets earlier.
How to solve that problem?
you can't fix the ping, it can change all the time. if your client enable e.g. torrent during your ping discovery it will significantly affects the result.
but maybe you don't have to delay sending packets. maybe it's enough to delay receiving or analyzing them? and that might be easier, you just have to know when you sent the corresponding packet
there is also other option: many online shooters use feature called 'unlagged'. client sends the coordinate of his shoot and server, based on current ping (lets say 80ms) calculate if the target would have been hit 80ms ago. if so, the target is considered hit. but this of course introduces some artifacts like the victim getting the shot through the wall.
By 'ping' I assume you mean 'network latency', which is variable and entirely outside your control. Your question doesn't make sense.

Can someone interpret these apache bench results, is there something that stands out?

Below is a apache bench run for 10K requests with 50 concurrent threads.
I need help understanding the results, does anything stand out in the results that might be pointing to something blocking and restricting more requests per second?
I'm looking at the connection time section, and see 'waiting' and 'processing'. It shows the mean time for waiting is 208, and the mean time to connect is 0 and processing is 208..yet the total is 208. Can someone explain this to me as it doesn't make much sense to me.
Connect time is time it took ab to establish connection with your server. you are probably running it on same server or within LAN, so your connect time is 0.
Processing time is total time server took to process and send complete response.
Wait time is time between sending request and receiving 1st byte of response.
Again, since you are running on same server, and small size of file, your processing time == wait time.
For real benchmark, try ab from multiple points near your target market to get real idea of latency. Right now all the info you have is the wait time.
This question is getting old, but I've run into the same problem so I might as well contribute an answer.
You might benefit from disabling either TCP nagle on the agent side, or ACK delay on the server side. They can interact badly and cause an unwanted delay. Like me, that's probably why your minimum time is exactly 200ms.
I can't confirm, but my understanding is that the problem is cross-platform since it's part of the TCP spec. It might be just for quick connections with a small amount of data sent and received, though I've seen reports of issues for larger transfers too. Maybe somebody who knows TCP better can pitch in.
Reference:
http://en.wikipedia.org/wiki/TCP_delayed_acknowledgment#Problems
http://blogs.technet.com/b/nettracer/archive/2013/01/05/tcp-delayed-ack-combined-with-nagle-algorithm-can-badly-impact-communication-performance.aspx

Networking/international hosted pings and traceroutes

I am working on a project involving correlations in distance of the server in relation to pings and traceroutes. So I am getting the number of hops and the average ping time for different web sites. I am using puTTy and unix code to achieve this, when I traceroute the website (traceroute australia.gov.au -m 255) I allow for the maximum number of hops. I get about 18 hops, and then i get 237 numbers with three asterisks next to them. At first I assumed that this was a result of the -m 255 extension to the command, but it doesn't occur with websites like youtube or google. Is this timed out hops/connections? Also when i ping this website (ping australia.gov.au -c 25), I get no respsonse for about a minute (near 2000ms) and then the print out says that 25 packets were sent and 0 were received, what is the explanation for this?
Lots of places all over the internet block ICMP, meaning that australia.gov.au might have a valid routable IP, but just doesn't send back echo reply.
Three asterisks line from traceroute, as far as I can remember, means a packet with given TTL did not get a reply. That again probably indicates that the host/router does not want to be bothered replying to arbitrary ICMP and/or UDP packets.

Resources