Extremely high latency when network gets busy, TCP, libevent - networking

In our C/S based online game project, we use TCP for network transmission. We include Libevent, utilise a bufferevent for each connection to handling with the network I/O automatically.
It works well beforeļ¼Œbut the lagging problem comes to the surface recently. When I do some stress testing to make the network busier, the latency becomes extremely high, several seconds or more. The server sinks into a confusing state:
the average CPU usage decreased (0%-60%-0%-60% repeat, waiting something?)
the net traffic decreased (nethogs)
the clients connected to server still alive (netstat & tcpdump)
It looks like something magically slowed all system down, but new connection to server responded quit in time.
When I changed the protocol to UDP, it works well on the same situation: no obvious latency, the system runs fast. Net traffic is around 3M/S.
The project is running on an Intranet. I also tested the max download speed, nearly 18M/S.
I studied part of Libevent's header files and ducumentations, tried to setup a rate limit to all connections. It did some improvements, but not completely resolved the problem even though I had tried several different configurations. Here is my parameters: read_rate 163840, read_burst 163840, write_rate 163840, write_burst 163840, tick_len 500ms.
Thank you for your help!

TCP = Transmission Control Protocol. It responds to packet loss by retransmitting unacknowledged packets after a delay. In the case of repeated loss, it will exponentially back off. Take a look at this network capture of an attempt to open a connection to host that is not responding:
It sends the initial SYN, and then after not getting an ack for 1s it tries again. After not getting an ack it then sends another after ~2s, then ~4s, then ~8s, and so on. So you can see that you can get some serious latency in the face of repeated packet loss.
Since you said you were deliberately stressing the network, and that the CPU usage is inconsistent, one possible explanation is that TCP is waiting to retransmit lost packets.
The best way to see what is going on is to get a network capture of what is actually transmitted. If your hosts are connected to a single switch, you can "span" a port of interest to the port of another host where you can make the capture.
If your switch isn't capable of this, or if you don't have the administrative control of the switch, then you will have to get the capture from one of hosts involved in your online game. The disadvantage of this is that taking the capture will possibly alter what happens, and it doesn't see what is actually on the wire. For example, you might have TCP segmentation offload enabled for your interface, in which case the capture will see large packets that will be broken up by the network interface.
I would suggest installing wireshark to analyse the network capture (which you can do in real time by using wireshark to do the capture as well). Any time you are working with a networked system I would recommend using wireshark so that you have some visibility into what is actually happening on the network. The first filter I would suggest you use is tcp.analysis.flags which will show you packets suggestive of problems.
I would also suggest turning off the rate limiting first to try to see what is going on (rate limiting is adding another reason to not send packets, which is probably going to make it harder to diagnose what is going on). Also, 500ms might be a longish tick_len depending on how your game operates. If your burst configuration allows the rate to be used up in 100ms, you will end up waiting 400ms before you can transmit again. The IO Graph is a very helpful feature of Wireshark in this regard. It can help you see transmission rates, although the default tick interval and unit are not very helpful in this regard. Here is an example of a bursty flow being rate limited to 200mbit/s:
Note that the tick interval is 1ms and the unit is bits/tick, which makes the top of the chart 1gb/s, the speed of the interface in question.

Related

TCP and UDP behavior

I am experimenting with TCP and UDP protocols for video streaming and I have observed something. I run Wireshark in the background to capture the traffic. The number of packets per sec is consistent. During video streaming, I introduce a network failure. There is no data flowing during a failure. As soon as the connection's back, the number of packets per sec is twice as more than before the failure for about 1-2 secs and then it is consistent. I am not able to figure why. It'd be really helpful if anyone could help me understand it.
It depends where you are introducing the network outage. if it is at client level, wouldn't you expect non-video communication to get started at the same time when network is back up? If you are filtering on video packets only, is there a cache setup where the chunks of video segments are downloaded periodically?

TCP as connection protocol questions

I'm not sure if this is the correct place to ask, so forgive me if it isn't.
I'm writing computer monitoring software that needs to connect to a server. The server may send out relatively urgent messages, such as sound or cancel an alarm, and the client may send out data about the computer, such as screenshots. The data that the client sends isn't too critical on timing, but shouldn't be more than a two minutes late.
It is essential to the software that portforwarding need not be set up, and it is assumed that the internet connection will be done through a wireless router that has NAT almost all the time.
My idea is to have a TCP connection initiated from the client, and use that to transfer data. Ideally, I would have no data being sent when it is not needed, but I believe this to be impossible. Would sending the equivalent of a ping every now and again keep the connection alive, and what sort of bandwidth would it use if this program was running all the time on the computer? In addition, would it be possible to reduce the header size for these keep-alives?
Before I start designing the communication and programming, is this plan for connection flawed? Are there better alternatives?
Thanks!
1) You do not need to send 'ping' data to keep the connection alive, the TCP stack does this automatically; one reason for sending 'ping' data would be to detect a connection close on the client side - typically you only find out something has gone wrong when you try and read/write from the socket. There may be a way to change various time-outs so you can detect this condition faster.
2) In general while TCP provides a stream-oriented error free channel, it makes no guarantees about timeliness, if you are using it on the internet it is even more unpredictable.
3) For applications such as this (I hope you are making it for ethical purposes) - I would tend to use TCP, since you don't want a situation where the client receives a packet to raise an alarm but misses that one that turns it off again.

What are the chances of losing a UDP packet?

Okay, so I am programming for my networking course and I have to implement a project in Java using UDP. We are implementing an HTTP server and client along with a 'gremlin' function that corrupts packets with a specified probability. The HTTP server has to break a large file up into multiple segments at the application layer to be sent to the client over UDP. The client must reassemble the received segments at the application layer. What I am wondering however is, if UDP is by definition unreliable, why am I having to simulate unreliability here?
My first thought is that perhaps it's simply because my instructor is figuring in our case, both the client and the server will be run on the same machine and that the file will be transferred from one process to another 100% reliably even over UDP since it is between two processes on the same computer.
This led me first to question whether or not UDP could ever actually lose a packet, corrupt a packet, or deliver a packet out of order if the server and client were guaranteed to be two processes on the same physical machine, guaranteed to be routed strictly over localhost only such that it won't ever go out over the network.
I would also like to know, in general, for a given packet what is the rough probability that UDP will drop / corrupt / or deliver a packet out of order while being used to facilitate communication over the open internet between two hosts that are fairly geographically distant from one another (say something comparable to the route between the average broadband user in the US to one of Google's CDNs)? I'm mostly just trying to get a general idea of the conditions experienced when communicated over UDP, does it drop / corrupt / misorder something on the order of 25% of packets, or is it more like something on the order of 0.001% of packets?
Much appreciation to anyone who can shed some light on any of these questions for me.
Packet loss happens for multiple reasons. Primarily it is caused by errors on individual links and network congestion.
Packet loss due to errors on the link is very low, when links are working properly. Less than 0.01% is not unusual.
Packet loss due to congestion obviously depends on how busy the link is. If there is spare capacity along the entire path, this number will be 0%. But as the network gets busy, this number will increase. When flow control is done properly, this number will not get very high. A couple of lost packets is usually enough that somebody will reduce their transmission speed enough to stop packets getting lost due to congestion.
If packet loss ever reaches 1% something is wrong. That something could be a bug in how your congestion control algorithm responds to packet loss. If it keeps sending packets at the same rate, when the network is congested and losing packets, the packet loss can be pushed much higher, 99% packet loss is possible if software is misbehaving. But this depends on the types of links involved. Gigabit Ethernet uses backpressure to control the flow, so if the path from source to destination is a single Gigabit Ethernet segment, the sending application may simply be slowed down and never see actual packet loss.
For testing behaviour of software in case of packet loss, I would suggest two different simulations.
On each packet drop it with a probability of 10% and transmit it with a probability of 90%
Transmit up to 100 packets per second or up to 100KB per second, and drop the rest if the application would send more.
if UDP is by definition unreliable, why am I having to simulate unreliability here?
It is very useful to have a controlled mechanism to simulate worst case scenarios and how both your client and server can respond to them. The instructor will likely want you to demonstrate how robust the system can be.
You are also talking about payload validity here and not just packet loss.
This led me to question whether or not UDP, lose a packet, corrupt a packet, or deliver it out of order if the server and client were two processes on the same machine and it wasn't having to go out over the actual network.
It is obviously less likely over the loopback adapter, but this is not impossible.
I found a few forum posts on the topic here and here.
I am also wondering what the chances of actually losing a packet, having it corrupted, or having them delivered out of order in reality would usually be over the internet between two geographically distant hosts.
This question would probably need to be narrowed down a bit. There are several factors both application level (packet size and frequency) as well as limitations/traffic of routers and switches along the path.
I couldn't find any hard numbers on this but it seems to be fairly low... like sub 5%.
You may be interested in The Internet Traffic Report and possibly pages such as this.
I was spamming udp packets over wifi to some nanoleaf panels and my packet loss was roughly 1/7000.
I think it depends on a ton of factors.

TCP vs UDP - Issues that arise from using both

When I'm learning about various technologies, I often try to think of how applications I use regularly implement such things. I've played a few MMOs along with some FPSs. I did some looking around and happened upon this thread:
http://www.gamedev.net/topic/319003-mmorpg-and-the-ol-udp-vs-tcp
I have been seeing that UDP shines when some loss of packets is permissible. There's less overhead involved and updates are done more quickly. After looking around for a bit and reading various articles and threads, I've come to see that character positioning will often be done with UDP. Games like FPSs will often be done with UDP because of all the rapid changes that are occuring.
I've seen multiple times now where someone pointed out issues that can occur when using UDP and TCP simultaneously. What might some of these problems be? Are these issues that would mostly be encountered by novice programmers? It seems to me that it would be ideal to use a combination of UDP and TCP, gaining the advantages of each. However, if using the two together adds a significant amount of complexity to the code to deal with problems caused, it may not be worth it in certain situations.
Many games use UDP and TCP together. Since it is mandatory for every game to deliver the actions of a player to everyone, it has to be done in one way or the other. It now depends on what kind of game you want to make. In a RTS, surely TCP would be much wiser, since you cannot lose information about your opponents movement. In an RPG, it is not that incredibly important to keep track of everything every second of the game.
Bottom line is, if data has to arrive at the client, in any case, (position updates, upgrades aso.), you have to send it via TCP, or, you implement your own reliable protocol based on UDP. I have constructed quite a few network stacks for games, and I have to say, what you use depends on the usecase and what you are trying to accomplish. I mostly did a heartbeat over UDP, so the remote server/client knows that I am still there. The problem with UDP is, that packets get lost and not resent. If a packet drops, it is lost for ever. You have to take that into account. If you send some information over UDP, it has to be information that is allowed to be lost. Everything else goes via TCP.
And so the madness began. If you want to get most out of both, you will have to adapt them. TCP can be slow sometimes, and you have to wait, if packets get fragmented or do not arrive in order, until the OS has reassembled them. In some cases it could be advisable to build your own reliable protocol on top of UDP. That would allow you complete control over your traffic. Most firewalls do not drop UDP or anything else, but as with TCP, any traffic that is not declared to be safe (Opening Ports, packet redirects, aso.), gets dropped. I would suggest you read up the TCP and UDP and UDP-Lite article up at wikipedia and then decide which ones you want to use for what. AFAIK Battle.net uses a combination of the two.
Many services use udp and tcp together, but doing so without adding congestion control to your udp implementation can cause major problems for tcp. Long story short, udp can and often will clog the routers at each endpoint making tcp's congestion control to go haywire and significantly limit the throughout of the tcp connection. The udp based congestion can also cause significant increase in packet loss for tcp, limiting tcp's throughput even more as it will need to have these packets retransmitted. Using them together isn't a bad idea and is even becoming somewhat common, but you'll want to keep this in mind.
The first possible issue I can think of is that, because UDP doesn't have the overhead inherent in the "transmission control" that TCP does, UDP has higher data bandwidth and lower latency. So, it is possible for a UDP datagram that was sent after a TCP message to be available on the remote computer's input buffers before the TCP message is received in full.
This may cause problems if you use TCP to control or monitor UDP transmission; for instance, you might tell the server via TCP that you'll be sending some datagrams via UDP on port X. If the remote computer isn't already listening on port X, it may not receive some of these datagrams, because they arrived before it was told to listen; if it is listening, but not expecting traffic from you, it may discard them because they showed up before it was told to expect them. This may have an adverse effect on your program's flow or your user's experience.
I think that if you like to transfer game data TCP will be the only solution. Imagine that you send a command (e.g gained x item :P) at the server and this packet never reach its destination. (udp has no guaranties).
Also imagine the scenario that two or more UDP packets reach their destination in wrong order.
But if with your game you integrate any VoIP or Video Call capabilities you can use at these UDP.

Assessing/diagnosing time connections are in SYN_RECV before being established

I'm trying to improve the performance of a (virtual) web server with a fairly standard CentOS/Apache setup and one thing I noticed is that new connections seem to "stick" in the SYN_RECV state, sometimes for several seconds, before finally being established and handled by Apache.
My first guess was that Apache could be reaching the limit for the number of connections it's prepared to handle simultaneously, but e.g. with keep-alive off netstat is reporting a few established connections (just those not involving localhost, so discarding "housekeeping" connections e.g. between Apache and Tomcat), whereas with keep-alive on it will happily get up to 100+ established connections (but with no clear difference to the SYN_RECV behaviour either way -- there's typically 10-20 connections sitting in SYN_RECV at any one time).
What are people's recommendations for investigating where the bottleneck is that's preventing the connections from being established quickly?
P.S. Follow-on question: does anybody know what a TYPICAL statistic would be for the time for a connection to be established once first "hitting" the server?
Update in case anyone else encounters this: in the end, I wrote a small Java program to take data from /proc/net/tcp and analyse and it appears that this is happening for a small proportion of connections (although that still means that at any one time there can be a number of connections in this state, because they can stay this way for a number of seconds) and looks like an issue local to those connections. Over 90% of connections are still going through in < 500ms and 81% in < 200ms. So if others get this, there isn't necessarily need for panic immediately.
Try capturing a packet trace and see if SYN ACKs are being retransmitted (and the number of re-tx). This could indicate a routing issue (SYN comes in via path A and SYN-ACK goes via path B which is broken).
Also see if these connections have a specific pattern (such as originating from the same network).

Resources