How to evaluate the performance of DPDK in a complex circumstance - networking

Let's think about a circumstance like that:
In a MMORPG Game, We send a packet to server, and the server will do a lot of computing about all the palyers who have some connection like attack or heal or somethingelse. After that, we may receive several packets.
Since we may receive several packets, the thing is a little hard. If we only read one, then we can just use timestamp to see the time cost. But now, we cannot do that. So, how to evaluate the performance between traditional TCP/IP stack and the DPDK process in a complex circumstance like that?

If we only read one, then we can just use timestamp to see the time cost. But now, we cannot do that.
Answer> you can always register callback handler in RX, to get it invoked per packet.
how to evaluate the performance between traditional TCP/IP stack and the DPDK process in a complex circumstance like that?
Answer> I assume you are having TLDK or mTCP or ANS as userspace stack, your best approach is to have callback at each read success.

Related

QuickFix using Pipes, shared memory, message queues etc

Here's my scenario:
In my application i have several processes which communicate with each other using Quickfix which internally use tcp sockets.the flow is like:
Process1 sends quickfix messaage-> process 2 sends quickfix message after processing message from
process 1 -> .....->process n
Similarly the acknowledgement messages flow like,
process n->....->process 1
Now, All of these processes except the last process( process n ) are on the same machine.
I googled and found that tcp sockets are the slowest of ipc mechanisms.
So, is there a way to transmit and recieve quick fix messages( obviously using their apis)
through other ipc mechanisms. If yes, i can then reduce the latency by using that ipc mechanism between all the processes which are on the same machine.
However if i do so, do those mechanisms guarentee the tranmission of complete message like tcp sockets do?
I think you are doing premature optimization, and I don't think that TCP will be your performance bottleneck. Your local LAN latency will be faster than that of your exterior FIX connection. From experience, I'd expect perf issues to originate in your app's message handling (perhaps due to accidental blocking in OnMessage() callbacks) rather than the IPC stuff going on afterward.
Advice: Write your communication component with an abstraction-layer interface so that later down the line you can swap out TCP for something else (e.g ActiveMQ, ZeroMQ, whatever else you may consider) if you decide you may need it.
Aside from that, just focus on making your system work correctly. Once you are sure teh behavior are correct (hopefully with tests to confirm them), then you can work on performance. Measure your performance before making any optimizations, and then measure again after you make "improvements". Don't trust your gut; get numbers.
Although it would be good to hear more details about the requirements associated with this question, I'd suggest looking at a shared memory solution. I'm assuming that you are running a server in a colocated facility with the trade matching engine and using high speed, kernel bypass communication for external communications. One of the issues with TCP is the user/kernel space transitions. I'd recommend considering user space shared memory for IPC and use a busy polling technique for synchronization rather than using synchronization mechanisms that might also involve kernel transitions.

What are some common methods used in game networking?

So I'm writing a fairly simple game with very low networking requirements, I'm using TCP.
I'm unsure where to start in even defining/implementing a protocol for the client and server to use. I've been looking around and I've seen a few examples, for instance Mojang's Minecraft which uses a table of 'commands' the client sends the server and the server sends the client, with numbers of arguments and such.
What's a good way to do this? I've heard complaints about Minecraft's protocol because if you overread by a byte you ruin the entire stream.
Game networking is a broad question, depending on what type of problem you are solving. TCP (may) not even be the correct choice for you.
For example - games that send movement of characters is typically done with UDP. The reason being that character movement isn't critical to the operation of the game, so some data loss of movement is "acceptable". That may be why sometimes your character "jumps" - some UDP packets were lost, or severely out-of-order.
UDP is argued as the preferred protocol for networked games. So before you even get started, carefully consider whether you are even picking the correct protocol.
Overall, I consider Glenn Fiedler's series on developing a networked game a fantastic read. I'd start here. He covers all of the basics of using UDP for gaming.
If you want to use TCP simply just to get a handle on TCP - then Minecraft is a reasonable example. A known list of commands that can be sent back and forth is a simple way to start. However, as you stated, is prone to some problems. This is more aligned with using the wrong protocol than how it was developed.
Google "game networking library" and you'll get a bunch of results. GNE would be a good one to look at.
I guess it depends on what your game is, what it mechanics are, what information is necessary. In any case I think this stack exchange https://gamedev.stackexchange.com/ is more suited to answer your question.
Gamedev.net's networking forum has a great FAQ covering these sorts of questions and many others, however, to make this more than a 'go-there-look-at-that' answer, I'll suggest some small improvements you can make. When using tcp, delivery is guarenteed, but this has a speed cost, which is fine if your not making a fps, but it means you need to get more from the data you do send, a great way to do this is via deltas/differentials, that is, sending only the change in state, not the entire game state, you can also validate your incoming packets for corrupt/anomalys data over and about tcp checks by predicting possibilities are allow, and with the same prediction, you can cut out even more data etc. But as others have said, this is a broad question, and not suited to getting truely helpful answers
As you're coding in lua, the only library anyone uses is luasocket (though ZMQ is gaining ground).
You're really going to have several protocols going: TCP for data that must be received (eg, server commands such as changemap or you_got_kicked, conversations and such; then use UDP for non-compulsory data, or data that quickly expires (eg, character positions).

What happened to the TCP Nagle flush?

According to this Socket FAQ article, Nagle's algorithm is one of many algorithms that can cause a bunch of data to sit in the TCP buffer and not hit the wire. The delay from the Nagle algorithm can be up to 200ms.
For some reason, Nagle's algorithm can be turned off completely, but not flushed just once. This is really puzzling to me. Why is there no way to say that "just this one time, don't wait for any more data. Just act as if Nagle's 200ms are up."
Wouldn't that make perfect sense, and strike a good balance between no Nagle at all, Nagle all the time, and implementing one's own protocol from scratch?
Good question. I guess nobody ever really needed it or they got around it. If I remember correctly, enabling TCP_NODELAY pushes the data immediately. Then you can just disable it.
Of course, this comes at the high cost of two system calls for a "flush". What you could do: send(2), on Unix implementations has a flags argument. You could implement your own flag, something like: MSG_JUSTPUSHIT (okay, maybe another name) and consider it in tcp_output.
In performance-sensitive applications where the delays introduced by Nagle's algorithm are an issue, it's often easier to just disable Nagle's algorithm entirely and emulate its batching in software by using scatter/gather IO (e.g, writev(), or by implementing buffering in software where needed). As an added bonus, doing this cuts out some system call overhead.
Alternatively, you can open up two separate sockets and disable Nagling on one of them. Just keep in mind that data sent on one socket won't necessarily be synced up with the other one.

Limiting TCP sends with a "to-be-sent" queue and other design issues

This question is the result of two other questions I've asked in the last few days.
I'm creating a new question because I think it's related to the "next step" in my understanding of how to control the flow of my send/receive, something I didn't get a full answer to yet.
The other related questions are:
An IOCP documentation interpretation question - buffer ownership ambiguity
Non-blocking TCP buffer issues
In summary, I'm using Windows I/O Completion Ports.
I have several threads that process notifications from the completion port.
I believe the question is platform-independent and would have the same answer as if to do the same thing on a *nix, *BSD, Solaris system.
So, I need to have my own flow control system. Fine.
So I send send and send, a lot. How do I know when to start queueing the sends, as the receiver side is limited to X amount?
Let's take an example (closest thing to my question): FTP protocol.
I have two servers; One is on a 100Mb link and the other is on a 10Mb link.
I order the 100Mb one to send to the other one (the 10Mb linked one) a 1GB file. It finishes with an average transfer rate of 1.25MB/s.
How did the sender (the 100Mb linked one) knew when to hold the sending, so the slower one wouldn't be flooded? (In this case the "to-be-sent" queue is the actual file on the hard-disk).
Another way to ask this:
Can I get a "hold-your-sendings" notification from the remote side? Is it built-in in TCP or the so called "reliable network protocol" needs me to do so?
I could of course limit my sendings to a fixed number of bytes but that simply doesn't sound right to me.
Again, I have a loop with many sends to a remote server, and at some point, within that loop I'll have to determine if I should queue that send or I can pass it on to the transport layer (TCP).
How do I do that? What would you do? Of course that when I get a completion notification from IOCP that the send was done I'll issue other pending sends, that's clear.
Another design question related to this:
Since I am to use a custom buffers with a send queue, and these buffers are being freed to be reused (thus not using the "delete" keyword) when a "send-done" notification has been arrived, I'll have to use a mutual exlusion on that buffer pool.
Using a mutex slows things down, so I've been thinking; Why not have each thread have its own buffers pool, thus accessing it , at least when getting the required buffers for a send operation, will require no mutex, because it belongs to that thread only.
The buffers pool is located at the thread local storage (TLS) level.
No mutual pool implies no lock needed, implies faster operations BUT also implies more memory used by the app, because even if one thread already allocated 1000 buffers, the other one that is sending right now and need 1000 buffers to send something will need to allocated these to its own.
Another issue:
Say I have buffers A, B, C in the "to-be-sent" queue.
Then I get a completion notification that tells me that the receiver got 10 out of 15 bytes. Should I re-send from the relative offset of the buffer, or will TCP handle it for me, i.e complete the sending? And if I should, can I be assured that this buffer is the "next-to-be-sent" one in the queue or could it be buffer B for example?
This is a long question and I hope none got hurt (:
I'd loveeee to see someone takes the time to answer here. I promise I'll double-vote for him! (:
Thank you all!
Firstly: I'd ask this as separate questions. You're more likely to get answers that way.
I've spoken about most of this on my blog: http://www.lenholgate.com but then since you've already emailed me to say that you read my blog you know that...
The TCP flow control issue is such that since you are posting asynchronous writes and these each use resources until they complete (see here). During the time that the write is pending there are various resource usage issues to be aware of and the use of your data buffer is the least important of them; you'll also use up some non-paged pool which is a finite resource (though there is much more available in Vista and later than previous operating systems), you'll also be locking pages in memory for the duration of the write and there's a limit to the total number of pages that the OS can lock. Note that both the non-paged pool usage and page locking issues aren't something that's documented very well anywhere, but you'll start seeing writes fail with ENOBUFS once you hit them.
Due to these issues it's not wise to have an uncontrolled number of writes pending. If you are sending a large amount of data and you have a no application level flow control then you need to be aware that if you send data faster than it can be processed by the other end of the connection, or faster than the link speed, then you will begin to use up lots and lots of the above resources as your writes take longer to complete due to TCP flow control and windowing issues. You don't get these problems with blocking socket code as the write calls simply block when the TCP stack can't write any more due to flow control issues; with async writes the writes complete and are then pending. With blocking code the blocking deals with your flow control for you; with async writes you could continue to loop and more and more data which is all just waiting to be sent by the TCP stack...
Anyway, because of this, with async I/O on Windows you should ALWAYS have some form of explicit flow control. So, you either add application level flow control to your protocol, using an ACK, perhaps, so that you know when the data has reached the other side and only allow a certain amount to be outstanding at any one time OR if you cant add to the application level protocol, you can drive things by using your write completions. The trick is to allow a certain number of outstanding write completions per connection and to queue the data (or just don't generate it) once you have reached your limit. Then as each write completes you can generate a new write....
Your question about pooling the data buffers is, IMHO, premature optimisation on your part right now. Get to the point where your system is working properly and you have profiled your system and found that the contention on your buffer pool is the most important hot spot and THEN address it. I found that per thread buffer pools didn't work so well as the distribution of allocations and frees across threads tends not to be as balanced as you'd need to that to work. I've spoken about this more on my blog: http://www.lenholgate.com/blog/2010/05/performance-comparisons-for-recent-code-changes.html
Your question about partial write completions (you send 100 bytes and the completion comes back and says that you have only sent 95) isn't really a problem in practice IMHO. If you get to this position and have more than the one outstanding write then there's nothing you can do, the subsequent writes may well work and you'll have bytes missing from what you expected to send; BUT a) I've never seen this happen unless you have already hit the resource problems that I detail above and b) there's nothing you can do if you have already posted more writes on that connection so simply abort the connection - note that this is why I always profile my networking systems on the hardware that they will run on and I tend to place limits in MY code to prevent the OS resource limits ever being reached (bad drivers on pre Vista operating systems often blue screen the box if they can't get non paged pool so you can bring a box down if you don't pay careful attention to these details).
Separate questions next time, please.
Q1. Most APIs will give you "write is possible" event, after you last wrote and writing is available again (can happen immediately if you failed to fill major part of send buffer with the last send).
With completion port, it will arrive just as "new data" event. Think of new data as "read Ok", so there's also a "write ok" event. Names differ between the APIs.
Q2. If a kernel mode transition for mutex acquisition per chunk of data hurts you, I recommend rethinking what you are doing. It takes 3 microseconds at most, while your thread scheduler slice may be as big as 60 milliseconds on windows.
It may hurt in extreme cases. If you think you are programming extreme communications, please ask again, and I promise to tell you all about it.
To address your question about when it knew to slow down, you seem to lack an understanding of TCP congestion mechanisms. "Slow start" is what you're talking about, but it's not quite how you've worded it. Slow start is exactly that -- starts off slow, and gets faster, up to as fast as the other end is willing to go, wire line speed, whatever.
With respect to the rest of your question, Pavel's answer should suffice.

Boost asio async vs blocking reads, udp speed/quality

I have a quick and dirty proof of concept app that I wrote in C# that reads high data rate multicast UDP packets from the network. For various reasons the full implementation will be written in C++ and I am considering using boost asio. The C# version used a thread to receive the data using blocking reads. I had some problems with dropped packets if the computer was heavily loaded (generally with processing those packets in another thread).
What I would like to know is if the async read operations in boost (which use overlapped io in windows) will help ensure that I receive the packets and/or reduce the cpu time needed to receive the packets. The single thread doing blocking reads is pretty straightforward, using the async reads seems like a step up in complexity, but I think it would be worth it if it provided higher performance or dropped fewer packets on a heavily loaded system. Currently the data rate should be no higher than 60Mb/s.
I've written some multicast handling code using boost::asio also. I would say that overall, in my experience there is a lot of added complexity to doing things in asio that may not make it easy for other people you work with to understand the code you end up writing.
That said, presumably the argument in favour of moving to asio instead of using lots of different threads to do the work is that you would have to do less context switching. This would clearly be true on a single-core box, but what about when you go multi-core? Are you planning to offload the work you receive to threads or just have a single thread doing the processing work? If you go for a single threaded approach you are going to end up in a situation where you could drop packets waiting for that thread to process the work.
In the end it's swings and roundabouts. I'd say you want to get some fairly solid figures backing up your arguments for going down this route if you are going to do so, just because of all the added complexity it entails (a whole new paradigm for some people I'm sure).

Resources