Winsock send multiple packets api? - tcp

I've noticed that I'm losing a lot of server time in calls to send(). I have a service which sends high-frequency small packets to a moderate number of clients (>10, <100).
I suspect that recent OS patches, which have made syscalls slow, might be contributing.
I feel like I want a way to supply a bunch of sockets/packets to a single syscall, and it will do the many send() calls internally, reducing API time.
Does anything like this exist? Are there any hidden or niche API's available?
Linux has batching send functions...

Related

How to acces the data from a website 50-100 times a second using raspberry Pi?

I want to fetch the data of a stock. Since the data changes very fast, is there any way to pull the data like 50-100 times a second from trading websites?
And can we implement that using a raspberry Pi 4 8gig model.
RasPi4 should be more than adequate for this task. Both the ethernet and WiFi hardware is capable of connections at these speeds. (Unless you’re running a bunch of other stuff on it.) Consider where your bottlenecks may be, likely ISP or other network traffic). Consider avoiding WiFi in favor of cat5e or cat6. Consider hanging this device off your router (edge) to keep lan traffic lower and consider QOS settings if you think this traffic may compete with other lan traffic.
This appears to be a general question with no specific platform in mind. For stocks, there are lots of platforms to choose from.
APIs for trading platforms often include a method to open a stream. Instead of a full TCP conversation for each price check, a stream tells the server to just keep on sending data. There are timeout mechanisms of course, but it is good to close that stream gracefully (It’s polite since you’re consuming server resources at a different scale. I’ve seen some financial APIs monitor and throttle stream subscribers who leave sessions open.).
For some APIs/languages you can find solid classes already built on GitHub. Although, if simply pulling and reading a stream then the platform API doc code snippets should be enough to get you going.
Be sure to find out what other overhead may be implicated. For example, if an account or API key is needed to open a stream then either a session must be opened first or the creds must be passed with the stream being opened. The API docs will say. If you’re new to this sort of thing, just be a detective and try to infer what is needed. API docs usually try to be precise and technically correct with the absolute minimum word count.
Simply checking the steam should be easy. Depending on how that steam can be handled by your code/script, it may be harder to perform logic on the stream while it is being updated. That’s usually a thread issue or a variable scope issue depending on the script/code. For what you’re doing I would consider Python or PowerShell depending on your skill-set and other design parameters.

Load Testing Thousands of SLOW Connections

I would like to test an upload service with hundreds, if not thousands,
of slow HTTPS connections simultaneously.
I would like to have lots of, say, 3G-quality connections,
each throttled with low bandwidth and high latency,
each sending a few megabytes of data up to the server,
resulting in lots of concurrent, long-lived requests being handled by the server.
There are many load generation tools that can generate thousands of simultaneous requests.
(I'm currently using Locust, mostly so that I can take
advantage of my existing client library written in Python.)
Such tools typically run each concurrent request as fast as possible
over the shared network link.
There are various ways to adjust the apparent bandwidth and latency of TCP connections,
such as Linux's TC
and handy wrappers like Comcast.
As far as I can tell, TC and the like control the shared link
but they cannot throttle the individual requests.
If you want to throttle a single request, TC works well.
In theory, with many clients sharing the same throttled network link,
each request could be run serially,
subject to the constrained bandwidth,
rather than having lots of requests executing concurrently,
a few packets at a time.
The former would result in much fewer active requests executing
concurrently on the server.
I suspect that the tool I want has to actively manage each individual client's sending
and receiving to throttle them fairly.
Is there such a tool?
You can take a look at Apache JMeter, it can "throttle" connections to the throughput configurable via the following properties:
httpclient.socket.http.cps=0
httpclient.socket.https.cps=0
The properties can be defined either in user.properties file or passed to JMeter via -J command-line argument
cps stands for character per second so you can "slow down" JMeter threads (virtual users) to the given throughput rate, the formula for cps calculation is:
cps = (target bandwidth in kbps * 1024) / 8
Check out How to Simulate Different Network Speeds in Your JMeter Load Test for more information.
Yes, these are network simulators. A very primitive one is in the form of WanEM. It is not going to cover your testing needs. You will need something akin to Shunra Storm, a hardware device which can manage individual connections and impairment with models derived from Ookla (think speedtest.com) related to 3,4,5g connections from the wild. Well, perhaps I should say, "could manage," as this product has been absent since the HP acquisition of Shunra.
There are some other market competitors on the network front from companies such as Ixia, Agilent, PacketStorm, Spirent and the like. None of them are inexpensive, but I see your need. Slow, and particularly dirty connections likes cell phones, have a disproportionate impact on the stack and can result in the server running out of resources with fewer mobile connections than desktop ones.
On a side note, be sure you are including a representative model for think time in your test code. If you collapse the client-server model with no or extremely limited think time & impair the network only bad things can happen. This will play particular havoc with both predictability and repeatability on your tests. You may also wind up chasing dozens of engineering ghosts related to load in your code that will not occur in production because of the natural delays and the release of resources which should occur during those windows of activity between client requests.

QuickFix using Pipes, shared memory, message queues etc

Here's my scenario:
In my application i have several processes which communicate with each other using Quickfix which internally use tcp sockets.the flow is like:
Process1 sends quickfix messaage-> process 2 sends quickfix message after processing message from
process 1 -> .....->process n
Similarly the acknowledgement messages flow like,
process n->....->process 1
Now, All of these processes except the last process( process n ) are on the same machine.
I googled and found that tcp sockets are the slowest of ipc mechanisms.
So, is there a way to transmit and recieve quick fix messages( obviously using their apis)
through other ipc mechanisms. If yes, i can then reduce the latency by using that ipc mechanism between all the processes which are on the same machine.
However if i do so, do those mechanisms guarentee the tranmission of complete message like tcp sockets do?
I think you are doing premature optimization, and I don't think that TCP will be your performance bottleneck. Your local LAN latency will be faster than that of your exterior FIX connection. From experience, I'd expect perf issues to originate in your app's message handling (perhaps due to accidental blocking in OnMessage() callbacks) rather than the IPC stuff going on afterward.
Advice: Write your communication component with an abstraction-layer interface so that later down the line you can swap out TCP for something else (e.g ActiveMQ, ZeroMQ, whatever else you may consider) if you decide you may need it.
Aside from that, just focus on making your system work correctly. Once you are sure teh behavior are correct (hopefully with tests to confirm them), then you can work on performance. Measure your performance before making any optimizations, and then measure again after you make "improvements". Don't trust your gut; get numbers.
Although it would be good to hear more details about the requirements associated with this question, I'd suggest looking at a shared memory solution. I'm assuming that you are running a server in a colocated facility with the trade matching engine and using high speed, kernel bypass communication for external communications. One of the issues with TCP is the user/kernel space transitions. I'd recommend considering user space shared memory for IPC and use a busy polling technique for synchronization rather than using synchronization mechanisms that might also involve kernel transitions.

{ ProcessName, NodeName } ! Message VS rpc:call/4 VS HTTP/1.1 across Erlang Nodes

I have a setup in which two nodes are going to be communicating a lot. On Node A, there are going to be thousands of processes, which are meant to access services on Node B. There is going to be a massive load of requests and responses across the two nodes. The two Nodes, will be running on two different servers, each on its own hardware server.
I have 3 Options: HTTP/1.1 , rpc:call/4 and Directly sending a message to a registered gen_server on Node B. Let me explain each option.
HTTP/1.1 Suppose that on Node A, i have an HTTP Client like Ibrowse, and on Node B, i have a web server like Yaws-1.95, the web server being able to handle unlimited connections, the operating system settings tweaked to allow yaws to handle all connections. And then make my processes on Node A to communicate using HTTP. In this case each method call, would mean a single HTTP request and a reply. I believe there is an overhead here, but we are evaluating options here. The erlang Built in mechanism called webtool, may be built for this kind of purpose.
rpc:call/4 I could simply make direct rpc calls from Node A to Node B. I am not very susre how the underlying rpc mechanism works , but i think that when two erlang nodes connect via net_adm:ping/1, the created connection is not closed but all rpc calls use this pipe to transmit requests and pass responses. Please correct me on this one.Sending a Message from Node A to Node B I could make my processes on Node A to just send message to a registered process, or a group of processes on Node B. This too seems a clean option.
Q1. Which of the above options would you recommend and why, for an application in which two erlang nodes are going to have enormous communications between them all the time. Imagine a messaging system, in which two erlang nodes are the routers :) ? Q2. Which of the above methods is cleaner, less problematic and is more fault tolerant (i mean here that, the method should NOT have single point of failure, that could lead to all processes on Node A blind) ? Q3. The mechanism of your choice: how would you make it even more fault tolerant, or redundant? Assumptions: The Nodes are always alive and will never go down, the network connection between the nodes will always be available and non-congested (dedicated to the two nodes only) , the operating system have allocated maximum resources to these two nodes. Thank you for your evaluations
HTTP is definitely out. Just the round-trip overhead of creating a new connection is a problem.
As for Erlang connections and using Pids, you have the advantage that you can subscribe to node-down messages and handle the case where a node goes down. A single TCP connection should be able to give you very fast speeds, however, be aware that it works like a long pipe: messages are muxed and demuxed on a pipe which can affect latency on the line. It also means that large messages will block small messages from getting through.
How much bandwidth are you aiming for, and at what latency? What is the 95th and 99th percentile of answering messages? It is better to put up some rough numbers and then try to target these than just having "as fast as possible". Set your success criteria first.
Q1: HTTP will add additional overhead and will give you nothing in my opinion. HTTP would be useful if you were designing a REST API. Directly sending messages and rpc:call look about the same as far as overhead is regarded.
Q2: Sending messages is much much clearer. It's the way erlang is designed. With RPC calls you must always track which call is executed where and under which circumstances which can be a huge issue if the two servers have state. Also RPC calls are synchronous.
Q3: I would use UBF if I can afford minor overhead, otherwise I would directly send messages between the erlang nodes. If the bandwidth is an issue other trickery would be needed as well. Like encoding the messages in same way and then using some compression algorithm to reduce the size of the message, alternatively I may ditch the erlang message passing altogether and use UDP sockets.
It is not obvious that ! is the best way to go. Definitely, it is the easiest and the code will be the most elegant.
In terms of scalability, take under consideration that to use rpc/! you have to maintain an erlang cluster. I found it painful having just 10-20 nodes even in private cloud. I would never recommend bigger deployments on e.g. EC2, where io/latency/network is not deterministic.
I recommend to structure the project in a way that will let you exchange communication engine in the future. Also HTTP is pretty heavy, but there are options:
socket-socket (tcp/udp/sctp)
amqp (many benefits connected to load balancing)
zeromq (even nicer than amqp)
Betting on !/rpc and OTP cluster is risky. You will fight with full mesh overhead, master election algos and quorum/partition detection.

How do I maximize HTTP network throughput?

I was running a benchmark on CouchDB when I noticed that even with large bulk inserts, running a few of them in parallel is almost twice as fast. I also know that web browsers use a number of parallel connections to speed up page loading.
What is the reason multiple connections are faster than one? They go over the same wire, or even to localhost.
How do I determine the ideal number of parallel requests? Is there a rule of thumb, like "threadpool size = # cores + 1"?
The gating factor is not the wire itself which, after all, runs pretty quick (ignoring router delays) but the software overhead at each end. Each physical transfer has to be set up, the data sent and stored, and then completely handled before anything can go the other way. So each connection is effectively synchronous, no matter what it claims to be at the socket level: one socket operating asynchronously is still moving data back and forth in a synchronous way because the software demands synchronicity.
A second connection can take advantage of the latency -- the dead time on the wire -- that arises from the software doing its thing for first connection. So, even though each connection is synchronous, multiple connections let things happen much faster. Things seem (but of course only seem) to happen in parallel.
You might want to take a look at RFC 2616, the HTTP spec. It will tell you about the interchanges that happen to get an HTTP connection going.
I can't say anything about optimal number of parallel requests, which is a matter between the browser and the server.
Each connection consume one own thread. Each thread, have a quantum for consume CPU, network and other resources. Mainly, CPU.
When you start a parallel call, thread will dispute CPU time and run things "at the same time".
It's a high level overview of the things. I suggest you to read about asynchronous calls and thread programming to understand it better.
[]'s,
And Past

Resources