I'm using syslog to log data to a file - the data is pretty intensive, the order of thousands of rows every few seconds. What I observe is that trace amounts of logs are being missed - less than 0.1 % most of the times - but they're still missing. I have no explanation for why this occurs.
It doesn't seem to correlate directly to the amount of data being written because increasing the amount of data being written did not increase the rate of missed logs.
I'm wondering of ways to debug this - how could we understand or confirm if it is indeed syslog which is dropping data and if so why?
If you look at the source code for syslogd, you will see that the syslogd program only uses datagram sockets (type SOCK_DGRAM). These are by definition connectionsless but also not completely reliable in the sense that stream sockets are.
This is by design. Using stream sockets would mean that the syslog() call would have to wait for a confirmation that the message that it sent was received properly. So if syslogd was busy, every application that calls syslog() would block.
Syslogd was simply not designed with the volume of data that you are subjecting it to in mind. You could try enlarging the value of the sysctl variable kern.ipc.maxsockbuf, giving the logging socket a larger buffer.
If you want to make sure you capture everything, write to a file instead.
Related
Overview
I have a Go echo http server running with version 1.13.
$ go version
go version go1.13.7 linux/amd64
I'm monitoring a number of different statistics about the server, including the number of goroutines. I periodically see brief spikes of thousands of goroutines, when high load shouldn't cause it to exceed maybe a few hundred. These spikes do not correlate to an increase in http requests as logged by the labstack echo middleware.
To better debug this situation, I added a periodic check in the program which sends me a pprof report on the goroutines if the number spikes.
The added goroutines surprised me, as when the server is in "normal" operating mode, I see 0 goroutines of the listed functions.
goroutine profile: total 1946
601 # 0x4435f0 0x4542e1 0x8f09dc 0x472c61
# 0x8f09db net/http.(*persistConn).readLoop+0xf0b /usr/local/go/src/net/http/transport.go:2027
601 # 0x4435f0 0x4542e1 0x8f2943 0x472c61
# 0x8f2942 net/http.(*persistConn).writeLoop+0x1c2 /usr/local/go/src/net/http/transport.go:2205
601 # 0x4435f0 0x4542e1 0x8f705a 0x472c61
# 0x8f7059 net/http.setRequestCancel.func3+0x129 /usr/local/go/src/net/http/client.go:321
What I'm struggling with, however, is where these are coming from, what they indicate, and at what point in an http request would I expect them.
To my untrained eye, it looks as if something is briefly attempting to open a connection the immediately tries to close it.
But it would be good to have confirmation of this. In what part of an http request do readLoop, writeLoop, and setRequestCancel goroutines get started? What do these goroutines indicate?
Notes
A few things I've looked at:
I tried adding middleware to capture requests frequencies from IP addresses as they came in, and report on those when the spikes happen. To total request number remains low, in the 30-40 range even as this spike is happening. No IP address is anomalous.
I've considered executing something like lsof to find open connections but that seems like a tenuous approach at best, and relies on my understanding of what these goroutines mean.
I've tried to cross-correlate the timing of seeing this with other things on the network, but without understanding what could cause this, I can't make much sense of where the potential culprit may lie.
If the number of goroutines exceeds 8192, the program crashes with the error: race: limit on 8192 simultaneously alive goroutines is exceeded, dying. A search for this error gets me to this github issue, which feels relevant because I am, in fact, using gorilla websockets in the program. However, the binary was compiled with -race and no race condition is spit out along with my error, which is entirely different from the aforementioned question.
I currently have a game server with a customizable tick rate, but for this example let's suggest that the server is only going to tick once per second or 1hz. I'm wondering what's the best way to handle incoming packets if the client send rate is faster than the server's as my current setup doesn't seem to work.
I have my udp blocking receive with a timeout inside my tick function, and it works, however if the client tick rate is higher than the server, all of the packets are not received; only the one that is being read at the current time. So essentially the server is missing packets being sent by clients. The image below demonstrates my issue.
So my question is, how is this done correctly? Is there a separate thread where packets are read constantly, queued up and then the queue is processed when the server ticks or is there a better way?
Image was taken from a video https://www.youtube.com/watch?v=KA43TocEAWs&t=7s but demonstrates exactly what I'm explaining
There's a bunch going on with the scenario you describe, but here's my guidance.
If you are running a server at 1hz, having a blocking socket prevent your main logic loop from running is not a great idea. There is a chance you won't be receiving messages at the rate you expect (due to packet loss, network lag spike or client app closing)
A) you certainly could create another thread, continue to make blocking recv/recvfrom calls, and then enqueue them onto a thread safe data structure
B) you could also just use a non-blocking socket, and keep reading packets until it returns -1. The OS will buffer (usually configurable) a certain number of incoming messages, until it starts dropping them if you aren't reading.
Either way is fine, however for individual game clients, I prefer the second simple approaches when knowing I'm on a thread that is servicing the socket at a reasonable rate (5hz seems pretty low, but may be appropriate for your game). If there's a chance you are stalling the servicing thread (level loading, etc), then go with the first approach, so you don't detect it as a disconnection if you miss sending/receiving a periodic keepalive message.
On the server side, if I'm planning on a large number of clients/data, I go to great lengths to efficiently read from sockets - using IO Completion Ports on Windows, or epoll() on Linux.
Your server could have to have a thread to tick every 5 seconds just like the client to receive all the packets. Anything not received during that tick would be dropped as the server was not listening for it. You can then pass the data over from the thread after 5 ticks to the server as one chunk. The more reliable option though is to set the server to 5hz just like the client and thread every packet that comes in from the client so that it does not lock up the main thread.
For example, if the client update rate is 20, and the server tick rate is 64, the client might as well be playing on a 20 tick server.
I have seen a number of examples of paho clients reading sensor data then publishing, e.g., https://github.com/jamesmoulding/motion-sensor/blob/master/open.py. None that I have seen have started a network loop as suggested in https://eclipse.org/paho/clients/python/docs/#network-loop. I am wondering if the network loop is unnecessary for publishing? Perhaps only needed if I am subscribed to something?
To expand on what #hardillb has said a bit, his point 2 "To send the ping packets needed to keep a connection alive" is only strictly necessary if you aren't publishing at a rate sufficient to match the keepalive you set when connecting. In other words, it's entirely possible the client will never need to send a PINGREQ and hence never need to receive a PINGRESP.
However, the more important point is that it is impossible to guarantee that calling publish() will actually complete sending the message without using the network loop. It may work some of the time, but could fail to complete sending a message at any time.
The next version of the client will allow you to do this:
m = mqttc.publish("class", "bar", qos=2)
m.wait_for_publish()
But this will require that the network loop is being processed in a separate thread, as with loop_start().
The network loop is needed for a number of things:
To deal with incoming messages
To send the ping packets needed to keep a connection alive
To handle the extra packets needed for high QOS
Send messages that take up more than one network packet (e.g. bigger than local MTU)
The ping messages are only needed if you have a low message rate (less than 1 msg per keep alive period).
Given you can start the network loop in the background on a separate thread these days, I would recommend starting it regardless
I have a code in C++ in which i use recv() from Berkeley Sockets to receive data from a remote host. The issue is that i do not know the size of the data ( which is variable ) so i need some kind of timeout opt ( probably ) to make this work.
Since I'm new in sockets programming, i was wondering how does for example a web client handle responses from a server ( eg a server sends the html data to the client ). Does it use some kind of timeout, since it doesn't know how big the page is ? Same with an FTP client.
When your data is of variable length, then typically that data is framed within another container. That is to say, there's a header preceding the actual data block that tell the receiver how much data it should accept.
For example HTTP uses new line characters to delimit data. If there's variable-length message, then in the header it will include "Content-length:" field that indicates exactly how many bytes to read once entire header is received (header stops when you read 2 consecutive new lines).
It is perfectly fine to read 4 bytes from socket, get how much data follows, then do another receive and read the rest. Only be careful, when you ask for 4 bytes, the socket might give you anywhere between 1-4 bytes so anything less than 4 means you need to go back and ask for remaining few bytes. This is a very common mistake. In dev environment you will almost always get 4 bytes when asking for 4, but once you deploy your app, somewhere on some machine you will get random crashes because their network behavior is somehow different.
Generally, it is a bad approach to rely on timeouts to determine when you reach end of data. With a timeout, you might get things "reliably" working in a well-controlled dev environment, but it is a very flaky solution. Any CPU/disk/network hick up might cause your app to stop receiving prematurely. You are also limiting your data throughput and responsiveness since your app is sleeping for some time interval instead of doing work.
I am designing and testing a client server program based on TCP sockets(Internet domain). Currently , I am testing it on my local machine and not able to understand the following about SIGPIPE.
*. SIGPIPE appears quite randomly. Can it be deterministic?
The first tests involved single small(25 characters) send operation from client and corresponding receive at server. The same code, on the same machine runs successfully or not(SIGPIPE) totally out of my control. The failure rate is about 45% of times(quite high). So, can I tune the machine in any way to minimize this.
**. The second round of testing was to send 40000 small(25 characters) messages from the client to the server(1MB of total data) and then the server responding with the total size of data it actually received. The client sends data in a tight loop and there is a SINGLE receive call at the server. It works only for a maximum of 1200 bytes of total data sent and again, there are these non deterministic SIGPIPEs, about 70% times now(really bad).
Can some one suggest some improvement in my design(probably it will be at the server). The requirement is that the client shall be able to send over medium to very high amount of data (again about 25 characters each message) after a single socket connection has been made to the server.
I have a feeling that multiple sends against a single receive will always be lossy and very inefficient. Shall we be combining the messages and sending in one send() operation only. Is that the only way to go?
SIGPIPE is sent when you try to write to an unconnected pipe/socket. Installing a handler for the signal will make send() return an error instead.
signal(SIGPIPE, SIG_IGN);
Alternatively, you can disable SIGPIPE for a socket:
int n = 1;
setsockopt(thesocket, SOL_SOCKET, SO_NOSIGPIPE, &n, sizeof(n));
Also, the data amounts you're mentioning are not very high. Likely there's a bug somewhere that causes your connection to close unexpectedly, giving a SIGPIPE.
SIGPIPE is raised because you are attempting to write to a socket that has been closed. This does indicate a probable bug so check your application as to why it is occurring and attempt to fix that first.
Attempting to just mask SIGPIPE is not a good idea because you don't really know where the signal is coming from and you may mask other sources of this error. In multi-threaded environments, signals are a horrible solution.
In the rare cases were you cannot avoid this, you can mask the signal on send. If you set the MSG_NOSIGNAL flag on send()/sendto(), it will prevent SIGPIPE being raised. If you do trigger this error, send() returns -1 and errno will be set to EPIPE. Clean and easy. See man send for details.