Statsd dying silently - graphite

Statsd being started by Chef is dying. I believe I have isolated the problem away from Chef as the INIT script Chef is calling is doing what it is suppose to. I have turned debug on for statsd and in the log the following is the last messages before dying:
15 Oct 11:17:39 - reading config file: /etc/statsd/config.js
15 Oct 11:17:39 - server is up
15 Oct 11:17:39 - DEBUG: Loading backend: ./backends/graphite
I am absolutely stumped; nothing in /var/log/messages, nothing in the error log. Any idea if statsd requires certain services up and running?

StatsD doesn't require any specific services to run. You can also set dumpMessages to true in the config to have it log all messages that are incoming to get a picture of what's happening. Does it send any data to Graphite during the time it's up? If you continue having problems, there is also #statsd on Freenode IRC where a lot of people are idling who know a thing or two about StatsD.

I had a very similar issue. I found this page very informative. I took only the part to package my statsd with its backends (changed package files a bit) and installed the created package to make it a daemon.
Hope this helps.

Related

MariaDB has stopped responding - [ERROR] mysqld got signal 6

MariaDB service was stopped responding all of a sudden. It was running for more than 5 months continuously without any issues. When we check the MariaDB service status at the time of the incident, it showed as active (running) ( service mariadb status ). But we could not log into the MariaDB server, each logging attempt was just hanged without any response. All our web applications were also failed to communicate with the MariaDB service. Also, we checked the max_used_connections, and it was below the maximum value.
When we going through the logs, we saw the below error (this had been triggered at the time of the incident).
210623 2:00:19 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.34-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=139
max_threads=752
thread_count=72
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1621655 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f4c008501e8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f4c458a7d30 thread_stack 0x49000
2021-06-23 2:04:20 139966788486912 [Warning] InnoDB: A long semaphore wait:
--Thread 139966780094208 has waited at btr0sea.cc line 1145 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x55e1838d5ab0 created in file btr0sea.cc line 191
a writer (thread id 139966610978560) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.cc line 1145
Last time write locked in file btr0sea.cc line 1218
We could not even stop the MariaDB service using general stopping commands ( service MariaDB stop). But we were able to forcefully kill the MariaDB process and then we could get the MariaDB service back online.
What could be the reason for this failure. If you have already faced similar issues please share your experience, what actions you got to prevent such failures (in the future). Your feedback is much much appreciated.
Our Environment Details are as follows
Operating system: Red Hat Enterprise Linux 7
Mariadb version: 10.2.34-MariaDB-log MariaDB Server
I also face this issue on an aws instance (c5a.4xlarge) hosting my database.
Server version: 10.5.11-MariaDB-1:10.5.11+maria~focal
It happened already 3 times occasionnaly. Like you, no possibility to stop the service but reboot the machine to get it working again.
Logs at restart suggest some tables crashed and should be repaired.

Airflow 504 gateway time-out

Many times when I try to open the tree view or task duration page of some DAGs in the UI I get the error: 504 gateway time-out.
Sometimes after that I can't even open the page with the list of DAGs.
Do you know where this problem could come from?
The CPU and memory of the machine running Airflow seem to be fine and I use RDS for the metadata.
Thanks!
I've experienced this before as well. I believe it's caused by an HTTP request that takes longer than expected for the webserver's gunicorn worker to fulfill. For example, if you set the DAG tree view to a high setting like 365 DAG runs for a DAG with a lot of tasks, you may be able to reproduce this consistently.
Can you try bumping up the timeout settings on the webserver to see if it makes a difference?
First, try increasing web_server_worker_timeout (default = 120 seconds) under the [webserver] group.
If that doesn't resolve it, you might also try increasing web_server_master_timeout under the same group.
Another technique to try is switching the webserver worker_class (default = sync) to eventlet or gevent.
Reference: https://github.com/apache/incubator-airflow/blob/c27098b8d31fee7177f37108a6c2fb7c7ad37170/airflow/config_templates/default_airflow.cfg#L225-L229
Note that the alternative worker classes require installing Airflow with the async extras like:
pip install apache-airflow[async]
You can find more info about gunicorn worker timeouts in this question: How to resolve the gunicorn critical worker timeout error?.

reload nginx with monit

I'm looking to reload, not restart, nginx with monit. The docs say that valid service methods are start, stop and restart but not reload.
Does anyone have a workaround for how to reload nginx rather than restart it?
Edit - I should have pointed out that I still require the ability to restart nginx but I also need, under certain conditions, to reload nginx only.
An example might be that if nginx goes down it needs to be restarted but if it has an uptime > 3 days (for example) it should be reloaded.
I'm trying to achieve this: https://mmonit.com/monit/documentation/monit.html#UPTIME-TESTING
...but with nginx reloading, not restarting.
Thanks.
I solved this issue using the exec command when my conditions are met. For example:
check system localhost
if memory > 95%
for 4 cycles
then exec "/etc/init.d/nginx reload"
I've found that nginx memory issues can be resolved in the short term by reloading rather than restarting.
You can pass the reload signal which should do the job:
nginx -s reload
"Use the docs. Luke!"
According to the documentation, sending HUP signal will cause nginx to re-read its configuration file(s), to check it, and to apply new configuration.
See for details: http://nginx.org/en/docs/control.html#reconfiguration
Here's a config that will achieve what you wanted:
check process nginx with pidfile /usr/local/var/run/nginx.pid
start program = "/usr/local/bin/nginx -s start"
stop program = "/usr/local/bin/nginx -s stop"
if uptime > 3 days then exec "/usr/local/bin/nginx -s reload"
I've tried this on my configuration. The only problem I'm seeing is that Monit assumes that you're defining an error condition when checking the uptime like this. The nginx -s reload command, as I see it on my machine, does not reset the process' uptime back to 0. Since Monit thinks that the uptime being > 3 days is an error condition being remedied by the command you give it in the config, but that command doesn't reset the uptime to be less than 3 days, Monit will report Uptime failed as the status of the process, and you'll see this in the logs:
error : 'nginx' uptime test failed for /usr/local/var/run/nginx.pid -- current uptime is 792808 seconds
You'll see hundreds of these, actually (my config has Monit run every 30 seconds, so I get one of these every 30 seconds).
One question: I'm not sure what reloading nginx after a long time, like 3 days, will do for it - is it helpful to do that for nginx? If you have a link to info on why that would be good for nginx to do, that might help other readers getting to this page via search. Maybe you accepted the answer you did because you saw that it would only make sense to do this when there is an issue, like memory usage being high?
(old post, I know, but I got here via Google and saw that the accepted answer was incomplete, and also don't fully understand the OP's intent).
EDIT: ah, I see you accepted your own answer. My mistake. So it seems that you did in fact see that it was pointless to do what you initially asked, and instead opted for a memory check! I'll leave my post to give this clarity to any other readers with the same confusion

LoadRunner - Monitoring linux counters gives RPC error

Linux distribution is Red Hat. I'm monitoring linux counters with the LoadRunner Controller's System Resources Graphs - Unix Resources. Monitoring is working properly and graphs are plotted in real time. But after a few minutes, errors are appearing:
Monitor name :UNIX Resources. Internal rpc error (error code:2).
Machine: 31.2.2.63. Hint: Check that RPC on this machine is up and running.
Check that rstat daemon on this machine is up and running
(use rpcinfo utility for this verification).
Details: RPC: RPC call failed.
RPC-TCP: recv()/recvfrom() failed.
RPC-TCP: Timeout reached. (entry point: Factory::CollectData).
[MsgId: MMSG-47197]
I logged on the Linux server and found rstatd is still running. Clearing the measurements in Controller's Unix Resources and adding them again, monitoring again started to work but after a few minutes, the same error occurred.
What might cause this error ? Is it due to network traffic ?
Consider using SiteScope, which has been the preferred monitoring foundation for the collection of UNIX|Linux status since version 8.0 of LoadRunner. Every Loadrunner license since version 8 has come with aa 500 Point SiteScope license in the box for this purpose. More points are available upon request for test exclusive use of the instance.

Biztalk Cluster Servers

we used to have 1 biztalk 2006R2 32bit server. We recently upgraded it to Enterprise. But because our traffic size we didn't have enough power and memory with only one. So we also recently installed a second biztalk server, a 2006R2 64-bit, and we put them in a shared cluster. Since then a problem arose, actually two but I'm guessing they probably are connected. One of our (19) host instances keeps getting in the "stopped" status. This host instance is mainly connected with TCP ports. We have a script which checks if host instances are in the stopped state and starts them again, but this obviously has very little use since it keeps resetting to the stopped state. There also is an error in our event viewer, namely:
Faulting application btsntsvc.exe, version 3.6.1404.0, stamp 4674b0a4, faulting module kernel32.dll, version 5.2.3790.4480, stamp 49c51f0a, debug? 0, fault address 0x0000bef7.
Anyone has any idea?
Thanks
Having automated scripts to restart the host instance is not a good idea IMO, you need to get to the bottom of the problem. It looks like a known issue and a hot fix is availble. Worth lookint at this KB http://support.microsoft.com/kb/978059

Resources