How to find root cause for server shut down frequently - wordpress

I'm on ubuntu apache. Lately it shuts down frequently and back on after a restart. I kept analyzing apache2 error log to find the cause. Previously it was reporting PHP code error. But after fixing it, it throws different error now.
What can I conclude based on these errors? Which probably would have caused the downtime and how to fix it?
AH: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
AH00687: Negotiation: discovered file(s) matching request: /opt/bitnami/apps/file_that_doesn't-exist
(70007)The timeout specified has expired: [client 148.251.79.134:60170] AH01075: Error dispatching request to : (polling)
AH00045: child process 5062 still did not exit, sending a SIGTERM
AH00046: child process 5299 still did not exit, sending a SIGKILL
AH01909: localhost:443:0 server certificate does NOT include an ID which matches the server name
I've done enough google search to understand each of this error. Most importantly I woudl like to kow which of these error would have cause the server to go down? And what is the way fixing it?

Bitnami Engineer here,
AH: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit.
AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
It seems the server is reaching the limits and that can be the reason of the issues you are running into. It can reach the limits either because the instance is really small and you need to increase its memory/CPU or because you are being attacked.
You can check if you are being attacked by running these commands
cd /opt/bitnami/apache2/logs/
tail -n 10000 access_log | awk '{print $1}'| sort| uniq -c| sort -nr| head -n 10
Are those IPs familiar? Is there any IP that is requesting your site too many times?
You can find more information about how to block it here
You can also increase the MaxRequestWorkers parameter in Apache by editing the /opt/bitnami/apache2/conf/bitnami/httpd.conf file or you can also increase the instance type using the AWS console so the server has more resources from now on.

Related

Mariadb: MySQL server has gone away

In my application, I have an issue where I receive the "MySQL server has gone away" error during a quite long-running transaction. I know this has already been asked a lot, but I tried my test to go through all possible causes.
The one thing that baffles me a lot, is this error message in the log of the MariaDB server:
[Warning] Aborted connection 6 to db: 'default' user: 'root' host: '10.0.0.18' (Got timeout reading communication packets)
This would explain why the client reports a broken connection, but this error occurs 10-15 minutes before the client reports the "MySQL server has gone away" error. In the meantime, the client is happily running insert statements without an issue. But as soon the client runs a select statement, the statement fails practically immediately.
I have already checked for these possible causes:
The server was running all the time
wait_timeout is set to 8 hours, which is way longer than the time the transaction needs to fail
max_allowed_packet is set to 512M which should be more than enough since the query is a very short select statement
The server does not run out of memory
I'm pretty sure the issue must be related to the "Got timeout reading communication packets" error from the MariaDB log. But I cannot wrap my head around why the client still can write data. And why this timeout occurs in the first place, since the wait_timeout is super high.
Some system information:
I'm running on MariaDB 10.5.1
The client uses python 3.6 and mysqlclient, which uses libmysql, is used for the database connection
I hope maybe some of you have an idea what I should look for, because this is really driving me nuts.

Nginx High Open TCP connections

My web page got error 500 and dropped. Checking my Nginx's metrics in GCP, I detected:
To many Open TCP Conecctions.
Open TCP Conecctions
To many Accepted and Handled Conecctions
Accepted and Handled Conecctions
To many Writting Connections
enter image description here
Normal Request per minute for each different IPs from Access.log (in compare with others days and months)
Nginx's requests from acces.log
=> The drop in the graphs is because I restarted the server.
So, according to these metrics, I donĀ“t see any relation beetween the amount of connections (TCP, Accepted, Handled and Writting) and the requests (acces.log).
Further, Is normal these amount of Open TCP connections? I don't think.
I'll appreciate your opinions and possibles reasons why happened this.
500 is a server side error generally occurs if server is unable to process request. The webserver throws 500 internal server error when it encounterd an unexpected condition which prevents it from fulfilling the client's request.
The probable cause of 500 error could be because of:
Permission error.
Incorrect code in .htaccess file
PHP timeout.
Syntax or coder erro in CGI/Perl scripts
PHP memory limit

Timer_ConnectionIdle IIS

I get Timer_ConnectionIdle message from error logs of httperror folder in system32/logfiles.
And sometimes the web page return service unavailable or connection refused.
What is the problem?
How can I solve that?
Two Different issues that you are talking about.
Timer connection idle is not something you need to be worried about. It is HTTP.SYS's way of telling you that the client with which it established a connection did not disconnect because there is always a chance that the client would want to establish the connection again. I think it usually waits for 2 minutes before terminating the connection and that is when you get this message in the HTTPERR logs.
Now coming to Service Unavailable and Connection Timeout errors, this is something that you need to take note of. Check for event logs during the time of issue and see if you find anything there.
If you are unable to find anything in the event logs, my next question would be to identify what is done in order to overcome the issue ? Do you recycle the application pool to get the application up and running ? Do you reset IIS ? If you do any of the above, then please capture a full user dump of w3wp process using debug diag during the time of issue (before performing an iisreset or application pool recycle). Analyzing the dump will tell you exactly whats going wrong.
Feel free to follow up with any questions you have.

nginx <=> php-fpm: unix socket gives error, tcp connection is slow

I'm running nginx with php-fpm on a high traffic site. I let nginx communicate with php-fpm over tcp/ip, both nginx and the php-fpm pools running on the same server.
When I use tcp/ip to let nginx and php-fpm pools communicate with eachother, the loading of pages takes a few (5-10) seconds before it does anything at all, and when it finally gets going, it takes no time at all for the loading to finish. Since the statuspage of php-fpm shows that the listen backlog is full, I assume it takes some time before the request is handled.
Netstat shows a lot (20k+) connections in the TIME_WAIT status, don't know if this is related but it seemed relevant.
When I try to let nginx and php-fpm communicate over a UNIX socket, the time before the page actually loads is reduced to almost nothing, and the time before the finished page is in my browser is 1000x less. Only problem with the UNIX sockets is that it gives me a LOT of errors in the logs:
*3377 connect() to unix:/dev/shm/.php-fpm1.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 122.173.178.150, server: nottherealserver.fake, request: "GET somerandomphpfile HTTP/1.1", upstream: "fastcgi://unix:/dev/shm/.php-fpm1.sock:", host: "nottherealserver.fake", referrer: "nottherealserver.fake"
My two questions are:
does anybody know why the tcp/ip method has such a large wait before it actually seems to connect to the php-fpm backend?
why do the UNIX sockets cause problems when using this instead of tcp/ip?
What I tried:
set net.ipv4.tcp_tw_recycle and net.ipv4.tcp_tw_reuse to 1 when trying to decrease the number of TIME_WAIT connections (went down from 30k+ to 20k+)
increased the net.core.somaxconn value from the default 128 to 1024 (tried higher too but still the same error when using the UNIX sockets)
increased the max number of open files
What is probably also quite relevant: tried using lighttpd + fastcgi, has the same problem with the long time before a connection finally gets handled. MySQL is not too busy, shouldnt be the cause of the long waiting times. Disk wait time is 0% (SSD disk), so a busy disk doesn't seem to be the culprit either.
Hope that somebody found a fix for this problem, and is willing to share :)
Answering my own question since the problem is solved (not sure if this is the correct way to do it).
My problem was that APC caching didnt work at all. It was installed, configured and enabled, but did not add anything to its cache. After switching from APC to Xcache, there was a huge drop in load and loadtimes. Still don't know why APC did nothing, but at the moment im just happy that the problem is solved :)
Thanks for all the input from you guys!

HTTP 504 timeout after exactly 120 seconds

I have a server application which runs in the Amazon EC2 cloud. From my client (the browser) I make a HTTP request which uploads a file to the server which then processes the file. If there is a lot of processing (large file
), the server always times out with a 504 backend continuation error always exactly after 120 seconds. Though I get this error, the server continues to process the request and completes it (verified by checking the database) but I cannot see the final result on my client because of the timeout.
I am clueless as to why this is happening. Has anyone faced a similar 504 timeout ? Is there some intermediate proxy server not in my control which is timing out ?
I have a similar problem and in my case I believe it is due to the connection between the Elastic Load Balancer (ELB) and the EC2 instance.
For a long-term solution I will go with the 303 Status response + back-end processing suggested by james.garriss above.
For short-term solution it may be possible for Amazon support to increase the ELB timeout (see their response in https://forums.aws.amazon.com/thread.jspa?messageID=491594&#491594). Unfortunately there doesn't seem to be any way to change the timeout yourself through either API or console.
[Update] AWS now does allow you to update the idle timeout either through console, CLI or .ebextensions configuration. See http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/config-idle-timeout.html (thanks #Daniel Patz for the update)
Assuming that the correct status code is being returned, the problem is that an intermediate proxy is timing out. "The server, while acting as a gateway or proxy, did not receive a timely response from the upstream server specified by the URI." (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.5.5) It most likely indicates that the origin server is having some sort of issue (i.e., taking a long time to process your request), so it's not responding quickly.
Perhaps the best solution is to re-craft your server app so that it responds with a "303 See Other" status code; then your client can retrieve the data at a later data point, once the server is done processing and creates the final result.
Edit: Another idea is to re-craft your server app so that it responds with a "413 Request Entity Too Large" status code when the request entity size is too large. This will get rid of the error, though it may make your app less useful if it can only process "small" files."
Other possible solutions:
Increase timeout value of the proxy (if it's under your control)
Make your request to a different server (if there's another, faster server with the same app)
Make your request differently (if possible) such that you are sending less data at a time
it is possible that the browser timeouts during the script execution.

Resources