I have a website using NGINX & PHP-FPM. As you may know PHP-FPM has a status page for it's pools with detailed information about it's processes. My problem is that as time passes many processes state become "finishing", and will not change their "finishing" state until I reload PHP-FPM.
The bad thing is that "finishing" processes counts as active processes and when the number of active processes surpass pm.max_children bad things happen on my website.
I know some php-fpm pool parameters to kill idle processes, but I can't find parameters to kill finishing proccesses after a certain amount of time.
How to deal with PHP-FPM finishing state? Is there a configuration parameter to kill these "finishing" processes after some time? Could this be a misconfiguration between NGINX and PHP-FPM? What are the reasons for "finishing" states?
Here is an image of php-fpm status. Reds are finishing states, which is what I'm trying to fix. Request URI are the different pages of my site.
Thanks for your knowledge.
PD1: Right now I'm reloading PHP-FPM every 15 minutes and that "fixes" more or less finishing states... but I think this could be an important performance problem with more traffic.
PD2: Until now the only solution I think could work is to read the php-fpm status page, get all procceses with finishing states and kill proccesses by pid that surpass an arbitrary request duration.
Had the same problem. Here's what I used as a temporary solution:
Create a PHP file with the content:
<?php
fastcgi_finish_request();
?>
edit php.ini:
auto_append_file = /path/to/your/file.php
i had problem like that and this was my fix :
It turns out we were using invalid Memcached keys in certain
situations. This was causing Memcached to die without error and the
PHP process was staying alive.
https://serverfault.com/questions/626904/php-fpm-state-finishing-but-never-completes
Neither comment solved or explained the root cause behind the issue. Sample code to perform the PD2 approach in killing these processes of should be something like this:
<?php
while (true) {
$data_json = file_get_contents("http://localhost/fpmstatus?json&full");
$data = json_decode($data_json, true);
foreach ($data['processes'] as $proc) {
if ($proc['state'] === "Finishing") {
$pid = $proc['pid'];
$duration = $proc['request duration'] / 1000000.0;
echo json_encode(compact("pid", "duration"));
if ($duration > 60) {
passthru("kill -9 " . $pid);
echo " KILLED\n";
} else {
echo "\n";
}
}
}
echo count($data['processes']);
echo "\n";
sleep(30);
}
When running this code, I did find out that errors like this would occur in the error.log:
2017/08/06 13:46:42 [error] 20#20: *9161 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 77.88.5.19, server: hostname1, request: "GET /?p=9247 HTTP/1.1", upstream: "fastcgi://uni
x:/run/php/php7.0-fpm.sock:", host: "hostname2"
The obvious mismatch was that the request was handled by the block for "hostname1", and a block for "hostname2" didn't exist (anymore). I can't say for sure that it was the reason. There were "finishing" requests even after declaring a server_name _; block, but they were less frequent than anything that was going on before that.
My server had the enablereuse=on setting in the apache proxy configuration. Removing this fixed the "Finishing" problem.
Also listed in the question: https://serverfault.com/questions/626904/php-fpm-state-finishing-but-never-completes
Related
I have a flask server with and endpoint that processes some uploaded .csv files and returns a .zip (in a JSON reponse, as a base64 string)
This process can take up to 90 seconds
I've been setting it up for production using gunicorn and nginx and I'm testing the endpoint with smaller .csv s. They get processed fine and in a couple seconds I get the "got blob" log. But nginx doesn't return it to the client and finally it times out. I set up a longer fail-timeout of 10 minutes and the client WILL wait 10 minutes, then time out
the proxy read timeout offered as solution here is set to 3600s
Also the proxy connect timeout is set to 75s according to this
also the timeout for the gunicorn workers according to this
The error log says: "upstream timed out connection timed out while reading upstream"
I also see examples of nginx receiving an OPTIONS request and immediately after the POST request (some CORS weirdness from the client) where nginx passes the OPTIONS request but fails to pass the POST request to gunicorn despite nginx having received it
Question:
What am I doing wrong here?
Many thanks
http {
upstream flask {
server 127.0.0.1:5050 fail_timeout=600;
}
# error log
# 2022/08/18 14:49:11 [error] 1028#1028: *39 upstream timed out (110: Connection timed out) while reading upstream, ...
# ...
server {
# ...
location /api/ {
proxy_pass http://flask/;
proxy_read_timeout 3600;
proxy_connect_timeout 75s;
# ...
}
# ...
}
}
# wsgi.py
from main import app
if __name__ == '__main__':
app.run()
# flask endpoint
#app.route("/process-csv", methods=['POST'])
def process_csv():
def wrapped_run_func():
return blob, export_filename
# ...
try:
blob, export_filename = wrapped_run_func()
b64_file = base64.b64encode(blob.getvalue()).decode()
ret = jsonify(file=b64_file, filename=export_filename)
# return Response(response=ret, status=200, mimetype="application/json")
print("got blob")
return ret
except Exception as e:
app.logger.exception(f"0: Error processing file: {export_filename}")
return Response("Internal server error", status=500)
ps. getting this error from stackoverflow
"Your post appears to contain code that is not properly formatted as code. Please indent all code by 4 spaces using the code toolbar button or the CTRL+K keyboard shortcut. For more editing help, click the [?] toolbar icon."
for having perfectly well formatted code with language syntax, I'm sorry that I had to post it ugly
Sadly I got no response
See last lines for the "solution" finally implemented
CAUSE OF ERROR: I believe the problem is that I'm hosting the Nginx server on wsl1
I tried updating to wsl2 and see if that fixed it but I need to enable some kind of "nested virtualization", as the wsl1 is running already on a VM.
Through conf changes I got it to the point where no error is logged, gunicorn return the file then it just stays in the ether. Nginx never gets/sends the response
"SOLUTION":
I ended up changing the code for the client, the server and the nginx.conf file:
the server saves the resulting file and only returns the file name
the client inserts the filename into an href that then displays a link
on click a request is sent to nginx which in turn just sends the file from a static folder, leaving gunicorn alone
I guess this is the optimal way to do it anyway, though it still bugs me I couldn't (for sure) find the reason of the error
I've got a Python flask app behind gunicorn behind nginx-ingress and I'm honestly just running out of ideas. What happens after a long-running computation is that there's a RemoteDisconnect error at 60 seconds, with nothing untoward in the logs. Gunicorn is set to have a massive timeout, so it's not that. Nginx is quite happy to terminate at 60 sec without any error:
xxx.xx.xx.xx - [xxx.xx.xx.xx] - - [03/Dec/2019:19:32:08 +0000] "POST /my/url" 499 0 "-" "python-requests/2.22.0" 1516 59.087 [my-k8s-service] [] xxx.xx.xx.xx:port 0 59.088 - c676c3df9a40c1692b1789e677a27268
No error, warning, nothing. Since 60s was so suspect, I figured it was proxy-read-timeout or upstream-keepalive-timeout... nothing; I've set those both in a configmap and in the .yaml files using annotations, and exec'ing into the pod for a cat /etc/nginx/nginx.conf shows the requisite server has the test values in place:
proxy_connect_timeout 72s;
proxy_send_timeout 78s;
proxy_read_timeout 75s;
...funny values set to better identify the result. And yet... still disconnects at 60 sec.
The "right" answer, which we're doing, is to rewrite the thing to have asynchronous calls, but it's really bothering me that I don't know how to fix this. Am I setting the wrong thing? In the background, the Flask app continues running and completes after several minutes, but Nginx just says the POST dies after a minute. I'm totally baffled. I've got an Nginx 499 error which means a client disconnect.
We're on AWS, so I even tried adding the annotation
Annotations: service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: 93
...to the service just in case (from https://kubernetes.io/docs/concepts/services-networking/service/). No dice: still dies at 60s.
What am I missing?
Seems no issue on AWS load balancer side. its NGINX GUNICORN connection issue . you need to use update proxy timeout value . try annotations in ingress rules to fix .
nginx.ingress.kubernetes.io/proxy-connect-timeout = 300s
nginx.ingress.kubernetes.io/proxy-send-timeout = 300s
nginx.ingress.kubernetes.io/proxy-read-timeout = 300s
if you are using GUNICORN also set --timeout = 300
I have NGINX working as a cache engine and can confirm that pages are being cached as well as being served from the cache. But the error logs are getting filled with this error:
2018/01/19 15:47:19 [crit] 107040#107040: *26 chmod()
"/etc/nginx/cache/nginx3/c0/1d/61/ddd044c02503927401358a6d72611dc0.0000000007"
failed (1: Operation not permitted) while reading upstream, client:
xx.xx.xx.xx, server: *.---.com, request: "GET /support/applications/
HTTP/1.1", upstream: "http://xx.xx.xx.xx:80/support/applications/",
host: "---.com"
I'm not really sure what the source of this error could be since NGINX is working. Are these errors that can be safely ignored?
It looks like you are using nginx proxy caching, but nginx does not have the ability to manipulate files in it's cache directory. You will need to get the ownership/permissions correct on the cache directory.
Not explained in the original question is that the mounted storage is an Azure file share. So in the FSTAB I had to include the gid= and uid= for the desired owner. This then removed the need for chown and chmod also became unnecessary. This removed the chmod() error but introduced another.
Then I was getting errors on rename() without permission to perform this. At this point I scrapped what I was doing, moved to a different type of Azure storage (specifically a Disk attached to the VM) and all these problems went away.
So I'm offering this as an answer but realistically, the problem was not solved.
We noticed the same problem. Following the guide from Microsoft # https://learn.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv#create-a-storage-class seems to have fixed it.
In our case the nginx process was using a different user for the worker threads, so we needed to find that user's uid and gid and use that in the StorageClass definition.
I'm having a problem with Memcached pools. I will try to add all the context of the error to see if you guys can help me with this.
Context:
PHP 5.5.17 (cgi-fcgi) (built: Sep 24 2014 20:38:04)
php-pecl-memcache-3.0.8-2.fc17.remi.5.5.x86_64
nginx version:
nginx/1.0.15
My problem:
I am creating a connection with memcached and saving a several keys, just in one server first, something like this:
$_memcache = new Memcache;
$_memcache->addServer("127.0.0.1", "11211", true, 50, 3600, 45);
So, let suppose that I add several keys, in that server and I can get those without problem, I actually can, when I see my site and my code is calling to get the keys, it's getting it.
Now the problem, let say with those keys already saved and working without problem, I added another memcached server to the pool, this way:
$_memcache = new Memcache;
$_memcache->addServer("10.0.0.2", "11211", true, 50, 3600, 45);
$_memcache->addServer("10.0.0.3", "11211", true, 50, 3600, 45);
But before I refreshed the site to run my code and get the keys that I have storage in the first server I stopped the memcached in that server number 1 (10.0.0.2), after that I refreshed my site and then I received a 502 error (Bad Gateway)
The error that I am seeing in the log of NGINX is:
[error] 9364#0: *329504 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: XX.XX.XXX.XXX, server: _, request: "GET / HTTP/1.1", upstream: "fastcgi://unix:/run/php-fastcgi/sock:", host: "www.myhost.com"
So, why I am getting that error. The only theory that I have is that for some reason because the connections is persistent is not closing properly when I stop the memcached server, but that only happens when I have a pool of X > 1 servers. If I use just that one and stopped I won't see the error.
There is any way that PHP 5.5 has a bug with the fastcgi socket that I don't know.
NOTE: This problem that I am having wasn't happening on previous PHP version 5.3, after changing the version is this is happening.
NOTE: When I don't use persistent connection seems to work, but this site has a huge traffic and it won't handle the amount of connections open, I tested this in dev environment.
Any help or any suggestions are more than welcome.
Thanks in advance!
I use nginx along with fastcgi. I see a lot of the following errors in the error logs
readv() failed (104: Connection reset
by peer) while reading upstream and
recv() failed (104: Connection reset
by peer) while reading response header
from upstream
I don't see any problem using the application. Are these errors serious or how to get rid of them.
I was using php-fpm in the background and slow scripts were getting killed after a said timeout because it was configured that way. Thus, scripts taking longer than a specified time would get killed and nginx would report a recv or readv error as the connection is closed from the php-fpm engine/process.
Update:
Since nginx version 1.15.3 you can fix this by setting the keepalive_requests option of your upstream to the same number as your php-fpm's pm.max_requests:
upstream name {
...
keepalive_requests number;
...
}
Original answer:
If you are using nginx to connect to php-fpm, one possible cause can also be having nginx' fastcgi_keep_conn parameter set to on (especially if you have a low pm.max_requests setting in php-fpm):
http|server|location {
...
fastcgi_keep_conn on;
...
}
This may cause the described error every time a child process of php-fpm restarts (due to pm.max_requests being reached) while nginx is still connected to it. To test this, set pm.max_requests to a really low number (like 1) and see if you get even more of the above errors.
The fix is quite simple - just deactivate fastcgi_keep_conn:
fastcgi_keep_conn off;
Or remove the parameter completely (since the default value is off). This does mean your nginx will reconnect to php-fpm on every request, but the performance impact is negligible if you have both nginx and php-fpm on the same machine and connect via unix socket.
Regarding this error:
readv() failed (104: Connection reset by peer) while reading upstream and recv() failed (104: Connection reset by peer) while reading response header from upstream
there was 1 more case where I could still see this.
Quick set up overview:
CentOS 5.5
PHP with PHP-FPM 5.3.8 (compiled from scratch with some 3rd party
modules)
Nginx 1.0.5
After looking at the PHP-FPM error logs as well and enabling catch_workers_output = yes in the php-fpm pool config, I found the root cause in this case was actually the amfext module (PHP module for Flash).
There's a known bug and fix for this module that can be corrected by altering the amf.c file.
After fixing this PHP extension issue, the error above was no longer an issue.
This is a very vague error as it can mean a few things. The key is to look at all possible logs and figure it out.
In my case, which is probably somewhat unique, I had a working nginx + php / fastcgi config. I wanted to compile a new updated version of PHP with PHP-FPM and I did so. The reason was that I was working on a live server that couldn't afford downtime. So I had to upgrade and move to PHP-FPM as seamlessly as possible.
Therefore I had 2 instances of PHP.
1 directly talking with fastcgi (PHP 5.3.4) - using TCP / 127.0.0.1:9000 (PHP 5.3.4)
1 configured with PHP-FPM - using Unix socket - unix:/dir/to/socket-fpm
(PHP 5.3.8)
Once I started up PHP-FPM (PHP 5.3.8) on an nginx vhost using a socket connection instead of TCP I started getting this upstream error on any fastcgi page taking longer than x minutes whether they were using FPM or not. Typically it was pages doing large SELECTS in mysql that took ~2 min to load. Bad I know, but this is because of back end DB design.
What I did to fix it was add this in my vhost configuration:
fastcgi_read_timeout 5m;
Now this can be added in the nginx global fastcgi settings as well. It depends on your set up. http://wiki.nginx.org/HttpFcgiModule
Answer # 2.
Interestingly enough fastcgi_read_timeout 5m; fixed one vhost for me.
However I was still getting the error in another vhost, just by running phpinfo();
What fixed this for me was by copying over a default production php.ini file and adding the config I needed into it.
What I had was an old copy of my php.ini from the previous PHP install.
Once I put the default php.ini from 'shared' and just added in the extensions and config I needed, this solved my problem and no longer did I have nginx errors readv() and recv() failed.
I hope 1 of these 2 fixes helps someone.
Also it can be a very simple problem - there is an infinity cicle somewhere in your code, or an infinity trying to connect an external host on your page.
Some times this problem happen because of huge of requests. By default the pm.max_requests in php5-fpm maybe is 100 or below.
To solve it increase its value depend on the your site's requests, For example 500.
And after the you have to restart the service
sudo service php5-fpm restart
Others have mentioned the fastcgi_read_timeout parameter, which is located in the nginx.conf file:
http {
...
fastcgi_read_timeout 600s;
...
}
In addition to that, I also had to change the setting request_terminate_timeout in the file: /etc/php5/fpm/pool.d/www.conf
request_terminate_timeout = 0
Source of information (there are also a few other recommendations for changing php.ini parameters, which may be relevant in some cases): https://ma.ttias.be/nginx-and-php-fpm-upstream-timed-out-failed-110-connection-timed-out-or-reset-by-peer-while-reading/