How to achievd HA of seaweedfs volume server? - volumes

I have two volume server in the same rack. And my replication is 001, if one of the volume server has down, Since the replication is 001, So the upload function will not be available. How to ensure high availability of volume server? If I have fix the crash nodes up, will the data be automatically synchronized? if so, During the synchronizeing, Does the request will automatically switch to another good node when the request come in.
I have run to volume server in kubernetes in the same rack. one of the pod always restart. And There are no obvious errors in the log.
System Setup
master:
/usr/bin/weed master -ip=weedfs-master -port=9333 -defaultReplication="001" -mdir=/data -volumePreallocate -volumeSizeLimitMB=1024
ports: - containerPort: 9333 args: - master - -ip=weedfs-master - -port=9333 - -defaultReplication="001" env: - name: TZ value: Asia/Hong_Kong
volume server1:
/usr/bin/weed volume -mserver=weedfs-master:9333 -max=500 -publicUrl=https://file-storage-exhibition.ssiid.com -ip=weedfs-volume-1 -port=8080 -rack=rack1 -dir=/data -max=0
ports: - containerPort: 8080 args: - volume - -mserver=weedfs-master:9333 - -max=500 - -publicUrl=https://file-storage-exhibition.ssiid.com - -ip=weedfs-volume-1 - -port=8080 - -rack=rack1 env: - name: TZ
volume server2:
/usr/bin/weed volume -mserver=weedfs-master:9333 -max=500 -publicUrl=https://file-storage-exhibition.ssiid.com -ip=weedfs-volume-2 -port=8080 -rack=rack1 -dir=/data -max=0
ports: - containerPort: 8080 args: - volume - -mserver=weedfs-master:9333 - -max=500 - -publicUrl=https://file-storage-exhibition.ssiid.com - -ip=weedfs-volume-2 - -port=8080 - -rack=rack1 env: - name: TZ
OS version
version 30GB 1.75 linux amd6
the log of the pods which always start:
I0513 11:03:04 1 file_util.go:20] Folder /data Permission: -rwxr-xr-x
I0513 11:03:04 1 volume_loading.go:104] loading index /data/5.idx to memory
I0513 11:03:04 1 disk_location.go:81] data file /data/5.idx, replicaPlacement=001 v=3 size=8 ttl=
I0513 11:03:04 1 volume_loading.go:104] loading index /data/6.idx to memory
I0513 11:03:04 1 disk_location.go:81] data file /data/6.idx, replicaPlacement=001 v=3 size=8 ttl=
I0513 11:03:04 1 volume_loading.go:104] loading index /data/7.idx to memory
I0513 11:03:04 1 disk_location.go:81] data file /data/7.idx, replicaPlacement=001 v=3 size=8 ttl=
I0513 11:03:04 1 volume_loading.go:104] loading index /data/2.idx to memory
I0513 11:03:04 1 volume_loading.go:104] loading index /data/3.idx to memory
I0513 11:03:04 1 disk_location.go:81] data file /data/3.idx, replicaPlacement=001 v=3 size=8 ttl=
I0513 11:03:04 1 volume_loading.go:104] loading index /data/4.idx to memory
I0513 11:03:04 1 disk_location.go:81] data file /data/4.idx, replicaPlacement=001 v=3 size=8 ttl=
I0513 11:03:04 1 disk_location.go:81] data file /data/2.idx, replicaPlacement=001 v=3 size=8 ttl=
I0513 11:03:04 1 disk_location.go:117] Store started on dir: /data with 6 volumes max 0
I0513 11:03:04 1 disk_location.go:120] Store started on dir: /data with 0 ec shards
I0513 11:03:04 1 volume.go:279] Start Seaweed volume server 30GB 1.75 at 0.0.0.0:8080
I0513 11:03:04 1 volume_grpc_client_to_master.go:27] Volume server start with seed master nodes: [weedfs-master:9333]
I0513 11:03:04 1 volume_grpc_client_to_master.go:71] Heartbeat to: weedfs-master:9333
I0513 11:03:04 1 disk.go:11] read disk size: dir:"/data" all:527253700608 used:6518046720 free:520735653888 percent_free:98.76377 percent_used:1.2362258
I0513 11:03:04 1 store.go:430] disk /data max 483 unclaimedSpace:490468MB, unused:6143MB volumeSizeLimit:1024MB
I0513 11:05:24 1 volume.go:205] graceful stop cluster http server, elapsed [0]
volume server has be killed
I0513 11:05:24 1 volume.go:210] graceful stop gRPC, elapsed [0]
I0513 11:05:24 1 volume_server.go:104] Shutting down volume server...
I0513 11:05:24 1 volume_server.go:106] Shut down successfully!
I0513 11:05:24 1 volume.go:215] stop volume server, elapsed [0]

How to ensure high availability of volume server?
Add more than two volume servers.
If I have fix the crash nodes up, will the data be automatically synchronized?
The writes will fail. But the write request should get a new assignment from the master and go to the other volume servers to write. Nothing need to synchronize.

Related

Error on Starting MySQL Cluster 8.0 Data Node on Ubuntu 22.04 LTS

When I start the data nodeid 1 (10.1.1.103) of MySQL Cluster 8.0 on Ubuntu 22.04 LTS I am getting the following error:
# ndbd
Failed to open /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory
2023-01-02 17:16:55 [ndbd] INFO -- Angel connected to '10.1.1.102:1186'
2023-01-02 17:16:55 [ndbd] INFO -- Angel allocated nodeid: 2
When I start data nodeid 2 (10.1.1.105) I get the following error:
# ndbd
Failed to open /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory
2023-01-02 11:10:04 [ndbd] INFO -- Angel connected to '10.1.1.102:1186'
2023-01-02 11:10:04 [ndbd] ERROR -- Failed to allocate nodeid, error: 'Error: Could not alloc node id at 10.1.1.102:1186: Connection done from wrong host ip 10.1.1.105.'
The management node log file reports (on /var/lib/mysql-cluster/ndb_1_cluster.log):
2023-01-02 11:28:47 [MgmtSrvr] INFO -- Node 2: Initial start, waiting for 3 to connect, nodes [ all: 2 and 3 connected: 2 no-wait: ]
What is the relevance of failing to open: /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory?
Why is data node on 10.1.1.105 unable to allocate a nodeid?
I initially installed a single Management Node on 10.1.1.102:
wget https://dev.mysql.com/get/Downloads/MySQL-Cluster-8.0/mysql-cluster_8.0.31-1ubuntu22.04_amd64.deb-bundle.tar
tar -xf mysql-cluster_8.0.31-1ubuntu22.04_amd64.deb-bundle.tar
dpkg -i mysql-cluster-community-management-server_8.0.31-1ubuntu22.04_amd64.deb
mkdir /var/lib/mysql-cluster
vi /var/lib/mysql-cluster/config.ini
The configuration set up on config.ini:
[ndbd default]
# Options affecting ndbd processes on all data nodes:
NoOfReplicas=2 # Number of replicas
[ndb_mgmd]
# Management process options:
hostname=10.1.1.102 # Hostname of the manager
datadir=/var/lib/mysql-cluster # Directory for the log files
[ndbd]
hostname=10.1.1.103 # Hostname/IP of the first data node
NodeId=2 # Node ID for this data node
datadir=/usr/local/mysql/data # Remote directory for the data files
[ndbd]
hostname=10.1.1.105 # Hostname/IP of the second data node
NodeId=3 # Node ID for this data node
datadir=/usr/local/mysql/data # Remote directory for the data files
[mysqld]
# SQL node options:
hostname=10.1.1.102 # In our case the MySQL server/client is on the same Droplet as the cluster manager
I then started and killed the running server and created a systemd file for Cluster manager:
ndb_mgmd -f /var/lib/mysql-cluster/config.ini
pkill -f ndb_mgmd
vi /etc/systemd/system/ndb_mgmd.service
Adding the following configuration:
[Unit]
Description=MySQL NDB Cluster Management Server
After=network.target auditd.service
[Service]
Type=forking
ExecStart=/usr/sbin/ndb_mgmd -f /var/lib/mysql-cluster/config.ini
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure
[Install]
WantedBy=multi-user.target
I then reloaded the systemd daemon to apply the changes, started and enabled the Cluster Manager and checked its active status:
systemctl daemon-reload
systemctl start ndb_mgmd
systemctl enable ndb_mgmd
Here is the status of the Cluster Manager:
# systemctl status ndb_mgmd
● ndb_mgmd.service - MySQL NDB Cluster Management Server
Loaded: loaded (/etc/systemd/system/ndb_mgmd.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2023-01-01 08:25:07 CST; 27min ago
Main PID: 320972 (ndb_mgmd)
Tasks: 12 (limit: 9273)
Memory: 2.5M
CPU: 35.467s
CGroup: /system.slice/ndb_mgmd.service
└─320972 /usr/sbin/ndb_mgmd -f /var/lib/mysql-cluster/config.ini
Jan 01 08:25:07 nuc systemd[1]: Starting MySQL NDB Cluster Management Server...
Jan 01 08:25:07 nuc ndb_mgmd[320971]: MySQL Cluster Management Server mysql-8.0.31 ndb-8.0.31
Jan 01 08:25:07 nuc systemd[1]: Started MySQL NDB Cluster Management Server.
I then set up a data node on 10.1.1.103, installing dependencies, downloading the data node and setting up its config:
apt update && apt -y install libclass-methodmaker-perl
wget https://dev.mysql.com/get/Downloads/MySQL-Cluster-8.0/mysql-cluster_8.0.31-1ubuntu22.04_amd64.deb-bundle.tar
tar -xf mysql-cluster_8.0.31-1ubuntu22.04_amd64.deb-bundle.tar
dpkg -i mysql-cluster-community-data-node_8.0.31-1ubuntu22.04_amd64.deb
vi /etc/my.cnf
I entered the address of the Cluster Management Node in the configuration:
[mysql_cluster]
# Options for NDB Cluster processes:
ndb-connectstring=10.1.1.102 # location of cluster manager
I then created a data directory and started the node:
mkdir -p /usr/local/mysql/data
ndbd
This is when I got the "Failed to open" error result on data nodeid 1 (102.1.1.103):
# ndbd
Failed to open /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory
2023-01-02 17:16:55 [ndbd] INFO -- Angel connected to '10.1.1.102:1186'
2023-01-02 17:16:55 [ndbd] INFO -- Angel allocated nodeid: 2
UPDATED (2023-01-02)
Thank you #MauritzSundell. I corrected the (private) IP addresses above and no longer got:
# ndbd
Failed to open /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: No such file or directory
ERROR: Unable to connect with connect string: nodeid=0,10.1.1.2:1186
Retrying every 5 seconds. Attempts left: 12 11 10 9 8 7 6 5 4 3 2 1, failed.
2023-01-01 14:41:57 [ndbd] ERROR -- Could not connect to management server, error: ''
Also #MauritzSundell, in order to use the ndbmtd process rather than the ndbd process, does any alteration need to be made to any of the configuration files (e.g. /etc/systemd/system/ndb_mgmd.service)?
What is the appropriate reference/tutorial documentation for MySQL Cluster 8.0? Is it MySQL Cluster "MySQL NDB Cluster 8.0" on:
https://downloads.mysql.com/docs/mysql-cluster-excerpt-8.0-en.pdf
Or is it "MySQL InnoDB Cluster" on:
https://dev.mysql.com/doc/refman/8.0/en/mysql-innodb-cluster-introduction.html
Not sure I understand the difference.

kubernetes autoscale fails to get metrics (heapster already installed in kube-system namespace)

I created a mini cluster with vagrant and centos 7. I managed to install kube-dns and heapster but when i try to test an autoscale with the php-apache example it doesnt work.
failed to get CPU consumption and request: failed to unmarshall heapster response: invalid character 'E' looking for beginning of value
That's odd because I can see the metrics with grafana and the limits. My kube-dns and my heapster are in kube-system namespace so it should work.
I have kubernetes 1.2, if someone can help that will be awesome.
Here are the logs of heapster :
[vagrant#localhost ~]$ kubectl logs --namespace=kube-system heapster-xzy31
I0625 11:08:50.041788 1 heapster.go:65] /heapster --source=kubernetes:http://192.168.50.130:8080?inClusterConfig=false&useServiceAccount=true&auth= --sink=influxdb:http://monitoring-influxdb:8086
I0625 11:08:50.042310 1 heapster.go:66] Heapster version 1.1.0
I0625 11:08:50.090679 1 configs.go:60] Using Kubernetes client with master "http://192.168.50.130:8080" and version v1
I0625 11:08:50.090705 1 configs.go:61] Using kubelet port 10255
E0625 11:09:00.097603 1 influxdb.go:209] issues while creating an InfluxDB sink: failed to ping InfluxDB server at "monitoring-influxdb:8086" - Get http://monitoring-influxdb:8086/ping: dial tcp: lookup monitoring-influxdb on 10.254.0.10:53: read udp 172.17.39.2:38757->10.254.0.10:53: read: connection refused, will retry on use
I0625 11:09:00.097624 1 influxdb.go:223] created influxdb sink with options: host:monitoring-influxdb:8086 user:root db:k8s
I0625 11:09:00.097638 1 heapster.go:92] Starting with InfluxDB Sink
I0625 11:09:00.097641 1 heapster.go:92] Starting with Metric Sink
I0625 11:09:00.103486 1 heapster.go:171] Starting heapster on port 8082
I0625 11:10:05.003399 1 manager.go:79] Scraping metrics start: 2016-06-25 11:09:00 +0000 UTC, end: 2016-06-25 11:10:00 +0000 UTC
E0625 11:10:05.003479 1 kubelet.go:279] Node 192.168.50.131 is not ready
I0625 11:10:05.051081 1 manager.go:152] ScrapeMetrics: time: 47.581507ms size: 70
I0625 11:10:05.060501 1 influxdb.go:201] Created database "k8s" on influxDB server at "monitoring-influxdb:8086"
I0625 11:11:05.001120 1 manager.go:79] Scraping metrics start: 2016-06-25 11:10:00 +0000 UTC, end: 2016-06-25 11:11:00 +0000 UTC
I0625 11:11:05.091844 1 manager.go:152] ScrapeMetrics: time: 90.657932ms size: 132
Ok I found the solution, my cluster was flawed. I had to install flannel on the master too. With the option --iface=eth1 because of vagrant.
I followed this guide http://severalnines.com/blog/installing-kubernetes-cluster-minions-centos7-manage-pods-services but they didnt say to install flannel on the master.
Now everything is working.

Nginx worker_rlimit_nofile

How does one set worker_rlimit_nofile to a higher number and what's the maxium it can be or is recommended to be?
I'm trying to follow the following advice:
The second biggest limitation that most people run into is also
related to your OS. Open up a shell, su to the user nginx runs as and
then run the command ulimit -a. Those values are all limitations
nginx cannot exceed. In many default systems the open files value is
rather limited, on a system I just checked it was set to 1024. If
nginx runs into a situation where it hits this limit it will log the
error (24: Too many open files) and return an error to the client.
Naturally nginx can handle a lot more than 1024 files and chances are
your OS can as well. You can safely increase this value.
To do this you can either set the limit with ulimit or you can use
worker_rlimit_nofile to define your desired open file descriptor
limit.
From: https://blog.martinfjordvald.com/2011/04/optimizing-nginx-for-high-traffic-loads/
worker_rlimit_nofile = worker_connections * 2 file descriptors
Each worker connection open 2 file descriptors (1 for upstream, 1 for downstream)
While setting worker_rlimit_nofile parameter, you should consider both worker_connections and worker_processes. You may want to check your OS's file descriptor first using: ulimit -Hn and ulimit -Sn which will give you the per user hard and soft file limits respectively. You can change the OS limit using systemctl as:
sudo sysctl -w fs.file-max=$VAL
where $VAL is the number you would like to set. Then, you can verify using:
cat /proc/sys/fs/file-max
If you are automating the configuration, it is easy to set worker_rlimit_nofile as:
worker_rlimit_nofile = worker_connections*2
The worker_processes is set to 1 by default, however, you can set it to a number less than or equal to the number of cores you have on your server:
grep -c ^processor /proc/cpuinfo
EDIT:
The latest versions of nginx set worker_processes: auto by default, which sets to the number of processors available in the machine. Hence, it's important to know why you would really to change it.
Normally, setting it to highest value or to all available processors doesn't always improve the performance beyond certain limit: it's likely you get the same performance by setting it to 24 vs 32 processors. Some kernel/TCP-stack parameters could also help mitigate bottle-necks.
And in micro-services deployment (kubernetes), it's very important to consider pod resource request/limits while setting these configurations.
To check how many workers process nginx has spawned you could run ps -lfC nginx. e.g. on my machine I got the following, since my machine has 12 processors, nginx spawned 12 worker processes.
$ ps -lfC nginx
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
5 S root 70488 1 0 80 0 - 14332 - Jan15 ? 00:00:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
5 S www-data 70489 70488 0 80 0 - 14526 - Jan15 ? 00:08:24 nginx: worker process
5 S www-data 70490 70488 0 80 0 - 14525 - Jan15 ? 00:08:41 nginx: worker process
5 S www-data 70491 70488 0 80 0 - 14450 - Jan15 ? 00:08:49 nginx: worker process
5 S www-data 70492 70488 0 80 0 - 14433 - Jan15 ? 00:08:37 nginx: worker process
5 S www-data 70493 70488 0 80 0 - 14447 - Jan15 ? 00:08:44 nginx: worker process
5 S www-data 70494 70488 0 80 0 - 14433 - Jan15 ? 00:08:46 nginx: worker process
5 S www-data 70495 70488 0 80 0 - 14433 - Jan15 ? 00:08:34 nginx: worker process
5 S www-data 70496 70488 0 80 0 - 14433 - Jan15 ? 00:08:31 nginx: worker process
5 S www-data 70498 70488 0 80 0 - 14433 - Jan15 ? 00:08:46 nginx: worker process
5 S www-data 70499 70488 0 80 0 - 14449 - Jan15 ? 00:08:50 nginx: worker process
5 S www-data 70500 70488 0 80 0 - 14433 - Jan15 ? 00:08:39 nginx: worker process
5 S www-data 70501 70488 0 80 0 - 14433 - Jan15 ? 00:08:41 nginx: worker process
To print the exact count, you could you UID (e.g. for my setup it's UUID is www-data. which is configured in nginx.conf as user www-data;)
$ ps -lfC nginx | awk '/nginx:/ && /www-data/{count++} END{print count}'
12
In kubernetes, nginx spawn worker processes depending on the resource request for the pod by default.
e.g if you have the following in your deployment:
resources:
requests:
memory: 2048Mi
cpu: 2000m
Then nginx will spawn 2 worker process (2000 milli cpu = 2 cpu)

Docker container http requests limit

I'm new to Docker so, most likely, I'm missing something.
I'm running a container with Elasticsearch, using this image.
I'm able to setup everyhing correctly. After that I was a using a script developed by a collegue in order to insert some data, basically querying a MySQL database and making HTTP requests .
Problem is, many of those requests get stuck until it fails. If I do netstat -tn | grep 9200 I get:
tcp6 0 0 ::1:58436 ::1:9200 TIME_WAIT
tcp6 0 0 ::1:59274 ::1:9200 TIME_WAIT
...
tcp6 0 0 ::1:58436 ::1:9200 TIME_WAIT
tcp6 0 0 ::1:59274 ::1:9200 TIME_WAIT
with a lot of requests. At this point I'm not sure if it's something related to elastic search or docker. This does not happen if Elasticsearch is instaleld on my machine.
Some info:
$ docker version
Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 7c8fca2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2
OS/Arch (server): linux/amd64
$ docker info
Containers: 6
Images: 103
Storage Driver: devicemapper
Pool Name: docker-252:1-9188072-pool
Pool Blocksize: 65.54 kB
Backing Filesystem: extfs
Data file: /dev/loop0
Metadata file: /dev/loop1
Data Space Used: 4.255 GB
Data Space Total: 107.4 GB
Data Space Available: 103.1 GB
Metadata Space Used: 6.758 MB
Metadata Space Total: 2.147 GB
Metadata Space Available: 2.141 GB
Udev Sync Supported: false
Data loop file: /var/lib/docker/devicemapper/devicemapper/data
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
Library Version: 1.02.82-git (2013-10-04)
Execution Driver: native-0.2
Kernel Version: 3.14.22-031422-generic
Operating System: Ubuntu 14.04.2 LTS
CPUs: 4
Total Memory: 15.37 GiB
$ docker logs elasticsearch
[2015-06-15 09:10:33,761][INFO ][node ] [Energizer] version[1.6.0], pid[1], build[cdd3ac4/2015-06-09T13:36:34Z]
[2015-06-15 09:10:33,762][INFO ][node ] [Energizer] initializing ...
[2015-06-15 09:10:33,766][INFO ][plugins ] [Energizer] loaded [], sites []
[2015-06-15 09:10:33,792][INFO ][env ] [Energizer] using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/mapper/ubuntu--vg-root)]], net usable_space [145.3gb], net total_space [204.3gb], types [ext4]
[2015-06-15 09:10:35,516][INFO ][node ] [Energizer] initialized
[2015-06-15 09:10:35,516][INFO ][node ] [Energizer] starting ...
[2015-06-15 09:10:35,642][INFO ][transport ] [Energizer] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.17.0.5:9300]}
[2015-06-15 09:10:35,657][INFO ][discovery ] [Energizer] elasticsearch/Y1zfiri4QO21zRhcI-bTXA
[2015-06-15 09:10:39,426][INFO ][cluster.service ] [Energizer] new_master [Energizer][Y1zfiri4QO21zRhcI-bTXA][76dea3e6d424][inet[/172.17.0.5:9300]], reason: zen-disco-join (elected_as_master)
[2015-06-15 09:10:39,446][INFO ][http ] [Energizer] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.17.0.5:9200]}
[2015-06-15 09:10:39,446][INFO ][node ] [Energizer] started
[2015-06-15 09:10:39,479][INFO ][gateway ] [Energizer] recovered [0] indices into cluster_state
The important part of the script:
for package in c.fetchall():
id_package, tracking_number, order_number, payment_info, shipment_provider_name, package_status_name=package
el['tracking_number'] = tracking_number
el['order_number'] = order_number
el['payment_info'] = payment_info
el['shipment_provider_name'] = shipment_provider_name
el['package_status_name'] = package_status_name
requests.put("http://localhost:9200/packages/package/%s/_create"%(id_package), json=el)
So, it wasn't a problem with either Docker or Elastic. Just to recap, the same script throwning PUT requests at a Elasticsearch setup locally worked, but when throwning at a container with Elasticsearch failed after a few thousand documents (20k). To note that the overal number of documents was roughtly 800k.
So, what happend? When you setup somethig running on localhost and make a request to it (in this case a PUT request) that request goes through the loopback interface. In pratice ths means that no TCP connection gets created making a lot faster.
When the docker container was setup, ports were bound to the host. Although the script still makes requests to localhost on the desired port, a TCP connection gets created between the host and the docker container through the docker0 interface. This comes at the expense of 2 things:
the time to setup a TCP connection
TIME_WAIT state
This is actually a more realistic scenario. We setup Elasticsearch on another machine and did the exact same test and got, as expected, the same result.
The problem was that we were sending to requests and for each of them creating a new connection. Due to the way TCP works, connections cannot be closed immediately. Which meant that we were using all available connections until we got none to use because the rate of creation was higher the actual close rate.
Three suggestions to fix this:
Pause requests every once in a while. Maybe put a sleep at every X requests making possible for the TIME_WAIT to pass and the connection closing
Send the the Connection: close header: option for the sender to signal that the connection will be closed after completion of the response.
Reuse connection(s).
I ended up going with option 3) and rewrote my collegue's script and reusing the same TCP connection.

On graceful signal, nginx throws Connection refused when trying to connect to uWSGI socket

This problem is perplexing me, because I seem to be following everything within the docs that would allow for a graceful restart.
I am running uWSGI in Emperor mode, with a bunch of vassals. When I try to do a graceful restart of one of the vassals, I receive an nginx 502 Bad Gateway response for about half a second. Here's some information:
One of my vassal .ini file:
[uwsgi]
master = true
processes = 2
home = /var/www/.virtualenvs/www.mysite.com
socket = /tmp/uwsgi.sock.myapp
pidfile = /tmp/uwsgi.pid.myapp
module = myapp
pythonpath = /var/www/www.mysite.com/mysite
logto = /var/log/uwsgi/myapp.log
chmod-socket = 666
vacuum = true
gid = www-data
uid = www-data
Then, I want to gracefully restart this process:
kill -HUP `cat /tmp/uwsgi.pid.myapp`
The output from the vassal log file looks alright (I think?)
...gracefully killing workers...
Gracefully killing worker 1 (pid: 29957)...
Gracefully killing worker 2 (pid: 29958)...
binary reloading uWSGI...
chdir() to /var/www/www.mysite.com/vassals
closing all non-uwsgi socket fds > 2 (max_fd = 1024)...
found fd 3 mapped to socket 0 (/tmp/uwsgi.sock.kilroy)
running /var/www/.virtualenvs/www.mysite.com/bin/uwsgi
*** has_emperor mode detected (fd: 15) ***
[uWSGI] getting INI configuration from kilroy.ini
open("/var/log/uwsgi/kilroy.log"): Permission denied [utils.c line 250]
unlink(): Operation not permitted [uwsgi.c line 998]
*** Starting uWSGI 1.2.3 (64bit) on [Fri Jun 8 09:15:10 2012] ***
compiled with version: 4.6.3 on 01 June 2012 09:56:19
detected number of CPU cores: 2
current working directory: /var/www/www.mysite.com/vassals
writing pidfile to /tmp/uwsgi.pid.kilroy
detected binary path: /var/www/.virtualenvs/www.mysite.com/bin/uwsgi
setgid() to 33
setuid() to 33
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
uwsgi socket 0 bound to UNIX address /tmp/uwsgi.sock.kilroy fd 3
Python version: 2.7.3 (default, Apr 20 2012, 23:04:22) [GCC 4.6.3]
Set PythonHome to /var/www/.virtualenvs/www.mysite.com
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x19e3e90
your server socket listen backlog is limited to 100 connections
*** Operational MODE: preforking ***
added /var/www/www.mysite.com/gapadventures/ to pythonpath.
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x19e3e90 pid: 30041 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 30041)
spawned uWSGI worker 1 (pid: 30042, cores: 1)
spawned uWSGI worker 2 (pid: 30043, cores: 1)
But when I try to access the site quickly after this, my nginx log gets this result:
2012/06/08 09:44:43 [error] 5885#0: *873 connect() to unix:///tmp/uwsgi.sock.kilroy failed (111: Connection refused) while connecting to upstream, client: 10.100.50.137, server: mydomain.com, request: "GET /favicon.ico HTTP/1.1", upstream: "uwsgi://unix:///tmp/uwsgi.sock.kilroy:", host: "mydomain.com"
This happens for about half a second after sending the signal, so this is clearly not very graceful.
Any advice? Thanks so much!
Correct sockets path in nginx config and uWSGI. Sockt have to be identical
Was
unix:///tmp/uwsgi.sock.kilroy
or
/tmp/uwsgi.sock.myapp
Need:
nginx
unix:/tmp/uwsgi.sock.myapp
and
uwsgi
socket = /tmp/uwsgi.sock.myapp

Resources