Pacemake not failover when nginx service down - nginx

I have setup HA-Cluster for nginx. So when nginx or node fail, then it will failover to second node.
pcs status Cluster name: push_noti_cluster Stack: corosync Current DC: push2 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Tue Jul 31 11:29:16 2018 Last change: Tue Jul 31 09:20:05 2018 by root via cibadmin on push1
2 nodes configured 3 resources configured
Online: [ push1 push2 ]
Full list of resources:
virtual_ip (ocf::heartbeat:IPaddr2): Started push1 Clone Set: Nginx-clone [Nginx] Started: [ push1 push2 ]
Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled You have new mail in /var/spool/mail/root [root#server1 ~]#
Failover work fine when we stop cluster service using pcs cluster stop on either of these nodes or rebooting the servers.
What we want to achieve is to perform the resource failover when nginx on host node01 stop running and both the resources virtual_ip/webserver should failover to second host node02.
Is it possible to do a service level failover? I.e. when one of resource fails in one node (node01), all the configured resources (here virtual_ip/webserver) should failover to other node (node02).

From what you write, I see there is not configured that "active" node must be that node where present active nginx(any needed service).
Try to check your configuration with examples from this site.
https://wiki.clusterlabs.org/wiki/Example_configurations#Failover_IP_.2B_One_service

Related

EUCA 4.4.5 VPCMIDO Instances Terminate at Launch

I have achieved a small test cloud on 3 pieces of hardware. It works fine when in EDGE mode but when I try to configure it for VPCMIDO, new instances begin to launch but then timeout and move to a terminated state. I can also see the instances' initial volume and config data appear in the NC and CC data directories. Below is my system layout and network.json.
HOST 1 : CLC/UFS/WALRUS/MIDO CLUSTER/MIDO GATEWAY/MIDOLMAN AGENT:
em1 (All Services including Mido Cluster): 10.0.0.21
em3 (Target VPCMIDO Adapter): 10.0.0.22
HOST 2 : CC/SC
em1 : 10.0.0.23
HOST 3 : NC/MIDOLMAN AGENT
em1 : 10.0.0.24
{
"Mido": {
"Gateways": [
{
"Ip": "10.0.0.22",
"ExternalDevice": "em3",
"ExternalCidr": "192.168.0.0/16",
"ExternalIp": "192.168.0.2",
"ExternalRouterIp": "192.168.0.1"
}
]
},
"Mode": "VPCMIDO",
"PublicIps": [
"10.0.100.1-10.0.100.254"
]
}
I may be misunderstanding the intent of reserving an interface just for the mido gateway. All of my eucalyptus/zookeeper/cassandra/midonet configs use the 10.0.0.21 interface and seem to communicate fine. The midonet tunnel zone reports my CLC host and NC host successfully in the tunnel zone. The only part of my config that references the interface I intend to use for the midonet gateway is the network.json. No errors were returned at any time during my config so I think I may be missing something conceptual.
You may need to start eucanetd as described here:
https://docs.eucalyptus.cloud/eucalyptus/4.4.5/index.html#install-guide/starting_euca_clc.html
The eucanetd component in vpcmido mode runs on the cloud controller and is responsible for controlling midonet.
When eucanetd is not running instances will fail to start as the required network resources will not be created.
I configured a bridge on the NC and instances were able to launch and I no longer got an error in my nc.log. Docs and the eucalyptus.conf file comments tell me I shouldn't need to do this in VPCMIDO netowrking mode: https://docs.eucalyptus.cloud/eucalyptus/4.4.5/index.html#install-guide/configuring_bridge.html
Despite all that adding the bridge fixed this issue.

Squid proxy cluster with multiple virtual IP

I configured a clustered Squid Proxy server in CentOS 7 using Corosync, Pacemaker and PCS
I have two servers in cluster server01 and server02. Both server has one IP each. They are in the above mentioned cluster with two Virtual IPs virtual_ip and virtual_ip2. So, crm_mon output is as below:
Stack: corosync
Current DC: server02 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Sun Sep 2 12:43:38 2018
Last change: Thu Aug 30 14:12:24 2018 by root via cibadmin on server01
2 nodes configured
3 resources configured
Online: [ server01 server02 ]
Active resources:
Resource Group: ProxyAndIP
virtual_ip (ocf::heartbeat:IPaddr2): Started server02
squid (ocf::heartbeat:Squid): Started server02
virtual_ip2 (ocf::heartbeat:IPaddr2): Started server01
I want to use both the virtual IPs of the cluster, in Squid proxy for better HA, load balancing and there by defining different access control lists and policies. Is that possible? If it is possible, how to achieve the same?
You have to add virtual_ip2 to the group ProxyAndIP
That should do and you can sequence them accordingly so that their start and stop order is controlled:
Resource Group: ProxyAndIP
virtual_ip (ocf::heartbeat:IPaddr2): Started server02
squid (ocf::heartbeat:Squid): Started server02
virtual_ip2 (ocf::heartbeat:IPaddr2): Started server02

Corda V3 Network Permissioning self issuance of node certificates

Hi I have been trying non-dev mode to start up the nodes for corda V3.
Currently after starting the node, during restart I am experiencing an error of: java.security.cert.CertPathValidatorException: The issuing certificate for C=UK, L=London, O=NetworkMapAndNotary has role NETWORK_MAP, expected one of [INTERMEDIATE_CA, NODE_CA]
the roles that I followed is provided in this link: https://docs.corda.net/head/permissioning.html#certificate-role-extension
obtained from OID Corda Role (1.3.6.1.4.1.50530.1.1)
Any pointers for this issue?
When i followed Devmode and assign my NetworkMapAndNotary to (Role 4) it fails to startup with the error: java.lang.IllegalArgumentException: Incorrect cert role: NODE_CA at net.corda.nodeapi.internal.network.NetworkMapKt.verifiedNetworkMapCert(NetworkMap.kt:48) ~[corda-node-api-corda-3.0.jar:?]
on a side note: i tried to follow devmode cert creation and noticed that the devmode (NetworkMapAndNotary) cert is tagged under a node ( role 4 ) why is that so..
Certificate[2]:
Owner: O=NetworkMapAndNotary, L=London, C=UK
Issuer: C=UK, L=London, OU=corda, O=R3, CN=Corda Node Intermediate CA
Serial number: 39551bff61207fb6
Valid from: Mon Mar 26 07:00:00 ICT 2018 until: Thu May 20 07:00:00 ICT 2027
Certificate fingerprints:
MD5: D1:8C:4D:83:F2:A7:F4:DA:60:05:E3:69:2C:30:FF:20
SHA1: E5:4D:01:A5:68:01:73:59:8B:7A:3D:0B:28:4E:35:C4:CD:DE:C7:52
SHA256: 3F:D6:24:E5:C8:9F:BE:EE:D4:99:D7:2C:85:50:F0:A8:26:46:84:D7:FB:3A:42:54:F2:12:64:51:48:58:FD:CF
Signature algorithm name: SHA256withECDSA
Version: 3
Extensions:
#1: ObjectId: 1.3.6.1.4.1.50530.1.1 Criticality=false
0000: 02 01 04
I resolved it by assigning two different certificates by following this diagram: https://docs.corda.net/_images/certificate_structure.png
Basically I need to create two certs instead of one.
self sign certificate for network map ( network map role )
another signed certificate for nodeca ( node role )
An issue here was because of Corda's tool networkBootStrapper.kt file comes with a hard code function inside the function of: installNetworkParameters where it will always call: createDevNetworkMapCa() function to generate a dev key pair regardless if I am in dev-mode or not.
Customize the file to use the self-signed certificate for network map adding on the role-extension. so the node certificate still remains but the network Map will be a one-time used key just to generate the network-parameters file for each nodes, the node role will always be used for node startup.
It was failing restart because it was seeing that there was a networkmap role certificate acting as another node role in the network.
The Network Map has been redesigned in Corda V3. Take a look at the following blog post and the docs here
Try removing the Network Map identity

Can't establish connection over second NIC (two hops)

We are having trouble with network routing configuration in Ubuntu Xenial.
We have many servers with both Debian 8.4 (Jessie) and Ubuntu 16.04.2 (xenial)
and the exact same networking setup (or at least as far as we can see).
They all have two NICs attached to two VLANs (Say "A" and "B") both accessible
though other VLANs say, for example, from VLAN "C".
Both /etc/network/interfaces files are of the form:
NOTE: I faked names and IPs for the sake of better readability.
# VLAN A
auto eth0
iface eth0 inet static
address 192.168.111.xxx
netmask 255.255.255.0
broadcast 192.168.111.255
network 192.168.111.0
gateway 192.168.111.254
dns-nameservers 192.168.111.25 192.168.111.26
# VLAN B
auto eth1
iface eth1 inet static
address 192.168.222.xxx
netmask 255.255.255.0
broadcast 192.168.222.255
network 192.168.222.0
gateway 192.168.222.254 # <-- (Commented out in Ubuntu machine)
dns-nameservers 192.168.111.25 192.168.111.26
...say xxx is 100 for Debian Machine and 200 for Ubuntu machine and I'm
trying to ping from 192.168.1.10 in VLAN "C" to following addresses:
192.168.111.100: Works fine.
192.168.222.100: Works fine.
192.168.111.200: Works fine.
192.168.222.200: NO Answer!!
The "B" vlan is used mostly for backup and other "background" traffic to
avoid saturation problems in vlan "A".
I know that having two network paths to access same machine is not an usual
setup and I must say that only being able to connect thought one of them from
other networks is not a big problem nowadays. But what stucks to me is why
I can access to Debian Machines and not to Ubuntu ones?
Even, on the other hand, if it were working well in both platforms, we could
consider closing some services (such as ssh, and backend interfaces) from NIC
"A" to improve security (Our firewall only allows access to vlan "B" from our
IT staff vlan).
Of course, as it is commented in previous interfaces snippet, gateway
row is commented out in Ubuntu machines, but that is because, networking
initialization fails in that machines otherwise. That is, in fact, what we are
trying to solve.
But both machines routing tables are almost identical. The only difference
I could see was the onlink flag in the Ubuntu machine:
myUser#debianMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.100
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.100
myUser#ubuntuMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0 onlink
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.200
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.200
...but I was able to remove it by following command:
myUser#ubuntuMachine:~$ sudo ip route replace default via 192.168.111.254 dev eth0
myUser#ubuntuMachine:~$ sudo ip route
default via 192.168.111.254 dev eth0
192.168.111.0/24 dev eth0 proto kernel scope link src 192.168.111.200
192.168.222.0/24 dev eth1 proto kernel scope link src 192.168.222.200
And it did'nt fix the problem.
After that, I also tried to uncomment gateway row of 'VLAN B' which, as I
said, it were commented out in /etc/network/interfaces file and tryed to
restart networking but this is what happened:
myUser#ubuntuMachine:~$ sudo /etc/init.d/networking restart
[....] Restarting networking (via systemctl): networking.serviceJob for networking.service failed because the control process exited with error code. See "systemctl status networking.service" and "journalctl -xe" for details.
failed!
...and the onlink flag came back again.
As a note, commenting out that line again and issuing new
/etc/init.d/networking restart command, the output is the same until the
machine is rebooted, (even networking, despite the VLAN B default gateyay
issue, continues working as usual).
Following are the output of suggested commands:
myUser#ubuntuMachine:~$ sudo systemctl status networking.service
● networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
Drop-In: /run/systemd/generator/networking.service.d
└─50-insserv.conf-$network.conf
Active: failed (Result: exit-code) since jue 2017-12-21 14:55:29 CET; 42s ago
Docs: man:interfaces(5)
Process: 8552 ExecStop=/sbin/ifdown -a --read-environment --exclude=lo (code=exited, status=0/SUCCESS)
Process: 8940 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Process: 8934 ExecStartPre=/bin/sh -c [ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-envi
Main PID: 8940 (code=exited, status=1/FAILURE)
dic 21 14:55:29 ubuntuMachine systemd[1]: Stopped Raise network interfaces.
dic 21 14:55:29 ubuntuMachine systemd[1]: Starting Raise network interfaces...
dic 21 14:55:29 ubuntuMachine ifup[8940]: RTNETLINK answers: File exists
dic 21 14:55:29 ubuntuMachine ifup[8940]: Failed to bring up eth1.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILUR
dic 21 14:55:29 ubuntuMachine systemd[1]: Failed to start Raise network interfaces.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Unit entered failed state.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Failed with result 'exit-code'.
...and the meaningful part of sudo journalctl -xe:
dic 21 14:55:29 ubuntuMachine sudo[8922]: myUser : TTY=pts/0 ; PWD=/home/myUser ; USER=root ; COMMAND=/etc/init.d/networking restart
dic 21 14:55:29 ubuntuMachine sudo[8922]: pam_unix(sudo:session): session opened for user root by myUser(uid=0)
dic 21 14:55:29 ubuntuMachine systemd[1]: Stopped Raise network interfaces.
-- Subject: Unit networking.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has finished shutting down.
dic 21 14:55:29 ubuntuMachine systemd[1]: Starting Raise network interfaces...
-- Subject: Unit networking.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has begun starting up.
dic 21 14:55:29 ubuntuMachine ifup[8940]: RTNETLINK answers: File exists
dic 21 14:55:29 ubuntuMachine ifup[8940]: Failed to bring up eth1.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
dic 21 14:55:29 ubuntuMachine systemd[1]: Failed to start Raise network interfaces.
-- Subject: Unit networking.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit networking.service has failed.
--
-- The result is failed.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Unit entered failed state.
dic 21 14:55:29 ubuntuMachine systemd[1]: networking.service: Failed with result 'exit-code'.
dic 21 14:55:29 ubuntuMachine sudo[8922]: pam_unix(sudo:session): session closed for user root
I googled a lot about being able to found some related information but none
fully answering my question:
An explanation of "onlink" flag that seemed to me it were pointing
out the possibilitity that the "onlink" flag were responsible of a
"wrong back routing" in the meaning that «tells the kernel that the it
does not have to check if the gateway is reachable directly by the
current machine» so (I figured out) the kernel may thought it could (or
should) route the answers of incomming connections from VLAN C to the
default gateway instead of thought the same NIC from where the
connection was started.
But, as I said, removing the "onlink" flag didn't seem to change
anything.
This unix StackExchange answer seems to solve the problem (I didn't
tested it yet) by using multiple routing tables and rules (to tell the
kernel which table to use). But it doesn't explain why Debian
machines are working well (I checked /etc/iproute2/rt_tables file of
both machines and they are identical too:
myUser#bothMachines:~$ sudo cat /etc/iproute2/rt_tables
#
# reserved values
#
255 local
254 main
253 default
0 unspec
#
# local
#
#1 inr.ruhep
So my final hypothesis is that it could be just an implementation difference
between kernel versions and, having that ubuntu one is much more recent, this
could be the correct behaviour so, in modern kernels, I need to use two
different routing tables (but I'm not sure and don't know why...).
myUser#debianMachine:~$ sudo uname -a
Linux debianMachine 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux
myUser#ubuntuMachine:~$ sudo uname -a
Linux ubuntuMachine 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
And, hence, the question is:
Are we doing something wrong (or there is some bug in them) in the Ubuntu machines? Or, conversely, this is the correct behaviour and we are forced to setup more complex routing schema (either by per-vlan routes or by using two routing tables to make two default gateway's to work again)?
EDIT:
Now I tried to add static route to fix the problem:
myUser#ubuntuMachine:~$ sudo ip route add 192.168.1.0/24 via 192.168.222.254 dev eth1
...but that freezed my ssh connection (thought NIC A) even I could then connect thought NIC B (at 192.168.111.200)
Both rules at the same time seems to not being possible:
myUser#ubuntuMachine:~$ sudo ip route add 192.168.1/24 via 102.168.111.254 dev eth0
myUser#ubuntuMachine:~$ sudo ip route add 192.168.1/24 via 192.168.222.254 dev eth1
RTNETLINK answers: File exists
EDIT 2:
I finally found the Linux Advanced Routing & Traffic Control HOWTO which seems to be more accurate than all other documentation I found and specifically in its Chapter 4. Rules - routing policy database I see following text:
If you want to use this feature, make sure that your kernel is
compiled with the "IP: advanced router" and "IP: policy routing"
features
...so I thing all points to that my previous hypothesis of a kernel implementation difference was right and that difference is concretely is those two features being compiled in.
Not an authoritative answer, but my first working attempt (applying what I managed to understand):
sudo ip route add 192.168.1.0/24 via 192.168.222.254 from 192.168.222.200 dev eth1 table 253
sudo ip rule add from 192.168.222.200 table 253
Update: from and devarguments in the ip route command aren't required (it works perfetly well without them).
...after isuinng first command I couldn't connect yet, but after issuing second one yes.
The logic behind that comes from this text i found in this document:
Linux-2.x can pack routes into several routing tables identified by a number in the range from 1 to 255 or by name from the file /etc/iproute2/rt_tables By default all normal routes are inserted into the main table (ID 254) and the kernel only uses this table when calculating routes.
Actually, one other table always exists, which is invisible but even more important. It is the local table (ID 255). This table consists of routes for local and broadcast addresses. The kernel maintains this table automatically and the administrator usually need not modify it or even look at it.
In fact, I finally ended up using another routing table, identified by its id (253) instead of what I now understand it is just an alias (defined in /etc/iproute2/rt_tables file).
...and checking again that file, I now see that there was an alias ("default") already defined for that routing table (next to the "main" one which is indeed 254 as the text fragment I pasted previously says.
What I don't know yet is which is the logic behind this naming (the "default" for 253 routing table I mean) and if, for any reason, is better to use lower routing tables (1, 2, 3...) like this solution (already mentioned in the question) does.
But, for the sake of simplicity, if we aren't going to build complex routing policies and just want to fix this connectivity issue, I guess it could be a good solution to use something like (not yet tested):
gateway 192.168.222.254 table 253
post-up ip rule add from 192.168.222.200 table 253
I still need to test and check if I need an additional via 192.168.222.254 in the gateway row or if it won't work at all and need to add it with another post-up command instead.
I will update this answer with the results.
Edit 1: Same works with default routes:
sudo ip route add default from 192.168.222.200 via 192.168.222.254 table 253
sudo ip rule add from 192.168.222.200 table 253
Edit 2: First (now fully¹) working approach
After playing for a while with a testing machine, I think that the best solution is to add following rows to the second NIC configuration in /etc/network/interfaces file:
gateway 192.168.222.254 table 1
post-up ip rule add from 192.169.222.200 table 1
pre-down ip rule del from 192.168.222.200 table 1
post-up ip route add 192.188.222.0/24 dev eth1 src 192.168.222.200 table 1
Comments:
Adding table 1 to the gateway keyword worked well so additional (less readable) post-up command to add that default route was not necessary.
...in fact, using specific table (other than main) for first NIC together with a similar rule than what we used for our second NIC would be a bad idea because, that that rule will only apply when 192.168.111.200 is going to be used as source address so there will not be any "default default gateway". Leaving first NIC configuration in the main routing table, will make all ("locally generated") outgoing connections to remote LANs will go though our first default gateway by default.
First post-up command adds a rule that packets with the source address of that NIC, should be routed using table 1 (otherwise our new default gateway wouldn't be used).
pre-down command removes that rule. It is not mandatory but, without it, multiple network service restarts will duplicate this rule every time.
I also tried to use dev eth1 instead of from 192.169.222.200 (to avoid having to duplicate network address) but it didn't work. I guess which NIC to use to for "response" packets were "not yet decided".
I used table 1 for eth1 (our second NIC) and I could use table 2 for an eventual third one and so on. It wasn't needed to specify any table/rule for first NIC because it comes to the main table (not "default": see below note).
Finally(¹) the second post-up command make all things work well because (as I now realize) only (first matching) one routing table is used so the default network route (automatically created when the interface brought up) doesn't apply because it was created in table main.
I still don't know if there is a way to force it to be crated directly into table 1.
NOTE: By command sudo ip rule list we can see current routing rules as follows:
0: from all lookup local
32765: from 192.168.222.200 lookup 1
32766: from all lookup main
32767: from all lookup default
As I can understand, they are added decreasingly from 32767 to 0 and tried
increasingly until one matches. Last two ones and the "0" were already
defined by default. The former because of the logic I previously cited
from this document but that documents says that rules starts from "1"
so I guess "0" should also be some predefined "default starting point".
Edit 3:
As I said in the Edit 2 (of the question), I found this Linux Advanced Routing & Traffic Control HOWTO that helped me a lot in clarifying things.
Concretely the Routing for multiple uplinks/providers chapter was very useful to me in the task of understanding setups having "network loops" (even in our case we aren't acting as a router to Internet).

Understanding Docker container resource usage

I have server running Ubuntu 16.04 with Docker 17.03.0-ce running an Nginx container. That server also has ConfigServer Security & Firewall installed. Shortly after starting the Nginx container I start receiving emails about "Excessive resource usage" with the following details:
Time: Fri Mar 24 00:06:02 2017 -0400
Account: systemd-timesync
Resource: Process Time
Exceeded: 1820 > 1800 (seconds)
Executable: /usr/sbin/nginx
Command Line: nginx: worker process
PID: 2302 (Parent PID:2077)
Killed: No
I fully understand that I can add exe:/usr/sbin/nginx to csf.pignore to stop these email alerts but I would like to understand a few things first.
Why is the "systemd-timesync" account being reported? That does not seem to have anything to do with Docker.
Why does the host machine seem to be reporting the excessive resource usage (the extended process time) when that is something running in the container?
Why are other docker containers not running Nginx not resulting in excessive resource usage emails?
I'm sure there are other questions but basically, why is this being reported the way it is being reported?
I can at least answer the first two questions:
Unlike real VMs, Docker containers are simply a collection of processes run under the host system kernel. They just have a different view on certain system resources, including their own file hierarchy, their own PID namespace and their own /etc/passwd file. As a result, they will still show up if you ps aux on the host machine.
The nginx container's /etc/passwd includes a user 'nginx' with UID 104 that runs the nginx worker process. However, in the host's /etc/passwd, UID 104 might belong to a completely different user, such as systemd-timesync.
As a result, if you run ps aux | grep nginx in the container, you might see
nginx 7 0.0 0.0 32152 2816 ? S 11:20 0:00 nginx: worker process
while on the host, you see
systemd-timesync 22004 0.0 0.0 32152 2816 ? S 13:20 0:00 nginx: worker process
even though both are the are the same process (also note the different PID namespaces; in containers, PIDs are counted from 1 again).
As a result, container processes will still be subject to ConfigServer's resource monitoring, but they might show up with random, or even non-existent user accounts.
As to why nginx triggers the emails and other containers don't, I can only assume that nginx is the only one of your containers that crosses ConfigServer's resource thresholds.

Resources