Infinispan Clustered Cache & JGroups - Servers don't see each other - spring-mvc

I'm using Infinispan to create a distributed cache between two servers and to leverage its failover feature.
I initially tested my webservice on two local instances of tomcat, using the pre-configured JGroups configuration file provided by infinispan-core-7.0.0.Final.jar. I was able to get the distributed cache working between the two Tomcat instances since the pre-configured xml files were using the loopback ip address.
I then moved the webservice onto two separate servers and have been unable to have them join the same Group. I created my own custom JGroups tcp configuration xml because using the loopback ip in the pre-configured one was causing some issues.
I don't have much experience in setting up tcp or udp channel, so I think the problem may lie with my JGroups configuration file (I based it off the pre-configured one).
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">
<!-- bind_addr="${jgroups.tcp.address:127.0.0.1}"-->
<TCP
bind_addr="GLOBAL"
bind_port="${jgroups.tcp.port:7800}"
port_range="30"
recv_buf_size="20m"
send_buf_size="640k"
max_bundle_size="31k"
use_send_queues="true"
enable_diagnostics="false"
bundler_type="sender-sends-with-timer"
thread_naming_pattern="pl"
thread_pool.enabled="true"
thread_pool.min_threads="2"
thread_pool.max_threads="30"
thread_pool.keep_alive_time="60000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="100"
thread_pool.rejection_policy="Discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="2"
oob_thread_pool.max_threads="30"
oob_thread_pool.keep_alive_time="60000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="Discard"
internal_thread_pool.enabled="true"
internal_thread_pool.min_threads="2"
internal_thread_pool.max_threads="4"
internal_thread_pool.keep_alive_time="60000"
internal_thread_pool.queue_enabled="true"
internal_thread_pool.queue_max_size="100"
internal_thread_pool.rejection_policy="Discard"
/>
<!-- Ergonomics, new in JGroups 2.11, are disabled by default in TCPPING until JGRP-1253 is resolved -->
<!--
<TCPPING timeout="3000"
initial_hosts="localhost[7800],localhost[7801]"
port_range="5"
num_initial_members="3"
ergonomics="false"
/>
-->
<!-- bind_addr="${jgroups.bind_addr:127.0.0.1}" -->
<!-- ip_ttl="${jgroups.udp.ip_ttl:2}"-->
<MPING bind_addr="GLOBAL" break_on_coord_rsp="true"
mcast_addr="${jgroups.mping.mcast_addr:228.2.4.6}"
mcast_port="${jgroups.mping.mcast_port:43366}"
num_initial_members="3"/>
<MERGE3/>
<FD_SOCK/>
<FD timeout="3000" max_tries="5"/>
<VERIFY_SUSPECT timeout="1500"/>
<pbcast.NAKACK2 use_mcast_xmit="false"
xmit_interval="1000"
xmit_table_num_rows="100"
xmit_table_msgs_per_row="10000"
xmit_table_max_compaction_time="10000"
max_msg_batch_size="100"/>
<UNICAST3 xmit_interval="500"
xmit_table_num_rows="20"
xmit_table_msgs_per_row="10000"
xmit_table_max_compaction_time="10000"
max_msg_batch_size="100"
conn_expiry_timeout="0"/>
<pbcast.STABLE stability_delay="500" desired_avg_gossip="5000" max_bytes="1m"/>
<pbcast.GMS print_local_addr="false" join_timeout="3000" view_bundling="true"/>
<tom.TOA/> <!-- the TOA is only needed for total order transactions-->
<MFC max_credits="2m" min_threshold="0.40"/>
<FRAG2 frag_size="30k"/>
<RSVP timeout="60000" resend_interval="500" ack_on_delivery="false" />
</config>
My initial thought is that the problem may be with the bind_addr in the TCP and MPing elements. The two servers are on the same network and are able to ping each other. Anyone have any tips/insights on the configuration file above?
If it helps I've posted what's in the log file in regards to the Infinispan/JGroups startup below:
SERVER 1:
INFO JGroupsTransport - ISPN000078: Starting JGroups channel esrs
Nov 20, 2014 3:22:43 AM org.jgroups.logging.JDKLogImpl warn
WARNING: JGRP000014: Discovery.num_initial_members has been deprecated: will be ignored
INFO JGroupsTransport - ISPN000094: Received new cluster view for channel esrs: [udmesrs02-61057|0] (1) [udmesrs02-61057]
INFO JGroupsTransport - ISPN000079: Channel esrs local address is udmesrs02-61057
INFO GlobalComponentRegistry - ISPN000128: Infinispan version: Infinispan 'Guinness' 7.0.0.Final
SERVER 2:
INFO JGroupsTransport - ISPN000078: Starting JGroups channel esrs
Nov 20, 2014 3:20:28 AM org.jgroups.logging.JDKLogImpl warn
WARNING: JGRP000014: Discovery.num_initial_members has been deprecated: will be ignored
INFO JGroupsTransport - ISPN000094: Received new cluster view for channel esrs: [udmesrs01-16389|0] (1) [udmesrs01-16389]
INFO JGroupsTransport - ISPN000079: Channel esrs local address is udmesrs01-16389
INFO GlobalComponentRegistry - ISPN000128: Infinispan version: Infinispan 'Guinness' 7.0.0.Final

There are two possible issues: IPv4/IPv6 issues and UDP routing.
First try to set -Djava.net.preferIPv4Stack=true on both machines.
If that does not help, check your UDP firewall and routing settings.
If you don't find anything strange there, you'll have to use tcpdump on udp and port 43366 and tcp 7800 and see if there's any activity - there should be some multicast packet going from each node at least every 15 s.

Related

MPI A process or daemon was unable to complete a TCP connection

Open MPI: 4.0.1a
HostFile:
34bb0519eAAA
a2935f150BBB
I am in machine 34bb0519eAAA. And I could use ssh a2935f150BBB to connect a2935f150BBB successfully. And also ssh 34bb0519eAAA In machine a2935f150BBB to connect 34bb0519eAAA successfully .
But when I mpiexec command . I get error message
****Warning: Permanently added '[XX.XX.XX.XX]:XX' (a2935f150BBB'IP address) to the list of known hosts.**
----------------------**--------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: a2935f150BBB
Remote host: 34bb0519eAAA
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
I am very confused that.Because I run ssh to each other successfully . How could fail that.
Here is ssh connection
ssh a2935f150BBB
Warning: Permanently added '[XX.XX.XX.XX]:XX to the list of known hosts.
Welcome to Ubuntu 18.04.1 LTS (XXXXXXXXXXXXXXXXXX)
Documentation: https://help.ubuntu.com
Management: https://landscape.canonical.com
Support: https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.
To restore this content, you can run the 'unminimize' command.
Last login:XXXXXXXXXXXXX from XXXXXXXXXX

libvirt xml: host starting with image in remote server

I am trying to launch a guest vm on an ubuntu host, from a remote machine. The image for the guest is also at the remote machine(http server as image repo).
The following is the domain xml segment for disk section:
<disk type='network' device='disk'>
<driver name='qemu' type="qcow2"/>
<source protocol="http" name="img/guest_1.qcow2">
<host name="192.168.10.16" port="80"/>
</source>
<target dev='vdb' bus='virtio'/>
</disk>
While i am launching the vm i get this error:
virsh -c qemu://hostname/system start guest_vm
error: Failed to start domain guest_vm
error: internal error: process exited while connecting to monitor: 2017-04-07T12:31:24.421836Z qemu-system-x86_64: -drive file=http://192.168.10.16:80/img/guest_1.qcow2,format=qcow2,if=none,id=drive-virtio-disk1: curl block device does not support writes
Any inputs on how to resolve the issue?
From domain xml related documents, i could see other protocols like rbd,nbd,iscsi,etc being used.Is it not possible with http ?
As the error message says, the curl driver in QEMU (which is used for accessing disks via the http,https & ftp network protocols) only supports read-only access. You've configured a disk which requires read-write access, hence it reports an error.
Even if curl did support writes you really wouldn't want to use it. The HTTP protocol is not an efficient way to access guest disks. You should use any of iSCSI, NBD, NFS, RBD or GlusterFS instead.

WSO2 API Manager Custom Domain error

I have configured my wso2 with custom name by setting
-->
secu.helomyl.in
<!--
Host name to be used for the Carbon management console
-->
<MgtHostName>secu.helomyl.in</MgtHostName>
It starts and i can access the url and get wso2.But the below error is in the logs.Can you please help?
[2017-02-17 14:46:32,513] INFO - QpidServiceComponent Successfully connected to AMQP server on port 5673
[2017-02-17 14:46:32,514] WARN - QpidServiceComponent MQTT Transport is disabled as per configuration.
[2017-02-17 14:46:32,514] INFO - QpidServiceComponent WSO2 Message Broker is started.
[2017-02-17 14:46:32,533] WARN - PropertiesFileInitialContextFactory Unable to create factory:Illegal character in query between indicies 66 and 1
amqp://admin:admin#clientid/carbon?brokerlist='tcp://15.100.133.77 :5673'
^
[2017-02-17 14:46:33,044] INFO - PassThroughHttpSSLListener Starting Pass-through HTTPS Listener...
[2017-02-17 14:46:33,047] INFO - PassThroughListeningIOReactorManager Pass-through HTTPS Listener started on 0.0.0.0
Check the api-manager.xml in wso2am-2.0.0/repository/conf location. There is space in the below configuration. That causes the issue.
tcp://15.100.133.77 :5673

Passing hostname to netty

Background: I've got two machines with identical hostnames, I need to set up a local spark cluster for testing, setting up a master and a worker works fine, but trying to run an application with the driver causes problems, netty doesn't seem to be picking the correct host (regardless of what I put in there, it just picks the first host).
Identical hostname:
$ dig +short corehost
192.168.0.100
192.168.0.101
Spark config (used by master and the local worker):
export SPARK_LOCAL_DIRS=/some/dir
export SPARK_LOCAL_IP=corehost // i tried various like 192.168.0.x for
export SPARK_MASTER_IP=corehost // local, master and the driver
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_DIR=/some/dir
Spark starts up and I can see the worker in the web-ui.
When I run the spark "job" below:
val conf = new SparkConf().setAppName("AaA")
// tried 192.168.0.x and localhost
.setMaster("spark://corehost:7077")
val sc = new SparkContext(conf)
I get this exception:
15/04/02 12:34:04 INFO SparkContext: Running Spark version 1.3.0
15/04/02 12:34:04 WARN Utils: Your hostname, corehost resolves to a loopback address: 127.0.0.1; using 192.168.0.100 instead (on interface en1)
15/04/02 12:34:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/04/02 12:34:05 ERROR NettyTransport: failed to bind to corehost.home/192.168.0.101:0, shutting down Netty transport
...
Exception in thread "main" java.net.BindException: Failed to bind to: corehost.home/192.168.0.101:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
Process finished with exit code 1
Not sure how to proceed... its a whole jungle of ip addresses.
Not sure if this is a netty issue either.
My experience with the identical problem is that it revolves around setting things up locally. Try being more verbose in your spark driver code, add the SPARK_LOCAL_IP and driver host ip to the config:
val conf = new SparkConf().setAppName("AaA")
.setMaster("spark://localhost:7077")
.set("spark.local.ip","192.168.1.100")
.set("spark.driver.host","192.168.1.100")
This should tell netty which of the two identical hosts to use.

tomcat http request aborted but https fine

I have a web application deployed on tomcat 7.0.23, and there are two connectors are set, almost default value.
<Service name="Catalina">
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
compression="on"
compressableMimeType="text/xml"
address="SERVER_HOSTNAME" />
<Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true"
maxThreads="400" scheme="https" secure="true"
address="SERVER_HOSTNAME"
clientAuth="false" SSLProtocol="ALL"
SSLCertificateFile="/PATH/tomcat-server.crt"
SSLCertificateKeyFile="/PATH/tomcat-server.rsa"
SSLCipherSuite="ALL:!ADH:!SSLv2:!EXPORT40:!EXP:!LOW"
compression="on" compressableMimeType="text/xml"/>
After tomcat just restarts, both http:8080 and https:8443 work fine. While after a few days, the 8080 will not work, but the 8443 still works fine. The meaning of "8080 not work" is when using firefox to access the http:8080, some resources like js/css files will unavailable randomly.
In firebug, sometimes the A.js file will be shown as "Aborted", sometime the B.js will be shown as "Aborted". I tried to access one single file, like http://:8080/js/A.js file, the result is also random, sometime the full content can be shown in browser, sometime http request is aborted.
I also tried to increase the connectionTimeout to "60000", the only change thing is in Firebug, the aborted request was 0B but now is actual size. The only way to make 8080 work fine is to restart the tomcat.
Please someone tell me what's the cause or which way I should try? Thanks.
Another process might be taking the port 8080 somehow. And this process does not respond correctly to requests you address to Tomcat.
So, next time you see this issue, before restarting Tomcat, check which process the port 8080 currently belongs to.
On Linux I use the following command for this:
netstat -nlpt | grep 8080
One of the columns (the last one if I remember correctly) will be the ID of the process that consumes the port.
In case you have a Windows setup, use
netstat -ano | find "LISTENING" | find "8080"
Then find this PID in the Task Manager.
FYI: Windows Task Manager – showing the PID

Resources