We have a graphite full stack server which receives metrics from different machines. While other collectd client are sending data fine, one of the client is giving the below error:
Jan 29 23:24:44 collectd-client collectd[25489]: write_graphite plugin: send
to graphite-server:2003 ((null)) failed with status -1 (Connection
refused) Jan 29 23:24:44 collectd-client collectd[25489]: collectd: Stopping
5 write threads.
collectd.conf as below
LoadPlugin syslog
LoadPlugin cpu
LoadPlugin df
LoadPlugin disk
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin rrdtool
LoadPlugin write_graphite
<Plugin df>
MountPoint "/"
</Plugin>
<Plugin disk>
Disk "/^[hs]d[a-f][0-9]?$/"
</Plugin>
<Plugin interface>
Interface "eth0"
</Plugin>
<Plugin write_graphite>
<Node "carbon">
Host "sde-graphite"
Port "2003"
Prefix "collectd"
Postfix "collectd"
StoreRates true
AlwaysAppendDS false
EscapeCharacter "_"
</Node>
</Plugin>
Verify whether carbon is running in host sde-graphite at port 2003. you can do a netstat and see if there is a UDP listener at 2003. I guess, it is not running.
SOLVED:
I had the same issue, my metrics are always working but randomly some nodes stop sending metrics. And collectd shows the same error:
Jun 18 15:04:23 node-a collectd[20235]: write_graphite plugin: send to 10.8.0.100:2003 (udp) failed with status -1 (Invalid argument)
Jun 18 15:04:23 node-a collectd[20235]: Filter subsystem: Built-in target `write': Dispatching value to all write plugins failed with status -1.
The daemon is still alive but not sending metrics to graphite.
NOTE: My nodes send data to graphite trough an openvpn tunnel.
It seems to be a connection timeout error against the graphite server. I can reproduce the error by stopping/interrupting vpn service and immediately collectd shows the error above.
Hope it helps
Enjoy!
Related
I am trying to launch a guest vm on an ubuntu host, from a remote machine. The image for the guest is also at the remote machine(http server as image repo).
The following is the domain xml segment for disk section:
<disk type='network' device='disk'>
<driver name='qemu' type="qcow2"/>
<source protocol="http" name="img/guest_1.qcow2">
<host name="192.168.10.16" port="80"/>
</source>
<target dev='vdb' bus='virtio'/>
</disk>
While i am launching the vm i get this error:
virsh -c qemu://hostname/system start guest_vm
error: Failed to start domain guest_vm
error: internal error: process exited while connecting to monitor: 2017-04-07T12:31:24.421836Z qemu-system-x86_64: -drive file=http://192.168.10.16:80/img/guest_1.qcow2,format=qcow2,if=none,id=drive-virtio-disk1: curl block device does not support writes
Any inputs on how to resolve the issue?
From domain xml related documents, i could see other protocols like rbd,nbd,iscsi,etc being used.Is it not possible with http ?
As the error message says, the curl driver in QEMU (which is used for accessing disks via the http,https & ftp network protocols) only supports read-only access. You've configured a disk which requires read-write access, hence it reports an error.
Even if curl did support writes you really wouldn't want to use it. The HTTP protocol is not an efficient way to access guest disks. You should use any of iSCSI, NBD, NFS, RBD or GlusterFS instead.
I'm using Infinispan to create a distributed cache between two servers and to leverage its failover feature.
I initially tested my webservice on two local instances of tomcat, using the pre-configured JGroups configuration file provided by infinispan-core-7.0.0.Final.jar. I was able to get the distributed cache working between the two Tomcat instances since the pre-configured xml files were using the loopback ip address.
I then moved the webservice onto two separate servers and have been unable to have them join the same Group. I created my own custom JGroups tcp configuration xml because using the loopback ip in the pre-configured one was causing some issues.
I don't have much experience in setting up tcp or udp channel, so I think the problem may lie with my JGroups configuration file (I based it off the pre-configured one).
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">
<!-- bind_addr="${jgroups.tcp.address:127.0.0.1}"-->
<TCP
bind_addr="GLOBAL"
bind_port="${jgroups.tcp.port:7800}"
port_range="30"
recv_buf_size="20m"
send_buf_size="640k"
max_bundle_size="31k"
use_send_queues="true"
enable_diagnostics="false"
bundler_type="sender-sends-with-timer"
thread_naming_pattern="pl"
thread_pool.enabled="true"
thread_pool.min_threads="2"
thread_pool.max_threads="30"
thread_pool.keep_alive_time="60000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="100"
thread_pool.rejection_policy="Discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="2"
oob_thread_pool.max_threads="30"
oob_thread_pool.keep_alive_time="60000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="Discard"
internal_thread_pool.enabled="true"
internal_thread_pool.min_threads="2"
internal_thread_pool.max_threads="4"
internal_thread_pool.keep_alive_time="60000"
internal_thread_pool.queue_enabled="true"
internal_thread_pool.queue_max_size="100"
internal_thread_pool.rejection_policy="Discard"
/>
<!-- Ergonomics, new in JGroups 2.11, are disabled by default in TCPPING until JGRP-1253 is resolved -->
<!--
<TCPPING timeout="3000"
initial_hosts="localhost[7800],localhost[7801]"
port_range="5"
num_initial_members="3"
ergonomics="false"
/>
-->
<!-- bind_addr="${jgroups.bind_addr:127.0.0.1}" -->
<!-- ip_ttl="${jgroups.udp.ip_ttl:2}"-->
<MPING bind_addr="GLOBAL" break_on_coord_rsp="true"
mcast_addr="${jgroups.mping.mcast_addr:228.2.4.6}"
mcast_port="${jgroups.mping.mcast_port:43366}"
num_initial_members="3"/>
<MERGE3/>
<FD_SOCK/>
<FD timeout="3000" max_tries="5"/>
<VERIFY_SUSPECT timeout="1500"/>
<pbcast.NAKACK2 use_mcast_xmit="false"
xmit_interval="1000"
xmit_table_num_rows="100"
xmit_table_msgs_per_row="10000"
xmit_table_max_compaction_time="10000"
max_msg_batch_size="100"/>
<UNICAST3 xmit_interval="500"
xmit_table_num_rows="20"
xmit_table_msgs_per_row="10000"
xmit_table_max_compaction_time="10000"
max_msg_batch_size="100"
conn_expiry_timeout="0"/>
<pbcast.STABLE stability_delay="500" desired_avg_gossip="5000" max_bytes="1m"/>
<pbcast.GMS print_local_addr="false" join_timeout="3000" view_bundling="true"/>
<tom.TOA/> <!-- the TOA is only needed for total order transactions-->
<MFC max_credits="2m" min_threshold="0.40"/>
<FRAG2 frag_size="30k"/>
<RSVP timeout="60000" resend_interval="500" ack_on_delivery="false" />
</config>
My initial thought is that the problem may be with the bind_addr in the TCP and MPing elements. The two servers are on the same network and are able to ping each other. Anyone have any tips/insights on the configuration file above?
If it helps I've posted what's in the log file in regards to the Infinispan/JGroups startup below:
SERVER 1:
INFO JGroupsTransport - ISPN000078: Starting JGroups channel esrs
Nov 20, 2014 3:22:43 AM org.jgroups.logging.JDKLogImpl warn
WARNING: JGRP000014: Discovery.num_initial_members has been deprecated: will be ignored
INFO JGroupsTransport - ISPN000094: Received new cluster view for channel esrs: [udmesrs02-61057|0] (1) [udmesrs02-61057]
INFO JGroupsTransport - ISPN000079: Channel esrs local address is udmesrs02-61057
INFO GlobalComponentRegistry - ISPN000128: Infinispan version: Infinispan 'Guinness' 7.0.0.Final
SERVER 2:
INFO JGroupsTransport - ISPN000078: Starting JGroups channel esrs
Nov 20, 2014 3:20:28 AM org.jgroups.logging.JDKLogImpl warn
WARNING: JGRP000014: Discovery.num_initial_members has been deprecated: will be ignored
INFO JGroupsTransport - ISPN000094: Received new cluster view for channel esrs: [udmesrs01-16389|0] (1) [udmesrs01-16389]
INFO JGroupsTransport - ISPN000079: Channel esrs local address is udmesrs01-16389
INFO GlobalComponentRegistry - ISPN000128: Infinispan version: Infinispan 'Guinness' 7.0.0.Final
There are two possible issues: IPv4/IPv6 issues and UDP routing.
First try to set -Djava.net.preferIPv4Stack=true on both machines.
If that does not help, check your UDP firewall and routing settings.
If you don't find anything strange there, you'll have to use tcpdump on udp and port 43366 and tcp 7800 and see if there's any activity - there should be some multicast packet going from each node at least every 15 s.
I've set up a small Hadoop cluster for testing. Setup went fairly well with the NameNode (1 machine), SecondaryNameNode (1) and all DataNodes (3). The machines are named "master", "secondary" and "data01", "data02" and "data03". All DNS are properly set up, and passwordless SSH was configured from master/secondary to all machines and back.
I formatted the cluster with bin/hadoop namenode -format, and then started all services using bin/start-all.sh. All processes on all nodes were checked to be up and running with jps. My basic configuration files look something like this:
<!-- conf/core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<!--
on the master it's localhost
on the others it's the master's DNS
(ping works from everywhere)
-->
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<!-- I picked /hdfs for the root FS -->
<value>/hdfs/tmp</value>
</property>
</configuration>
<!-- conf/hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
# conf/masters
secondary
# conf/slaves
data01
data02
data03
I'm just trying to get HDFS running properly now.
I've created a dir for testing hadoop fs -mkdir testing, then tried to copy some files into it with hadoop fs -copyFromLocal /tmp/*.txt testing. This is when hadoop crashes, giving me more or less this:
WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
at ... (such and such)
WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
at ...
WARN hdfs.DFSClient: Could not get block locations. Source file "/user/hd/testing/wordcount1.txt" - Aborting...
at ...
ERROR hdfs.DFSClient: Exception closing file /user/hd/testing/wordcount1.txt: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
at ...
And so on. A similar issue occurs when I try to run hadoop fs -lsr . from a DataNode machine, only to get the following:
12/01/02 10:02:11 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 0 time(s).
12/01/02 10:02:12 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 1 time(s).
12/01/02 10:02:13 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 2 time(s).
...
I'm saying it's similar, because I suspect this is a port availability issue. Running telnet master 9000 reveals that the port is closed. I've read somewhere that this might be an IPv6 clash issue, and thus defined the following in conf/hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
But that didn't do the trick. Running netstat on the master reveals something like this:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:9000 localhost:56387 ESTABLISHED
tcp 0 0 localhost:56386 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56387 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56384 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56385 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56383 localhost:9000 TIME_WAIT
At this point I'm pretty sure the problem is with the port (9000), but I'm not sure what I missed as far as configuration goes. Any ideas? Thanks.
update
I found that hard coding DNS names into /etc/hosts not only help resolve this, but also speeds up the connections. The downside is that you have to do this on all the machines in the cluster, and again when you add new nodes. Or you can just set up a DNS server, which I didn't.
Here's a sample of my one node in my cluster (nodes are named hadoop01, hadoop02, etc, with the master and secondary being 01 and 02). Node that most of it are generated by the OS:
# this is a sample for a machine with dns hadoop01
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastrprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allroutes
# --- Start list of nodes
192.168.10.101 hadoop01
192.168.10.102 hadoop02
192.168.10.103 hadoop03
192.168.10.104 hadoop04
192.168.10.105 hadoop05
192.168.10.106 hadoop06
192.168.10.107 hadoop07
192.168.10.108 hadoop08
192.168.10.109 hadoop09
192.168.10.110 hadoop10
# ... and so on
# --- End list of nodes
# Auto-generated hostname. Please do not remove this comment.
127.0.0.1 hadoop01 localhost localhost.localdomain
Hope this helps.
Replace localhost in hdfs://localhost:9000 with ip-address or hostname for the fs.default.name property in NameNode when there are remote nodes connecting to the NameNode.
All processes on all nodes were checked to be up and running with jps
There might be some errors in the log files. jps makes sure that the process is running.
Correct your /etc/hosts file to include localhost or correct your core-site file to specify ip or hostname of node that hosts HDFS filesystem.
I have box A and it has a consumer on it that listens on a Rabbit MQ server
I have box B that will publish a message to the listener
So as long as all of this in on box A and I start Rabbit MQ server w/ defaults it works fine.
The defaults are host=127.0.0.1 on port 5672, but
when I telnet box.a.ip.addy 5672 from box B I get:
Trying box.a.ip.addy...
telnet: connect to address box.a.ip.addy: No route to host
telnet: Unable to connect to remote host: No route to host
telnet on port 22 is fine, I can ssh into Box A from Box B
So I assume I need to change the ip that the RabbitMQ server uses
I found this: http://www.rabbitmq.com/configure.html and I now have a config file in the location the documentation said to use, with the name rabbitmq.config and it contains:
[
{rabbit, [{tcp_listeners, {"box.a.ip.addy", 5672}}]}
].
So I stopped the server, and started RabbitMQ server again. It failed. Here are the errors from the error logs. It's a little over my head. (in fact most of this is)
=ERROR REPORT==== 23-Aug-2011::14:49:36 ===
FAILED
Reason: {{case_clause,{{"box.a.ip.addy",5672}}},
[{rabbit_networking,'-boot_tcp/0-lc$^0/1-0-',1},
{rabbit_networking,boot_tcp,0},
{rabbit_networking,boot,0},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
{rabbit,run_boot_step,1},
{rabbit,'-start/2-lc$^0/1-0-',1},
{rabbit,start,2},
{application_master,start_it_old,4}]}
=INFO REPORT==== 23-Aug-2011::14:49:37 ===
application: rabbit
exited: {bad_return,{{rabbit,start,[normal,[]]},
{'EXIT',{rabbit,failure_during_boot}}}}
type: permanent
and here is some more from the start up log:
Erlang has closed
Error: {node_start_failed,normal}
^M
Crash dump was written to: erl_crash.dump^M
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})^M
Please help
did you try adding?
RABBITMQ_NODE_IP_ADDRESS=box.a.ip.addy
to the /etc/rabbitmq/rabbitmq.conf file?
Per http://www.rabbitmq.com/configure.html#customise-general-unix-environment
Also per this documentation it states that the default is to bind to all interfaces. Perhaps there is a configuration setting or environment variable already set in your system to restrict the server to localhost overriding anything else you do.
UPDATE: After reading again I realize that the telnet should have returned "Connection Refused" not "No route to host." I would also check to see if you are having a firewall related issue.
You need to open up the tcp port on your firewall
Using Linux, Find the iptables config file:
eric#dev ~$ find / -name "iptables" 2>/dev/null
/etc/sysconfig/iptables
Edit the file:
sudo vi /etc/sysconfig/iptables
Fix the file by adding a port:
# Generated by iptables-save v1.4.7 on Thu Jan 16 16:43:13 2014
*filter
-A INPUT -p tcp -m tcp --dport 15672 -j ACCEPT
COMMIT
I've got a remote server on eapps.com that I'm using as my "production" server. I have my own computer at home that I'm using as my "development" server. I'm trying to use JNDI over HTTP to do some batch processing. The following works at home, but not on the eapps machine.
I'm connecting to some EJBs (stateless session), and have my jndi.properties set to this:
(this is for the eapps machine)
java.naming.factory.initial=org.jboss.naming.HttpNamingContextFactory
java.naming.provider.url=http://my.prodhost.com:8080/invoker/JNDIFactory
java.naming.factory.url.pkgs=org.jboss.naming.client:org.jnp.interfaces
# timeout is in milliseconds
jnp.timeout=15000
jnp.sotimeout=15000
jnp.maxRetries=3
(this is for my machine at home)
java.naming.factory.initial=org.jboss.naming.HttpNamingContextFactory
java.naming.provider.url=http://localhost:8080/invoker/JNDIFactory
java.naming.factory.url.pkgs=org.jnp.interfaces
java.naming.factory.url.pkgs=org.jboss.naming.client
# timeout is in milliseconds
jnp.timeout=15000
jnp.sotimeout=15000
jnp.maxRetries=3
As I said, it works at home, but when I try it remotely, I get:
Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://my.prodhost.com:4446//?dataType=invocation&enableTcpNoDelay=true&marshaller=org.jboss.invocation.unified.marshall.InvocationMarshaller&socketTimeout=600000&unmarshaller=org.jboss.invocation.unified.marshall.InvocationUnMarshaller]
...
Caused by: java.net.ConnectException: Connection timed out: connect
Am I doing something wrong here, or is it possibly a firewall issue? To the best of my knowledge, port 4446 is not blocked.
Are the differences in the jndi.properties intentional (at the java.naming.factory.url.pkgs property level)?
Also, can you run a netstat -a | grep 4446 on both machines and update the question with the output?
Update: If the netstat command didn't return anything for port 4446 (JBoss was running, right?), then the JBoss Remoting Connector for the UnifiedInvoker service is very likely not listening on your eApps host, hence the connection timeout. Maybe this service has been disabled by eApps, you should contact the support and discuss this with them.
Just in case, a sample Connector configuration can be found in the jboss-service.xml under the server node's conf directory. Maybe compare the remote one (if you have access to it) with your local file to confirm this (but if it's disable, there must be a reason, discuss it with the support).
And by the way, this is what I get when I run the netstat command with JBoss 4.2.3.GA started on my GNU/Linux machine (default configuration):
$ netstat -a | grep 4446
tcp 0 0 localhost:4446 *:* LISTEN