I've set up a small Hadoop cluster for testing. Setup went fairly well with the NameNode (1 machine), SecondaryNameNode (1) and all DataNodes (3). The machines are named "master", "secondary" and "data01", "data02" and "data03". All DNS are properly set up, and passwordless SSH was configured from master/secondary to all machines and back.
I formatted the cluster with bin/hadoop namenode -format, and then started all services using bin/start-all.sh. All processes on all nodes were checked to be up and running with jps. My basic configuration files look something like this:
<!-- conf/core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<!--
on the master it's localhost
on the others it's the master's DNS
(ping works from everywhere)
-->
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<!-- I picked /hdfs for the root FS -->
<value>/hdfs/tmp</value>
</property>
</configuration>
<!-- conf/hdfs-site.xml -->
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
# conf/masters
secondary
# conf/slaves
data01
data02
data03
I'm just trying to get HDFS running properly now.
I've created a dir for testing hadoop fs -mkdir testing, then tried to copy some files into it with hadoop fs -copyFromLocal /tmp/*.txt testing. This is when hadoop crashes, giving me more or less this:
WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
at ... (such and such)
WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
at ...
WARN hdfs.DFSClient: Could not get block locations. Source file "/user/hd/testing/wordcount1.txt" - Aborting...
at ...
ERROR hdfs.DFSClient: Exception closing file /user/hd/testing/wordcount1.txt: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/hd/testing/wordcount1.txt could only be replicated to 0 nodes, instead of 1
at ...
And so on. A similar issue occurs when I try to run hadoop fs -lsr . from a DataNode machine, only to get the following:
12/01/02 10:02:11 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 0 time(s).
12/01/02 10:02:12 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 1 time(s).
12/01/02 10:02:13 INFO ipc.Client: Retrying connt to server master/192.162.10.10:9000. Already tried 2 time(s).
...
I'm saying it's similar, because I suspect this is a port availability issue. Running telnet master 9000 reveals that the port is closed. I've read somewhere that this might be an IPv6 clash issue, and thus defined the following in conf/hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
But that didn't do the trick. Running netstat on the master reveals something like this:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:9000 localhost:56387 ESTABLISHED
tcp 0 0 localhost:56386 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56387 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56384 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56385 localhost:9000 TIME_WAIT
tcp 0 0 localhost:56383 localhost:9000 TIME_WAIT
At this point I'm pretty sure the problem is with the port (9000), but I'm not sure what I missed as far as configuration goes. Any ideas? Thanks.
update
I found that hard coding DNS names into /etc/hosts not only help resolve this, but also speeds up the connections. The downside is that you have to do this on all the machines in the cluster, and again when you add new nodes. Or you can just set up a DNS server, which I didn't.
Here's a sample of my one node in my cluster (nodes are named hadoop01, hadoop02, etc, with the master and secondary being 01 and 02). Node that most of it are generated by the OS:
# this is a sample for a machine with dns hadoop01
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastrprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allroutes
# --- Start list of nodes
192.168.10.101 hadoop01
192.168.10.102 hadoop02
192.168.10.103 hadoop03
192.168.10.104 hadoop04
192.168.10.105 hadoop05
192.168.10.106 hadoop06
192.168.10.107 hadoop07
192.168.10.108 hadoop08
192.168.10.109 hadoop09
192.168.10.110 hadoop10
# ... and so on
# --- End list of nodes
# Auto-generated hostname. Please do not remove this comment.
127.0.0.1 hadoop01 localhost localhost.localdomain
Hope this helps.
Replace localhost in hdfs://localhost:9000 with ip-address or hostname for the fs.default.name property in NameNode when there are remote nodes connecting to the NameNode.
All processes on all nodes were checked to be up and running with jps
There might be some errors in the log files. jps makes sure that the process is running.
Correct your /etc/hosts file to include localhost or correct your core-site file to specify ip or hostname of node that hosts HDFS filesystem.
Related
I am trying to launch a guest vm on an ubuntu host, from a remote machine. The image for the guest is also at the remote machine(http server as image repo).
The following is the domain xml segment for disk section:
<disk type='network' device='disk'>
<driver name='qemu' type="qcow2"/>
<source protocol="http" name="img/guest_1.qcow2">
<host name="192.168.10.16" port="80"/>
</source>
<target dev='vdb' bus='virtio'/>
</disk>
While i am launching the vm i get this error:
virsh -c qemu://hostname/system start guest_vm
error: Failed to start domain guest_vm
error: internal error: process exited while connecting to monitor: 2017-04-07T12:31:24.421836Z qemu-system-x86_64: -drive file=http://192.168.10.16:80/img/guest_1.qcow2,format=qcow2,if=none,id=drive-virtio-disk1: curl block device does not support writes
Any inputs on how to resolve the issue?
From domain xml related documents, i could see other protocols like rbd,nbd,iscsi,etc being used.Is it not possible with http ?
As the error message says, the curl driver in QEMU (which is used for accessing disks via the http,https & ftp network protocols) only supports read-only access. You've configured a disk which requires read-write access, hence it reports an error.
Even if curl did support writes you really wouldn't want to use it. The HTTP protocol is not an efficient way to access guest disks. You should use any of iSCSI, NBD, NFS, RBD or GlusterFS instead.
I used FileZilla to connect to one of my Linux servers via the SFTP protocol, but got below error stack trace.
Status: Connecting to <server_ip>...
Response: fzSftp started, protocol_version=5
Command: keyfile "C:\ruifeng_ibm.ppk"
Command: open "root#<server_ip>" 22
Status: Connected to <server_ip>
Error: Connection timed out after 20 seconds of inactivity
Error: Could not connect to server
On the server when I ran lsof -i, I was able to see the established sshd connection.
sshd 12333 root 3u IPv4 109406 0t0 TCP <server_hostname>:ssh-><workstation_ip>:54315 (ESTABLISHED)
How could the directories not be listed when the connection is successful? No idea how to debug either.
Turned out to be a silly problem.
I put below welcome message in the .bashrc file.
echo -e "\n\nHello Ruifeng...Welcome to the Arena! \n#>>------>---->>"
Either it contained some illegal characters FileZilla does not honor, or it's completely not supported by FileZilla. Too lazy to further dig in. After removing this message, the connection worked and the directories got listed.
We have a graphite full stack server which receives metrics from different machines. While other collectd client are sending data fine, one of the client is giving the below error:
Jan 29 23:24:44 collectd-client collectd[25489]: write_graphite plugin: send
to graphite-server:2003 ((null)) failed with status -1 (Connection
refused) Jan 29 23:24:44 collectd-client collectd[25489]: collectd: Stopping
5 write threads.
collectd.conf as below
LoadPlugin syslog
LoadPlugin cpu
LoadPlugin df
LoadPlugin disk
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin rrdtool
LoadPlugin write_graphite
<Plugin df>
MountPoint "/"
</Plugin>
<Plugin disk>
Disk "/^[hs]d[a-f][0-9]?$/"
</Plugin>
<Plugin interface>
Interface "eth0"
</Plugin>
<Plugin write_graphite>
<Node "carbon">
Host "sde-graphite"
Port "2003"
Prefix "collectd"
Postfix "collectd"
StoreRates true
AlwaysAppendDS false
EscapeCharacter "_"
</Node>
</Plugin>
Verify whether carbon is running in host sde-graphite at port 2003. you can do a netstat and see if there is a UDP listener at 2003. I guess, it is not running.
SOLVED:
I had the same issue, my metrics are always working but randomly some nodes stop sending metrics. And collectd shows the same error:
Jun 18 15:04:23 node-a collectd[20235]: write_graphite plugin: send to 10.8.0.100:2003 (udp) failed with status -1 (Invalid argument)
Jun 18 15:04:23 node-a collectd[20235]: Filter subsystem: Built-in target `write': Dispatching value to all write plugins failed with status -1.
The daemon is still alive but not sending metrics to graphite.
NOTE: My nodes send data to graphite trough an openvpn tunnel.
It seems to be a connection timeout error against the graphite server. I can reproduce the error by stopping/interrupting vpn service and immediately collectd shows the error above.
Hope it helps
Enjoy!
I have box A and it has a consumer on it that listens on a Rabbit MQ server
I have box B that will publish a message to the listener
So as long as all of this in on box A and I start Rabbit MQ server w/ defaults it works fine.
The defaults are host=127.0.0.1 on port 5672, but
when I telnet box.a.ip.addy 5672 from box B I get:
Trying box.a.ip.addy...
telnet: connect to address box.a.ip.addy: No route to host
telnet: Unable to connect to remote host: No route to host
telnet on port 22 is fine, I can ssh into Box A from Box B
So I assume I need to change the ip that the RabbitMQ server uses
I found this: http://www.rabbitmq.com/configure.html and I now have a config file in the location the documentation said to use, with the name rabbitmq.config and it contains:
[
{rabbit, [{tcp_listeners, {"box.a.ip.addy", 5672}}]}
].
So I stopped the server, and started RabbitMQ server again. It failed. Here are the errors from the error logs. It's a little over my head. (in fact most of this is)
=ERROR REPORT==== 23-Aug-2011::14:49:36 ===
FAILED
Reason: {{case_clause,{{"box.a.ip.addy",5672}}},
[{rabbit_networking,'-boot_tcp/0-lc$^0/1-0-',1},
{rabbit_networking,boot_tcp,0},
{rabbit_networking,boot,0},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
{rabbit,run_boot_step,1},
{rabbit,'-start/2-lc$^0/1-0-',1},
{rabbit,start,2},
{application_master,start_it_old,4}]}
=INFO REPORT==== 23-Aug-2011::14:49:37 ===
application: rabbit
exited: {bad_return,{{rabbit,start,[normal,[]]},
{'EXIT',{rabbit,failure_during_boot}}}}
type: permanent
and here is some more from the start up log:
Erlang has closed
Error: {node_start_failed,normal}
^M
Crash dump was written to: erl_crash.dump^M
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})^M
Please help
did you try adding?
RABBITMQ_NODE_IP_ADDRESS=box.a.ip.addy
to the /etc/rabbitmq/rabbitmq.conf file?
Per http://www.rabbitmq.com/configure.html#customise-general-unix-environment
Also per this documentation it states that the default is to bind to all interfaces. Perhaps there is a configuration setting or environment variable already set in your system to restrict the server to localhost overriding anything else you do.
UPDATE: After reading again I realize that the telnet should have returned "Connection Refused" not "No route to host." I would also check to see if you are having a firewall related issue.
You need to open up the tcp port on your firewall
Using Linux, Find the iptables config file:
eric#dev ~$ find / -name "iptables" 2>/dev/null
/etc/sysconfig/iptables
Edit the file:
sudo vi /etc/sysconfig/iptables
Fix the file by adding a port:
# Generated by iptables-save v1.4.7 on Thu Jan 16 16:43:13 2014
*filter
-A INPUT -p tcp -m tcp --dport 15672 -j ACCEPT
COMMIT
I've got a remote server on eapps.com that I'm using as my "production" server. I have my own computer at home that I'm using as my "development" server. I'm trying to use JNDI over HTTP to do some batch processing. The following works at home, but not on the eapps machine.
I'm connecting to some EJBs (stateless session), and have my jndi.properties set to this:
(this is for the eapps machine)
java.naming.factory.initial=org.jboss.naming.HttpNamingContextFactory
java.naming.provider.url=http://my.prodhost.com:8080/invoker/JNDIFactory
java.naming.factory.url.pkgs=org.jboss.naming.client:org.jnp.interfaces
# timeout is in milliseconds
jnp.timeout=15000
jnp.sotimeout=15000
jnp.maxRetries=3
(this is for my machine at home)
java.naming.factory.initial=org.jboss.naming.HttpNamingContextFactory
java.naming.provider.url=http://localhost:8080/invoker/JNDIFactory
java.naming.factory.url.pkgs=org.jnp.interfaces
java.naming.factory.url.pkgs=org.jboss.naming.client
# timeout is in milliseconds
jnp.timeout=15000
jnp.sotimeout=15000
jnp.maxRetries=3
As I said, it works at home, but when I try it remotely, I get:
Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://my.prodhost.com:4446//?dataType=invocation&enableTcpNoDelay=true&marshaller=org.jboss.invocation.unified.marshall.InvocationMarshaller&socketTimeout=600000&unmarshaller=org.jboss.invocation.unified.marshall.InvocationUnMarshaller]
...
Caused by: java.net.ConnectException: Connection timed out: connect
Am I doing something wrong here, or is it possibly a firewall issue? To the best of my knowledge, port 4446 is not blocked.
Are the differences in the jndi.properties intentional (at the java.naming.factory.url.pkgs property level)?
Also, can you run a netstat -a | grep 4446 on both machines and update the question with the output?
Update: If the netstat command didn't return anything for port 4446 (JBoss was running, right?), then the JBoss Remoting Connector for the UnifiedInvoker service is very likely not listening on your eApps host, hence the connection timeout. Maybe this service has been disabled by eApps, you should contact the support and discuss this with them.
Just in case, a sample Connector configuration can be found in the jboss-service.xml under the server node's conf directory. Maybe compare the remote one (if you have access to it) with your local file to confirm this (but if it's disable, there must be a reason, discuss it with the support).
And by the way, this is what I get when I run the netstat command with JBoss 4.2.3.GA started on my GNU/Linux machine (default configuration):
$ netstat -a | grep 4446
tcp 0 0 localhost:4446 *:* LISTEN