Background: I've got two machines with identical hostnames, I need to set up a local spark cluster for testing, setting up a master and a worker works fine, but trying to run an application with the driver causes problems, netty doesn't seem to be picking the correct host (regardless of what I put in there, it just picks the first host).
Identical hostname:
$ dig +short corehost
192.168.0.100
192.168.0.101
Spark config (used by master and the local worker):
export SPARK_LOCAL_DIRS=/some/dir
export SPARK_LOCAL_IP=corehost // i tried various like 192.168.0.x for
export SPARK_MASTER_IP=corehost // local, master and the driver
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=2g
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_DIR=/some/dir
Spark starts up and I can see the worker in the web-ui.
When I run the spark "job" below:
val conf = new SparkConf().setAppName("AaA")
// tried 192.168.0.x and localhost
.setMaster("spark://corehost:7077")
val sc = new SparkContext(conf)
I get this exception:
15/04/02 12:34:04 INFO SparkContext: Running Spark version 1.3.0
15/04/02 12:34:04 WARN Utils: Your hostname, corehost resolves to a loopback address: 127.0.0.1; using 192.168.0.100 instead (on interface en1)
15/04/02 12:34:04 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/04/02 12:34:05 ERROR NettyTransport: failed to bind to corehost.home/192.168.0.101:0, shutting down Netty transport
...
Exception in thread "main" java.net.BindException: Failed to bind to: corehost.home/192.168.0.101:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/04/02 12:34:05 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
Process finished with exit code 1
Not sure how to proceed... its a whole jungle of ip addresses.
Not sure if this is a netty issue either.
My experience with the identical problem is that it revolves around setting things up locally. Try being more verbose in your spark driver code, add the SPARK_LOCAL_IP and driver host ip to the config:
val conf = new SparkConf().setAppName("AaA")
.setMaster("spark://localhost:7077")
.set("spark.local.ip","192.168.1.100")
.set("spark.driver.host","192.168.1.100")
This should tell netty which of the two identical hosts to use.
Related
I am trying to run this docker file https://gitlab.com/snippets/1713665
consoles
I have running iroha container as you can see in right console on 50051 port, but on running the above docker file for web GRPC then you can see in left console it is unable to make connection. as i have also tried with enabling and disabling the firewalls and also with opening the 50051 withudo ufw allow 50051 sudo ufw allow 50051 ...But in the end i have the same results
"Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused". Reconnecting... system=system"
I have also posted this issue month ago but no once gave me any response, Thats why i am reposting with further elaboration
Try running the grpc web proxy, with the backend address as localhost, instead of whatever is default in the gitlab post.
ex. ./grpcwebproxy-v0.13.0-osx-x86_64 --backend_addr=localhost:50051 --run_tls_server=false
From the console logs, it looks like it is trying to connect to dev.localdomain:50051
I am running Airflowv1.9 with Celery Executor. I have 5 Airflow workers running in 5 different machines. Airflow scheduler is also running in one of these machines. I have copied the same airflow.cfg file across these 5 machines.
I have daily workflows setup in different queues like DEV, QA etc. (each worker runs with an individual queue name) which are running fine.
While scheduling a DAG in one of the worker (no other DAG have been setup for this worker/machine previously), I am seeing the error in the 1st task and as a result downstream tasks are failing:
*** Log file isn't local.
*** Fetching here: http://<worker hostname>:8793/log/PDI_Incr_20190407_v2/checkBCWatermarkDt/2019-04-07T17:00:00/1.log
*** Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url: http://<worker hostname>:8793/log/PDI_Incr_20190407_v2/checkBCWatermarkDt/2019-04-07T17:00:00/1.log
I have configured MySQL for storing the DAG metadata. When I checked task_instance table, I see proper hostnames are populated against the task.
I also checked the log location and found that the log is getting created.
airflow.cfg snippet:
base_log_folder = /var/log/airflow
base_url = http://<webserver ip>:8082
worker_log_server_port = 8793
api_client = airflow.api.client.local_client
endpoint_url = http://localhost:8080
What am I missing here? What configurations do I need to check additionally for resolving this issue?
Looks like the worker's hostname is not being correctly resolved.
Add a file hostname_resolver.py:
import os
import socket
import requests
def resolve():
"""
Resolves Airflow external hostname for accessing logs on a worker
"""
if 'AWS_REGION' in os.environ:
# Return EC2 instance hostname:
return requests.get(
'http://169.254.169.254/latest/meta-data/local-ipv4').text
# Use DNS request for finding out what's our external IP:
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(('1.1.1.1', 53))
external_ip = s.getsockname()[0]
s.close()
return external_ip
And export: AIRFLOW__CORE__HOSTNAME_CALLABLE=airflow.hostname_resolver:resolve
The web program of the master needs to go to the worker to fetch the log and display it on the front-end page. This process is to find the host name of the worker. Obviously, the host name cannot be found,Therefore, add the host name to IP mapping on the master's vim /etc/hosts
If this happens as part of a Docker Compose Airflow setup, the hostname resolution needs to be passed to the container hosting the webserver, e.g. through extra_hosts:
# docker-compose.yml
version: "3.9"
services:
webserver:
extra_hosts:
- "worker_hostname_0:192.168.xxx.yyy"
- "worker_hostname_1:192.168.xxx.zzz"
...
...
More details here.
I am writing amqp 1.0 client (using rabbitMQ.Client in .NET) for a broker who provided me the following information:
amqps://brokerRemoteHostName:5671
certificate_openssl.p12
password for certificate as a string "mypassword"
queue name
I developed the following code in Visual Studio which is supposed to work (based on long searches on the web):
var cf = new ConnectionFactory();
cf.Uri = new Uri("amqps://brokerRemoteHostName:5671");
cf.Ssl.Enabled = true;
cf.Ssl.ServerName = "brokerRemoteHostName";
cf.Ssl.CertPath = #"C:\Users\mahmoud\Documents\certificate_openssl.p12";
cf.Ssl.CertPassphrase = "myPassword";
var connection = cf.CreateConnection();
However, the output shows an exception:
RabbitMQ.Client.Exceptions.BrokerUnreachableException:
None of the specified endpoints were reachable ---> System.IO.IOException:
connection.start was never received
likely due to a network timeout) as seen in the image.
Where line 50 corresponds to the line where we create the connection.
I appreciate your kind assistance on the error above.
If you're connecting to a docker container, you need to add the 5672 port in addition to 15672 port when creating the container. For those using ssl, the port would be 5671 instead of 5672.
Example: docker run -d --hostname my-rabbit --name rabbitmq --net customnet -p customport:15672 -p 5672:5672 rabbitmq:3-management.
You would connect from client by calling this: ConnectionFactory factory = new ConnectionFactory() { HostName = "localhost" };.
Feel free to pass in username and password if those were changed.
Official RabbitMq docker image https://hub.docker.com/_/rabbitmq starts RabbitMq broker on port 5672, but .NET RabbitMq library expects to see broker on port 5673 which for sure differs from what we have in fact in docker. The solution is just to remap 5672 to expected 5673 port
docker run -d --hostname my-rabbit --name ds-rabbit -p 8080:15672 -p 5673:5672 rabbitmq:3-management
I have box A and it has a consumer on it that listens on a Rabbit MQ server
I have box B that will publish a message to the listener
So as long as all of this in on box A and I start Rabbit MQ server w/ defaults it works fine.
The defaults are host=127.0.0.1 on port 5672, but
when I telnet box.a.ip.addy 5672 from box B I get:
Trying box.a.ip.addy...
telnet: connect to address box.a.ip.addy: No route to host
telnet: Unable to connect to remote host: No route to host
telnet on port 22 is fine, I can ssh into Box A from Box B
So I assume I need to change the ip that the RabbitMQ server uses
I found this: http://www.rabbitmq.com/configure.html and I now have a config file in the location the documentation said to use, with the name rabbitmq.config and it contains:
[
{rabbit, [{tcp_listeners, {"box.a.ip.addy", 5672}}]}
].
So I stopped the server, and started RabbitMQ server again. It failed. Here are the errors from the error logs. It's a little over my head. (in fact most of this is)
=ERROR REPORT==== 23-Aug-2011::14:49:36 ===
FAILED
Reason: {{case_clause,{{"box.a.ip.addy",5672}}},
[{rabbit_networking,'-boot_tcp/0-lc$^0/1-0-',1},
{rabbit_networking,boot_tcp,0},
{rabbit_networking,boot,0},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1},
{rabbit,run_boot_step,1},
{rabbit,'-start/2-lc$^0/1-0-',1},
{rabbit,start,2},
{application_master,start_it_old,4}]}
=INFO REPORT==== 23-Aug-2011::14:49:37 ===
application: rabbit
exited: {bad_return,{{rabbit,start,[normal,[]]},
{'EXIT',{rabbit,failure_during_boot}}}}
type: permanent
and here is some more from the start up log:
Erlang has closed
Error: {node_start_failed,normal}
^M
Crash dump was written to: erl_crash.dump^M
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot}}}}})^M
Please help
did you try adding?
RABBITMQ_NODE_IP_ADDRESS=box.a.ip.addy
to the /etc/rabbitmq/rabbitmq.conf file?
Per http://www.rabbitmq.com/configure.html#customise-general-unix-environment
Also per this documentation it states that the default is to bind to all interfaces. Perhaps there is a configuration setting or environment variable already set in your system to restrict the server to localhost overriding anything else you do.
UPDATE: After reading again I realize that the telnet should have returned "Connection Refused" not "No route to host." I would also check to see if you are having a firewall related issue.
You need to open up the tcp port on your firewall
Using Linux, Find the iptables config file:
eric#dev ~$ find / -name "iptables" 2>/dev/null
/etc/sysconfig/iptables
Edit the file:
sudo vi /etc/sysconfig/iptables
Fix the file by adding a port:
# Generated by iptables-save v1.4.7 on Thu Jan 16 16:43:13 2014
*filter
-A INPUT -p tcp -m tcp --dport 15672 -j ACCEPT
COMMIT
I've got a remote server on eapps.com that I'm using as my "production" server. I have my own computer at home that I'm using as my "development" server. I'm trying to use JNDI over HTTP to do some batch processing. The following works at home, but not on the eapps machine.
I'm connecting to some EJBs (stateless session), and have my jndi.properties set to this:
(this is for the eapps machine)
java.naming.factory.initial=org.jboss.naming.HttpNamingContextFactory
java.naming.provider.url=http://my.prodhost.com:8080/invoker/JNDIFactory
java.naming.factory.url.pkgs=org.jboss.naming.client:org.jnp.interfaces
# timeout is in milliseconds
jnp.timeout=15000
jnp.sotimeout=15000
jnp.maxRetries=3
(this is for my machine at home)
java.naming.factory.initial=org.jboss.naming.HttpNamingContextFactory
java.naming.provider.url=http://localhost:8080/invoker/JNDIFactory
java.naming.factory.url.pkgs=org.jnp.interfaces
java.naming.factory.url.pkgs=org.jboss.naming.client
# timeout is in milliseconds
jnp.timeout=15000
jnp.sotimeout=15000
jnp.maxRetries=3
As I said, it works at home, but when I try it remotely, I get:
Can not get connection to server. Problem establishing socket connection for InvokerLocator [socket://my.prodhost.com:4446//?dataType=invocation&enableTcpNoDelay=true&marshaller=org.jboss.invocation.unified.marshall.InvocationMarshaller&socketTimeout=600000&unmarshaller=org.jboss.invocation.unified.marshall.InvocationUnMarshaller]
...
Caused by: java.net.ConnectException: Connection timed out: connect
Am I doing something wrong here, or is it possibly a firewall issue? To the best of my knowledge, port 4446 is not blocked.
Are the differences in the jndi.properties intentional (at the java.naming.factory.url.pkgs property level)?
Also, can you run a netstat -a | grep 4446 on both machines and update the question with the output?
Update: If the netstat command didn't return anything for port 4446 (JBoss was running, right?), then the JBoss Remoting Connector for the UnifiedInvoker service is very likely not listening on your eApps host, hence the connection timeout. Maybe this service has been disabled by eApps, you should contact the support and discuss this with them.
Just in case, a sample Connector configuration can be found in the jboss-service.xml under the server node's conf directory. Maybe compare the remote one (if you have access to it) with your local file to confirm this (but if it's disable, there must be a reason, discuss it with the support).
And by the way, this is what I get when I run the netstat command with JBoss 4.2.3.GA started on my GNU/Linux machine (default configuration):
$ netstat -a | grep 4446
tcp 0 0 localhost:4446 *:* LISTEN