dask bokeh port not reused - bokeh

I'm running some extended tests under dask. I create a LocalCluster and a Client from it, do some processing and then shutdown the Client and close the LocalCluster. On the first test, port 8787 is used for bokeh as expected. However on subsequent tests, the port number is not 8787 but instead some random number. Here's a script that illustrates the problem:
from distributed import Client
import time
if __name__ == '__main__':
max_n_workers = 8
print('Maximum number of workers is ', max_n_workers)
n_workers = max_n_workers
while n_workers > 1:
c = Client(n_workers=n_workers)
print(c)
addr = c.scheduler_info()['address']
services = c.scheduler_info()['services']
if 'bokeh' in services.keys():
bokeh_addr = 'http:%s:%s' % (addr.split(':')[1], services['bokeh'])
print('Diagnostic pages available on port %s' % bokeh_addr)
c.shutdown()
n_workers = n_workers // 2
time.sleep(10)
exit()
And an example output is
Maximum number of workers is 8
<Client: scheduler='tcp://127.0.0.1:41049' processes=8 cores=16>
Diagnostic pages available on port http://127.0.0.1:8787
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
<Client: scheduler='tcp://127.0.0.1:34152' processes=4 cores=16>
Diagnostic pages available on port http://127.0.0.1:39621
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed
<Client: scheduler='tcp://127.0.0.1:37583' processes=2 cores=16>
Diagnostic pages available on port http://127.0.0.1:45183
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Worker closed

When you close the LocalCluster it shuts down all services. This gives back all previously used ports to the operating system. When you start up a new LocalCluster it asks the operating system for these ports again. If the bokeh port is not yet accessible then it will choose a new random port rather than fail.
In your case I suspect that the operating system has not yet registered the default port as being open. I suspect that if you gave the operating system some time that things would clear up.

Related

How do connections recycle in a multiprocess pool serving requests from a single requests.Session object in python?

Below is the complete code simplified for the question.
ids_to_check returns a list of ids. For my testing, I used a list of 13 random strings.
#!/usr/bin/env python3
import time
from multiprocessing.dummy import Pool as ThreadPool, current_process as threadpool_process
import requests
def ids_to_check():
some_calls()
return(id_list)
def execute_task(id):
url = f"https://myserver.com/todos/{ id }"
json_op = s.get(url,verify=False).json()
value = json_op['id']
print(str(value) + '-' + str(threadpool_process()) + str(id(s)))
def main():
pool = ThreadPool(processes=20)
while True:
pool.map(execute_task, ids_to_check())
print("Let's wait for 10 seconds")
time.sleep(10)
if __name__ == "__main__":
s = requests.Session()
s.headers.update = {
'Accept': 'application/json'
}
main()
Output:
4-<DummyProcess(Thread-2, started daemon 140209222559488)>140209446508360
5-<DummyProcess(Thread-5, started daemon 140209123481344)>140209446508360
7-<DummyProcess(Thread-6, started daemon 140209115088640)>140209446508360
2-<DummyProcess(Thread-11, started daemon 140208527894272)>140209446508360
None-<DummyProcess(Thread-1, started daemon 140209230952192)>140209446508360
10-<DummyProcess(Thread-4, started daemon 140209131874048)>140209446508360
12-<DummyProcess(Thread-7, started daemon 140209106695936)>140209446508360
8-<DummyProcess(Thread-3, started daemon 140209140266752)>140209446508360
6-<DummyProcess(Thread-12, started daemon 140208519501568)>140209446508360
3-<DummyProcess(Thread-13, started daemon 140208511108864)>140209446508360
11-<DummyProcess(Thread-10, started daemon 140208536286976)>140209446508360
9-<DummyProcess(Thread-9, started daemon 140209089910528)>140209446508360
1-<DummyProcess(Thread-8, started daemon 140209098303232)>140209446508360
Let's wait for 10 seconds
None-<DummyProcess(Thread-14, started daemon 140208502716160)>140209446508360
3-<DummyProcess(Thread-20, started daemon 140208108455680)>140209446508360
1-<DummyProcess(Thread-19, started daemon 140208116848384)>140209446508360
7-<DummyProcess(Thread-17, started daemon 140208133633792)>140209446508360
6-<DummyProcess(Thread-6, started daemon 140209115088640)>140209446508360
4-<DummyProcess(Thread-4, started daemon 140209131874048)>140209446508360
9-<DummyProcess(Thread-16, started daemon 140208485930752)>140209446508360
5-<DummyProcess(Thread-15, started daemon 140208494323456)>140209446508360
2-<DummyProcess(Thread-2, started daemon 140209222559488)>140209446508360
8-<DummyProcess(Thread-18, started daemon 140208125241088)>140209446508360
11-<DummyProcess(Thread-1, started daemon 140209230952192)>140209446508360
10-<DummyProcess(Thread-11, started daemon 140208527894272)>140209446508360
12-<DummyProcess(Thread-5, started daemon 140209123481344)>140209446508360
Let's wait for 10 seconds
None-<DummyProcess(Thread-3, started daemon 140209140266752)>140209446508360
2-<DummyProcess(Thread-10, started daemon 140208536286976)>140209446508360
1-<DummyProcess(Thread-12, started daemon 140208519501568)>140209446508360
4-<DummyProcess(Thread-9, started daemon 140209089910528)>140209446508360
5-<DummyProcess(Thread-14, started daemon 140208502716160)>140209446508360
9-<DummyProcess(Thread-6, started daemon 140209115088640)>140209446508360
8-<DummyProcess(Thread-16, started daemon 140208485930752)>140209446508360
7-<DummyProcess(Thread-4, started daemon 140209131874048)>140209446508360
3-<DummyProcess(Thread-20, started daemon 140208108455680)>140209446508360
6-<DummyProcess(Thread-8, started daemon 140209098303232)>140209446508360
12-<DummyProcess(Thread-13, started daemon 140208511108864)>140209446508360
10-<DummyProcess(Thread-7, started daemon 140209106695936)>140209446508360
11-<DummyProcess(Thread-19, started daemon 140208116848384)>140209446508360
Let's wait for 10 seconds
.
.
My observation:
multiple connections are created (i.e., connection per process), but session object is same throughtout the execution of the code (as session object id is same)
connections keep recycling as seen from ss output. I couldn't identify any certain pattern/timeout for the recycling
connections are not recycling if I reduce the processes to a smaller number. (Example: 5)
I do not understand how/why the connections are being recycled and why they are not if I reduce the process count. I have tried disabling the garbage collector import gc; gc.disable() and still connections are recycled.
I would like the created connections to keep alive, until it reaches a maximum number of requests. I think it would work without sessions and using keep-alive connection header.
But I am curious to know what causing these sessions connections to keep recycling when a process pool length is high.
I can reproduce this issue with any server, so it may not be dependent on server.
I solved the same issue for myself by creating session for each process and parallelized requests executions. And at first time I used multiprocessing.dummy too, but I faced the same issue as yours and changed it to concurrent.futures.thread.ThreadPoolExecutor.
Here is my solution.
from concurrent.futures.thread import ThreadPoolExecutor
from functools import partial
from requests import Session, Response
from requests.adapters import HTTPAdapter
def thread_pool_execute(iterables, method, pool_size=30) -> list:
"""Multiprocess requests, returns list of responses."""
session = Session()
session.mount('https://', HTTPAdapter(pool_maxsize=pool_size)) # that's it
session.mount('http://', HTTPAdapter(pool_maxsize=pool_size)) # that's it
worker = partial(method, session)
with ThreadPoolExecutor(pool_size) as pool:
results = pool.map(worker, iterables)
session.close()
return list(results)
def simple_request(session, url) -> Response:
return session.get(url)
response_list = thread_pool_execute(list_of_urls, simple_request)
I test sitemaps with 200k urls with it with pool_size=150 without any problems. It's restricts only by target host configuration.

airflow webserver suddenly stopped after long time of no issues, "No response from gunicorn"

Have had airflow webserver -D deamon process (v1.10.7) running on machine (CentOS 7) for long time. Suddenly saw that the webserver could no longer be accessed and checking the airflow-webserver.log saw...
[airflow#airflowetl airflow]$ cat airflow-webserver.log
2020-10-23 00:57:15,648 ERROR - No response from gunicorn master within 120 seconds
2020-10-23 00:57:15,649 ERROR - Shutting down webserver
(nothing of note in airflow-webserver.err)
[airflow#airflowetl airflow]$ cat airflow-webserver.err
/home/airflow/.local/lib/python3.6/site-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
""")
The airflow.cfg values for the webserver section looks like...
[webserver]
# The base url of your website as airflow cannot guess what domain or
# cname you are using. This is used in automated emails that
# airflow sends to point links to the right web server
#base_url = http://localhost:8080
base_url = http://airflowetl.co.local:8080
# The ip specified when starting the web server
web_server_host = 0.0.0.0
# The port on which to run the web server
web_server_port = 8080
# Paths to the SSL certificate and key for the web server. When both are
# provided SSL will be enabled. This does not change the web server port.
web_server_ssl_cert =
web_server_ssl_key =
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 120
# Number of seconds the gunicorn webserver waits before timing out on a worker
#web_server_worker_timeout = 120
web_server_worker_timeout = 300
# Number of workers to refresh at a time. When set to 0, worker refresh is
# disabled. When nonzero, airflow periodically refreshes webserver workers by
# bringing up new ones and killing old ones.
worker_refresh_batch_size = 1
# Number of seconds to wait before refreshing a batch of workers.
worker_refresh_interval = 30
# Secret key used to run your flask app
secret_key = my_key
# Number of workers to run the Gunicorn web server
workers = 4
# The worker class gunicorn should use. Choices include
# sync (default), eventlet, gevent
worker_class = sync
Ultimately, just restarted the process as a daemon again (airflow webserver -D (should I have deleted the old airflow-webserer.log and .err files first?)), but not sure what would make this happen, since it had had no problems running for months before this.
Could anyone with more experience explain what could have happened after all this time and how I could prevent it in the future? Any issues with running dags or anything else that I should check for that this temporary unexpected shutdown of the websever may have caused?
I am experiencing the same issue, and it only started (very unfrequently) when I changed the following two config parameters in the webserver.
worker_refresh_interval = 120
workers = 2
However, my parameters are also set quite differently than yours, will share them here.
rbac = True
web_server_host = 0.0.0.0
web_server_port = 8080
web_server_master_timeout = 600
web_server_worker_timeout = 600
default_ui_timezone = Europe/Amsterdam
reload_on_plugin_change = True
After comparing the two, as your settings of the two I changed were set to the default (same as me before changing them), it seems that it is a combination of more parameters.

Jenkins build failure due to closed channel Exception

Jenkins version -2.76
Slave jar version -3.11
Master is running on Linux
And slave is running on docker container inside the windows machine
Whenever I triggered a job for more than 3hours it was failed with the closed channel Exception. At first it was failed less than 30 min. After changing some registry parameters now it extended upto 2hours. But now I want it for more than 3hours
Pls give some solution to my problem

ERROR when sending data to my KAA server

When using first kaa server application and sending data to my Kaa server from outside I get this error: CONNACK message + KAASYNC message.
My configuration for kaa server is:
transport host...=localhost=My PUBLIC ip
My config mongo on kaa server is:
host: MY PUBLIC IP port: 27017
THIS IS WHAT I GET WHEN I COMPILE MY SDK
[pool-2-thread-1] INFO org.kaaproject.kaa.client.channel.failover.DefaultFailoverManager - Server [BOOTSTRAP, -1835393002] failed
[pool-2-thread-1] WARN org.kaaproject.kaa.client.channel.impl.DefaultChannelManager - Attempt to connect to the next bootstrap service will be made in 2000 ms, according to failover strategy decision
[pool-1-thread-1] INFO FirstKaaDemo - Sampled Temperature: 34
[pool-4-thread-14] INFO org.kaaproject.kaa.client.logging.strategies.RecordCountLogUploadStrategy - Need to upload logs - current count: 14, threshold: 1
[Thread-2] INFO org.kaaproject.kaa.client.channel.impl.channels.DefaultOperationTcpChannel - Can't sync. Channel [default_operation_tcp_channel] is waiting for CONNACK message + KAASYNC message
[pool-6-thread-1] INFO org.kaaproject.kaa.client.channel.impl.channels.AbstractHttpChannel - Processing sync all for channel default_bootstrap_channel
[pool-1-thread-1] INFO FirstKaaDemo - Sampled Temperature: 25
[pool-4-thread-15] INFO org.kaaproject.kaa.client.logging.strategies.RecordCountLogUploadStrategy - Need to upload logs - current count: 15, threshold: 1
[Thread-2] INFO org.kaaproject.kaa.client.channel.impl.channels.DefaultOperationTcpChannel - Can't sync. Channel [default_operation_tcp_channel] is waiting for CONNACK message + KAASYNC message
Step 1: Change Kaa Host/IP
To VM ip address. My case is 192.168.1.142
Step 2: Regenerate SDK and recompide and run. Done

How to stop systemd from spamming rsyslog in CentOS 7

I've followed the instructions in this Redhat Bugfix article, that is I did the following (I realize this is a Redhat fix)
Create a /etc/systemd/system/systemd-logind.service file
put the following lines in the file
.include /lib/systemd/system/systemd-logind.service
[Service]
Environment=SYSTEMD_LOG_LEVEL=warning
Restarted the rsyslog and systemd-logind services
Yet I still see systemd logs spamming /var/log/messages. Anything else I need to do in a CentOS box? Here's a sample
<30>1 2016-08-23T13:09:48-04:00 susralcent09 systemd - - - Starting Flush Journal to Persistent Storage...
<30>1 2016-08-23T12:48:37-04:00 susralcent09 systemd - - - Started System Logging Service.
<30>1 2016-08-23T12:48:36-04:00 susralcent09 systemd - - - Stopping System Logging Service...
<30>1 2016-08-23T12:48:36-04:00 susralcent09 systemd - - - Starting System Logging Service...
...and
<30>1 2016-08-23T13:01:01-04:00 susralcent09 systemd - - - Created slice user-0.slice.
<30>1 2016-08-23T13:01:01-04:00 susralcent09 systemd - - - Started Session 2 of user root.
<30>1 2016-08-23T13:09:48-04:00 susralcent09 systemd - - - Started Flush Journal to Persistent Storage.
<30>1 2016-08-23T13:01:01-04:00 susralcent09 systemd - - - Starting Session 2 of user root.
<30>1 2016-08-23T13:01:01-04:00 susralcent09 systemd - - - Starting user-0.slice.

Resources