openstack: changing scheduler_max_attempts nova.conf does not affect anything - openstack

I want to reduce scheduler_max_attempts in nova.conf. I changed it and restarted nova-scheduler service. Then I tried to test the change by triggering multiple vm creation at the same time. I could see that there are more number of retries happen than I set for scheduler_max_attempts. I want to know if any other service needs a restart to apply scheduler_max_attempts change. Does anyone have idea about it?

I found that I have to restart nova-conductor service to apply changes in scheduler_max_attempts.

Related

Uploading larger files with User-Agent python-requests/2.2.1 results in RemoteDisconnected

Using the python library requests and uploading larger files I will get the error RemoteDisconnected('Remote end closed connection without response').
However it will work if I change the default User-Agent of the library to something like "Mozilla/5.0".
Does anybody know the reason for this behaviour ?
Edit: Only happens with Property X-Explode-Archive: true
Are there any specific pattern of timeout that you could highlight in this case?
For example: It times out after 60 seconds every time (of that sort)?
I would suggest to check the logs from all the medium configured with the Artifactory instance. Like, Reverse-proxy & the embedded-tomcat too. As the issue is specific to large-sized files, correlate the timeout pattern with the timeouts configured from all the entities which would give us a hint towards this issue.

pyinfra: how to know when an operation causes a change

Sometimes, you need to run more commands if one command causes a change on the remote system. Good examples would be:
You update a systemd service file. If the file was actually changed, then you need to restart the service.
You update the configuration for a service (like say /etc/dhcp/dhcpd.conf). If that file was changed, you need to restart the service.
Is there a way to do this with files.put? Ideally you could write code like:
changed = files.put(src='files/dhcpd.conf', dest='/etc/dhcp/dhcpd.conf')
if changed:
systemd.service(service='isc-dhcp-server', running=True, restarted=True)
In pyinfra, every operation returns an object that has a changed property that does what you want:
dhcpconfig = file.put(…)
if dhcpconfig.changed:
systemd.service(…)

Google Cloud Composer (Apache Airflow) cannot access log files

I'm running a DAG in Google Cloud Composer (hosted Airflow) which runs fine in Airflow locally. All it does is print "Hello World". However, when I run it through Cloud Composer I receive the error:
*** Log file does not exist: /home/airflow/gcs/logs/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Fetching from: http://airflow-worker-d775d7cdd-tmzj9:8793/log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log
*** Failed to fetch log file from worker. HTTPConnectionPool(host='airflow-worker-d775d7cdd-tmzj9', port=8793): Max retries exceeded with url: /log/matts_custom_dag/main_test/2020-04-20T23:46:53.652833+00:00/2.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8825920160>: Failed to establish a new connection: [Errno -2] Name or service not known',))
I've also tried making the DAG add data into a database and it actually succeeds 50% of the time. However, it always returns this error message (and no other print statements or logs). Any help much appreciated on why this might be happening.
We also faced the same issue then raised a support ticket to GCP and got the following reply.
The message is related to the latency of syncing logs from Airflow workers to WebServer, it takes at least some minutes (depending on the number of objects and their size)
The total log size seems not large but it’s enough to noticeably slow down synchronization, hence, we recommend cleanup/archive the logs
Basically we recommend relying on Stackdriver logs instead, because of latency due to the design of this sync
I hope this will help you solve the problem.
I have the same problem after upgrading from 1.10.3 to 1.10.6 of Google Composer.
I can see in my logs that airflow is trying to get the logs from a bucket with a name ended with -tenant while the bucket in my account ends with -bucket
In the configuration, I can see something weird too.
## airflow.cfg
[core]
remote_base_log_folder = gs://us-east1-dada-airflow-xxxxx-bucket/logs
## also in the running configuration says
core remote_base_log_folder gs://us-east1-dada-airflow-xxxxx-tenant/logs env var
I wrote to google support and they said the team is working on a fix.
EDIT:
I've been accessing my logs with gsutil and replacing the bucket name suffix to -bucket
gsutil cat gs://us-east1-dada-airflow-xxxxx-bucket/logs/...../5.logs
I faced the same situation in multiple occasions.
As soon as when the job finished when I take a look at the log on Airflow Web UI, it used to give me the same error. Although when I check back the same logs on UI after a min or 2, I could see the logs properly.
As per the above answers, its a sync issue between the webserver and the Worker node.
In general, the issue describe here should be more like a sporadic issue.
In certain situations, what could help is setting default-task-retries to a value that allows for retrying a task at least 1.
This issue is resolved at least since Airflow version: 1.10.10+composer.

Workermanager REPLACE will affect already running instance?

We get many number of requests in a queue. We are instantiating the workermanager work as and when we get any request. How does the ExistingWorkPolicy.REPLACE work?
Document says
If there is existing pending (uncompleted) work with the same unique name, cancel and delete it.
Will it also kill the existing running worker in the middle? We really do not want the existing worker to stop in the middle, it is ok to be replaced when the worker is enqueued but not in running state. Can we use REPLACE option here?
https://developer.android.com/reference/androidx/work/ExistingWorkPolicy
As explained in WorkManager's guide and in your question, when you enqueue a new UniqueWorkRequest using REPLACE as the existing worker policy, this is going to stop a previous worker that is currently running.
What happens to your worker really depends on how you implemented it (Worker, CoroutineWorker, or another ListenableWorker subclass) and how you handle stoppages and cancellations]2.
What this means, is that your Worker needs to "cooperatively" finish and cleanup your worker:
In the case of unique work, you explicitly enqueued a new WorkRequest with an ExistingWorkPolicy of REPLACE. The old WorkRequest is immediately considered terminated.
Under these conditions, your worker will receive a call to ListenableWorker.onStopped(). You should perform cleanup and cooperatively finish your worker in case the OS decides to shut down your app.

Is it ever possible to reduce pg_num for specific pool

It's sad that I found it's not allowed by ceph cli to decrease the value of pg_num for a specific pool.
ceph osd pool set .rgw.root pg_num 32
The error is shown:
Error EEXIST: specified pg_num 32 <= current 128
The tutorial from placement-groups is about to tell me what is it and how to set the best value of it. But there is seldom any tutorial about how to reduce the pg_num without re-installing ceph or delete the pool firstly, like ceph-reduce-the-pg-number-on-a-pool.
The existed SO thread ceph-too-many-pgs-per-osd shows us how to decide the best value. If I met the issue, how can I recover from the mess?
If it's not easy to reduce the value pg_num, what's the story behind it? Why doesn't ceph expose the interface to reduce it?
Nautilus version allows pg_num changes without restrictions (and pg_autoscale).
If you want to increase/reduce pg_num/pgp_num values without having to create, copy & rename pools (as suggested on your link), the best option is to upgrade to Nautilus.

Resources