Multiple airflow schedulers - airflow

I am trying to install three node airflow cluster. Each node has airflow scheduler, airflow worker, airflow webserver, also it has celery, RabbitMQ cluster and Postgres multi master cluster(implemented with Bucardo). Versions of software:
Airflow 2.0.1
Postregsql 13.2
Ubuntu 20.04
pyhton 3.8.5
celery 4.4.7
bucardo 5.6.0
RabbitMQ 3.8.2
And I occur the problem starting airflow scheduler.
When I launch the first one(database is empty), it successfully starts.
But then when I'm launching another scheduler on another machine(I tried to launch on the same machine too), it fails with the following:
sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "job_pkey"
DETAIL: Key (id)=(25) already exists.
[SQL: INSERT INTO job (dag_id, state, job_type, start_date, end_date, latest_heartbeat, executor_class, hostname, unixname) VALUES (%(dag_id)s, %(state)s, %(job_type)s, %(start_date)s, %(end_date)s, %(latest_heartbeat)s, %(executor_class)s, %(hostname)s, %(unixname)s) RETURNING job.id]
[parameters: {'dag_id': None, 'state': 'running', 'job_type': 'SchedulerJob', 'start_date': datetime.datetime(2021, 4, 21, 7, 39, 20, 429478, tzinfo=Timezone('UTC')), 'end_date': None, 'latest_heartbeat': datetime.datetime(2021, 4, 21, 7, 39, 20, 429504, tzinfo=Timezone('UTC')), 'executor_class': 'CeleryExecutor', 'hostname': 'hostname', 'unixname': 'root'}]
(Background on this error at: http://sqlalche.me/e/13/gkpj)
After trying to launch a few times eventually scheduler is working. I am assuming id is incremented and then data is successfully added into database:
airflow=> select * from job order by state;
id | dag_id | state | job_type | start_date | end_date | latest_heartbeat | executor_class | hostname | unixname
----+--------+---------+--------------+-------------------------------+-------------------------------+-------------------------------+----------------+------------------------------+----------
26 | | running | SchedulerJob | 2021-04-21 07:39:22.243721+00 | | 2021-04-21 07:39:22.243734+00 | CeleryExecutor | machine name | root
25 | | running | SchedulerJob | 2021-04-21 07:39:14.515009+00 | | 2021-04-21 07:39:19.632811+00 | CeleryExecutor | machine name | root
There is a warning with log tables as well(If the second and subsequent schedulers successfully started):
WARNING - Failed to log action with (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "log_pkey"
DETAIL: Key (id)=(40) already exists.
I understand why scheduler cannot insert data into table, but how should it work correctly, how to launch multiple schedulers? Official documentation tells no additional configuration required. Hope I explained very clear. Thanks!

Looks like there is a race condition between the Airflow Schedulers and Bucardo.
Probably the easiest way to fix it is to query all servers sequentially with a connection string like this in your airflow.cfg (the same on all nodes):
[core]
sql_alchemy_conn=postgresql://USER:PASS#/DB?host=node1:port1&host=node2B&host=node3
For this to work you'll need sqlalchemy >= 1.3
Why this happens
There is a race condition between your schedulers and bucardo trying to read and write data from the table in different hosts. Changes does not propagate as quickly as they should and server writes to the table fail.
Even if you treat all your nodes as "multimaster", making all nodes look first at the same server will remediate this problem. In case of failure, they will use the second one.

I asked Airflow developers. The problem is in Bucardo since it does not support
'SELECT ... FOR UPDATE' :
I suspect that the problem is with Bucardo, which does not support record locking properly. We have high expectations, because it is a key protection mechanism against running the same task by many schedulers.
http://airflow.apache.org/docs/apache-airflow/stable/scheduler.html#database-requirements
If that doesn't work you will have problems with duplicate keys.
Thanks!

Related

amplify push yields "The AWS Access Key Id you provided does not exist in our records."

Returning to an app from a few months ago, I ran:
amplify push
which returned
Current Environment: dev
| Category | Resource name | Operation | Provider plugin |
| -------- | --------------------- | --------- | ----------------- |
| Api | e9app201907021400api | Update | awscloudformation |
| Auth | eauth201907021400 | No Change | awscloudformation |
? Are you sure you want to continue? Yes
GraphQL schema compiled successfully.
Edit your schema at /Projects/2019/june/e9-app/amp<snip>0api/schema
✖ An error occurred when pushing the resources to the cloud
The AWS Access Key Id you provided does not exist in our records.
So I generated a new set of credentials in the console and installed them with aws configure.
I ran aws configure list
and got
Name Value Type Location
---- ----- ---- --------
profile default manual --profile
access_key ****************CAGH shared-credentials-file
secret_key ****************uU0C shared-credentials-file
region eu-west-1 config-file ~/.aws/config
checked:
cat ~/.aws/credentials
which returned:
[default]
aws_access_key_id = ****************CAGH
aws_secret_access_key = ****************uU0C
amplify push continues to return the same message.
When I go back to the console and look at the user it says "access key age Today" - as opposed to 45 days ago (before I requested new credentials).
Any clues as to what else I can check please?
Try to check your configured 'profileName' in /amplify/.config/local-aws-info.json.
In my case, I was trying to run the push command using a different profile and that didn't work. Switching to the correct profile solved the issue.
It would appear the Inactive key associated with the user account was invalidating the Active key. To test the theory I reactivated the Inactive key. I've since delete the inactive key.
So it would seem to me that amplify doesn't see the non-primary key.

DSE graph not able to create search index asText for a property

I've just started my journey with DSE graph (had fair bit of understanding of Titan earlier). I've set-up DSE graph with Datastax 5.0.3.
When trying to create a search index for a property, I am getting following exception.
schema.vertexLabel('Employee').index('search').search().by('story').asText().add()
org.apache.tinkerpop.gremlin.driver.exception.ResponseException: Cannot create search index with workload: Analytics
I was able to create properties, materialized and secondary indices. But when I tried to create the search index, I am facing this issue.
I realized that while bringing up my single node cluster, I've had to turn off -s flag because that is not letting me to bring up the DSE server. There was some exception when bringing up the node for the first time I was not supposed to set the -s flag as per some Datastax developer QAs.
entrypoint: ["/usr/local/bin/dse-entrypoint", "-k", "-g"]
Now when I tried to enable the -s flag, my node is not coming up and I am getting the following exception.
dse | WARN 12:54:28,038 CLibrary.java:163 - Unable to lock JVM memory (ENOMEM). This can result in part of the JVM being swapped out, especially with mmapped I/O enabled. Increase RLIMIT_MEMLOCK or run Cassandra as root.
dse | WARN 12:54:28,038 StartupChecks.java:118 - jemalloc shared library could not be preloaded to speed up memory allocations
dse | WARN 12:54:28,039 StartupChecks.java:150 - JMX is not enabled to receive remote connections. Please see cassandra-env.sh for more info.
dse | WARN 12:54:28,047 SigarLibrary.java:174 - Cassandra server running in degraded mode. Is swap disabled? : false, Address space adequate? : true, nofile limit adequate? : true, nproc limit adequate? : true
dse | ERROR 12:54:28,710 CassandraDaemon.java:709 - Cannot start node if snitch's data center (SearchGraphAnalytics) differs from previous data center (GraphAnalytics). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
dse | INFO 12:54:28,717 DseDaemon.java:556 - DSE shutting down...
dse | INFO 12:54:28,718 PluginManager.java:104 - All plugins are stopped.
dse | Oct 27, 2016 12:54:28 PM org.apache.coyote.http11.Http11Protocol pause
dse | INFO: Pausing Coyote HTTP/1.1 on http-172.19.0.3-8983
dse | Oct 27, 2016 12:54:29 PM org.apache.catalina.core.StandardService stop
dse | INFO: Stopping service Solr
dse | INFO 12:54:29,907 SolrHttpAuditLogFilter.java:225 - Shutting down Solr audit logging filter
dse | INFO 12:54:29,924 RepeatablePOSTQueryFilter.java:81 - Shutting down com.datastax.bdp.search.solr.RepeatablePOSTQueryFilter filter
dse | Oct 27, 2016 12:54:29 PM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
dse | SEVERE: The web application [/solr] appears to have started a thread named [Thread-3] but has failed to stop it. This is very likely to create a memory leak.
dse | Oct 27, 2016 12:54:29 PM org.apache.catalina.loader.WebappClassLoader clearReferencesThreads
dse | SEVERE: The web application [/solr] appears to have started a thread named [NonPeriodicTasks:1] but has failed to stop it. This is very likely to create a memory leak.
dse | Oct 27, 2016 12:54:29 PM org.apache.coyote.http11.Http11Protocol destroy
dse | INFO: Stopping Coyote HTTP/1.1 on http-172.19.0.3-8983
dse | INFO 12:54:33,191 MessageServer.java:129 - internode-messaging message server finished shutting down.
dse | INFO 12:54:37,209 MessageServer.java:129 - internode-messaging message server finished shutting down.
dse | Exception in thread "Daemon shutdown" java.lang.AssertionError
dse | at org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal(Gossiper.java:1427)
dse | at org.apache.cassandra.gms.Gossiper.addLocalApplicationStates(Gossiper.java:1451)
dse | at org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1441)
dse | at com.datastax.bdp.gms.DseState.setActiveStatusSync(DseState.java:241)
dse | at com.datastax.bdp.server.DseDaemon.preStop(DseDaemon.java:576)
dse | at com.datastax.bdp.server.DseDaemon.safeStop(DseDaemon.java:587)
dse | at com.datastax.bdp.server.DseDaemon.lambda$getShutdownHook$226(DseDaemon.java:905)
dse | at java.lang.Thread.run(Thread.java:745)
Please suggest how I can rectify this situation and be able to add the searchIndex to my properties.
This error is stating that you are starting up the cluster with a different name than what the cluster expects based on a previous configuration/start.
By default, unless you override the cluster name in the cassandra.yaml file, your cluster will startup with a name based on the workloads you enable, i.e. -s, -t. In your case, since you started the cluster as an Analytics cluster and then restarted it as a SearchAnalytics cluster, the cluster name is defaulting to the new cluster name, which doesn't match the old cluster name.
The easiest thing to do here is to wipe your cassandra commit log, caches, and data directory and restart the node. That will wipe out the old cluster name from your system tables and will allow the cluster to start. Doing this will wipe out any data you had in the cluster.

how to let bosh errand execute on existed vms

All,
I have an issue, how to let bosj errand on existed vms,
I have a vm deployed a mongodb, and i want to run some errand command in the vm, but i don't know how?
Do any one know this?
Director task 2121
Task 2121 done
+---------+---------+---------+--------------+
| VM | State | VM Type | IPs |
+---------+---------+---------+--------------+
| app-p/0 | stopped | app | 10.62.90.171 |
+---------+---------+---------+--------------+
How to run an errand:
bosh run errand ERRAND_NAME [--download-logs] [--logs-dir DESTINATION_DIRECTORY]
ref: http://bosh.io/docs/sysadmin-commands.html#errand
Assuming you have already marked the deployment job 'lifecycle' as 'errand' in your manifest
ref: Jobs Block section of http://bosh.io/docs/deployment-manifest.html
Like Ben Moss mentioned in his comment, a new VM is created for running an errand job. The errand VM does not stay around after the errand completes. Errand is a short lived job and can be run multiple times

nova boot baremetal, select specific machine in pool to 'boot'

I am using Ironic to help me deploy bare metal in a data center environment using 1U Dell servers. It works very well, I can use Ironic to marshall dozens of servers in the rack, then when I need a bare metal instance (via nova) I just use the flavor associated with those servers and I get one of them. Is there a way I can get a specific one? For example, my servers are numbered from the top, starting with control0, control1 all the way down to control39. So, first I create all of the baremetal servers, introspect them. Then I create a flavor (like below, please forgive the pseudo code) and associate each baremetal server with that profile.
openstack flavor create --id auto --ram 6144 --disk 40 --vcpus 4 control
openstack flavor set --property "cpu_arch"="x86_64" --property "capabilities:boot_option"="local" --property "capabilities:profile"="control" control
i = 0
for each baremetal server's uuid:
ironic node-update server-uuid add name=control$i
i = i + 1
ironic node-update server-uuid add properties/capabilities="profile:control,boot_option:local"
When I loop through the list I know that the servers are in top down physical order. What I would like to be able to do is get nova to create a boot instance on a specific ironic bare metal (like control3). I could create separate flavors for each one but I think there must be a way to select a specific piece of hardware? Or a strategy that would pick them in the order I specify.
I am pretty new to Ironic. I have done quite a bit of googling on the topic but haven't found anything. Here is how I start nova:
nova boot --flavor control --image rhel-server-7.1-x86_64-dvd.iso --nic 'net-id=723e7b11-3e61-481a-827e-e58b369dd28f' mybootinstance1
Which works fine. What I would like to do is have a nova boot line which uses the flavor control, and also the name (control0) or any other property that I can assign to make that machine unique. Something like:
nova boot --flavor control --ironic-instance-name control0 --image rhel-server-7.1-x86_64-dvd.iso --nic 'net-id=723e7b11-3e61-481a-827e-e58b369dd28f' mybootinstance1
This is actually a simplification of the nova pool selection process. I don't want to use a pool, but rather, a specific piece of hardware.
If that isn't possible, is there a big drawback to using 40 flavors to create individual 'pools'?
I think you can use --hint in nova boot to select specific machine in pool.
Preconditions: edit /etc/nova/nova.conf, add 'JsonFilter' in scheduler_default_filters and restart nova-scheduler.Then use nova boot command like this:
nova boot --flavor <flavor> --image <image_id> --nic net-id=<net_id> --hint reservation=<reservation_id> --hint query='["=","$hypervisor_hostname", "<node_uuid>"]' <instance_name>
I'm not quite familiar to this topic, but I'd like to share how to boot an instance to specific host via availability zone.
In my devstack (master) development environment, the procedure is:
$ nova availability-zone-list
+---------------------+----------------------------------------+
| Name | Status |
+---------------------+----------------------------------------+
| internal | available |
| |- fcwszq | |
| | |- nova-conductor | enabled :-) 2015-11-23T06:31:46.000000 |
| | |- nova-cert | enabled :-) 2015-11-23T06:31:41.000000 |
| | |- nova-scheduler | enabled :-) 2015-11-23T06:31:43.000000 |
| | |- nova-network | enabled :-) 2015-11-23T06:31:44.000000 |
| nova | available |
| |- fcwszq | |
| | |- nova-compute | enabled :-) 2015-11-23T06:31:41.000000 |
+---------------------+----------------------------------------+
Note that my environment only gets one compute node whose hostname is fcwszq, but still can be specified as:
nova boot --availability-zone nova:fcwszq --flavor 1 --image c38f0c7e-8ee0-4b0f-8a56-022040b4696f test02
If I specify a non-existent node, for example, nova:non-existent, the instance cannot be created correctly (state is ERROR).
Hope this can help you.
Another way is using host aggregate and flavor metadata to boot instance on a random server in a group, reference: http://docs.openstack.org/liberty/config-reference/content/section_compute-scheduler.html#d6e21786

WebLogic OBIEE Scheduler Component Down

I have an OBIEE 11g installation in a Red Hat machine, but I'm finding problems to make it running. I can start WebLogic and its services, so I’m able to enter the WebLogic console and Enterprise Manager, but problems come when I try to start OBIEE components with opmnctl command.
The steps I’m performing are the following:
1) Start WebLogic
cd /home/Oracle/Middleware/user_projects/domains/bifoundation_domain/bin/
./startWebLogic.sh
2) Start NodeManager
cd /home/Oracle/Middleware/wlserver_10.3/server/bin/
./startNodeManager.sh
3) Start Managed WebLogic
cd /home/Oracle/Middleware/user_projects/domains/bifoundation_domain/bin/
./startManagedWebLogic.sh bi_server1
4) Set up OBIEE Components
cd /home/Oracle/Middleware/instances/instance1/bin/
./opmnctl startall
The result is:
opmnctl startall: starting opmn and all managed processes...
================================================================================
opmn id=JustiziaInf.mmmmm.mmmmm.9999
Response: 4 of 5 processes started.
ias-instance id=instance1
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ias-component/process-type/process-set:
coreapplication_obisch1/OracleBISchedulerComponent/coreapplication_obisch1/
Error
--> Process (index=1,uid=1064189424,pid=4396)
failed to start a managed process after the maximum retry limit
Log:
/home/Oracle/Middleware/instances/instance1/diagnostics/logs/OracleBISchedulerComponent/
coreapplication_obisch1/console~coreapplication_obisch1~1.log
5) Check the status of components
cd /home/Oracle/Middleware/instances/instance1/bin/
./opmnctl status
Processes in Instance: instance1
---------------------------------+--------------------+---------+---------
ias-component | process-type | pid | status
---------------------------------+--------------------+---------+---------
coreapplication_obiccs1 | OracleBIClusterCo~ | 8221 | Alive
coreapplication_obisch1 | OracleBIScheduler~ | N/A | Down
coreapplication_obijh1 | OracleBIJavaHostC~ | 8726 | Alive
coreapplication_obips1 | OracleBIPresentat~ | 6921 | Alive
coreapplication_obis1 | OracleBIServerCom~ | 7348 | Alive
Read the log files from /home/Oracle/Middleware/instances/instance1/diagnostics/logs/OracleBISchedulerComponent/
coreapplication_obisch1/console~coreapplication_obisch1~1.log.
I would recommend trying the the steps in the below link as this is a common issue when upgrading OBIEE.
http://www.askjohnobiee.com/2012/11/fyi-opmnctl-failed-to-start-managed.html
Not sure, what your log says, but try these below steps and check if it works or not
Login as superuser
cd $ORACLE_HOME/Apache/Apache/bin
chmod 6750 .apachectl
logout and login as ORACLE user
opmnctl startproc process-type=OracleBIScheduler

Resources