Airflow - Error when configuring User authentication - airflow

I am trying to setup login page for Airflow. I am getting an error when I try to update the password using (user.password = 'set_the_password')
Getting an error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda3/lib/python3.6/site-packages/sqlalchemy/ext/hybrid.py", line 873, in __set__
raise AttributeError("can't set attribute")
AttributeError: can't set attribute
Could anyone help me on this. Thanks.

Try the following in python interpreter:
import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'USERNAME'
user.email = 'EMAIL'
user._set_password = 'PASSWORD'.encode('utf8')
session = settings.Session()
session.add(user)
session.commit()
session.close()

Related

__call__() got an unexpected keyword argument 'metadata' error for Airflow PubSubHook acknowledge method

I am trying to manually acknowledge each PubSub messages in the Python call back method for PubSubPull Operator. I have provided the arguments as per the documentation. However when i am getting errors related to optional "metadata" argument
Scenario 1 - when metadata=[]: Getting error -> call() got an unexpected keyword argument 'metadata'
PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, retry=Retry , timeout=10,metadata=[])
**Traceback:**
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/operators/pubsub.py", line 785, in execute
ret = handle_messages(pulled_messages, context)
File "/home/airflow/gcs/dags/snow_ticket_creator_1.py", line 70, in print_messages
PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, retry=Retry , timeout=10)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 457, in inner_wrapper
return func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/pubsub.py", line 561, in acknowledge
subscriber.acknowledge(
File "/opt/python3.8/lib/python3.8/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1270, in acknowledge
rpc(
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/gapic_v1/method.py", line 154, in __call__
return wrapped_func(*args, **kwargs)
TypeError: __call__() got an unexpected keyword argument 'metadata'
Scenario 2 - when metadata = None: Getting error message TypeError: 'NoneType' object is not iterable
PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, retry=Retry , timeout=10,metadata=None)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/operators/pubsub.py", line 785, in execute
ret = handle_messages(pulled_messages, context)
File "/home/airflow/gcs/dags/snow_ticket_creator_1.py", line 70, in print_messages
PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, retry=Retry , timeout=10,metadata=None)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 457, in inner_wrapper
return func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/pubsub.py", line 561, in acknowledge
subscriber.acknowledge(
File "/opt/python3.8/lib/python3.8/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1263, in acknowledge
metadata = tuple(metadata) + (
TypeError: 'NoneType' object is not iterable
**Scenario 3 - when metadata is omitted: Getting error -> call() got an unexpected keyword argument 'metadata'
PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, retry=Retry , timeout=10)
Traceback: Same as Scenario 1
Composer Version composer-1.19.12
Airflow Version -airflow-2.3.3
Complete Code:
from __future__ import annotations
import os
from datetime import datetime
import base64
import airflow
from airflow import DAG
import json
from airflow.operators.bash import BashOperator
from airflow.providers.google.cloud.operators.pubsub import (
PubSubCreateSubscriptionOperator,
PubSubPullOperator,
)
from airflow.providers.google.cloud.sensors.pubsub import PubSubPullSensor
from airflow.providers.google.cloud.hooks.pubsub import PubSubHook,Retry
from airflow.utils.trigger_rule import TriggerRule
ENV_ID = "Dev" #os.environ.get("SYSTEM_TESTS_ENV_ID")
PROJECT_ID = "abcdef" #os.environ.get("SYSTEM_TESTS_GCP_PROJECT", "your-project-id")
DAG_ID = "DataPullDag_1"
TOPIC_ID = "alert_topic_jp" #f"topic-{DAG_ID}-{ENV_ID}"
SNOW_SUBSCRIPTION="alert_subscription_jp"
def print_ack_messages(pulled_messages, context):
for idx,m in enumerate(pulled_messages):
data = m.message.data.decode('utf-8')
print(f'################----------{data}')
data_json_dict = json.loads(data)
print(f"AckID: { m.ack_id }, incident_id: { data_json_dict['incident']['incident_id'] }"
f"scoping_project_id: { data_json_dict['incident']['scoping_project_id'] } "
f"resource_name: { data_json_dict['incident']['resource_name'] } "
f"summary: { data_json_dict['incident']['summary'] } ")
#acknowldege message
ack_id_list = [m.ack_id]
print(type(ack_id_list))
if idx == 0:
PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, retry=Retry , timeout=10)
print(f"Successfully acknowldeged incident_id: { data_json_dict['incident']['incident_id'] }")
with DAG(
DAG_ID,
schedule_interval='#once', # Override to match your needs
start_date=airflow.utils.dates.days_ago(0),
catchup=False,
) as dag:
# [START howto_operator_gcp_pubsub_create_subscription]
subscribe_task = PubSubCreateSubscriptionOperator(
task_id="subscribe_task", project_id=PROJECT_ID, topic=TOPIC_ID,subscription=SNOW_SUBSCRIPTION
)
subscription = subscribe_task.output
pull_messages_operator = PubSubPullOperator(
task_id="pull_messages_operator",
ack_messages=False,
project_id=PROJECT_ID,
messages_callback=print_ack_messages,
subscription=subscription,
max_messages=50,
)
(
subscribe_task
>> pull_messages_operator
)
I did bit more experimenting into the actual source code for PullOperator (if we provide "ack_messages=True" in the PullOperator itself, it will acknowledge all the pulled messages by calling hook.acknowledge(project_id=self.project_id,subscription=self.subscription,messages=pulled_messages,) ) and found out that the retry object in my acknowledge call was creating the issue. So instead of PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, retry=Retry , timeout=10) i have dropped retry object and used PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, timeout=10) and it worked!!!.
However as per the documentation documentation, retry object has a purpose,
retry object used to retry requests. If None is specified, requests will not be retried.
Update 31/10/2022
This was due to a mistake in the code, instead of Retry object i was using Retry class. Thank you for pointing this out Taragolis (airflow collaborator).
In the above code if we replace Retry class with retry object it will work, as shown below
retryObj= Retry(initial=10, maximum=10, multiplier=1.0, deadline=600)
PubSubHook().acknowledge(subscription=SNOW_SUBSCRIPTION,project_id=PROJECT_ID, ack_ids=ack_id_list, retry=retryObj, timeout=10)

Airflow PostgresOperator :Task failed with exception while using postgres_conn_id="redshift"

~$ airflow version
2.1.2
python 3.8
I am trying to execute some basic queries on my redshift cluster using a dag but the task is failing with an exception(not shown in the logs)
import datetime
import logging
from airflow import DAG
from airflow.contrib.hooks.aws_hook import AwsHook
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator
import sql_statements
def load_data_to_redshift(*args, **kwargs):
aws_hook = AwsHook("aws_credentials")
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook("redshift")
sql_stmt = sql_statements.COPY_ALL_data_SQL.format(
credentials.access_key,
credentials.secret_key,
)
redshift_hook.run(sql_stmt)
dag = DAG(
'exercise1',
start_date=datetime.datetime.now()
)
create_t1_table = PostgresOperator(
task_id="create_t1_table",
dag=dag,
postgres_conn_id="redshift_default",
sql=sql_statements.CREATE_t1_TABLE_SQL
)
create_t2_table = PostgresOperator(
task_id="create_t2_table",
dag=dag,
postgres_conn_id="redshift_default",
sql=sql_statements.CREATE_t2_TABLE_SQL,
)
create_t1_table >> create_t2_table
following is the exception
[2021-09-17 05:23:33,902] {base.py:69} INFO - Using connection to: id: redshift_default. Host: rdscluster.123455.us-west-2.redshift.amazonaws.com, Port: 5439, Schema: udac, Login: ***, Password: ***, extra: {}
[2021-09-17 05:23:33,903] {taskinstance.py:1501} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/8085/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1157, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/8085/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1331, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/8085/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1361, in _execute_task
result = task_copy.execute(context=context)
File "/home/8085/.local/lib/python3.8/site-packages/airflow/providers/postgres/operators/postgres.py", line 70, in execute
self.hook.run(self.sql, self.autocommit, parameters=self.parameters)
File "/home/8085/.local/lib/python3.8/site-packages/airflow/hooks/dbapi.py", line 177, in run
with closing(self.get_conn()) as conn:
File "/home/8085/.local/lib/python3.8/site-packages/airflow/providers/postgres/hooks/postgres.py", line 115, in get_conn
self.conn = psycopg2.connect(**conn_args)
File "/home/8085/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 124, in connect
conn = psycopg2.connect("dbname=airflow user=abc password=ubantu host=127.0.0.1 port=5432")
File "/home/8085/.local/lib/python3.8/site-packages/psycopg2abc/__init__.py", line 124, in connect
conn = psycopg2.connect("dbname=airflow user=abc password=abc host=127.0.0.1 port=5432")
File "/home/8085/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 124, in connect
conn = psycopg2.connect("dbname=airflow user=abc password=abc host=127.0.0.1 port=5432")
[Previous line repeated 974 more times]
RecursionError: maximum recursion depth exceeded
[2021-09-17 05:23:33,907] {taskinstance.py:1544} INFO - Marking task as FAILED. dag_id=exercise1, task_id=create_t1_table, execution_date=20210917T092331, start_date=20210917T092333, end_date=20210917T092333
[2021-09-17 05:23:33,953] {local_task_job.py:149} INFO - Task exited with return code 1
I can't tell from the logs what is going wrong here, it appears that even after providing redshift connection ID the PostgresOperator is using default Postgres connection configured while installing the Airflow webserver but I could be wrong.
Any idea how do I resolve this or get more log out of airflow? (note I already tried with different airflow log levels from airflow config it didn't help either)
redshift - connection is defined properly and I can connect to redshift using another standalone python utility as well as plsql, so there is no issue with Redshift cluster.
-Thanks,
Resolved:
Somehow following file was referring to the airflow postgres DB created during the Airflow installation rather than connecting to the local postgres.
File "/home/8085/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line 124, in connect
**conn = psycopg2.connect("dbname=airflow user=abc password=abc host=127.0.0.1 port=5432")**
Had to recreate the airflow DB from scratch to resolve the issue.

Airflow unexpected argument 'mounts'

I'm trying to set up an Airflow ETL pipeline that extracts images from the .bag file. I wanna extract it inside docker and I'm using DockerOperator. Docker image is pulled from private GitLab repository. The script I want to run is a python script inside a Docker container. The .bag file is on my external-SSD so I'm trying to mount it inside docker. Is there something wrong with the code or is it a different kind of problem?
Error:
[2021-09-16 10:39:17,010] {docker.py:246} INFO - Starting docker container from image registry.gitlab.com/url/of/gitlab:a24a3f05
[2021-09-16 10:39:17,010] {taskinstance.py:1462} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 1312, in _execute_task
result = task_copy.execute(context=context)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 343, in execute
return self._run_image()
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 265, in _run_image
return self._run_image_with_mounts(self.mounts, add_tmp_variable=False)
File "/home/filip/.local/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 287, in _run_image_with_mounts
privileged=self.privileged,
File "/usr/lib/python3/dist-packages/docker/api/container.py", line 607, in create_host_config
return HostConfig(*args, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'mounts'
[2021-09-16 10:39:17,014] {taskinstance.py:1512} INFO - Marking task as FAILED. dag_id=ETL-test, task_id=docker_extract, execution_date=20210916T083912, start_date=20210916T083915, end_date=20210916T083917
[2021-09-16 10:39:17,062] {local_task_job.py:151} INFO - Task exited with return code 1
[2021-09-16 10:39:17,085] {local_task_job.py:261} INFO - 0 downstream tasks scheduled from follow-on schedule check
This is my code :
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import datetime, timedelta
from airflow.operators.dummy import DummyOperator
from airflow.providers.docker.operators.docker import DockerOperator
from docker.types import Mount
from airflow.operators.bash_operator import BashOperator
ssd_dir=Mount(source='/media/filip/external-ssd', target='/external-ssd', type='bind')
dag = DAG(
'ETL-test',
default_args = {
'owner' : 'admin',
'description' : 'Extract data from bag, simple test',
'depend_on_past' : False,
'start_date' : datetime(2021, 9, 13),
},
)
start_dag = DummyOperator(
task_id='start_dag',
dag=dag
)
extract = DockerOperator(
api_version="auto",
task_id='docker_extract',
image='registry.gitlab.com/url/of/gitlab:a24a3f05',
container_name='extract-test',
mounts=[ssd_dir],
auto_remove = True,
force_pull = False,
mount_tmp_dir=False,
command='python3 rgb_image_extraction.py --bagfile /external-ssd/2021-09-01-13-17-10.bag --output_dir /external-ssd/airflow --camera_topic /kirby1/vm0/stereo/left/color/image_rect --every_n_img 20 --timestamp_as_name',
docker_conn_id='gitlab_registry',
dag=dag
)
test = BashOperator(
task_id='print_hello',
bash_command='echo "hello world"',
dag=dag
)
start_dag >> extract >> test
I think you have an old docker python library installed. If you want to make sure airflow 2.1.0 works, you should always use constraints mechanism as described in https://airflow.apache.org/docs/apache-airflow/stable/installation.html otherwise you risk you will have outdated dependencies.
For example if you use Python 3.6, the right constraints are https://raw.githubusercontent.com/apache/airflow/constraints-2.1.3/constraints-3.6.txt and there docker python library is 5.0.0 I bet you have much older version.

Airflow : Not receiving an Error message in Email, whenever the DAG/TASK is failed with on_failure_callback

Airflow version 1.10.3
Below is the module code that is been called by on_failure_callback
I have used reason = context.get("exception"), But I get an error as None in the email when the job is failed instead of getting an error message
Output in the email:
Reason for Failure: None
alert_email.py
from airflow.utils.email import send_email
from airflow.models import Variable
def failure_alert(context, config=None):
config = {} if config is None else config
email = config.get('email_id')
task_id = context.get('task_instance').task_id
dag_id = context.get("dag").dag_id
execution_time = context.get("execution_date")
reason = context.get("exception")
dag_failure_html_body = f"""<html>
<header><title>The below DAG has failed!</title></header>
<body>
<b>DAG Name</b>: {dag_id}<br/>
<b>Task Id</b>: {task_id}<br/>
<b>Execution Time (UTC)</b>: {execution_time}<br/>
<b>Reason for Failure</b>: {reason}<br/>
</body>
</html>
"""
try:
send_email(
to=email,
subject=f"Airflow alert: <DagInstance: {dag_id} - {execution_time} [failed]",
html_content=dag_failure_html_body,
)
except Exception as e:
logger.error(f'Error in sending email to address {email}: {e}', exc_info=True)
The issue with the version Airflow 1.10.3. We will be upgrading into Airflow 1.10.10

why I got the errors PartitionOwnedError and ConsumerStoppedException when starting a few consumers

I use pykafka to fetch message from kafka topic, and then do some process and update to mongodb. As the pymongodb can update only one item every time, so I start 100 processes. But when starting, some processes occoured errors "PartitionOwnedError and ConsumerStoppedException". I don't know why.
Thank you.
kafka_cfg = conf['kafka']
kafka_client = KafkaClient(kafka_cfg['broker_list'])
topic = kafka_client.topics[topic_name]
balanced_consumer = topic.get_balanced_consumer(
consumer_group=group,
auto_commit_enable=kafka_cfg['auto_commit_enable'],
zookeeper_connect=kafka_cfg['zookeeper_list'],
zookeeper_connection_timeout_ms = kafka_cfg['zookeeper_conn_timeout_ms'],
consumer_timeout_ms = kafka_cfg['consumer_timeout_ms'],
)
while(1):
for msg in balanced_consumer:
if msg is not None:
try:
value = eval(msg.value)
id = long(value.pop("id"))
value["when_update"] = datetime.datetime.now()
query = {"_id": id}}
result = collection.update_one(query, {"$set": value}, True)
except Exception, e:
log.error("Fail to update: %s, msg: %s", e, msg.value)
>
Traceback (most recent call last):
File "dump_daily_summary.py", line 182, in <module>
dump_daily_summary.run()
File "dump_daily_summary.py", line 133, in run
for msg in self.balanced_consumer:
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 745, in __iter__
message = self.consume(block=True)
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 734, in consume
raise ConsumerStoppedException
pykafka.exceptions.ConsumerStoppedException
>
Traceback (most recent call last):
File "dump_daily_summary.py", line 182, in <module>
dump_daily_summary.run()
File "dump_daily_summary.py", line 133, in run
for msg in self.balanced_consumer:
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 745, in __iter__
message = self.consume(block=True)
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 726, in consume
self._raise_worker_exceptions()
File "/data/share/python2.7/lib/python2.7/site-packages/pykafka-2.5.0.dev1-py2.7-linux-x86_64.egg/pykafka/balancedconsumer.py", line 271, in _raise_worker_exceptions
raise ex
pykafka.exceptions.PartitionOwnedError
PartitionOwnedError: check if there are some background process consuming in the same consumer_group, maybe there are not enough available partitions for starting another consumer.
ConsumerStoppedException: you can try upgrading your pykafka version (https://github.com/Parsely/pykafka/issues/574)
I met the same problem like you. But, I confused about others' solutions like adding enough partitions for consumers or updating the version of pykafka.
In fact, mine satisfied those conditions above.
Here is the version of tools:
python 2.7.10
kafka 2.11-0.10.0.0
zookeeper 3.4.8
pykafka 2.5.0
Here is my code:
class KafkaService(object):
def __init__(self, topic):
self.client_hosts = get_conf("kafka_conf", "client_host", "string")
self.topic = topic
self.con_group = topic
self.zk_connect = get_conf("kafka_conf", "zk_connect", "string")
def kafka_consumer(self):
"""kafka-consumer client, using pykafka
:return: {"id": 1, "url": "www.baidu.com", "sitename": "baidu"}
"""
from pykafka import KafkaClient
consumer = ""
try:
kafka = KafkaClient(hosts=str(self.client_hosts))
topic = kafka.topics[self.topic]
consumer = topic.get_balanced_consumer(
consumer_group=self.con_group,
auto_commit_enable=True,
zookeeper_connect=self.zk_connect,
)
except Exception as e:
logger.error(str(e))
while True:
message = consumer.consume(block=False)
if message:
print "message:", message.value
yield message.value
The two exceptions(ConsumerStoppedException and PartitionOwnedError), are raised by the function consum(block=True) of pykafka.balancedconsumer.
Of course, I recommend you to read the source code of that function.
There is a argument block=True, after altering it to False, the programme can not fall into the exceptions.
Then the kafka consumers work fine.
This behavior is affected by a longstanding bug that was recently discovered and is currently being fixed. The workaround we've used in production at Parse.ly is to run our consumers in an environment that handles automatically restarting them when they crash with these errors until all partitions are owned.

Resources