Apache Airflow does not pickle DAGs - airflow

I would like to recover DAG objects so that I can better inspect certain dependencies after DAG runs (e.g. what data is consumed by specific operators). I am using postgres:9.6 as metadata database backend.
This seems to be supported via the donot_pickle configuration variable, which by default indicates all DAGs must be pickled:
[core]
# Whether to disable pickling dags
donot_pickle = False
I have some test DAGs (3) available but their corresponding pickle_id is empty:
> select pickle_id from dag;
pickle_id
---------
(3 rows)
Pickles table is also empty:
> select count(*) from dag_pickle;
count
------
0
(1 row)
What might be going wrong here? I was not able to find any reference in the docs.

There are 2 ways to enable pickling:
DONT_PICKLE=False in scheduler config is only relevant for backfill jobs
-p, -do_pickle in scheduler command line arguments enables pickling for scheduled jobs (https://airflow.apache.org/cli.html#Named%20Arguments_repeat18)

Related

Airflow 2 - debugging why dag is not loading

On Airflow 2 my dag is not showing on the UI, and I'm getting DAG Import Errors (...) for it.
The error message is insufficient for me to debug (it's a custom operator, with a lot of custom logic - so I don't want to get into details of the error itself).
On Airflow 1.X I could use cli:
airflow list_dags
to get more elaborated debug message, is there anything analogical on airflow 2 ?
I'm looking for a cli command/UI option that will provide me with more elaborated error message, than the one I'm getting on the main screen of the webserver.
As described in the Airlfow's documentation, to test DAG loading you can simply run:
python your-dag-file.py
If there is any problem during the DAG loading phase you will get a stack trace here.
The later sections also describe how to test custom operators.
As explained in the upgrading manual the
airflow list_dags has been changed to airflow dags list
The full syntax is:
airflow dags list [-h] [-o table, json, yaml] [-S SUBDIR]
for more information see docs

Dag Seems to be missing

I have a dag which checks for new workflows to be generated (Dynamic DAG) at a regular interval and if found, creates them. (Ref: Dynamic dags not getting added by scheduler )
The above DAG is working and the dynamic DAGs are getting created and listed in the web-server. Two issues here:
When clicking on the DAG in web url, it says "DAG seems to be missing"
The listed DAGs are not listed using "airflow list_dags" command
Error:
DAG "app01_user" seems to be missing.
The same is for all other dynamically generated DAGs. I have compiled the Python script and found no errors.
Edit1:
I tried clearing all data and running "airflow run". It ran successfully but no Dynamic generated DAGs were added to "airflow list_dags". But when running the command "airflow list_dags", it loaded and executed the DAG, (which generated Dynamic DAGs). The dynamic DAGs are also listed as below:
[root#cmnode dags]# airflow list_dags
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
[2019-08-13 00:34:31,692] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=15, pool_recycle=1800, pid=25386
[2019-08-13 00:34:31,877] {__init__.py:51} INFO - Using executor LocalExecutor
[2019-08-13 00:34:32,113] {__init__.py:305} INFO - Filling up the DagBag from /root/airflow/dags
/usr/lib/python2.7/site-packages/airflow/operators/bash_operator.py:70: PendingDeprecationWarning: Invalid arguments were passed to BashOperator (task_id: tst_dyn_dag). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'provide_context': True}
super(BashOperator, self).__init__(*args, **kwargs)
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
app01_user
app02_user
app03_user
app04_user
testDynDags
Upon running again, all the above generated 4 dags disappeared and only the base DAG, "testDynDags" is displayed.
When I was getting this error, there was an exception showing up in the webserver logs. Once I resolved that error and I restarted the webserver it went through normally.
From what I can see this is the error that is thrown when the webserver tried to parse the dag file and there is an error. In my case it was an error importing a new operator I added to a plugin.
Usually, I check in Airflow UI, sometimes the reason of broken DAG appear in there. But if it is not there, I usually run the .py file of my DAG, and error (reason of DAG cant be parsed) will appear.
I never got to work on dynamic DAG generation but I did face this issue when DAG was not present on all nodes ( scheduler, worker and webserver ). In case you have airflow cluster, please make sure that DAG is present on all airflow nodes.
Same error, the reason was I renamed my dag_id in uppercase. Something like "import_myclientname" into "import_MYCLIENTNAME".
I am little late to the party but I faced the error today:
In short: try executing airflow dags report and/or airflow dags reserialize
Check out my comment here:
https://stackoverflow.com/a/73880927/4437153
I found that airflow fails to recognize a dag defined in a file that does not have from airflow import DAG in it, even if DAG is not explicitly used in that file.
For example, suppose you have two files, a.py and b.py:
# a.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
def makedag(dag_id="a"):
with DAG(dag_id=dag_id) as dag:
DummyOperator(task_id="nada")
dag = makedag()
and
# b.py
from a import makedag
dag = makedag(dag_id="b")
Then airflow will only look at a.py. It won't even look at b.py at all, even to notice if there's a syntax error in it! But if you add from airflow import DAG to it and don't change anything else, it will show up.

How to find the number of upstream tasks failed in Airflow?

I am having a tough time in figuring out how to find the failed task for the same dag run running twice on same day(same execution day).
Consider an example when a dag with dag_id=1 has failed on the first run (due to any reason lets say connection timeout maybe) and task got failed. TaskInstance table will contain the entry of the failed task when we try to query it. GREAT!!
But, If I re-run the same dag(note that dag_id is still 1) then in the last task(it has the rule of ALL_DONE so irrespective of the whether upstream task was failed or was successful it will be executed) I want to calculate the number of tasks failed in the current dag_run ignoring the previous dag_runs. I came across dag_run id which could be useful if we can relate it to TaskInstance but I could not. Any suggestions/help is appreciated.
In Airflow 1.10.x the same result can be achieved by much simpler code that avoids touching ORM directly.
from airflow.utils.state import State
def your_python_operator_callable(**context):
tis_dagrun = context['ti'].get_dagrun().get_task_instances()
failed_count = sum([True if ti.state == State.FAILED else False for ti in tis_dagrun])
print(f"There are {failed_count} failed tasks in this execution"
The one unfortunate problem is that context['ti'].get_dagrun() does not return instance of DAGRun when running test of a single task from CLI. In the effect, manual testing of that single task will fail but the standard run will work as expected.
You could create a PythonOperator task which queries the Airflow database to find the information you're looking for. This has the added benefit of passing along the information you need to query for the data you want:
from contextlib import closing
from airflow import models, settings
from airflow.utils.state import State
def your_python_operator_callable(**context):
with closing(settings.Session()) as session:
print("There are {} failed tasks in this execution".format(
session.query(
models.TaskInstance
).filter(
models.TaskInstance.dag_id == context["dag"].dag_id,
models.TaskInstance.execution_date == context["execution_date"],
models.TaskInstance.state == State.FAILED).count()
)
Then add the task to your DAG with a PythonOperator.
(I have not tested the above, but hopefully will send you on the right path)

Marklogic 8 : Scheduled task do not start and no logs

I scheduled a data extraction with an Xquery query on ML 8.0.6 using the "scheduler tasks".
My Xquery query (this query works if I copy/paste it in the ML web console and I get a file available on AWS S3):
xdmp:save("s3://XX.csv",let $nl := "
"
return
document {
for $book in collection("books")/books
return (root($book)/bookId||","||
$optin/updatedDate||$nl
)
})
My scheduled task :
Task enabled : yes
Task path : /home/bob/extraction.xqy
task root : /
task type: hourly
task period : 1
task start time: 8 past the hour
task database : mydatabase
task modules : file system
task user : admin
task host: XX
task priority : higher
Unfortunately, my script is not executed because no file is generated on AWS S3 (the storage used)and I do not have any logs.
Any idea to :
1/debug a job in the scheduler task?
2/ See the job running at the expected time ?
Thanks,
Romain.
First, I would try take a look at the ErrorLog.txt because it will probably show you where to look for the problem.
xdmp:filesystem-file(concat(xdmp:data-directory(),"/","Logs","/","ErrorLog.txt"))
Where is the script located logically: Has it been uploaded to the content database, modules database, or ./MarkLogic/Modules directory?
If this is a cluster, have you specified which host it's running on? If so, and using the filesystem modules, then ensure the script exists in the ./MarkLogic/Modules directory of the specified host. Inf not, and using the filesystem modules, then ensure the script exists in the ./MarkLogic/Modules directory of all the hosts of the cluster.
As for seeing the job running, you can check the http://servername:8002/dashboard/ and take a look at the Query Execution tab see the running processes, or you can get a snapshot of the process by taking a look at the Status page of the task server (Configure > Groups > [group name] > Task Server: Status and click show more button)

Airflow returns "Backfill done" without running tasks

I'm running Airflow and attempting to iterate on some task we're building from the command line.
When running a airflow webserver, everything works as expected. But when I run airflow backfill dag task '2017-08-12', airflow returns:
[2017-08-15 02:52:55,639] {__init__.py:57} INFO - Using executor LocalExecutor
[2017-08-15 02:52:56,144] {models.py:168} INFO - Filling up the DagBag from /usr/local/airflow/dags
2017-08-15 02:52:59,055 - airflow.jobs.BackfillJob - INFO - Backfill done. Exiting
...and doesn't actually run the dag.
When using airflow test or airflow run (i.e. commands involving running a task rather than a dag), it works as expected
Am I making a basic mistake? What can I do to debug from here?
Thanks
Have you run those DAG on that date range already? You will need to clear the DAG first then backfill. Base on what Maxime mentioned here: https://groups.google.com/forum/#!topic/airbnb_airflow/gMY-sc0QVh0
If a task has a #monthly schedule, then if you try and run it with a start_date mid-month, it will merely state Backfill done. Exiting.. If a task has a schedule of '30 5 * * *', this also prevents backfill from the command line
(Updated to reflect better information, and this discussion)
Two possible reasons:
Execution date specified via -e option is outside of the DAG's [start_date, end_date) range.
Even if execution date is between the dates, please keep in mind that if you DAG has schedule_interval=None then it won't backfill iteratively: it will only run for a single date (specified as --start_date or --end_date if the first is omitted).

Resources