Running puckel/docker-airflow, modified build so that both environment variables, and airflow.cfg have:
ENV AIRFLOW__CORE__DEFAULT_TIMEZONE=system
and
default_timezone = system
accordingly.
But in the UI, it still shows UTC, even though system time is EAT. Here is some evidence from the container:
airflow#906d2275235d:~$ echo $AIRFLOW__CORE__DEFAULT_TIMEZONE
system
airflow#906d2275235d:~$ cat airflow.cfg | grep default_timez
default_timezone = system
airflow#906d2275235d:~$ date
Thu 01 Aug 2019 04:54:23 PM EAT
Would appreciate any help, or an advice on your practice with this.
According to Airflow docs:
Please note that the Web UI currently only runs in UTC.
Although UI uses UTC, Airflow uses local time to launch DAGs. So if you have for example schedule_interval set to 0 3 * * *, Airflow will start the DAG at 3:00 EAT, but it in the UI you will see it as 0:00.
Related
I'm running a DAG that runs once per day. It starts with 9 concurrently running tasks that all do the same thing - each is basically polling S3 to see if that tasks's designated 1 file exists. Each task is the same code in Airflow and is put into the structure in the same way. I have 1 of these tasks, which, on random days, fails to "begin" - it won't enter the running stage. It just sits as queued . When it does this, here's what its log says
*** Log file isn't local.
*** Fetching here: http://:8793/log/my.dag.name./my_airflow_task/2020-03-14T07:00:00
*** Failed to fetch log file from worker.
*** Reading remote logs...
Could not read logs from s3://mybucket/airflow/logs/my.dag.name./my_airflow_task/2020-03-14T07:00:00
Why does this only happen on random days? All similar questions I've seen point to this error happening consistently, and once overcome, no longer continues. To "trick" this task into "running" I manually touch whatever the name of the log file is supposed to be, and then it changes to running.
So the issue appears that it had to do with the system's ownership rules regarding the folder the logs for that particular task wrote to. I used a CI tool to ship the new task_3 when I updated my Airflow's Python code to the production environment, so the task was created that way. When I peaked for log directory ownership, I noticed this for the tasks:
# inside/airflow/log/dir:
drwxrwxr-x 2 root root 4096 Mar 25 14:53 task_3 # is the offending task
drwxrwxr-x 2 airflow airflow 20480 Mar 25 00:00 task_2
drwxrwxr-x 2 airflow airflow 20480 Mar 25 15:54 task_1
So, I think what was going on, was that randomly, Airflow couldn't get the permission to write the log file, thus it wouldn't start the rest of the task. When I applied the appropriate chown command using something like sudo chown -R airflow:airflow task_3 . Ever since I changed this, the issue has disappeared.
Recently, the Brazilian government abolished the daylight saving time where the timezone offset went from -3 to -2.
My Dokku container still contains the old information, causing my Ruby on Rails application that reads directly from the OS zoneinfo to display times in DST when it shouldn't.
I can check that my host machine has up-to-date timezone information because when I run TZ=":America/Sao_Paulo" date it outputs Fri Nov 8 12:10:xx -03 2019. Running the same command inside my Dokku container outputs Fri Nov 8 13:10:xx -02 2019.
How can I update my Dokku time zone information and make it persistent between deployments?
To solve it, I did the following steps:
run docker system prune -a
run dokku ps:rebuild [app-name]
The first command cleared the Docker image cache for gliderlabs/herokuish:latest which contained the Heroku stack with the out-of-date timezone information. The second command rebuilt the app from source downloading the newer herokuish image.
I have airflow setup with default_timezone = US/Eastern. If I set a DAG with schedule_interval="0 17 * * *" it runs at 12pm instead of the expected 5pm. I understand that airflow stores all dates as UTC. How can I get something to run in my timezone instead of writing the interval in UTC?
I have also tried setting the tzinfo in the dag's start_date to pendulum.timezone('US/Eastern') with no luck.
airflow=1.10.0
rhel7
python=3.6
server's tz = EST
In my first foray into airflow, I am trying to run one of the example DAGS that comes with the installation. This is v.1.8.0. Here are my steps:
$ airflow trigger_dag example_bash_operator
[2017-04-19 15:32:38,391] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:32:38,676] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
[2017-04-19 15:32:38,947] {cli.py:185} INFO - Created <DagRun example_bash_operator # 2017-04-19 15:32:38: manual__2017-04-19T15:32:38, externally triggered: True>
$ airflow dag_state example_bash_operator '2017-04-19 15:32:38'
[2017-04-19 15:33:12,918] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-04-19 15:33:13,229] {models.py:167} INFO - Filling up the DagBag from /Users/gbenison/software/kludge/airflow/dags
running
The dag state remains "running" for a long time (at least 20 minutes by now), although from a quick inspection of this task it should take a matter of seconds. How can I troubleshoot this? How can I see which step it is stuck on?
To run any DAGs, you need to make sure two processes are running:
airflow webserver
airflow scheduler
If you only have airflow webserver running, the UI will show DAGs as running, but if you click on the DAG, none of it's tasks are actually running or scheduled, but rather in a Null state.
What this means is that they are waiting to be picked up by airflow scheduler. If airflow scheduler is not running, you'll be stuck in this state forever, as the tasks are never picked up for execution.
Additionally, make sure that the toggle button in the DAGs view is switched to 'ON' for the particular DAG. Otherwise it will not get picked up by the scheduler if you trigger it manually.
I too recently started using Airflow and my dags kept endlessly running. Your dag may be set on 'pause' without you realizing it, and thus the scheduler will not schedule new task instances and when you trigger the dag it just looks like it is endlessly running.
There are a few solutions:
1) In the Airflow UI toggle the button left of the dag from 'Off' to 'On'. Off means that the dag is paused, so On will allow the scheduler to pick it up and complete the dag. (this fixed my initial issue)
2) In your airflow.cfg file dags_are_paused_at_creation = True, is the default. So all new dags you create are paused from the start. Change this to False, and future dags you create will be good to go right away (i had to reboot webserver and scheduler for changes to the airflow.cfg to be recognized)
3) use the command line $ airflow unpause [dag_id]
documentation: https://airflow.apache.org/cli.html#unpause
The below worked for me.
Make sure AIRFLOW_HOME is set
in AIRFLOW_HOME have folders dags, plugins. The folders to have permissions r,w,x to airflow user.
Make sure u have atleast one dag in the dags/ folder.
pip install celery[redis]==4.1.1
I have checked the above soln on airflow 1.9.0 Airflow version
I tried the same trick with airflow 1.10 version and it worked.
For some reason, Airflow doesn't seem to trigger the latest run for a dag with a weekly schedule interval.
Current Date:
$ date
$ Tue Aug 9 17:09:55 UTC 2016
DAG:
from datetime import datetime
from datetime import timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
dag = DAG(
dag_id='superdag',
start_date=datetime(2016, 7, 18),
schedule_interval=timedelta(days=7),
default_args={
'owner': 'Jon Doe',
'depends_on_past': False
}
)
BashOperator(
task_id='print_date',
bash_command='date',
dag=dag
)
Run scheduler
$ airflow scheduler -d superdag
You'd expect a total of four DAG Runs as the scheduler should backfill for 7/18, 7/25, 8/1, and 8/8.
However, the last run is not scheduled.
EDIT 1:
I understand that Vineet although that doesn’t seem to explain my issue.
In my example above, the DAG’s start date is July 18.
First DAG Run: July 18
Second DAG Run: July 25
Third DAG Run: Aug 1
Fourth DAG Run: Aug 8 (not run)
Where each DAG Run processes data from the previous week.
Today being Aug 9, I would expect the Fourth DAG Run to have executed with a execution date of Aug 8 which processes data for the last week (Aug 1 until Aug 8) but it doesn’t.
Airflow always schedules for the previous period. So if you have a dag that is scheduled to run daily, on Aug 9th, it will schedule a run with execution_date Aug 8th. Similarly if the schedule interval is weekly, then on Aug 9th, it will schedule for 1 week back i.e. Aug 2nd, though this gets run on Aug 9th itself. This is just airflow bookkeeping. You can find this in the airflow wiki (https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls):
Understanding the execution date
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.
This date is available to you in both Jinja and a Python callable's context in many forms as documented here. As a note ds refers to date_string, not date start as may be confusing to some.
The similar issue happened to me as well.
I solved it by manually run
airflow backfill -s start_date -e end_date DAG_NAME
where start_date and end_date covers the missing execution_date, in your case, 2016-08-08.
For example,
airflow backfill -s 2016-08-07 -e 2016-08-09 DAG_NAME
I have also encountered a similar problem these days while learning apache airflow.
I think as Vineet has explained, given the way airfow works, you should probably use the execution date as the beginning of DAG execution, and not as the end of DAG execution as you said below.
I understand that Vineet although that doesn’t seem to explain my issue.
In my example above, the DAG’s start date is July 18.
First DAG Run: July 18
Second DAG Run: July 25
Third DAG Run: Aug 1
Fourth DAG Run: Aug 8 (not run)
Where each DAG Run processes data from
the previous week.
To make it work, you should probably use, for instance, July 18 as the start date of the DAG execution for the week July 18 to July 22, instead of the end of the DAG execution for the week July 11 to July 15 for the week.