New airflow directory created when i run airflow webserver - airflow

I'm having a few problems with airflow. When I installed airflow and set the airflow home directory to be
my_home/Workspace/airflow_home
But when I start the webserver a new airflow directory is created
my_home/airflow
I thought maybe something in the airflow.cfg file needs to be changed but I'm not really sure. Has anyone had this problem before?

Try doing echo $AIRFLOW_HOME and see if it the correct path you set

you need to set AIRFLOW_HOME to the directory where you save airflow config file.
if the full path of airflow.cfg file is /home/test/bigdata/airflow/airflow.cfg
just run
export AIRFLOW_HOME=/home/test/bigdata/airflow
if AIRFLOW_HOME is not set, it will use ~/airflow as default.
you could also write a shell script to start airflow webserver
it might contain lines below
source ~/.virtualenvs/airflow/bin/activate # if your airflow is installed with virtualenv, this is not necessary
export AIRFLOW_HOME=/home/test/bigdata/airflow # path should be changed according to your environment
airflow webserver -D # start airflow webserver as daemon

Related

airflow logs: how to view them live in the server directly

I am using airflow and running a dag
In have following in airflow.cfg
[core]
# The folder where your airflow pipelines live, most likely a
# subfolder in a code repository. This path must be absolute.
dags_folder = /usr/local/airflow/dags
# The folder where airflow should store its log files
# This path must be absolute
base_log_folder = /usr/local/airflow/logs
I have a long running task in airflow.
Its very difficuly to use web interface to check the logs of such size.
Instead i want to check on the airflow server directly
But i dont see a log file is created till the task fails or completely succeeded
In between the task i cant see any 1.log file created in the local server at the path mentioned in cfg
So on the server if i have access, how to check the logs of the airflow task live

(Dagster) Schedule my_hourly_schedule was started from a location that can no longer be found

I'm getting the following Warning message when trying to start the dagster-daemon:
Schedule my_hourly_schedule was started from a location Scheduler that can no longer be found in the workspace, or has metadata that has changed since the schedule was started. You can turn off this schedule in the Dagit UI from the Status tab.
I'm trying to automate some pipelines with dagster and created a new project using dagster new-project Scheduler where "Scheduler" is my project.
This command, as expected, created a diretory with some hello_world files. Inside of it I put the dagster.yaml file with configuration for a PostgreDB to which I want to right the logs. The whole thing looks like this:
However, whenever I run dagster-daemon run from the directory where the workspace.yaml file is located, I get the message above. I tried runnning running the daemon from other folders, but it then complains that it can't find any workspace.yaml files.
I guess, I'm running into a "beginner mistake", but could anyone help me with this?
I appreciate any counsel.
One thing to note is that the dagster.yaml file will not do anything unless you've set your DAGSTER_HOME environment variable to point at the directory that this file lives.
That being said, I think what's going on here is that you don't have the Scheduler package installed into the python environment that you're running your dagster-daemon in.
To fix this, you can run pip install -e . in the Scheduler directory, although the README.md inside that directory has more specific instructions for working with virtualenvs.

You have two airflow.cfg files

I have created a venv project and installed airflow with in this venv. I have also set the export AIRFLOW_HOME - to a directory ( airflow_home ) with in this venv project. First time, after I ran
$airflow version
this created airflow.cfg and logs directory under this 'airflow_home' folder. However, when I repeat the same on next day, now I have the error message that I have two airflow.cfg.
one airflow.cfg under my venv project
another one under /home/username/airflow/airflow.cfg
Why is that ? I haven't installed airflow anywhere outside this venv project.
Found the issue. If I don't set the environment variable AIRFLOW_HOME, by default it creates a new ariflow.cfg under /home/usernme/airflow. To avoid this, AIRFLOW_HOME should be set before calling airflow each time after terminal starts or add to bash profile.

Airflow user issues

We have installed airflow from service account say 'ABC' using sudo root in virtual environment, but we are facing few issues.
Calling python script using bash operator. Python script uses some
environmental variables from unix account 'ABC'.While running from
airflow, environmental variables are not picked. In order to find the
user of airflow, created dummy dag with bashoperator command
'whoami', it returns the ABC user. So airflow is using the same 'ABC'
user. Then why environmental variables are not picked?
Then tried sudo -u ABC python script. Environmental variables are not picked, due to sudo usage. Did the workaround without environmental variables and it ran well in development environment without issues. But while moving to different environment, got the below error and we don't have permission to edit sudoers file. Admin policy didn't comply.
sudo: sorry, you must have a tty to run sudo
Then used 'impersonation=ABC' option in .cfg file and ran the airflow without sudo. This time, bash command fails for environmental variables and it's asking all the packages used in script in virtual environment.
My Questions:
Airflow is installed through ABC after sudoing root. Why ABC was not
treated while running the script.
Why ABC environmental variables are not picked?
Even Impersonation option is not picking the environmental
variables?
Can airflow be installed without virtual environment?
Which is the best approach to install airflow? Using separate user
and sudoing root? We are using dedicated user for running python
script.Experts kindly clarify.
It's always a good idea to use virtualenv for installing any python packages. So, you should always prefer installing airflow in a virtaulenv.
You can use systemd or supervisord and create programs for airflow webserver and scheduler. Example configuration for supervisor:
[program:airflow-webserver]
command=sh /home/airflow/scripts/start-airflow-webserver.sh
directory=/home/airflow
autostart=true
autorestart=true
startretries=3
stderr_logfile=/home/airflow/supervisor/logs/airflow-webserver.err.log
stdout_logfile=/home/airflow/supervisor/logs/airflow-webserver.log
user=airflow
environment=AIRFLOW_HOME='/home/airflow/'
[program:airflow-scheduler]
command=sh /home/airflow/scripts/start-airflow-scheduler.sh
directory=/home/airflow
autostart=true
autorestart=true
startretries=3
stderr_logfile=/home/airflow/supervisor/logs/airflow-scheduler.err.log
stdout_logfile=/home/airflow/supervisor/logs/airflow-scheduler.log
user=airflow
environment=AIRFLOW_HOME='/home/airflow/'
We got the same issue as.
sudo: sorry, you must have a tty to run sudo
The solution we got is,
su ABC python script

Airflow not loading dags in /usr/local/airflow/dags

Airflow seems to be skipping the dags I added to /usr/local/airflow/dags.
When I run
airflow list_dags
The output shows
[2017-08-06 17:03:47,220] {models.py:168} INFO - Filling up the DagBag from /usr/local/airflow/dags
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
example_bash_operator
example_branch_dop_operator_v3
example_branch_operator
example_http_operator
example_passing_params_via_test_command
example_python_operator
example_short_circuit_operator
example_skip_dag
example_subdag_operator
example_subdag_operator.section-1
example_subdag_operator.section-2
example_trigger_controller_dag
example_trigger_target_dag
example_xcom
latest_only
latest_only_with_trigger
test_utils
tutorial
But this doesn't include the dags in /usr/local/airflow/dags
ls -la /usr/local/airflow/dags/
total 20
drwxr-xr-x 3 airflow airflow 4096 Aug 6 17:08 .
drwxr-xr-x 4 airflow airflow 4096 Aug 6 16:57 ..
-rw-r--r-- 1 airflow airflow 1645 Aug 6 17:03 custom_example_bash_operator.py
drwxr-xr-x 2 airflow airflow 4096 Aug 6 17:08 __pycache__
Is there some other condition that neededs to be satisfied for airflow to identify a DAG and load it?
My dag is being loaded but I had the name of the DAG wrong. I was expecting the dag to be named by the file but the name is determined by the first argument to the DAG constructor
dag = DAG(
'tutorial', default_args=default_args, schedule_interval=timedelta(1))
Try airflow db init before listing the dags. This is because airflow list_dags lists down all the dags present in the database (And not in the folder you mentioned). Airflow initdb will create entry for these dags in the database.
Make sure you have environment variable AIRFLOW_HOME set to /usr/local/airflow. If this variable is not set, airflow looks for dags in the home airflow folder, which might not be existing in your case.
The example files are not in /usr/local/airflow/dags. You can simply mute them by edit airflow.cfg (usually in ~/airflow). set load_examples = False in 'core' section.
There are couple of errors may make your DAG not been listed in list_dags.
Your DAG file has syntax issue. To check this, just run python custom_example_bash_operator.py and see if any issue.
See if the folder is the default dag loading path. For a new bird, I suggest that just create a new .py file and copy the sample from here https://airflow.incubator.apache.org/tutorial.html then see if the testing dag shows up.
Make sure there is dag = DAG('dag_name', default_args=default_args) in the dag file.
dag = DAG(
dag_id='example_bash_operator',
default_args=args,
schedule_interval='0 0 * * *',
dagrun_timeout=timedelta(minutes=60))
When a DAG is instantiated it pops up by the name you specify in the dag_id attribute. dag_id serves as a unique identifier for your DAG
It will be the case if airflow.cfg config is pointed to an incorrect path.
STEP 1: Go to {basepath}/src/config/
STEP 2: Open airflow.cfg file
STEP 3: Check the path it should point to the dags folder you have created
dags_folder = /usr/local/airflow/dags
I find that I have to restart the scheduler for the UI to pick up the new dags, When I make changes to a dag in my dags folder. I find that when I update the dags they appear in the list when I run airflow list_dags just not in the UI until I restart the scheduler.
First try running:
airflow scheduler
There can be two issues:
1. Check the Dag name given at the time of DAG object creation in the DAG python program
dag = DAG(
dag_id='Name_Of_Your_DAG',
....)
Note that many of the times the name given may be the same as the already present name in the list of DAGs (since if you copied the DAG code). If this is not the case then
2. Check the path set to the DAG folder in Airflow's config file.
You can create DAG file anywhere on your system but you need to set the path to that DAG folder/directory in Airflow's config file.
For example, I have created my DAG folder in the Home directory then I have to edit airflow.cfg file using the following commands in the terminal:
creating a DAG folder at home or root directory
$mkdir ~/DAG
Editing airflow.cfg present in the airflow directory where I have installed the airflow
~/$cd airflow
~/airflow$nano airflow.cfg
In this file change dags_folder path to DAG folder that we have created.
If you still facing the problem then reinstall the Airflow and refer this link for the installation of Apache Airflow.
Are your
custom_example_bash_operator.py
has a DAG name different from the others?
If yes, try restart the scheduler or even resetdb. I usually mistook the filename to be the dag name as well, so better to name them the same.
Can you share what is in custom_example_bash_operator.py? Airflow scans for certain magic inside a file to determine whether is a DAG or not. It scans for airflow and for DAG.
In addition if you are using a duplicate dag_id for a DAG it will be overwritten. As you seem to be deriving from the example bash operator did you keep the name of the DAG example_bash_operator maybe? Try renaming that.
You need to set airflow first and initialise the db
export AIRFLOW_HOME=/myfolder
mkdir /myfolder/dags
airflow db init
You need to create a user too
airflow users create \
--username admin \
--firstname FIRST_NAME \
--lastname LAST_NAME \
--role Admin \
--email admin#example.org
If you have done it correctly you should see airflow.cfg in your folder. There you will find dags_folder which shows the dags folder.
If you have saved your dag inside this folder you should see it in the dag lists
airflow dags list
, or using the UI with
airflow webserver --port 8080
Otherwise, run again airflow db init.
In my case, print(something) in dag file prevented printing dag list on command line.
Check if there is print line in your dag if above solutions are not working.
Try Restarting the scheduler. Scheduler needs to be restarted when new DAGS need to be added to the DAG Bag

Resources