How to run an existing shell script using airflow? - airflow

I wanted to run some existing bash scripts using airflow without modifying the code in the script itself. Is it possible without mentioning the shell commands in the script in a task?

Not entirely sure if understood your question, but you can load your shell commands into
Variables through Admin >> Variables menu as a json file.
And in your dag read the variable and pass as parameter into the BashOperator.
Airflow variables in more detail:
https://www.applydatascience.com/airflow/airflow-variables/
Example of variables file:
https://github.com/tuanavu/airflow-tutorial/blob/v0.7/examples/intro-example/dags/config/example_variables.json
How to read the variable:
https://github.com/tuanavu/airflow-tutorial/blob/v0.7/examples/intro-example/dags/example_variables.py
I hope this post helps you.

As long as the shell script is on the same machine that the Airflow Worker is running you can just call the shell script using the Bash Operator like the following:
t2 = BashOperator(
task_id='bash_example',
# Just call the script
bash_command="/home/batcher/test.sh ",
dag=dag)

you have to 'link' your local folder where your shell script is with worker, which means that you need to add volume in worker part of your docker-compose file..
so I added volume line under worker settings and worker now looks at this folder on your local machine:
airflow-worker:
<<: *airflow-common
command: celery worker
healthcheck:
test:
- "CMD-SHELL"
- 'celery --app airflow.executors.celery_executor.app inspect ping -d "celery#$${HOSTNAME}"'
interval: 10s
timeout: 10s
retries: 5
restart: always
volumes:
- /LOCAL_MACHINE_FOLDER/WHERE_SHELL_SCRIPT_IS:/folder_in_root_folder_of_worker

Related

how to clear failing DAGs using the CLI in airflow

I have some failing DAGs, let's say from 1st-Feb to 20th-Feb. From that date upword, all of them succeeded.
I tried to use the cli (instead of doing it twenty times with the Web UI):
airflow clear -f -t * my_dags.my_dag_id
But I have a weird error:
airflow: error: unrecognized arguments: airflow-webserver.pid airflow.cfg airflow_variables.json my_dags.my_dag_id
EDIT 1:
Like #tobi6 explained it, the * was indeed causing troubles.
Knowing that, I tried this command instead:
airflow clear -u -d -f -t ".*" my_dags.my_dag_id
but it's only returning failed task instances (-f flag). -d and -u flags don't seem to work because taskinstances downstream and upstream the failed ones are ignored (not returned).
EDIT 2:
like #tobi6 suggested, using -s and -e permits to select all DAG runs within a date range. Here is the command:
airflow clear -s "2018-04-01 00:00:00" -e "2018-04-01 00:00:00" my_dags.my_dag_id.
However, adding -f flag to the command above only returns failed task instances. is it possible to select all failed task instances of all failed DAG runs within a date range ?
If you are using an asterik * in the Linux bash, it will automatically expand the content of the directory.
Meaning it will replace the asterik with all files in the current working directory and then execute your command.
This will help to avoid the automatic expansion:
"airflow clear -f -t * my_dags.my_dag_id"
One solution I've found so far is by executing sql(MySQL in my case):
update task_instance t left join dag_run d on d.dag_id = t.dag_id and d.execution_date = t.execution_date
set t.state=null,
d.state='running'
where t.dag_id = '<your_dag_id'
and t.execution_date > '2020-08-07 23:00:00'
and d.state='failed';
It will clear all tasks states on failed dag_runs, as button 'clear' pressed for entire dag run in web UI.
In airflow 2.2.4 the airflow clear command was deprecated.
You could now run:
airflow tasks clear -s <your_start_date> -e <end_date> <dag_id>

How to put seed data into SQL Server docker image?

I have a project using ASP.NET Core and SQL Server. I am trying to put everything in docker containers. For my app I need to have some initial data in the database.
I am able to use docker sql server image from microsoft (microsoft/mssql-server-linux), but it is (obviously) empty. Here is my docker-compose.yml:
version: "3"
services:
web:
build: .\MyProject
ports:
- "80:80"
depends_on:
- db
db:
image: "microsoft/mssql-server-linux"
environment:
SA_PASSWORD: "your_password1!"
ACCEPT_EULA: "Y"
I have an SQL script file that I need to run on the database to insert initial data. I found an example for mongodb, but I cannot find which tool can I use instead of mongoimport.
You can achieve this by building a custom image. I'm currently using the following solution. Somewhere in your dockerfile should be:
RUN mkdir -p /opt/scripts
COPY database.sql /opt/scripts
ENV MSSQL_SA_PASSWORD=Passw#rd
ENV ACCEPT_EULA=Y
RUN /opt/mssql/bin/sqlservr --accept-eula & sleep 30 & /opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P 'Passw#rd' -d master -i /opt/scripts/database.sql
Alternatively you can wait for a certain text to be outputted, useful when working on the dockerfile setup, as it is immediate. It's less robust as it relies on some 'random' text of course:
RUN ( /opt/mssql/bin/sqlservr --accept-eula & ) | grep -q "Service Broker manager has started" \
&& /opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P 'Passw#rd' -i /opt/scripts/database.sql
Don't forget to put a database.sql (with your script) next to the dockerfile, as that is copied into the image.
Roet's answer https://stackoverflow.com/a/52280924/10446284 ditn't work for me.
The trouble was with bash ampersands firing the sqlcmd too early. Not waiting for sleep 30 to finish.
Our Dockerfile now looks like this:
FROM microsoft/mssql-server-linux:2017-GA
RUN mkdir -p /opt/scripts
COPY db-seed/seed.sql /opt/scripts/
ENV MSSQL_SA_PASSWORD=Passw#rd
ENV ACCEPT_EULA=true
RUN /opt/mssql/bin/sqlservr & sleep 60; /opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P 'Passw#rd' -d master -i /opt/scripts/seed.sql
Footnotes:
Now the bash command works like this
run-asych(sqlservr) & run-asynch(wait-and-run-sqlcmd)`
We chose sleep 60, because the build of the docker image happens "offline", before all the runtine evironment is set up. Those 60 seconds don't occur at container runtime anymore. Giving more time for the sqlservr command gives our teammates' machines more time to complete the docker build phase successfully.
One simple option is to just navigate to the container file system and copy the database files in, and then use a script to attach.
This
https://learn.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker
has an example of using sqlcmd in a docker container although I'm not sure how you would add this to whatever build process you have

how to run a shell script on remote using salt-ssh

my web server generates a shell script with more than 100 lines of code based on complex user selections. I need to orchestrate this over several machines using salt-ssh. what I need is to copy this shell script to remote and execute it from there for all devices. how to achieve this with salt-ssh ?. I can not install minions on the remote device.
Just as with normal minion. Write a state...
add script:
file.managed:
- name: file.sh
- source: /path/to/file.sh
run script:
cmd.run:
- name: file.sh
...and apply it
salt-ssh 'minions*' state.apply mystate.sls

How do you keep your airflow scheduler running in AWS EC2 while exiting ssh?

Hi I'm using Airflow and put my airflow project in EC2. However, how would one keep the airflow scheduler running while my mac goes sleep or exiting ssh?
You have a few options, but none will keep it active on a sleeping laptop. On a server:
Can use --daemon to run as daemon: airflow scheduler --daemon
Or, maybe run in background: airflow scheduler >& log.txt &
Or, run inside 'screen' as above, then detach from screen using ctrl-a d, reattach as needed using 'screen -r'. That would work on an ssh connection.
I use nohup to keep the scheduler running and redirect the output to a log file like so:
nohup airflow scheduler >> ${AIRFLOW_HOME}/logs/scheduler.log 2>&1 &
Note: Assuming you are running the scheduler here on your EC2 instance and not on your laptop.
In case you need more details on running it as deamon i.e. detach completely from terminal and redirecting stdout and stderr, here is an example:
airflow webserver -p 8080 -D --pid /your-path/airflow-webserver.pid --stdout /your-path/airflow-webserver.out --stderr /your-path/airflow-webserver.err
airflow scheduler -D --pid /your-path/airflow-scheduler.pid —stdout /your-path/airflow-scheduler.out --stderr /your-path/airflow-scheduler.err
The most robust solution would be to register it as a service on your EC2 instance. Airflow provides systemd and upstart scripts for that (https://github.com/apache/incubator-airflow/tree/master/scripts/systemd and https://github.com/apache/incubator-airflow/tree/master/scripts/upstart).
For Amazon Linux, you'd need the upstart scripts, and for e.g. Ubuntu, you would use the systemd scripts.
Registering it as a system service is much more robust because Airflow will be started upon reboot or when it crashes. This is not the case when you use e.g. nohup like other people suggest here.

How deployment works with Airflow?

I'm using the Celery Executor and the setup from this dockerfile.
I'm deploying my dag into /usr/local/airflow/dags directory into the scheduler's container.
I'm able to run my dag with the command:
$ docker exec airflow_webserver_1 airflow backfill mydag -s 2016-01-01 -e 2016-02-01
My dag contains a simple bash operator:
BashOperator(command = "test.sh" ... )
The operator runs the test.sh script.
However if the test.sh refers to other files, like callme.sh, then I receive a "cannot find file" error.
e.g
$ pwd
/usr/local/airflow/dags/myworkflow.py
$ ls
myworkflow.py
test.sh
callme.sh
$ cat test.sh
echo "test file"
./callme.sh
$ cat callme.sh
echo "got called"
When running myworkflow, the task to call test.sh is invoked but fails for not finding the callme.sh.
I find this confusing. Is it my responsibility to share the code resource files with the worker or airflow's responsibility? If it's mine, then what is the recommended approach to do so? I'm looking at using EFS with it mounted on all container but it looks very expensive to me.
For celery executor, it is your responsibility to make sure that each worker has all the files it needs to run a job.

Resources