Airflow Scheduler fails to execute Windows EXE via WSL - airflow

My Windows 10 machine has Airflow 1.10.11 installed within WSL 2 (Ubuntu-20.04).
I have a BashOperator task which calls an .EXE on Windows (via /mnt/c/... or via symlink).
The task fails. Log shows:
[2020-12-16 18:34:11,833] {bash_operator.py:134} INFO - Temporary script location: /tmp/airflowtmp2gz6d79p/download.legacyFilesnihvszli
[2020-12-16 18:34:11,833] {bash_operator.py:146} INFO - Running command: /mnt/c/Windows/py.exe
[2020-12-16 18:34:11,836] {bash_operator.py:153} INFO - Output:
[2020-12-16 18:34:11,840] {bash_operator.py:159} INFO - Command exited with return code 1
[2020-12-16 18:34:11,843] {taskinstance.py:1150} ERROR - Bash command failed
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.8/dist-packages/airflow/operators/bash_operator.py", line 165, in execute
raise AirflowException("Bash command failed")
airflow.exceptions.AirflowException: Bash command failed
[2020-12-16 18:34:11,844] {taskinstance.py:1187} INFO - Marking task as FAILED. dag_id=test-dag, task_id=download.files, execution_date=20201216T043701, start_date=20201216T073411, end_date=20201216T073411
And that's it. Return code 1 with no further useful info.
Running the very same EXE via bash works perfectly, with no error (I also tried it on my own program which emits something to the console - in bash it emits just fine, but via airflow scheduler it's the same error 1).
Some more data and things I've done to rule out any other issue:
airflow scheduler runs as root. I also confirmed it's running in a root context by putting an whoami command in my BashOperator, which indeed emitted root (I should also note that all native Linux programs run just fine! only the Windows programs don't.)
The Windows EXE I'm trying to execute and its directory have full 'Everyone' permissions (on my own program of course, wouldn't dare doing it on my Windows folder - that was just an example.)
The failure happens both when accessing via /mnt/c as well as via symlink. In the case of a symlink, the symlink has 777 permissions.
I tried running airflow test on a BashOperator task - it runs perfectly - emits output to the console and returns 0 (success).
Tried with various EXE files - both "native" (e.g. ones that come with Windows) as well as my C#-made programs. Same behavior in all.
Didn't find any similar issue documented in Airflow's GitHub repo nor here in Stack Overflow.
The question is: How does Airflow's Python usage of a subprocess (which airflow scheduler uses to run Bash Operators) different than a "normal" Bash, causing an error 1?

you can use the library subprocess and sys of Python and PowerShell
In the folder Airflow > Dags, create 2 files: main.py and caller.py
so, main.py call caller.py and caller.py go in machine (Windows) to run the files or routines.
This is the process:
code Main.py:
# Importing the libraries we are going to use in this example
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
# Defining some basic arguments
default_args = {
'owner': 'your_name_here',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'retries': 0,
}
# Naming the DAG and defining when it will run (you can also use arguments in Crontab if you want the DAG to run for example every day at 8 am)
with DAG(
'Main',
schedule_interval=timedelta(minutes=1),
catchup=False,
default_args=default_args
) as dag:
# Defining the tasks that the DAG will perform, in this case the execution of two Python programs, calling their execution by bash commands
t1 = BashOperator(
task_id='caller',
bash_command="""
cd /home/[Your_Users_Name]/airflow/dags/
python3 Caller.py
""")
# copy t1, paste, rename t1 to t2 and call file.py
# Defining the execution pattern
t1
# comment: t1 execute and call t2
# t1 >> t2
Code Caller.py
import subprocess, sys
p = subprocess.Popen(["powershell.exe"
,"cd C:\\Users\\[Your_Users_Name]\\Desktop; python file.py"] # file .py
#,"cd C:\\Users\\[Your_Users_Name]\\Desktop; .\file.html"] # file .html
#,"cd C:\\Users\\[Your_Users_Name]\\Desktop; .\file.bat"] # file .bat
#,"cd C:\\Users\\[Your_Users_Name]\\Desktop; .\file.exe"] # file .exe
, stdout=sys.stdout
)
p.communicate()
How to know if your code will work in airflow, if run, its Ok.

Related

Working DAG fails when triggered from another dag in CLI

I have a simple DAG which connects to an impala db and runs an sql script. The dag runs fine when running independently:
airflow dags test original_dag_name 2022-9-27
However, when I test using TriggerDagRunOpertor from another DAG, the DAG fails:
airflow tasks test other_dag_name trigger_task 2022-9-27
Looking at the logs for original_dag_name I see the following:
[Cloudera][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function
...which appears to be driver related, which doesn't make sense as it works fine when I trigger the DAG on its own. Is there some sort of config not getting set correctly when triggering via TriggerDagRunOperator?
Here is the TriggerDagRunOperator task:
task_run_original_dag = TriggerDagRunOperator (
task_id='run_original_dag',
trigger_dag_id='original_dag_name',
execution_date='{{ ds }}',
reset_dag_run=True,
wait_for_completion=True,
poke_interval=60
)

Issues installing airflow locally

I installed airflow locally because i am testing sftp operator in airflow (2.0.0). When I try running this code
from airflow.providers.sftp.operators import sftp_operator
from airflow import DAG
import datetime
dag = DAG(
'test_dag',
start_date = datetime.datetime(2020,1,8,0,0,0),
schedule_interval = '#daily'
)
get_operation = SFTPOperator(
task_id="operation",
ssh_conn_id="ssh_default",
local_filepath="route_to_local_file",
remote_filepath="remote_route_to_copy",
operation="get",
dag=dag
)
get_operation
When I run this code python code I am getting this error.
Traceback (most recent call last):
File "test_dags.py", line 1, in <module>
from airflow.providers.sftp.operators import sftp_operator
ModuleNotFoundError: No module named 'airflow.providers.sftp'
can anyone please tell if I am missing anything in my installation?
Since you don't specify how you installed Airflow I'm assuming you did something like pip install apache-airflow>=2.0.0. If you look at the Python dependencies in that environment with pip freeze you won't see apache-airflow-providers-sftp because as of version 2, Airflow extracts its functionality into provider packages, the vast majority of which need to installed manually, eg: pip install apache-airflow-providers-sftp. Now it should work. Supporting documentation https://airflow.apache.org/docs/apache-airflow-providers/packages-ref.html#apache-airflow-providers-sftp.

Using Apache Airflow Tool, Implement a DAG for a batch processing pipeline to get a directory from a remote system

Using Apache airflow tool, how can I implement a DAG for the following Python code. The task accomplished in the code is to get a directory from GPU server to local system. Code is working fine in Jupyter notebook. Please help to implement in Airflow...I'm very new to this. Thanks.
import pysftp
import os
myHostname = "hostname"
myUsername = "username"
myPassword = "pwd"
with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword) as sftp:
print("Connection successfully stablished ... ")
src = '/path/src/'
dst = '/home/path/path/destination'
os.mkdir(dst)
sftp.get_d(src, dst, preserve_mtime=True)
print("Fetched source images from GPU server to local directory")
# connection closed automatically at the end of the with-block```
For SFTP duties, Airflow provides SFTOperator that you can use directly.
Alternatively it's corresponding SFTPHook can be used with a simple PythonOperator
I acknowledge there aren't many examples, but this might be helpful
For SSH-connection, see this

Dag Seems to be missing

I have a dag which checks for new workflows to be generated (Dynamic DAG) at a regular interval and if found, creates them. (Ref: Dynamic dags not getting added by scheduler )
The above DAG is working and the dynamic DAGs are getting created and listed in the web-server. Two issues here:
When clicking on the DAG in web url, it says "DAG seems to be missing"
The listed DAGs are not listed using "airflow list_dags" command
Error:
DAG "app01_user" seems to be missing.
The same is for all other dynamically generated DAGs. I have compiled the Python script and found no errors.
Edit1:
I tried clearing all data and running "airflow run". It ran successfully but no Dynamic generated DAGs were added to "airflow list_dags". But when running the command "airflow list_dags", it loaded and executed the DAG, (which generated Dynamic DAGs). The dynamic DAGs are also listed as below:
[root#cmnode dags]# airflow list_dags
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
[2019-08-13 00:34:31,692] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=15, pool_recycle=1800, pid=25386
[2019-08-13 00:34:31,877] {__init__.py:51} INFO - Using executor LocalExecutor
[2019-08-13 00:34:32,113] {__init__.py:305} INFO - Filling up the DagBag from /root/airflow/dags
/usr/lib/python2.7/site-packages/airflow/operators/bash_operator.py:70: PendingDeprecationWarning: Invalid arguments were passed to BashOperator (task_id: tst_dyn_dag). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'provide_context': True}
super(BashOperator, self).__init__(*args, **kwargs)
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
app01_user
app02_user
app03_user
app04_user
testDynDags
Upon running again, all the above generated 4 dags disappeared and only the base DAG, "testDynDags" is displayed.
When I was getting this error, there was an exception showing up in the webserver logs. Once I resolved that error and I restarted the webserver it went through normally.
From what I can see this is the error that is thrown when the webserver tried to parse the dag file and there is an error. In my case it was an error importing a new operator I added to a plugin.
Usually, I check in Airflow UI, sometimes the reason of broken DAG appear in there. But if it is not there, I usually run the .py file of my DAG, and error (reason of DAG cant be parsed) will appear.
I never got to work on dynamic DAG generation but I did face this issue when DAG was not present on all nodes ( scheduler, worker and webserver ). In case you have airflow cluster, please make sure that DAG is present on all airflow nodes.
Same error, the reason was I renamed my dag_id in uppercase. Something like "import_myclientname" into "import_MYCLIENTNAME".
I am little late to the party but I faced the error today:
In short: try executing airflow dags report and/or airflow dags reserialize
Check out my comment here:
https://stackoverflow.com/a/73880927/4437153
I found that airflow fails to recognize a dag defined in a file that does not have from airflow import DAG in it, even if DAG is not explicitly used in that file.
For example, suppose you have two files, a.py and b.py:
# a.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
def makedag(dag_id="a"):
with DAG(dag_id=dag_id) as dag:
DummyOperator(task_id="nada")
dag = makedag()
and
# b.py
from a import makedag
dag = makedag(dag_id="b")
Then airflow will only look at a.py. It won't even look at b.py at all, even to notice if there's a syntax error in it! But if you add from airflow import DAG to it and don't change anything else, it will show up.

Scheduling R Script

I have written an R script that pulls some data from a database, performs several operations on it and post the output to a new database.
I would like this script to run every day at a specific time but I can not find any way to do this effectively.
Can anyone recommend a resource I could look at to solve this issue? I am running this script on a Windows machine.
Actually under Windows you do not even have to create a batch file first to use the Scheduler.
Open the scheduler: START -> All Programs -> Accesories -> System Tools -> Scheduler
Create a new Task
under tab Action, create a new action
choose Start Program
browse to Rscript.exe which should be placed e.g. here:
"C:\Program Files\R\R-3.0.2\bin\x64\Rscript.exe"
input the name of your file in the parameters field
input the path where the script is to be found in the Start in field
go to the Triggers tab
create new trigger
choose that task should be done each day, month, ... repeated several times, or whatever you like
Supposing your R script is mytest.r, located in D:\mydocuments\, you can create a batch file including the following command:
C:\R\R-2.10.1\bin\Rcmd.exe BATCH D:\mydocuments\mytest.r
Then add it, as a new task, to windows task scheduler, setting there the triggering conditions.
You could also omit the batch file. Set C:\R\R-2.10.1\bin\Rcmd.exe in the program/script textbox in task scheduler, and give as Arguments the rest of the initial command: BATCH D:\mydocuments\mytest.r
Scheduling R Tasks via Windows Task Scheduler (Posted on February 11, 2015)
taskscheduleR: R package to schedule R scripts with the Windows task manager (Posted on March 17, 2016)
EDIT
I recently adopted the use of batch files again, because I wanted the cmd window to be minimized (I couldn't find another way).
Specifically, I fill the windows task scheduler Actions tab as follows:
Program/script:
cmd.exe
Add arguments (optional):
/c start /min D:\mydocuments\mytest.bat ^& exit
Contents of mytest.bat:
C:\R\R-3.5.2\bin\x64\Rscript.exe D:\mydocuments\mytest.r params
Now there is built in option in RStudio to do this, to run scheduler first install below packages
install.packages('data.table')
install.packages('knitr')
install.packages('miniUI')
install.packages('shiny')
install.packages("taskscheduleR", repos = "http://www.datatailor.be/rcube", type =
"source")
After installing go to
**TOOLS -> ADDINS ->BROWSE ADDINS ->taskscheduleR -> Select it and execute it.**
Setting up the task scheduler
Step 1) Open the task scheduler (Start > search Task Scheduler)
Step 2) Click "Action" > "Create Task"
Step 3) Select "Run only when the user is logged on", uncheck "Run with highest priveledges", name your task,
configure for "Windows Vista/Windows Server 2008"
Step 4) Under the "Triggers" tab, set when you would like the script to run
Step 5) Under the "Actions" tab, put the full location of the Rscript.exe file, i.e.
"C:\Program Files\R\R-3.6.2\bin\Rscript.exe" (include the quotes)
Put the name of your script with with -e and source() in arguments wrapping it like this:
-e "source('C:/location_of_my_script/test.R')"
Troubleshooting a Rscript scheduled in the Task Scheduler
When you run a script using the Task Scheduler, it is difficult to troubleshoot any issues because you don't get any error messages.
This can be resolved by using the sink() function in R which will allow you to output all error messages to a file that you specify. Here is how you can do this:
# Set up error log ------------------------------------------------------------
error_log <- file("C:/location_of_my_script/error_log.Rout", open="wt")
sink(error_log, type="message")
try({
# insert your code here
})
The other thing that you will have to change to make your Rscript work is to specify the full file path of any file paths in your script.
This will not work in task scheduler:
source("./functions/import_function.R")
You will need to specify the full file path of any scripts you are sourcing within your Rscript:
source("C:/location_of_my_script/functions/import_function.R")
Additionally, I would remove any special characters from any file paths that you are referencing in your R script. For example:
df <- fread("C:/location_of_my_data/file#2342.csv")
may not run. Instead, try:
df <- fread("C:/location_of_my_data/file_2342.csv")
Changing windows passwords
Beware: Changing windows passwords will pause your task scheduler script(s). You will need to log back into the task scheduler and enter your password to get them started again.
I set up my tasks via the SCHTASKS program. For running scripts on startup, you would write something along the lines of
SCHTASKS /Create /SC ONSTART /TN MyProgram /TR "R CMD BATCH --vanilla d:\path\to\script.R"
See this website for more details on SCHTASKS. More details at Microsoft's website.
You can use Windows Task Scheduler.
After following any combination of these steps and you receive the "Argument Batch Ignored" error after R.exe runs, try this, it worked for me.
In Windows Task Scheduler:
Replace BATCH "C:\Users\desktop\yourscript.R"in the arguments field
with
CMD BATCH --vanilla --slave "C:\Users\desktop\yourscript.R"

Resources