In Airflow, what could be different between calling "python pppp.py" and in a #task() import pppp? - airflow

So I have an Airflow task that is currently a BashOperator() where its bash_command is just "python pppp.py". There is a "__main __" that the only thing it does is calls mycommand(). I tried to switch it to a #task() where I "import pppp" and then call "pppp.mycommand()". My command connects to an internal messaging system. The BashOperator() works but the task() fails to connect. I even kubectl into the pod, and I can start up a python shell, do "import pppp" and call "pppp.mycommand()" and it works. I have confirmed they both ways are using the same image, have all of the same environment variables set to the same value, and such. Obviously there is a difference between the 2 ways. Can anyone think of what is different? THERE IS NO CODE TO SHARE since the issue is with an internal message system that you wouldnt be able to access. I guess I am interested in happens before airflow executes the my_task()
#task(**args)
def my_task():
import pppp
pppp.mycommand()

Related

Airflow 2 - debugging why dag is not loading

On Airflow 2 my dag is not showing on the UI, and I'm getting DAG Import Errors (...) for it.
The error message is insufficient for me to debug (it's a custom operator, with a lot of custom logic - so I don't want to get into details of the error itself).
On Airflow 1.X I could use cli:
airflow list_dags
to get more elaborated debug message, is there anything analogical on airflow 2 ?
I'm looking for a cli command/UI option that will provide me with more elaborated error message, than the one I'm getting on the main screen of the webserver.
As described in the Airlfow's documentation, to test DAG loading you can simply run:
python your-dag-file.py
If there is any problem during the DAG loading phase you will get a stack trace here.
The later sections also describe how to test custom operators.
As explained in the upgrading manual the
airflow list_dags has been changed to airflow dags list
The full syntax is:
airflow dags list [-h] [-o table, json, yaml] [-S SUBDIR]
for more information see docs

Using Apache Airflow Tool, Implement a DAG for a batch processing pipeline to get a directory from a remote system

Using Apache airflow tool, how can I implement a DAG for the following Python code. The task accomplished in the code is to get a directory from GPU server to local system. Code is working fine in Jupyter notebook. Please help to implement in Airflow...I'm very new to this. Thanks.
import pysftp
import os
myHostname = "hostname"
myUsername = "username"
myPassword = "pwd"
with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword) as sftp:
print("Connection successfully stablished ... ")
src = '/path/src/'
dst = '/home/path/path/destination'
os.mkdir(dst)
sftp.get_d(src, dst, preserve_mtime=True)
print("Fetched source images from GPU server to local directory")
# connection closed automatically at the end of the with-block```
For SFTP duties, Airflow provides SFTOperator that you can use directly.
Alternatively it's corresponding SFTPHook can be used with a simple PythonOperator
I acknowledge there aren't many examples, but this might be helpful
For SSH-connection, see this

Dag Seems to be missing

I have a dag which checks for new workflows to be generated (Dynamic DAG) at a regular interval and if found, creates them. (Ref: Dynamic dags not getting added by scheduler )
The above DAG is working and the dynamic DAGs are getting created and listed in the web-server. Two issues here:
When clicking on the DAG in web url, it says "DAG seems to be missing"
The listed DAGs are not listed using "airflow list_dags" command
Error:
DAG "app01_user" seems to be missing.
The same is for all other dynamically generated DAGs. I have compiled the Python script and found no errors.
Edit1:
I tried clearing all data and running "airflow run". It ran successfully but no Dynamic generated DAGs were added to "airflow list_dags". But when running the command "airflow list_dags", it loaded and executed the DAG, (which generated Dynamic DAGs). The dynamic DAGs are also listed as below:
[root#cmnode dags]# airflow list_dags
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8\nLANG=en_US.UTF-8)
[2019-08-13 00:34:31,692] {settings.py:182} INFO - settings.configure_orm(): Using pool settings. pool_size=15, pool_recycle=1800, pid=25386
[2019-08-13 00:34:31,877] {__init__.py:51} INFO - Using executor LocalExecutor
[2019-08-13 00:34:32,113] {__init__.py:305} INFO - Filling up the DagBag from /root/airflow/dags
/usr/lib/python2.7/site-packages/airflow/operators/bash_operator.py:70: PendingDeprecationWarning: Invalid arguments were passed to BashOperator (task_id: tst_dyn_dag). Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were:
*args: ()
**kwargs: {'provide_context': True}
super(BashOperator, self).__init__(*args, **kwargs)
-------------------------------------------------------------------
DAGS
-------------------------------------------------------------------
app01_user
app02_user
app03_user
app04_user
testDynDags
Upon running again, all the above generated 4 dags disappeared and only the base DAG, "testDynDags" is displayed.
When I was getting this error, there was an exception showing up in the webserver logs. Once I resolved that error and I restarted the webserver it went through normally.
From what I can see this is the error that is thrown when the webserver tried to parse the dag file and there is an error. In my case it was an error importing a new operator I added to a plugin.
Usually, I check in Airflow UI, sometimes the reason of broken DAG appear in there. But if it is not there, I usually run the .py file of my DAG, and error (reason of DAG cant be parsed) will appear.
I never got to work on dynamic DAG generation but I did face this issue when DAG was not present on all nodes ( scheduler, worker and webserver ). In case you have airflow cluster, please make sure that DAG is present on all airflow nodes.
Same error, the reason was I renamed my dag_id in uppercase. Something like "import_myclientname" into "import_MYCLIENTNAME".
I am little late to the party but I faced the error today:
In short: try executing airflow dags report and/or airflow dags reserialize
Check out my comment here:
https://stackoverflow.com/a/73880927/4437153
I found that airflow fails to recognize a dag defined in a file that does not have from airflow import DAG in it, even if DAG is not explicitly used in that file.
For example, suppose you have two files, a.py and b.py:
# a.py
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
def makedag(dag_id="a"):
with DAG(dag_id=dag_id) as dag:
DummyOperator(task_id="nada")
dag = makedag()
and
# b.py
from a import makedag
dag = makedag(dag_id="b")
Then airflow will only look at a.py. It won't even look at b.py at all, even to notice if there's a syntax error in it! But if you add from airflow import DAG to it and don't change anything else, it will show up.

What to do when a py.test hangs silently?

While using py.test, I have some tests that run fine with SQLite but hang silently when I switch to Postgresql. How would I go about debugging something like that? Is there a "verbose" mode I can run my tests in, or set a breakpoint ? More generally, what is the standard plan of attack when pytest stalls silently? I've tried using the pytest-timeout, and ran the test with $ py.test --timeout=300, but the tests still hang with no activity on the screen whatsoever
I ran into the same SQLite/Postgres problem with Flask and SQLAlchemy, similar to Gordon Fierce. However, my solution was different. Postgres is strict about table locks and connections, so explicitly closing the session connection on teardown solved the problem for me.
My working code:
#pytest.yield_fixture(scope='function')
def db(app):
# app is an instance of a flask app, _db a SQLAlchemy DB
_db.app = app
with app.app_context():
_db.create_all()
yield _db
# Explicitly close DB connection
_db.session.close()
_db.drop_all()
Reference: SQLAlchemy
To answer the question "How would I go about debugging something like that?"
Run with py.test -m trace --trace to get trace of python calls.
One option (useful for any stuck unix binary) is to attach to process using strace -p <PID>. See what system call it might be stuck on or loop of system calls. e.g. stuck calling gettimeofday
For more verbose py.test output install pytest-sugar. pip install pytest-sugar And run test with pytest.py --verbose . . .
https://pypi.python.org/pypi/pytest-sugar
I had a similar problem with pytest and Postgresql while testing a Flask app that used SQLAlchemy. It seems pytest has a hard time running a teardown using its request.addfinalizer method with Postgresql.
Previously I had:
#pytest.fixture
def db(app, request):
def teardown():
_db.drop_all()
_db.app = app
_db.create_all()
request.addfinalizer(teardown)
return _db
( _db is an instance of SQLAlchemy I import from extensions.py )
But if I drop the database every time the database fixture is called:
#pytest.fixture
def db(app, request):
_db.app = app
_db.drop_all()
_db.create_all()
return _db
Then pytest won't hang after your first test.
Not knowing what is breaking in the code, the best way is to isolate the test that is failing and set a breakpoint in it to have a look. Note: I use pudb instead of pdb, because it's really the best way to debug python if you are not using an IDE.
For example, you can the following in your test file:
import pudb
...
def test_create_product(session):
pudb.set_trace()
# Create the Product instance
# Create a Price instance
# Add the Product instance to the session.
...
Then run it with
py.test -s --capture=no test_my_stuff.py
Now you'll be able to see exactly where the script locks up, and examine the stack and the database at this particular moment of execution. Otherwise it's like looking for a needle in a haystack.
I just ran into this problem for quite some time (though I wasn't using SQLite). The test suite ran fine locally, but failed in CircleCI (Docker).
My problem was ultimately that:
An object's underlying implementation used threading
The object's __del__ normally would end the threads
My test suite wasn't calling __del__ as it should have
I figured I'd add how I figured this out. Other answers suggest these:
Found usage of pytest-timeout didn't help, the test hung after completion
Invoked via pytest --timeout 5
Versions: pytest==6.2.2, pytest-timeout==1.4.2
Running pytest -m trace --trace or pytest --verbose yielded no useful information either
I ended up having to comment literally everything out, including:
All conftest.py code and test code
Slowly uncommented/re-commented regions and identified the root cause
Ultimate solution: using a factory fixture to add a finalizer to call __del__
In my case the Flask application did not check if __name__ == '__main__': so it executed app.start() when that was not my intention.
You can read many more details here.
In my case diff worked very slow on comparing 4 MB data when assert failed.
with open(path, 'rb') as f:
assert f.read() == data
Fixed by:
with open(path, 'rb') as f:
eq = f.read() == data
assert eq

RHadoop Stream Job Fail with Apache Oozie

I'm really just looking to pick the community's brain for some leads in figuring out what is going on with the issue I'm having.
I'm writing a MR job with RHadoop (rmr2, v3.0.0) and things are great -- IO with HDFS, mapping, reducing. No problems. Life is great.
I'm trying to schedule the job with Apache Oozie, and am running into some issues:
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
I've read the rmr2 debugging guide, but nothing is really getting to the stderr because the job fails before anything even gets scheduled.
In my head, everything points to a difference in environments. However, Oozie is running the job as the same user that I'm able to run everything with via cli, and all of the R environment variables (fetched with Sys.getenv()) are the same, excepting there's some additional class path stuff set with Oozie.
I can post more of the OS or Hadoop versions and config details, but sleuthing some version-specific bugs seems like a bit of a red herring as everything runs fine at the command line.
Anybody have any thoughts what might be some helpful next steps in hunting this beast down?
UPDATE:
I overwrote the system function in the base package to log the user, the host name of the node, and the command being executed before the internal call to system. So before any system call is actually executed, I get something like the following in the stderr:
user#host.name
/usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-102.jar ...
When ran with Oozie, the command printed in the stderr fails with an exit status of 1. When I run the command on user#host.name, it runs successfully. So essentially the EXACT same command with the SAME user on the SAME node fails with Oozie, but runs successfully from cli.

Resources