Data share between two task in airflow dag - airflow

I want to do hive query using HiveOperator and the output of that query should transfer to python script using PythonOperator. Is it possible and how?

One approach to this kind of problem in general is to use Airflow's xcoms - see documentation
However, I would use this sparingly and only where strictly necessary. Otherwise, you risk ending up with your operators being quite tangled and interdependent.

Related

Batching operations in gremlin on Neptune

In AWS Neptune documentation (best-practices-gremlin-java-batch-add) they recommend batching operations together.
How can I batch a few operations together in case one of them may end the stream.
For example if I want to batch together the following:
g.V(2).drop().addV('test').property(id,1)
The problem is that the addV won't be called.
Is there a way to batch the drop and the addV together and making sure the addV will be called?
I tried to put fold() in between but because it isn't supported natively in Neptune and will probably create performance issues.
The sideEffect isn't a good option for performance reasons with Neptune as well (see drop documentation in gremlin-step-support).
You can accomplish this by using the sideEffect() step to perform the drop() as shown in this recipe. In your case this might look something like:
g.addV('test').property(id,1).sideEffect(V(2).drop())

Schedule tasks at fixed time in multiple timezones

I'm getting started with Airflow and I'm not sure how to approach this problem:
I'm building a data export system that should run at a fixed time daily for different locations. My issue is that the locations have several timezones and the definition of day start/end changes depending on the timezone.
I saw in the documentation that I can make a dag timezone aware but I'm not sure creating 100s of dags with different timezone is the right approach. I also have some common tasks so multiple dags creates more complexity or duplication in the tasks performed.
Is there an airflow idiomatic way to achieve this? I think it would be a fairly common use case to build reports that are timezone dependant but I didn't find any informations about it.
Dynamic DAGs is a hot topic in Airflow, but from my point of view, Airflow itself doesn't provide a straightforward way to implement that. You'll need to balance the pros and cons depending on your use case.
As a good starting point, you can check Astronomer guide for dynamically generating DAGs. There you have all the options available and some ideas of the pros and cons. Make sure you check out the scalability considerations to see the implications in terms of performance.
From your use case, I think the best approach will be either the Create_DAG Method (under Single-File Methods), or the DAG Factory. I personally prefer the first one because it's like a Factory (in terms of programming patterns), but you have the flexibility to create all the files you need for each DAG. In the second you won't have much control of what you create and there's an external dependency needed.
Another interesting article about dynamically generating DAGs is "How to build a DAG Factory on Airflow".

Why use CustomOperator over PythonOperator in Apache Airflow?

As I'm using Apache Airflow, I can't seem to find why someone would create a CustomOperator over a PythonOperator. Wouldn't it lead to the same results if I'm using a python function inside a PythonOperator instead of a CustomOperator?
If someone would know what are the different use cases and best practices, that would be great! !
Thanks a lot for your help
Both operators while similar are really at different abstraction levels, and depending on your use-case, one may be a better fit than another.
Code defined in a CustomOperator can be easily used by multiple DAGs. If you have a lot of DAGs that need to perform the same task it may make more sense to expose this code to the DAGs via a CustomOperator.
PythonOperator is very general and is a better fit for one-off DAG specific tasks.
At the end of the day the default set of operators provided in Airflow are just tools. Which tool you end up using (default operators) or whether it makes sense to create your own custom tool (custom operators) is a choice determined by a bunch of factors:
The type of task you are trying to accomplish.
Code organization requirements necessitated by policy or the number of people
maintaining the pipeline.
Simple personal taste.

How does executemany() work

I have been using c++ and work with sqlite. In python, I have an executemany operation in the library but the c++ library I am using does not have that operation.
I was wondering how the executemany operation optimizes queries to make them faster.
I was looking at the sqlite c/c++ api and saw that there were two commands, sqlite3_reset and sqlite3_clear_bindings, that can be used to clear and reuse prepared statements.
Is this what python does to batch and speedup executemany queries (at least for inserts)? Thanks for your time.
executemany just binds the parameters, executes the statements, and calls sqlite3_reset, in a loop.
Python does not give you direct access to the statement after it has been prepared, so this is the only way to reuse it.
However, SQLite does not take much time for preparing statements, so this is unlikely to have much of an effect on performance.
The most important thing for performance is to batch statements in a transaction; Python tries to be clever and to do this automatically (independently from executemany).
I looked into some of the related posts and found the folowing which was very detailed on ways to improve sqlite batch insert performace. These principles could effectively be used to create an executemany function.
Improve INSERT-per-second performance of SQLite?
The biggest improvement changes were indeed as #CL. said, turning it all into one transaction. The author of the other post also found significant improvement by using and reusing prepared statements and playing with some pragma settings.

Why does zumero_sync need to be called multiple times?

According to the documentation for zumero_sync:
If a large amount of information needs to be pulled from the server,
this function may need to be called more than once.
In my Android app that uses Zumero that's no problem; I just keep calling zumero_sync until the return value doesn't start with "0;".
However, now I'm trying to write an admin script that also syncs with my server dbfiles. I'd like to use the sqlite3 shell, and have the script pass the SQL to execute via command line arguments. I need to call zumero_sync in a loop (which SQLite doesn't support) to make sure the db is fully synced. If I had to, I could invoke sqlite3 in a loop (reading its output, looking for "0;"), or even write a C++ app to call the SQLite/Zumero functions natively. But it certainly would be easier if a single zumero_sync was enough.
I guess my real question is: could zumero_sync be changed so it completes the sync before returning? If there are cases where the existing behavior is more useful, maybe there could be a parameter for specifying which mode to use?
I see two basic questions here:
(1) Why does zumero_sync() work the way it does?
(2) Can it work differently?
I'll answer (2) first, since it's easier: Yes, it could work differently. Rather, we could (and probably will, soon, you brought this up) implement an additional function, named something like zumero_sync_complete(), which performs [the guts of] zumero_sync() in a loop and returns after the sync is complete.
We didn't implement zumero_sync_complete() because it doesn't add much value. It's a simple loop, so you can darn well write it yourself. :-)
Er, except in scripting environments which don't support loops. Like the sqlite3 shell.
Answer to (1):
The Zumero sync protocol is designed to give the server the flexibility to return partial results if it wants to do so. And for the sake of reducing load on the server (and increasing its scalability) it often does want to do exactly that.
Given that, one reason to expose this to the client is to increase the client's flexibility as well. As long we're making multiple roundtrips, we might as well give the client an opportunity to do something (like, maybe, update a progress bar) in between them.
Another thing a client might want to do in between loop iterations is handle an error.
Or, in the case of a multithreaded client, it might want to deal with changes that happened on the client while the sync is going on.
Which raises the question of how locking should be managed? Do we hold the sqlite write lock during the entire loop? Or only when absolutely necessary?
Bottom line: A robust app would probably want to implement the loop itself so that it can make its own decisions and retain full control over things.
But, as you observe, the sqlite3 shell doesn't have loops. And it's not an app. And it doesn't have threads. Or progress bars. So it's a use case where a simpler-and-less-powerful form of zumero_sync() would make sense.

Resources