Airflow can not connect to Mysql Server in multiple tasks - airflow

So, I am using airflow and I made a pretty large DAG, with mutliple different tasks all using cursors and hooks to connect and interact with the database. Lets say for example, the first 3 tasks will work successfully, but then the 4th one will say it cant connect to MySQL server, even though they use the same connections which I defined as environmental variables in Airflow interface. However, sometimes if I just re run it without changing anything, it will connect and work. Any ideas?

Related

Apache Airflow Connection hook is instantiating multiple connection

Background: Apache Airflow documentation reads:
Hooks Hooks act as an interface to communicate with the external
shared resources in a DAG. For example, multiple tasks in a DAG can
require access to a MySQL database. Instead of creating a connection
per task, you can retrieve a connection from the hook and utilize it.
I have tried spawning 10 tasks using different DB: MYSQL, POSTGRES, MONGODB. Please note that I am using one DB (ex: MYSQL) in one DAG (consisting of 10 tasks).
But, All tasks are instantiating a new connection.
Example of my task:
conn_string = kwargs.get('conn_id')
pg = PostgresHook(conn_string)
pg_query ="...."
records = pg.get_records(pg_query)
why is airflow instantiating a new connection when airflow documentation itself reads (..... multiple tasks in a DAG can require access to a MySQL database. Instead of creating a connection per task, you can retrieve a connection from the hook and utilize it...........)
What is being missed here...
I believe what they mean to say with that part of documentation is that hooks prevent you from redefining the same credentials over and over again. With connection, they are referring to an Airflow connection you define in the web interface and not an actual internet connection to a host.
If you think about it this way:
A task can be scheduled in any of the 3 airflow worker nodes.
Your 10 tasks are divided between these 3.
How would they be able to share the same internet connection if it comes from different hosts? It would be very very hard to maintain those internet connections across workers.
But don't worry, it also took ages for me to understand what they meant there.

Create and use Connections in Airflow operator at runtime [duplicate]

This question already has answers here:
Is there a way to create/modify connections through Airflow API
(5 answers)
Closed 4 years ago.
Note: This is NOT a duplicate of
Export environment variables at runtime with airflow
Set Airflow Env Vars at Runtime
I have to trigger certain tasks at remote systems from my Airflow DAG. The straight-forward way to achieve this is SSHHook.
The problem is that the remote system is an EMR cluster which is itself created at runtime (by an upstream task) using EmrCreateJobFlowOperator. So while I can get hold of job_flow_id of the launched EMR cluster (using XCOM), what I need is to an ssh_conn_id to be passed to each downstream task.
Looking at the docs and code, it is evident that Airflow will try to look up for this connection (using conn_id) in db and environment variables, so now the problem boils down to being able to set either of these two properties at runtime (from within an operator).
This seems a rather common problem because if this isn't achievable then the utility of EmrCreateJobFlowOperator would be severely hampered; but I haven't come across any example demonstrating it.
Is it possible to create (and also destroy) either of these from within an Airflow operator?
Connection (persisted in Airflow's db)
Environment Variable (should be accessible to all downstream tasks and not just current task as told here)
If not, what are my options?
I'm on
Airflow v1.10
Python 3.6.6
emr-5.15 (can upgrade if required)
Connections come from the ORM
Yes, you can create connections at runtime, even at DAG creation time if you're careful enough. Airflow is completely transparent on its internal models, so you can interact with the underlying SqlAlchemy directly. As exemplified originally in this answer, it's as easy as:
from airflow.models import Connection
from airflow import settings
def create_conn(username, password, host=None):
new_conn = Connection(conn_id=f'{username}_connection',
login=username,
host=host if host else None)
new_conn.set_password(password)
session = settings.Session()
session.add(new_conn)
session.commit()
Where you can, of course, interact with any other extra Connection properties you may require for the EMR connection.
Environment are process-bounded
This is not a limitation of Airflow or Python, but (AFAIK for every major OS) environments are bound to the lifetime of a process. When you export a variable in bash for example, you're simply stating that when you spawn child processes, you want to copy that variable to the child's environment. This means that the parent process can't change the child's environment after its creation and the child can't change the parents environment.
In short, only the process itself can change its environment after it's created. And considering that worker process are Airflow subprocesses, it's hard to control the creation of their environments as well. What you can do is to write the environment variables into a file and intentionally update the current environment with overrides from that file on each task start.
The way you can do this is to create an Airflow task after EmrCreateJobFlowOperator, that uses BashOperator to probably use aws-cli to retrieve the IP Address of the Virtual Machine where you want to run the task and in the same task run airflow cli that creates an SSH connection using that IP address.

Sharing large intermediate state between Airflow tasks

We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.

Is there any possibility for one rserve's client to share workspace with another?

I'd like replace RExcel with the Excelsi-R. Excelsi-R talks R via RServe, and RServe has this feature, that makes each client work in independent workspaces.
What I want is to actually share a single workspace between at least 2 simultaneously connected clients. One client would be run by Excelsi-R, and another by manually launched interactive R Session. That would allow me to interface with the Excelsi-R session in traditional way (say, in RStudio).
I don't need asynchronous computation; I'm perfectly happy if Excelsi-R would have to wait, until a command issued by the other connection finishes, and vice versa; just like in the RExcel "foreground mode".
Is it possible?
Not currently, since each process has exactly one connection. There are a few hacks - such as you can "switch" sessions by starting a listener for another connection in an existing session - but that may be a bit too limited.
That said, it is technically possible (Rserve support looping over multiple connections - it is used in RCloud to support two separate processes on one connection) - the challenge is how to link two independent connections to a single process. The rsio communication was added in Rserve 1.8 specifically to allow the passing of descriptors between Rserve instances, but it was not used so far. If there is interest in that kind of functionality, I can see how it could be added.

How to get informations about a zope/plone instance during execution?

We have a production environment (cluster) where there are two physical servers and 3 (three) plone/zope instances running inside each one.
We scheduled a job (with apscheduler) that needs to run only in a unique instance, but is executing by all 6 (six) instances.
To solve this, I think I need to verify if the job is running in the server1 and if it is a instance that listens on a specific port.
So, how to get programmaticly informations about a zope/plone instance?

Resources