I’m searching for some information about if is possible configure one scheduler processs of airflow per dag folder.
I'm using separated dag folders depending on the business subject, and I need to know if is possible configure one scheduler process per which one of these folders.
did someone knows if it is possible?
Related
Currently i am using Airflow with Version : 1.10.10
After opening into airflow/logs folder there are many folder that are named based on your DAG name but there is a folder named scheduler which when opened consist folder that are named in date format ( E.g 2020/07/08 ) and it goes until the date when i first using airflow.After searching through multiple forum I'm still not sure what this folder logs are for.
Anyway the probelm is I kept wondering if it is okay to delete the contents inside scheduler folder since it takes so much space unlike the rest of the folder that are named based on the DAG name (I'm assuming thats where the log of each DAG runs is stored). Will the action of deleting the contents of scheduler cause any error or loss of DAG log?.
This might be a silly question but i want to make sure since the Airflow is in production server. I've tried creating an Airflow instance in local instance and delete the scheduler folder contents and it seems no error have occurred. Any feedback and sharing experience on handling this issue is welcomed
Thanks in Advance
It contains the logs of airflow scheduler afaik. I have used it only one time for a problem about SLAs.
I've been deleting old files in it for over a year, never encountered a problem. this is my command to delete old log files of scheduler:
find /etc/airflow/logs/scheduler -type f -mtime +45 -delete
I know airflow supports logging into S3/GCS/Azure etc.,
But is there a way to log into specific folders inside this storage based on some configuration inside the DAGs?
Airflow does not support this feature yet. There is a centralised log folder to be configured in airflow.cfg where all logs get saved irrespective of the dag
What do people find is the best way to distribute code (dags) to airflow webserver / scheduler + workers? I am trying to run celery on a large cluster of workers such that any manual updates are impractical.
I am deploying airflow on docker and using s3fs right now and it is crashing on me constantly and creating weird core.### files. I am exploring other solutions (ie StorageMadeEasy, DropBox, EFS, a cron job to update from git...) but would love a little feedback as I explore solutions.
Also how do people typically make updates to dags and distribute that code? If one uses a share volume like s3fs, every time you update a dag do you restart the scheduler? Is editing the code in place on something like DropBox asking for trouble? Any best practices on how update dags and distribute the code would be much appreciated.
I can't really tell you what the "best" way of doing it is but I can tell you what I've done when I needed to distribute the workload onto another machine.
I simply set up an NFS share on the airflow master for the both the DAGS and the PLUGINS folders and mounted this share onto the worker. I had an issue once or twice where the NFS mount point would break for some reason but after re-mounting the jobs continued.
To distribute the DAG and PLUGIN code to the Airflow cluster I just deploy it to the master (I do this by bash script on my local machine which just SCPs the folders up from my local git branch) and NFS handles the replication to the worker. I always restart everything after a deploy, I also don't deploy while a job is running.
A better way to deploy would be to have GIT on the airflow master server which checks out a branch from a GIT repository (test or master depending on the airflow server?) and then replace the dags and plugins with the ones in the git repository. I'm experimenting with doing deployments like this at the moment with Ansible.
I want to modify the schedule of a task I created in a dags/ folder through the airflow UI. I can't find a way to modify the schedule through the UI. Can it be done or we can get it done only by modifying the python script ?
The only way to change it is through the code. As it's part of the DAG definition (like tasks and dependencies), it appears to be difficult to be able to change it through the web interface.
I have recently started working on Airflow scheduler. And I have been observing that it is creating multiple log files for each job that is scheduled. May I know how to restrict it to just one file. I have checked airflow.cfg file and I weren't able to find any argument which is related to the number of copies of log files.