We are having google Airflow environment set for our needs.
I have read a lot on the stackoverflow but everyone is saying to restart your webserver, which I think I can not do, as it is managed by google.
All the time when we deploy new DAG into environment, its always like DAG is missing.
What is happening is it - it take few hours before everything work fine after deployment, but until that time its hard for me to understand what is wrong and how to fix that.
Could you please help me get rid of this issue permanently.
Please let me know if any more information required here. Thanks in advance.
Every dag_dir_list_interval, the DagFileProcessorManager process list the scripts in the dags folder, then if the script is new or processed since more than the min_file_process_interval, it creates a DagFileProcessorProcess process for the file to process it and generate the dags.
You can check what do you have as values for these variables, and reduce them to add the dag ASAP to the Metadata.
Related
I placed a dag file in the dags folder based on a tutorial with slight modifications, but it doesn't show up in the GUI or when run airflow dags list.
Answering my own question: Check the python file for Exceptions by running it directly. It turns out one exception in the dag's python script due to a missing import made the dag not show up in the list. I note this just in case another new user comes across this. To me the moral of the story is that dag files should often be checked by running with python directly when they are modified because there won't be an obvious error showing up otherwise; they may just disappear from the list.
Currently i am using Airflow with Version : 1.10.10
After opening into airflow/logs folder there are many folder that are named based on your DAG name but there is a folder named scheduler which when opened consist folder that are named in date format ( E.g 2020/07/08 ) and it goes until the date when i first using airflow.After searching through multiple forum I'm still not sure what this folder logs are for.
Anyway the probelm is I kept wondering if it is okay to delete the contents inside scheduler folder since it takes so much space unlike the rest of the folder that are named based on the DAG name (I'm assuming thats where the log of each DAG runs is stored). Will the action of deleting the contents of scheduler cause any error or loss of DAG log?.
This might be a silly question but i want to make sure since the Airflow is in production server. I've tried creating an Airflow instance in local instance and delete the scheduler folder contents and it seems no error have occurred. Any feedback and sharing experience on handling this issue is welcomed
Thanks in Advance
It contains the logs of airflow scheduler afaik. I have used it only one time for a problem about SLAs.
I've been deleting old files in it for over a year, never encountered a problem. this is my command to delete old log files of scheduler:
find /etc/airflow/logs/scheduler -type f -mtime +45 -delete
I'm having trouble updating a dag file. Dag still have an old version of my dag file. I added a task but it seems not updated when I check the log and UI (DAG->Code).
I have very simple tasks.
I of course checked the dag directory path in airflow.cfg and restarted airflow webserver/scheduler.
I have no issue running it (but with the old dag file).
Looks like a bug of airflow. A temp solution is to delete the task instances from airflow db by
delete from task_instance where dag_id=<dag_name> and task_id=<deleted_task_name>';
This should be simpler and less impactful than the resetdb route which would delete everything including variables and connections set before.
Use terminal and run the below command soon after changing the dag file.
airflow initdb
This worked for me.
You can try to remove the old .pyc file for that dag in the dags folder and generate it again.
UI sometimes is not up to date to me, but the code is actually there in dag bag. You can try to:
Use refresh button to see if code refreshed
Use higher version 1.8+, this happens to me before when I used 1.7.X, but after 1.8+, it seems much better after you refresh dag in UI
You can also use "airflow test" to see if the code is in place, and try the advice from #Him as well.
Same thing happened to me.
In the end the best thing is to "resetdb", add connections and import variables again and then airflow initdb and set the scheduler back again.
I don't know why this happens. Anybody knows? It seems not so easy to add tasks or change names once compiled. Removing *.pyc or logs folder did not work for me.
In DAG page of Airflow webserver, delete the DAG. It will delete the record in the database. After a while the DAG will appear again in the page, but the old task_id is removed.
Is there any way to reload the jobs without having to restart the server?
In your airflow.cfg, you've these two configurations to control this behavior:
# after how much time a new DAGs should be picked up from the filesystem
min_file_process_interval = 0
dag_dir_list_interval = 60
You might have to reload the web-server, scheduler and workers for your new configuration to take effect.
I had the same question, and didn't see this answer yet. I was able to do it from the command line with the following:
python -c "from airflow.models import DagBag; d = DagBag();"
When the webserver is running, it refreshes dags every 30 seconds or so by default, but this will refresh them in between if necessary.
Dags should be reloaded when you update the associated python file.
If they are not, first try to manually refresh them in UI by clicking the button that looks like a recycle symbol:
If that doesn't work, delete all the .pyc files in the dags folder.
Usually though, when I save the python file the dag gets updated within a few moments.
I'm pretty new to airflow, but I had initially used sample code, which got picked up right away and then edited it to call my own code.
This was ultimately giving an error, but I only found this out once I had deleted the DAG with the example code, on the airflow webserver's UI (the trash button):
Once deleted, it showed me the error which was preventing it from loading the new dag.
I am new to Autosys, and looking for a way to achieve reverse of file watching
I am looking for a job similar to file watcher, which keeps on running till the file is present, and will only pass if the file is not present. The dependent job will only if the file is not present.
there are few
1) I am not sure if I can achieve this with fileWatcher.
2) Does FileWatcher job stops running after it finds the file,
3) is there any way to negate the success condition for filewatcher job.
Or if anyone can provide me some good extensive document on FileWatcher, that would be a help too.
Thanks
You cannot achieve this with filewatcher job alone.
Filewatcher jobs stops running and goes to Success state as soon as it finds the file in the defined path. There is no way to negate its Success state.
This is so as its assumed that such functionalities can be easily implemented by scripts.
You can achieve what you want by batch script(Windows) or Shell Script(Unix/Linux). A script can be triggered by the Autosys job which checks file presence at place you intend, then sleeps for some time ( say 20 secs) checks again, and sends exit code 0 if it finally doesn't find the file, or some other exit code if after certain checks file didnt move eventually.
You can keep downstream jobs depended on this Autosys job as per requirement.
Let me know if more clarification is needed on this.