I am trying to schedule my python script (merchant.py) once a week using APScheduler
For this purpose, I have installed the package and when I run the following example code with cronjob option-
from datetime import datetime
import time
import os
from apscheduler.schedulers.background import BackgroundScheduler
def tick():
print('Tick! The time is: %s' % datetime.now())
if __name__ == '__main__':
scheduler = BackgroundScheduler()
scheduler.add_job(tick, 'cron', day_of_week = 'mon', hour = 11)
scheduler.start()
print('Press Ctrl+{0} to exit'.format('Break' if os.name == 'nt' else 'C'))
try:
while True:
time.sleep(2)
except (KeyboardInterrupt, SystemExit):
scheduler.shutdown()
I am getting following error-
KeyError: 'cron'
LookupError: No trigger by the name "cron" was found
Could you please correct me what I am doing wrong here?
I found one solution so sharing for others-
from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.triggers.combining import OrTrigger
from apscheduler.triggers.cron import CronTrigger
import time
sched = BlockingScheduler()
trigger = OrTrigger([
CronTrigger(day_of_week='thu', hour='07', minute='28')
])
sched.add_job(YOUR_JOB, trigger)
sched.start()
try:
while True:
time.sleep(2)
except (KeyboardInterrupt, SystemExit):
sched.shutdown()
Related
Is there a way to trigger a DAG every time the airflow environment is brought up? This would be helpful to run some tests on the environment
You can moniter the boot time of system or process and check the last_execution datetime of Dag and accordingly trigger the dag.
You can refer to following code
import psutil
from datetime import datetime
last_reboot = psutil.boot_time()
boot_time = datetime.fromtimestamp(last_reboot)
Variable.set('boot_time',boot_time.strftime("%Y-%m-%d %H:%M:%S"))
last_run_dt = Variable.get('last_run_end_date',"")
try:
from dateutil import parser
if last_run_dt == "":
last_run_at = parser.parse('1970-01-01 00:00:00')
else:
last_run_at = parser.parse(last_run_dt)
if boot_time > last_run_at:
# Your code to trigger the Dag
I follow this example
create the example timetable py file, and put it in the $Home/airflow/plugins
create the example dag file, and put it in $Home/airflow/dags
After restart scheduler and webserver, I get DAG import error. In the web UI, the last line of detailed error message:
airflow.exceptions.SerializationError: Failed to serialize DAG 'example_timetable_dag2': Timetable class 'AfterWorkdayTimetable' is not registered
But if I run airflow plugins, I can see the timetable is in the name and source list.
How to fix this error?
Detail of plugins/AfterWorkdayTimetable.py:
from datetime import timedelta
from typing import Optional
from pendulum import Date, DateTime, Time, timezone
from airflow.plugins_manager import AirflowPlugin
from airflow.timetables.base import DagRunInfo, DataInterval, TimeRestriction, Timetable
UTC = timezone("UTC")
class AfterWorkdayTimetable(Timetable):
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
weekday = run_after.weekday()
if weekday in (0, 6): # Monday and Sunday -- interval is last Friday.
days_since_friday = (run_after.weekday() - 4) % 7
delta = timedelta(days=days_since_friday)
else: # Otherwise the interval is yesterday.
delta = timedelta(days=1)
start = DateTime.combine((run_after - delta).date(), Time.min).replace(tzinfo=UTC)
return DataInterval(start=start, end=(start + timedelta(days=1)))
def next_dagrun_info(
self,
*,
last_automated_data_interval: Optional[DataInterval],
restriction: TimeRestriction,
) -> Optional[DagRunInfo]:
if last_automated_data_interval is not None: # There was a previous run on the regular schedule.
last_start = last_automated_data_interval.start
last_start_weekday = last_start.weekday()
if 0 <= last_start_weekday < 4: # Last run on Monday through Thursday -- next is tomorrow.
delta = timedelta(days=1)
else: # Last run on Friday -- skip to next Monday.
delta = timedelta(days=(7 - last_start_weekday))
next_start = DateTime.combine((last_start + delta).date(), Time.min).replace(tzinfo=UTC)
else: # This is the first ever run on the regular schedule.
next_start = restriction.earliest
if next_start is None: # No start_date. Don't schedule.
return None
if not restriction.catchup:
# If the DAG has catchup=False, today is the earliest to consider.
next_start = max(next_start, DateTime.combine(Date.today(), Time.min).replace(tzinfo=UTC))
elif next_start.time() != Time.min:
# If earliest does not fall on midnight, skip to the next day.
next_day = next_start.date() + timedelta(days=1)
next_start = DateTime.combine(next_day, Time.min).replace(tzinfo=UTC)
next_start_weekday = next_start.weekday()
if next_start_weekday in (5, 6): # If next start is in the weekend, go to next Monday.
delta = timedelta(days=(7 - next_start_weekday))
next_start = next_start + delta
if restriction.latest is not None and next_start > restriction.latest:
return None # Over the DAG's scheduled end; don't schedule.
return DagRunInfo.interval(start=next_start, end=(next_start + timedelta(days=1)))
class WorkdayTimetablePlugin(AirflowPlugin):
name = "workday_timetable_plugin"
timetables = [AfterWorkdayTimetable]
Details of dags/test_afterwork_timetable.py:
import datetime
from airflow import DAG
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow.operators.dummy import DummyOperator
with DAG(
dag_id="example_workday_timetable",
start_date=datetime.datetime(2021, 1, 1),
timetable=AfterWorkdayTimetable(),
tags=["example", "timetable"],
) as dag:
DummyOperator(task_id="run_this")
If I run airflow plugins:
name | source
==================================+==========================================
workday_timetable_plugin | $PLUGINS_FOLDER/AfterWorkdayTimetable.py
I had similar issue.
Either you need to add __init__.py file or you should try this to debug your issue:
Get all plugin manager objects:
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
plugins_manager.timetable_classes
I got this result: {'quarterly.QuarterlyTimetable': <class 'quarterly.QuarterlyTimetable'>}
Compare your result with exception message. If timetable_classes dictionary has a different plugin name you should either change plugin file path.
You could also try this inside DAG python file:
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow import plugins_manager
print(plugins_manager.as_importable_string(AfterWorkdayTimetable))
This would help you find the name that airflow tries to use when searching through timetable_classes dictionary.
You need to register the timetable in "timetables" array via plugin interface. See:
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html
Encountered the same issue.
These are the steps I followed.
Add the Timetable file(custom_tt.py) into plugins folder.
Make sure the plugin folder has _ _ init_ _.py file present in plugins folder.
Change the lazy_load_plugins in airflow.cfg to False.
lazy_load_plugins = False
Add import statement in dagfile as:
from custom_tt import CustomTimeTable
In DAG as
DAG(timetable=CustomTimeTable())
Restart the webserver and scheduler.
Problem fixed.
They have found the resolution to this but doesn't seem like they have updated the documentation to represent the fix just yet.
Your function
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
should be
def infer_manual_data_interval(self, *, run_after: DateTime) -> DataInterval:
See reference:
Apache airflow.timetables.base Documentation
After updating the function with the correct name and extra parameter, everything else should work for you as it did for me.
I was running into this as well. airflow plugins reports that the plugin is registered, and running the DAG script on the command line works fine, but the web UI reports that the plugin is not registered. #Bakuchi's answer pointed me in the right direction.
In my case, the problem was how I was importing the Timetable - airflow apparently expects you to import it relative to the $PLUGINS_FOLDER, not from any other directory, even if that other directory is also on the PYTHONPATH.
For a concrete example:
export PYTHONPATH=/path/to/my/code:$PYTHONPATH
# airflow.cfg
plugins_folder = /path/to/my/code/airflow_plugins
# dag.py
import sys
from airflow_plugins.custom_timetable import CustomTimetable as Bad
from custom_timetable import CustomTimetable as Good
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
print(sys.path) # /path/to/my/code:...:/path/to/my/code/airflow_plugins
print(plugins_manager.as_importable_string(Bad)) # airflow_plugins.custom_timetable.CustomTimetable
print(plugins_manager.as_importable_string(Good)) # custom_timetable.CustomTimetable
print(plugins_manager.timetable_classes) # {'custom_timetable.CustomTimetable': <class 'custom_timetable.CustomTimetable'>}
A bad lookup in plugins_manager.timetable_classes is ultimately what ends up raising the _TimetableNotRegistered error, so the fix is to make the keys match by changing how the timetable is imported.
I submitted a bug report: https://github.com/apache/airflow/issues/21259
I was trying out ZenML, it said it can convert my .py pipeline into airflow DAG.
I followed every step here: https://docs.zenml.io/guides/low-level-api/chapter-7, all succeeded
My pipeline runs well locally, but why can't see this DAG created on airflow UI? The UI is totally empty....
The problem seems that, ZenML will copy my .py pipeline wrote in ZenML way, and expect it can run in airflow... In my case this won't work. Does anyone know how can I let ZenML run my code through airflow successfully?
Here's my ZenML .py code:
import pandas as pd
import numpy as np
import os
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import balanced_accuracy_score
from zenml.pipelines import pipeline
from zenml.steps import step
from zenml.steps.step_output import Output
from zenml.steps.base_step_config import BaseStepConfig
class pipeline_config(BaseStepConfig):
"""
Params used in the pipeline
"""
label: str = 'species'
#step
def split_data(config: pipeline_config) -> Output(
X=pd.DataFrame, y=pd.DataFrame
):
path_to_csv = os.path.join('~/airflow/data', 'leaf.csv')
df = pd.read_csv(path_to_csv)
label = config.label
y = df[[label]]
X = df.drop(label, axis=1)
return X, y
#step
def train_evaltor(
config: pipeline_config,
X: pd.DataFrame,
y: pd.DataFrame
) -> float:
y = y[config.label]
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=10)
lgbm = lgb.LGBMClassifier(objective='multiclass', random_state=10)
metrics_lst = []
for train_idx, val_idx in folds.split(X, y):
X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_val)
cv_balanced_accuracy = balanced_accuracy_score(y_val, y_pred)
metrics_lst.append(cv_balanced_accuracy)
avg_performance = np.mean(metrics_lst)
print(f"Avg Performance: {avg_performance}")
return avg_performance
#pipeline
def super_mini_pipeline(
data_spliter,
train_evaltor
):
X, y = data_spliter()
train_evaltor(X=X, y=y)
# run the pipeline
pipeline = super_mini_pipeline(data_spliter=split_data(),
train_evaltor=train_evaltor())
pipeline.run()
Okay so it worked! See picture below:
The reason is that the words airflow and dag have to be in the airflow if safe_mode is on (which it is by default). This is airflow specific logic can be in the airflow codebase.
So all I did was change the last few lines:
# run the pipeline airflow
pipeline = super_mini_pipeline(data_spliter=split_data(),
train_evaltor=train_evaltor())
DAG = pipeline.run()
You can also change the airflow.cfg file and turn safe mode off:
In $HOME/.config/zenml/airflow_root/<UUID>/airflow.cfg
# When discovering DAGs, ignore any files that don't contain the strings ``DAG`` and ``airflow``.
dag_discovery_safe_mode = False
Edit:
There might be another reason: Airflow DAG discovery also relies on the DAG being the globals() so maybe we needed to catch it with DAG = pipeline.run(). So in any case, it works!
I am trying to use AsyncHTTPClient to Get/Post from a local service that is already running at port 6000.
but i keep getting an error RuntimeError: Task got bad yield: <tornado.concurrent.Future object at 0x03C9B490>
ps. im using tornado 4.4.2, this error is fixed with the latest version but how do i do it in 4.4.2? Please help!
import tornado.ioloop
from tornado.httpclient import AsyncHTTPClient
import asyncio
import tornado
import urllib
from datetime import datetime
import time
async def client(url):
http_client = AsyncHTTPClient()
response = await http_client.fetch(url)
return response.body
async def main():
http_client = AsyncHTTPClient()
url = "http://localhost:6000/listings"
result = await client(url)
print(result)
if __name__ == "__main__":
result = asyncio.run(main())
print(result)
print(int(time.time() * 1e6))
You can't use asyncio with Tornado prior to version 5.0.
Use Tornado's own ioloop to run your program:
from tornado import ioloop
if __name__ == "__main__":
result = ioloop.IOLoop.current().run_sync(main)
UPDATE: The above solution will work fine, but, if you want, you can use asyncio with Tornado 4.x. See: tornado.platform.asyncio.AsyncIOMainLoop.
I want to automatically download new pdf files as they are published on a collection of online libraries. I tried using the following python code but found it doesn't work with urllib 3. Anyone know how I can replicate it?
import urllib2
def main():
download_file("http://mensenhandel.nl/files/pdftest2.pdf")
def download_file(download_url):
response = urllib2.urlopen(download_url)
file = open("document.pdf", 'wb')
file.write(response.read())
file.close()
print("Completed")
if __name__ == "__main__":
main()
Your solution is good. I just did a few minor adjustments. The first thing is you need to use the content method of the response. Below is the code -
import requests
def main():
download_file("http://mensenhandel.nl/files/pdftest2.pdf")
def download_file(download_url):
response = requests.get(download_url, stream = True)
with open("document.pdf", 'wb') as file:
file.write(response.content)
file.close()
print("Completed")
if __name__ == "__main__":
main()