How to use apscheduler to trigger job every minute between specific times? - python-3.6

I'm using python apscheduler module. Is it possible to trigger a job every minute between 7:30 AM and 11:30 PM every day?
I've tried following solution, but I don't know how to add constraint with minutes.
from apscheduler.schedulers.background import BackgroundScheduler
def job_function():
print("Hello World")
sched = BackgroundScheduler()
sched.add_job(job_function, 'cron', hour='7-23', minute='*')
sched.start()

You can use the new OrTrigger to combine several CronTriggers to cover the whole time span:
from apscheduler.triggers.combining import OrTrigger
from apscheduler.triggers.cron import CronTrigger
trigger = OrTrigger([
CronTrigger(hour='7', minute='30-59'),
CronTrigger(hour='8-22', minute='*'),
CronTrigger(hour='23', minute='0-30')
])
sched.add_job(job_function, trigger)

Related

How do I update values in streamlit on a schedule?

I have a simple streamlit app that is meant to show the latest data. I want it to re-fresh the data every 5 seconds, but right now the only way I found to do that is via a st.experimental_refresh; this is the core code:
import streamlit as st
import time
current_time = int(time.time())
if 'last_run' not in st.session_state:
st.session_state['last_run'] = current_time
#st.experimental_singleton
def load_data():
...
return data
data = load_data()
if current_time > st.session_state['last_run']+5: # check every 5 seconds
load_data.clear() # clear cache
st.session_state['last_run'] = current_time
st.experimental_rerun()
However, the st.experimental_rerun() makes the user experience terrible; are there any other ideas?
You can try using schedule and st.empty().
Example:
import time
from schedule import every, repeat, run_pending
def load_data():
...
return data
def main():
data = load_data()
# Do something with data
with st.empty():
#repeat(every(5).seconds)
def refresh_data():
main()
while True:
run_pending()
time.sleep(1)

Asyncio and TQDM Python

I am trying to make a progress bar using tqdm for an asyncio function. I've tried following the guidance at:
tqdm for asyncio
and
asyncio with tqdm
and
tqdm and coroutines
Here's my code, which is running in a jupyter notebook:
import pandas as pd
from pandas_datareader import data as pdr
import asyncio
# import tqdm
from tqdm.asyncio import tqdm
async def get_prices(index, row):
try:
prices = pdr.get_data_yahoo(row['Symbol'], row['Start'], row['End'])
except Exception as e:
print('Error',e,row['Symbol'],row['Date'])
return
prices['Symbol'] = row['Symbol']
prices['Key'] = index
return prices
async def get_stock_data(df):
# data = await asyncio.gather(*[get_prices(index, row) for index, row in df.iterrows()])
# data = [await f for f in tqdm(asyncio.as_completed([get_prices(index, row) for index, row in df.iterrows()]), total=len(df)]
flist = [get_prices(index, row) for index, row in df.iterrows()]
data = [await f for f in tqdm.as_completed(flist, total=len(df))]
return data
stocks = ['IBM', 'AAPL', 'C', 'ACTG', 'ACVA', 'ACWI', 'ACWX', 'ACXP', 'ADAG', 'ADAL', 'IBET', 'IBEX', 'IBKR', 'IBOC', 'IBRX', 'IBTB', 'IBTD', 'IBTE', 'IBTF', 'IBTG', 'IBTH', 'IBTI', 'IBTJ', 'IBTK', 'IBTL', 'IBTM', 'IBTX', 'ICAD', 'ICCC', 'ICCH', 'ICCM', 'ICFI', 'ICHR', 'ICLK', 'ICLN', 'ICLR', 'ICMB', 'ICPT', 'ICUI', 'ICVX', 'IDAI', 'IDBA', 'IDCC', 'IDEX', 'IDLB', 'IDN', 'IDRA', 'IDXX', 'IDYA', 'IEA', 'IEAWW', 'IEF', 'IEI', 'IEP', 'IESC', 'IEUS', 'IFBD', 'IFGL', 'IFRX', 'IFV', 'IGAC', 'IGACU', 'IGACW', 'IGF', 'IGIB', 'IGIC', 'IGICW', 'IGMS', 'IGNY', 'IGNYU', 'IGNYW', 'IGOV', 'IGSB', 'IGTA', 'IGTAR', 'IGTAU', 'IGTAW', 'IHRT', 'IHYF', 'III', 'IIII', 'IIIIU', 'IIIIW', 'IIIV', 'IINN', 'IINNW', 'IIVI', 'IIVIP', 'IJT', 'IKNA', 'IKT', 'ILAG', 'ILMN', 'ILPT', 'IMAB', 'IMAC', 'IMACW', 'IMAQ', 'IMAQR', 'IMAQU', 'IMAQW', 'IMBI', 'IMBIL', 'IMCC', 'IMCR', 'IMCV', 'IMGN', 'IMGO', 'IMKTA', 'IMMP', 'IMMR', 'IMMX', 'IMNM', 'IMOS', 'IMPL', 'IMPP', 'IMPPP', 'IMRA', 'IMRN', 'IMRX', 'IMTE', 'IMTX', 'IMTXW', 'IMUX', 'IMV', 'IMVT', 'IMXI', 'INAB', 'INBK', 'INBKZ', 'INBX', 'INCR', 'INCY', 'INDB', 'INDI', 'INDIW', 'INDP', 'INDT', 'INDY', 'INFI', 'INFN', 'INGN', 'INKA', 'INKAU', 'INKAW', 'INKT', 'INM', 'INMB', 'INMD', 'INNV', 'INO', 'INOD', 'INPX', 'INSE', 'INSG', 'INSM', 'INTA', 'INTC', 'INTE', 'INTEU', 'INTEW', 'INTG', 'INTR', 'INTU', 'INTZ', 'INVA', 'INVE', 'INVO', 'INVZ', 'INVZW', 'INZY', 'IOAC', 'IOACU', 'IOACW', 'IOBT', 'IONM', 'IONR', 'IONS', 'IOSP', 'IOVA', 'IPA', 'IPAR', 'IPAX', 'IPAXU', 'IPAXW', 'IPDN', 'IPGP', 'IPHA', 'IPKW', 'IPSC', 'IPVI', 'IPVIU', 'IPVIW', 'IPW', 'IPWR', 'IPX', 'IQ', 'IQMD', 'IQMDU', 'IQMDW', 'IRAA', 'IRAAU', 'IRAAW', 'IRBT', 'IRDM', 'IREN', 'IRIX', 'IRMD', 'IROQ', 'IRTC', 'IRWD', 'ISAA', 'ISDX', 'ISEE']
df_stocks = pd.DataFrame(stocks, columns=['Symbol'])
df_stocks['Start'] = '9/1/2022'
df_stocks['End'] = '9/11/2022'
data = pd.concat([d for d in await get_stock_data(df_stocks)])
data.dropna(inplace=True)
data.to_csv('../output/stockprices.csv', sep='\t')
data
The code works, but the progress bar does not. I get this output which does not change:
0%| | 0/214 [00:00<?, ?it/s]
I have also tried from tqdm.autonotebook import tqdm, but that gives the same result.
I am sure I'm doing something boneheaded, but am unable to solve this on my own.
TL;DR
Problem is that Async Function get_prices() doesn't have any Awaitable, hence it's not async but synchronous. You will definitely want to understand what Coroutine is before reading the rest of this.
Since that library pandas-datareader does not define any asynchronous function - you might be better offloading the pdr.get_data_xxx to thread or give up concurrency and use synchronous tqdm.
Explanation
Problem-wise, what you wrote could be simplified as following:
import asyncio
import random
import time
from tqdm.asyncio import tqdm
async def fake_async_task():
time.sleep(0.5 + random.random()) # <- notice there is no await & awaitable!
async def main():
tasks = [fake_async_task() for _ in range(10)]
_ = [await task_ for task_ in tqdm.as_completed(tasks, total=len(tasks))]
# asyncio.run(main()) --> when not in Jupyter
await main() # --> in Jupyter
If you run this, you'll notice this also seemingly jumps the progress from 0% to 100% just before it ends, and total execution takes much longer than 0.5 + α seconds.
0%| | 0/10 [00:00<?, ?it/s]
Reason behind it is quite complicated; Throwing a fake async function - aka async func without await - is outside of the documented usages, thus requires looking inside the library codes.
To overly simplify the flows - at cost of accuracy & proper terminology:
We feed 10 fake_async_task() coroutines to tqdm.asyncio.tqdm.as_completed.
tqdm.asyncio.tqdm.as_completed is merely a wrapper for asyncio.as_completed, so tqdm pass all given Awaitable to it then wait for any results.
asyncio.as_completed schedule execution of all given Awaitable, and then schedule Awaitable named _wait_for_one for getting results.
First scheduled fake_async_task() starts running, until next await keyword it encounters.
But we haven't put any await keywords, so coroutine ends up running start to end without stopping.
Same thing happens for other 9 scheduled fake_async_task(), and _wait_for_one is still patiently waiting for it's turn.
when it's finally _wait_for_one's turn, all tasks are already done, so yielding result happens faster than human eyes can see, so is the progress bar's progress changes.
That's why total execution time was addictive, it never really archived any concurrency during execution.
Running functions like fake_async_task() is simply neither tqdm authors nor asyncio had in mind. Others would usually write codes like this instead:
import asyncio
import random
from tqdm.asyncio import tqdm
async def task():
await asyncio.sleep(0.5 + random.random()) # <- await + something that's awaitable
# adding random val to prevent it finishing altogether,
# or progress bar will seemingly jump from 0 to 100 again.
async def main():
tasks = [task() for _ in range(10)]
_ = [await task_ for task_ in tqdm.as_completed(tasks, total=len(tasks))]
# asyncio.run(main())
await main()
Which now then print out progress - more like, slow enough so we can see - just as we wanted. Also, total execution time was 0.5 + α seconds, achieving proper concurrency.
30%|███ | 3/10 [00:00<00:01, 3.98it/s]
Alternatives
But if function you want to use happened to not have async variant, yet it is gladly not CPU intensive, but I/O intensive, then you can offload it to another thread to achieve concurrency while using asynchronous APIs.
import asyncio
import random
import time
from tqdm.asyncio import tqdm
def io_intensive_sync_task():
time.sleep(0.5 + random.random())
async def main():
tasks = [asyncio.to_thread(io_intensive_sync_task) for _ in range(10)]
_ = [await task_ for task_ in tqdm.as_completed(tasks, total=len(tasks))]
# can skip the total param as tqdm internally use len if not provided
# asyncio.run(main())
await main()
Which will run just as the previous example does.

combine BranchPythonOperator and PythonVirtualenvOperator

I have a PythonVirtualenvOperator which reads some data from a database - if there is no new data, then the DAG should end there, otherwise it should call additional tasks e.g
#dag.py
load_data >>[if_data,if_no_data]>>another_task>>last_task
I understand that it can be done using PythonBranchOperator but I can't see how I can combine the venv and the branch-operator.
Is it doable?
This can be solved using Xcom.
load_date can push the number of records it processed (new data).
Your pipe can be:
def choose(**context):
value = context['ti'].xcom_pull(task_ids='load_data')
if int(value)>0:
return 'if_data'
return 'if_no_data'
branch = BranchPythonOperator(
task_id='branch_task',
provide_context=True, # Remove this line if Airflow>=2.0.0
python_callable=choose)
load_data >> branch >>[if_data,if_no_data]>>another_task>>last_task

Daily_schedule triggered runs and backfill runs have different date partition

I have #daily_schedule triggered daily at 3 minutes past 12am
When triggered by the scheduled tick at '2021-02-16 00:03:00'
The date input shows '2021-02-15 00:00:00', partition tagged as '2021-02-15'
While if triggered via backfill for partition '2021-02-16'
The date input shows '2021-02-16 00:00:00', partition tagged as '2021-02-16'
Why does the scheduled tick fill the partition a day before? Is there an option to use the datetime of execution instead (without using cron #schedule)? This descrepency is confusing when I perform queries using the timestamp for exact dates
P.S I have tested both scheduled run and backfil run to have the same Timezone.
#solid()
def test_solid(_, date):
_.log.info(f"Input date: {date}")
#pipeline()
def test_pipeline():
test_solid()
#daily_schedule(
pipeline_name="test_pipeline",
execution_timezone="Asia/Singapore",
start_date=START_DATE,
end_date=END_DATE,
execution_time=time(00, 03),
# should_execute=four_hourly_fitler
)
def test_schedule_daily(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
Sorry for the trouble here - the underlying assumption that the system is making here is that for schedules on pipelines that are partitioned by date, you don't fill in the partition for a day until that day has finished (i.e. the job filling in the data for 2/15 wouldn't run until the next day on 2/16). This is a common pattern in scheduled ETL jobs, but you're completely right that it's not a given that all schedules will want this behavior, and this is good feedback that we should make this use case easier.
It is possible to make a schedule for a partition in the way that you want, but it's more cumbersome. It would look something like this:
from dagster import PartitionSetDefinition, date_partition_range, create_offset_partition_selector
def partition_run_config(date):
timestamp = date.strftime("%Y-%m-%d %X")
return {
"solids": {
"test_solid":{
"inputs": {
"date":{
"value": timestamp
}
}
}
}
}
test_partition_set = PartitionSetDefinition(
name="test_partition_set",
pipeline_name="test_pipeline",
partition_fn=date_partition_range(start=START_DATE, end=END_DATE, inclusive=True, timezone="Asia/Singapore"),
run_config_fn_for_partition=partition_run_config,
)
test_schedule_daily = (
test_partition_set.create_schedule_definition(
"test_schedule_daily",
"3 0 * * *",
execution_timezone="Asia/Singapore",
partition_selector=create_offset_partition_selector(lambda d:d.subtract(minutes=3)),
)
)
This is pretty similar to #daily_schedule's implementation, it just uses a different function for mapping the schedule execution time to a partition (subtracting 3 minutes instead of 3 minutes and 1 day - that's the create_offset_partition_selector part).
I'll file an issue for an option to customize the mapping for the partitioned schedule decorators, but something like that may unblock you in the meantime. Thanks for the feedback!
Just an update on this: We added a 'partition_days_offset' parameter to the 'daily_schedule' decorator (and a similar parameter to the other schedule decorators) that lets you customize this behavior. The default is still to go back 1 day, but setting partition_days_offset=0 will give you the behavior you were hoping for where the execution day is the same as the partition day. This should be live in our next weekly release on 2/18.

Creating Dynamic Text of date time that changes every second in Streamlit

I would like to display the output of datatime.now being refreshed every 1 second on the streamlit webui.
from datetime import datetime
datetime.now()
# print this output every one second
datetime.datetime(2020, 5, 19, 4, 22, 40, 921985)
What I have already tried
#!/usr/bin/env python3
import streamlit as st
from datetime import datetime
timenow = str(datetime.now())
st.write(timenow)
I suppose it depends on whether you need exactly one second resolution or not, but the solution is approximately:
import time
from datetime import datetime
import streamlit as st
t = st.empty()
while True:
t.markdown("%s" % str(datetime.now()))
time.sleep(1)
The while loop keeps the process going forever. By having the st.empty() call outside of the loop, we keep modifying the t variable. On each loop repetition, the value for the markdown string gets overwritten by the datetime.now() argument.
Display Time With Infinite While Loop AT END OF Streamlit Code
The placeholder code in the while loop is likely not needed. Using t.write or t.markdown as #Randy Zwitch does is fine. I just wanted to show that IF you use this while look approach, put it at the end of your streamlit script. Even then, watch how the time is interrupted when the button state is updated in the code above the while loop.
import streamlit as st
from datetime import datetime
import time
# Section 1
button = st.button('Button')
button_placeholder = st.empty()
button_placeholder.write(f'button = {button}')
time.sleep(2)
button = False
button_placeholder.write(f'button = {button}')
# Section 2
time_placeholder = st.empty()
while True:
timenow = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
time_placeholder.write(timenow)
time.sleep(1)

Resources