Is there a built-in facility or some operator that will run a sensor and negate its status? I am writing a workflow that needs to detect that an object does not exist in order to proceed to eventual success. I have a sensor, but it detects when the object does exist.
For instance, I would like my workflow to detect that an object does not exist. I need almost exactly S3KeySensor, except that I need to negate its status.
The use case you are describing is checking key in S3, if exist wait otherwise continue workflow. As you mentioned this is a Sensor use case. The S3Hook has function check_for_key that checks if key exist so all needed is just to wrap it with Sensor poke function..
A simple basic implementation would be:
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.sensors.base import BaseSensorOperator
class S3KeyNotPresentSensor(BaseSensorOperator):
""" Waits for a key to not be present in S3. """
template_fields: Sequence[str] = ('bucket_key', 'bucket_name')
def __init__(
self,
*,
bucket_key: str,
bucket_name: Optional[str] = None,
aws_conn_id: str = 'aws_default',
verify: Optional[Union[str, bool]] = None,
**kwargs,
):
super().__init__(**kwargs)
self.bucket_name = bucket_name
self.bucket_key = [bucket_key] if isinstance(bucket_key, str) else bucket_key
self.aws_conn_id = aws_conn_id
self.verify = verify
self.hook: Optional[S3Hook] = None
def poke(self, context: 'Context'):
return not self.get_hook().check_for_key(self.bucket_key, self.bucket_name)
def get_hook(self) -> S3Hook:
"""Create and return an S3Hook"""
if self.hook:
return self.hook
self.hook = S3Hook(aws_conn_id=self.aws_conn_id, verify=self.verify)
return self.hook
I ended up going another way. I can use the trigger_rule argument of (any) Task -- by setting it to one_failed or all_failed on the next task I can play around with the desired status.
For example,
file_exists = FileSensor(task_id='exists', timeout=3, poke_interval=1, filepath='/tmp/error', mode='reschedule')
sing = SmoothOperator(task_id='sing', trigger_rule='all_failed')
file_exists >> sing
It requires no added code or operator, but has the possible disadvantage of being somewhat surprising.
Replying to myself in the hope that this may be useful to someone else. Thanks!
Related
Typically I send an asynchronous task with .apply_async method of the Promise defined, and then I use the taskid on the AsyncResult method of the same object to get task status, and eventually, result.
But this requires me to know the exact type of task when more than one tasks are defined in the same deployment. Is there any way to circumvent this, when I can know the task status and result (if available) without knowing the exact task?
For example, take this example celery master node code.
#!/usr/bin/env python3
# encoding:utf-8
"""Define the tasks in this file."""
from celery import Celery
redis_host: str = 'redis://localhost:6379/0'
celery = Celery(main='test', broker=redis_host,
backend=redis_host)
celery.conf.CELERY_TASK_SERIALIZER = 'pickle'
celery.conf.CELERY_RESULT_SERIALIZER = 'pickle'
celery.conf.CELERY_ACCEPT_CONTENT = {'json', 'pickle'}
# pylint: disable=unused-argument
#celery.task(bind=True)
def add(self, x: float, y: float) -> float:
"""Add two numbers."""
return x + y
#celery.task(bind=True)
def multiply(self, x: float, y: float) -> float:
"""Multiply two numbers."""
return x * y
When I call something like this in a different module
task1=add.apply_async(args=[2, 3]).id
task2=multiply.apply_async(args=[2, 3]).id
I get two uuids for the tasks. But when checking back the task status, I need to know which method (add or multiply) is associated with that task id, since I have to call the method on the corresponding object, like this.
status: str = add.AsyncResult(task_id=task1).state
My question is how can I fetch the state and result armed only with the task id without knowing whether the task belongs add, multiply or any other category defined.
id and state are just properties of the AsyncResult objects. If you looked at documentation for the AsyncResult class, you would find the name property which is exactly what you are asking for.
In the following code how can we pass the context.args and context to another function, in this case callback_search_msgs
def search_msgs(update, context):
print('In TG, args', context.args)
context.job_queue.run_once(callback_search_msgs, 1, context=context, job_kwargs={'keys': context.args})
def callback_search_msgs(context, keys):
print('Args', keys)
chat_id = context.job.context
print('Chat ID ', chat_id)
def main():
updater = Updater(token, use_context=True)
dp = updater.dispatcher
dp.add_handler(CommandHandler("search_msgs",search_msgs, pass_job_queue=True,
pass_user_data=True))
updater.start_polling()
updater.idle()
if __name__ == '__main__':
main()
A few notes:
job callbacks accept exactly one argument of type CallbackContext. Not two.
the job_kwargs parameter is used to pass keywoard argument to the APScheduler backend, on which JobQueue is built. The way you're trying to use it doesn't work.
if you want to know only the chat_id in the job, you don't have to pass the whole context argument of search_msgs. Just do context.job_queue.run_once(..., context=chat_id,...)
if you want to pass both the chat_id and context.args you can e.g. pass them as tuple:
job_queue.run_once(..., context=(chat_id, context.args), ...)
and then retrieve them in the job via chat_id, args = context.job.context.
Since you're using use_context=True (which is the default in v13+, btw), the pass_* parameters of CommandHandler (or any other handler) have no effect at all.
I suggest to carefully read
The tutorial on JobQueue
The example on JobQueue
The docs of JobQueue
Disclaimer: I'm currently the maintainer of python-telegram-bot
I'm trying to extract vector-representations of text using BERT in the transformers libray, and have stumbled on the following part of the documentation for the "BERTModel" class:
Can anybody explain this in more detail? A forward-pass makes intuitive sense to me (am trying to get final hidden states after all), and I can't find any additional information on what "pre and post processing" means in this context.
Thanks up front!
I think this is just general advice concerning working with PyTorch Module's. The transformers modules are nn.Modules, and they require a forward method. However, one should not call model.forward() manually but instead call model(). The reason is that PyTorch does some stuff under the hood when just calling the Module. You can find that in the source code.
def __call__(self, *input, **kwargs):
for hook in self._forward_pre_hooks.values():
result = hook(self, input)
if result is not None:
if not isinstance(result, tuple):
result = (result,)
input = result
if torch._C._get_tracing_state():
result = self._slow_forward(*input, **kwargs)
else:
result = self.forward(*input, **kwargs)
for hook in self._forward_hooks.values():
hook_result = hook(self, input, result)
if hook_result is not None:
result = hook_result
if len(self._backward_hooks) > 0:
var = result
while not isinstance(var, torch.Tensor):
if isinstance(var, dict):
var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
else:
var = var[0]
grad_fn = var.grad_fn
if grad_fn is not None:
for hook in self._backward_hooks.values():
wrapper = functools.partial(hook, self)
functools.update_wrapper(wrapper, hook)
grad_fn.register_hook(wrapper)
return result
You'll see that forward is called when necessary.
Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty.
But similar code for setting list as a global variable
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, acc1, acc2):
acc1.update(acc2)
if __name__== "__main__":
sc, sqlContext = init_spark("generate_score_summary", 40)
rdd = sc.textFile('input')
#print(rdd.take(5))
dict1 = sc.accumulator({}, DictParam())
def file_read(line):
global dict1
ls = re.split(',', line)
dict1+={ls[0]:ls[1]}
return line
rdd = rdd.map(lambda x: file_read(x)).cache()
print(dict1)
For anyone who arrives at this thread looking for a Dict accumulator for pyspark: the accepted solution does not solve the posed problem.
The issue is actually in the DictParam defined, it does not update the original dictionary. This works:
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, value1, value2):
value1.update(value2)
return value1
The original code was missing the return value.
I believe that print(dict1()) simply gets executed before the rdd.map() does.
In Spark, there are 2 types of operations:
transformations, that describe the future computation
and actions, that call for action, and actually trigger the execution
Accumulators are updated only when some action is executed:
Accumulators do not change the lazy evaluation model of Spark. If they
are being updated within an operation on an RDD, their value is only
updated once that RDD is computed as part of an action.
If you check out the end of this section of the docs, there is an example exactly like yours:
accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
So you would need to add some action, for instance:
rdd = rdd.map(lambda x: file_read(x)).cache() # transformation
foo = rdd.count() # action
print(dict1)
Please make sure to check on the details of various RDD functions and accumulator peculiarities because this might affect the correctness of your result. (For instance, rdd.take(n) will by default only scan one partition, not the entire dataset.)
For accumulator updates performed inside actions only, their value is
only updated once that RDD is computed as part of an action
I am having trouble using the HashDict function within OTP. I would like to use one GenServer process to put and a different one to fetch. When I try and implement this, I can put and fetch items from the HashDict when calling from the same GenServer; it works perfectly (MyServerA in the example below). But when I use one GenServer to put and a different one to fetch, the fetch implementation does not work. Why is this? Presumably it's because I need to pass the HashDict data structure around between the three different processes?
Code example below:
I use a simple call to send some state to MyServerB:
MyServerA.add_update(state)
For MyServerB I have implemented the HashDict as follows:
defmodule MyServerB do
use GenServer
def start_link do
GenServer.start_link(__MODULE__,[], name: __MODULE__)
end
def init([]) do
#Initialise HashDict to store state
d = HashDict.new
{:ok, d}
end
#Client API
def add_update(update) do
GenServer.cast __MODULE__, {:add, update}
end
def get_state(window) do
GenServer.call __MODULE__, {:get, key}
end
# Server APIs
def handle_cast({:add, update}, dict) do
%{key: key} = update
dict = HashDict.put(dict, key, some_Value)
{:noreply, dict}
end
def handle_call({:get, some_key}, _from, dict) do
value = HashDict.fetch!(dict, some_key)
{:reply, value, dict}
end
end
So if from another process I use MyServerB.get_state(dict,some_key), I don't seem to be able to return the contents of the HashDict...
UPDATE:
So if I use ETS I have something like this:
def init do
ets = :ets.new(:my_table,[:ordered_set, :named_table])
{:ok, ets}
end
def handle_cast({:add, update}, state) do
update = :ets.insert(:my_table, {key, value})
{:noreply, ups}
end
def handle_call({:get, some_key}, _from, state) do
sum = :ets.foldl(fn({{key},{value}}, acc)
when key == some_Key -> value + acc
(_, acc) ->
acc
end, 0, :my_table)
{:reply, sum, state}
end
So again, the cast works - when I check with observer I can see the its filling up with my key value pairs. However, when I try my call it returns nothing again. So I'm wondering if I'm handling the state incorrectly?? Any help, gratefully received??
Thanks
Your problem is with this statement:
I would like to use one GenServer process to put and a different one to fetch.
In Elixir processes cannot share state. So you cannot have one process with data, and another process reading it directly. You could for example, store the HashDict in one process and then have the other process send a message to the first asking for data. That would make it appear as you describe, however behind the scenes it would still have all transactions go through the first process. There are techniques for doing this in a distributed/concurrent fashion so that multiple cores are utilize but that may be more work than you're looking to do at the moment.
Take a look at ETS, which will allow you to create a public table and access the data from multiple processes.
ETS is the way to go. Share a HashDict as state between GenServers is not possible.
I really don't know how you are testing your code, but ETS by default has read and write concurrency to false. For example, if you have no problem with reading or writing concurrently then you can change your init function to:
def init do
ets = :ets.new :my_table, [:ordered_set, :named_table,
read_concurrency: true,
write_concurrency: true]
{:ok, ets}
end
Hope this helps.