Why should I call a BERT module instance rather than the forward method? - bert-language-model

I'm trying to extract vector-representations of text using BERT in the transformers libray, and have stumbled on the following part of the documentation for the "BERTModel" class:
Can anybody explain this in more detail? A forward-pass makes intuitive sense to me (am trying to get final hidden states after all), and I can't find any additional information on what "pre and post processing" means in this context.
Thanks up front!

I think this is just general advice concerning working with PyTorch Module's. The transformers modules are nn.Modules, and they require a forward method. However, one should not call model.forward() manually but instead call model(). The reason is that PyTorch does some stuff under the hood when just calling the Module. You can find that in the source code.
def __call__(self, *input, **kwargs):
for hook in self._forward_pre_hooks.values():
result = hook(self, input)
if result is not None:
if not isinstance(result, tuple):
result = (result,)
input = result
if torch._C._get_tracing_state():
result = self._slow_forward(*input, **kwargs)
else:
result = self.forward(*input, **kwargs)
for hook in self._forward_hooks.values():
hook_result = hook(self, input, result)
if hook_result is not None:
result = hook_result
if len(self._backward_hooks) > 0:
var = result
while not isinstance(var, torch.Tensor):
if isinstance(var, dict):
var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
else:
var = var[0]
grad_fn = var.grad_fn
if grad_fn is not None:
for hook in self._backward_hooks.values():
wrapper = functools.partial(hook, self)
functools.update_wrapper(wrapper, hook)
grad_fn.register_hook(wrapper)
return result
You'll see that forward is called when necessary.

Related

How to negate Airflow sensor task result?

Is there a built-in facility or some operator that will run a sensor and negate its status? I am writing a workflow that needs to detect that an object does not exist in order to proceed to eventual success. I have a sensor, but it detects when the object does exist.
For instance, I would like my workflow to detect that an object does not exist. I need almost exactly S3KeySensor, except that I need to negate its status.
The use case you are describing is checking key in S3, if exist wait otherwise continue workflow. As you mentioned this is a Sensor use case. The S3Hook has function check_for_key that checks if key exist so all needed is just to wrap it with Sensor poke function..
A simple basic implementation would be:
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.sensors.base import BaseSensorOperator
class S3KeyNotPresentSensor(BaseSensorOperator):
""" Waits for a key to not be present in S3. """
template_fields: Sequence[str] = ('bucket_key', 'bucket_name')
def __init__(
self,
*,
bucket_key: str,
bucket_name: Optional[str] = None,
aws_conn_id: str = 'aws_default',
verify: Optional[Union[str, bool]] = None,
**kwargs,
):
super().__init__(**kwargs)
self.bucket_name = bucket_name
self.bucket_key = [bucket_key] if isinstance(bucket_key, str) else bucket_key
self.aws_conn_id = aws_conn_id
self.verify = verify
self.hook: Optional[S3Hook] = None
def poke(self, context: 'Context'):
return not self.get_hook().check_for_key(self.bucket_key, self.bucket_name)
def get_hook(self) -> S3Hook:
"""Create and return an S3Hook"""
if self.hook:
return self.hook
self.hook = S3Hook(aws_conn_id=self.aws_conn_id, verify=self.verify)
return self.hook
I ended up going another way. I can use the trigger_rule argument of (any) Task -- by setting it to one_failed or all_failed on the next task I can play around with the desired status.
For example,
file_exists = FileSensor(task_id='exists', timeout=3, poke_interval=1, filepath='/tmp/error', mode='reschedule')
sing = SmoothOperator(task_id='sing', trigger_rule='all_failed')
file_exists >> sing
It requires no added code or operator, but has the possible disadvantage of being somewhat surprising.
Replying to myself in the hope that this may be useful to someone else. Thanks!

Unable to pass args to context.job_queue.run_once in Python Telegram bot API

In the following code how can we pass the context.args and context to another function, in this case callback_search_msgs
def search_msgs(update, context):
print('In TG, args', context.args)
context.job_queue.run_once(callback_search_msgs, 1, context=context, job_kwargs={'keys': context.args})
def callback_search_msgs(context, keys):
print('Args', keys)
chat_id = context.job.context
print('Chat ID ', chat_id)
def main():
updater = Updater(token, use_context=True)
dp = updater.dispatcher
dp.add_handler(CommandHandler("search_msgs",search_msgs, pass_job_queue=True,
pass_user_data=True))
updater.start_polling()
updater.idle()
if __name__ == '__main__':
main()
A few notes:
job callbacks accept exactly one argument of type CallbackContext. Not two.
the job_kwargs parameter is used to pass keywoard argument to the APScheduler backend, on which JobQueue is built. The way you're trying to use it doesn't work.
if you want to know only the chat_id in the job, you don't have to pass the whole context argument of search_msgs. Just do context.job_queue.run_once(..., context=chat_id,...)
if you want to pass both the chat_id and context.args you can e.g. pass them as tuple:
job_queue.run_once(..., context=(chat_id, context.args), ...)
and then retrieve them in the job via chat_id, args = context.job.context.
Since you're using use_context=True (which is the default in v13+, btw), the pass_* parameters of CommandHandler (or any other handler) have no effect at all.
I suggest to carefully read
The tutorial on JobQueue
The example on JobQueue
The docs of JobQueue
Disclaimer: I'm currently the maintainer of python-telegram-bot

why this asynchronous code won't break while loop?

I use tornado asynchronous http client, but it doesn't work.
from tornado.concurrent import Future
import time
def async_fetch_future(url):
http_client = AsyncHTTPClient()
my_future = Future()
fetch_future = http_client.fetch(url)
fetch_future.add_done_callback(
lambda f: my_future.set_result(f.result()))
return my_future
future = async_fetch_future(url)
while not future.done():
print '.....'
print future.result()
You must run the event loop to allow asynchronous things to happen. You can replace this while loop with print IOLoop.current.run_sync(async_fetch_future(url) (but also note that manually handling Future objects like this is generally unnecessary; async_fetch_future can return the Future from AsyncHTTPClient.fetch directly, and if it needs to do something else it would be more idiomatic to decorate async_fetch_future with #tornado.gen.coroutine and use yield.
If you want to do something other than just print dots in the while loop, you should probably use a coroutine that periodically does yield tornado.gen.moment:
#gen.coroutine
def main():
future = async_fetch_future(url)
while not future.done():
print('...')
yield gen.moment
print(yield future)
IOLoop.current.run_sync(main)

accumulator in pyspark with dict as global variable

Just for learning purpose, I tried to set a dictionary as a global variable in accumulator the add function works well, but I ran the code and put dictionary in the map function, it always return empty.
But similar code for setting list as a global variable
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, acc1, acc2):
acc1.update(acc2)
if __name__== "__main__":
sc, sqlContext = init_spark("generate_score_summary", 40)
rdd = sc.textFile('input')
#print(rdd.take(5))
dict1 = sc.accumulator({}, DictParam())
def file_read(line):
global dict1
ls = re.split(',', line)
dict1+={ls[0]:ls[1]}
return line
rdd = rdd.map(lambda x: file_read(x)).cache()
print(dict1)
For anyone who arrives at this thread looking for a Dict accumulator for pyspark: the accepted solution does not solve the posed problem.
The issue is actually in the DictParam defined, it does not update the original dictionary. This works:
class DictParam(AccumulatorParam):
def zero(self, value = ""):
return dict()
def addInPlace(self, value1, value2):
value1.update(value2)
return value1
The original code was missing the return value.
I believe that print(dict1()) simply gets executed before the rdd.map() does.
In Spark, there are 2 types of operations:
transformations, that describe the future computation
and actions, that call for action, and actually trigger the execution
Accumulators are updated only when some action is executed:
Accumulators do not change the lazy evaluation model of Spark. If they
are being updated within an operation on an RDD, their value is only
updated once that RDD is computed as part of an action.
If you check out the end of this section of the docs, there is an example exactly like yours:
accum = sc.accumulator(0)
def g(x):
accum.add(x)
return f(x)
data.map(g)
# Here, accum is still 0 because no actions have caused the `map` to be computed.
So you would need to add some action, for instance:
rdd = rdd.map(lambda x: file_read(x)).cache() # transformation
foo = rdd.count() # action
print(dict1)
Please make sure to check on the details of various RDD functions and accumulator peculiarities because this might affect the correctness of your result. (For instance, rdd.take(n) will by default only scan one partition, not the entire dataset.)
For accumulator updates performed inside actions only, their value is
only updated once that RDD is computed as part of an action

Combine tornado gen.coroutine and joblib mem.cache decorators

Imagine having a function, which handles a heavy computational job, that we wish to execute asynchronously in a Tornado application context. Moreover, we would like to lazily evaluate the function, by storing its results to the disk, and not rerunning the function twice for the same arguments.
Without caching the result (memoization) one would do the following:
def complex_computation(arguments):
...
return result
#gen.coroutine
def complex_computation_caller(arguments):
...
result = complex_computation(arguments)
raise gen.Return(result)
Assume to achieve function memoization, we choose Memory class from joblib. By simply decorating the function with #mem.cache the function can easily be memoized:
#mem.cache
def complex_computation(arguments):
...
return result
where mem can be something like mem = Memory(cachedir=get_cache_dir()).
Now consider combining the two, where we execute the computationally complex function on an executor:
class TaskRunner(object):
def __init__(self, loop=None, number_of_workers=1):
self.executor = futures.ThreadPoolExecutor(number_of_workers)
self.loop = loop or IOLoop.instance()
#run_on_executor
def run(self, func, *args, **kwargs):
return func(*args, **kwargs)
mem = Memory(cachedir=get_cache_dir())
_runner = TaskRunner(1)
#mem.cache
def complex_computation(arguments):
...
return result
#gen.coroutine
def complex_computation_caller(arguments):
result = yield _runner.run(complex_computation, arguments)
...
raise gen.Return(result)
So the first question is whether the aforementioned approach is technically correct?
Now let's consider the following scenario:
#gen.coroutine
def first_coroutine(arguments):
...
result = yield second_coroutine(arguments)
raise gen.Return(result)
#gen.coroutine
def second_coroutine(arguments):
...
result = yield third_coroutine(arguments)
raise gen.Return(result)
The second question is how one can memoize second_coroutine? Is it correct to do something like:
#gen.coroutine
def first_coroutine(arguments):
...
mem = Memory(cachedir=get_cache_dir())
mem_second_coroutine = mem(second_coroutine)
result = yield mem_second_coroutine(arguments)
raise gen.Return(result)
#gen.coroutine
def second_coroutine(arguments):
...
result = yield third_coroutine(arguments)
raise gen.Return(result)
[UPDATE I] Caching and reusing a function result in Tornado discusses using functools.lru_cache or repoze.lru.lru_cache as a solution for second question.
The Future objects returned by Tornado coroutines are reusable, so it generally works to use in-memory caches such as functools.lru_cache, as explained in this question. Just be sure to put the caching decorator before #gen.coroutine.
On-disk caching (which seems to be implied by the cachedir argument to Memory) is trickier, since Future objects cannot generally be written to disk. Your TaskRunner example should work, but it's doing something fundamentally different from the others because complex_calculation is not a coroutine. Your last example will not work, because it's trying to put the Future object in the cache.
Instead, if you want to cache things with a decorator, you'll need a decorator that wraps the inner coroutine with a second coroutine. Something like this:
def cached_coroutine(f):
#gen.coroutine
def wrapped(*args):
if args in cache:
return cache[args]
result = yield f(*args)
cache[args] = f
return result
return wrapped

Resources