How to create many Bokeh figures with multiprocessing? - bokeh

I would like to speed up figure generation in Bokeh by multiprocessing:
jobs = []
for label in list(peakLabels):
args = {'data': rt_proj_data[label],
'label': label,
'tools': tools,
'colors': itertools.cycle(palette),
'files': files,
'highlight': highlight}
jobs.append(args)
pool = Pool(processes=cpu_count())
m = Manager()
q = m.Queue()
plots = pool.map_async(plot_peaks_parallel, jobs)
pool.close()
pool.join()
def plot_peaks_parallel(args):
data = args['data']
label = args['label']
colors = args['colors']
tools = args['tools']
files = args['files']
highlight = args['highlight']
p = figure(title=f'Peak: {label}',
x_axis_label='Retention Time',
y_axis_label='Intensity',
tools=tools)
...
return p
Though I ran into this error:
MaybeEncodingError: Error sending result: '[Figure(id='1078', ...)]'. Reason: 'PicklingError("Can't pickle at 0x7fc7df0c0ea0>: attribute lookup ColumnDataSource. on bokeh.models.sources failed")'
Can I do something to the object p, so that it becomes pickleable?

Individual Bokeh objects are not serializable in isolation, including with pickle. The smallest meaningful unit of serialization in Bokeh is the Document, which is a specific collection of Bokeh objects guaranteed to be complete with respect to following references. However, I would be surprised if pickle works with Document either (AFAIK you are the first person to ask about it since the project started, it's never been a priority, or even looked into that I know of). Instead, I would suggest if you want to do something like this, to use Bokeh's own JSON serialization functions, such as json_item:
# python code
p_serialized = json.dumps(json_item(p))
This will properly serialize p in the context of the Document it is a part of. Then you can pass this to your page templates to display with the Bokeh JS embed API:
# javascript code
p = JSON.parse(p_serialized);
Bokeh.embed.embed_item(p, "mydiv")

Related

BertModel transformers outputs string instead of tensor

I'm following this tutorial that codes a sentiment analysis classifier using BERT with the huggingface library and I'm having a very odd behavior. When trying the BERT model with a sample text I get a string instead of the hidden state. This is the code I'm using:
import transformers
from transformers import BertModel, BertTokenizer
print(transformers.__version__)
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
PATH_OF_CACHE = "/home/mwon/data-mwon/paperChega/src_classificador/data/hugingface"
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME,cache_dir = PATH_OF_CACHE)
sample_txt = 'When was I last outside? I am stuck at home for 2 weeks.'
encoding_sample = tokenizer.encode_plus(
sample_txt,
max_length=32,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
return_token_type_ids=False,
padding=True,
truncation = True,
return_attention_mask=True,
return_tensors='pt', # Return PyTorch tensors
)
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME,cache_dir = PATH_OF_CACHE)
last_hidden_state, pooled_output = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask']
)
print([last_hidden_state,pooled_output])
that outputs:
4.0.0
['last_hidden_state', 'pooler_output']
While the answer from Aakash provides a solution to the problem, it does not explain the issue. Since one of the 3.X releases of the transformers library, the models do not return tuples anymore but specific output objects:
o = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask']
)
print(type(o))
print(o.keys())
Output:
transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions
odict_keys(['last_hidden_state', 'pooler_output'])
You can return to the previous behavior by adding return_dict=False to get a tuple:
o = bert_model(
encoding_sample['input_ids'],
encoding_sample['attention_mask'],
return_dict=False
)
print(type(o))
Output:
<class 'tuple'>
I do not recommend that, because it is now unambiguous to select a specific part of the output without turning to the documentation as shown in the example below:
o = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'], return_dict=False, output_attentions=True, output_hidden_states=True)
print('I am a tuple with {} elements. You do not know what each element presents without checking the documentation'.format(len(o)))
o = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'], output_attentions=True, output_hidden_states=True)
print('I am a cool object and you can acces my elements with o.last_hidden_state, o["last_hidden_state"] or even o[0]. My keys are; {} '.format(o.keys()))
Output:
I am a tuple with 4 elements. You do not know what each element presents without checking the documentation
I am a cool object and you can acces my elements with o.last_hidden_state, o["last_hidden_state"] or even o[0]. My keys are; odict_keys(['last_hidden_state', 'pooler_output', 'hidden_states', 'attentions'])
I faced the same issue while learning how to implement Bert. I noticed that using
last_hidden_state, pooled_output = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'])
is the issue. Use:
outputs = bert_model(encoding_sample['input_ids'], encoding_sample['attention_mask'])
and extract the last_hidden state using
output[0]
You can refer to the documentation here which tells you what is returned by the BertModel

Google Earth Engine download problems, is this caused by immutable server side objects?

I have a function that will download an image collection as a TFrecord or a geotiff.
Heres the function -
def download_image_collection_to_drive(collection, aois, bands, limit, export_format):
if collection.size().lt(ee.Number(limit)):
bands = [band for band in bands if band not in ['SCL', 'QA60']]
for aoi in aois:
cluster = aoi.get('cluster').getInfo()
geom = aoi.bounds().getInfo()['geometry']['coordinates']
aoi_collection = collection.filterMetadata('cluster', 'equals', cluster)
for ts in range(1, 11):
print(ts)
ts_collection = aoi_collection.filterMetadata('interval', 'equals', ts)
if ts_collection.size().eq(ee.Number(1)):
image = ts_collection.first()
p_id = image.get("PRODUCT_ID").getInfo()
description = f'{cluster}_{ts}_{p_id}'
task_config = {
'fileFormat': export_format,
'image': image.select(bands),
'region': geom,
'description': description,
'scale': 10,
'folder': 'output'
}
if export_format == 'TFRecord':
task_config['formatOptions'] = {'patchDimensions': [256, 256], 'kernelSize': [3, 3]}
task = ee.batch.Export.image.toDrive(**task_config)
task.start()
else:
logger.warning(f'no image for interval {ts}')
else:
logger.warning(f'collection over {limit} aborting drive download')
It seems whenever it gets to the second aoi it fails, Im confused by this as if ts_collection.size().eq(ee.Number(1)) confirms there is an image there so it should manage to get product id from it.
line 24, in download_image_collection_to_drive
p_id = image.get("PRODUCT_ID").getInfo()
File "/lib/python3.7/site-packages/ee/computedobject.py", line 95, in getInfo
return data.computeValue(self)
File "/lib/python3.7/site-packages/ee/data.py", line 717, in computeValue
prettyPrint=False))['result']
File "/lib/python3.7/site-packages/ee/data.py", line 340, in _execute_cloud_call
raise _translate_cloud_exception(e)
ee.ee_exception.EEException: Element.get: Parameter 'object' is required.
am I falling foul of immutable server side objects somewhere?
This is a server-side value, problem, yes, but immutability doesn't have to do with it — your if statement isn't working as you intend.
ts_collection.size().eq(ee.Number(1)) is a server-side value — you've described a comparison that hasn't happened yet. That means that doing any local operation like a Python if statement cannot take the comparison outcome into account, and will just treat it as a true value.
Using getInfo would be a quick fix:
if ts_collection.size().eq(ee.Number(1)).getInfo():
but it would be more efficient to avoid using getInfo more than needed by fetching the entire collection's info just once, which includes the image info.
...
ts_collection_info = ts_collection.getInfo()
if ts_collection['features']: # Are there any images in the collection?
image = ts_collection.first()
image_info = ts_collection['features'][0] # client-side image info already downloaded
p_id = image_info['properties']['PRODUCT_ID'] # get ID from client-side info
...
This way, you only make two requests per ts: one to check for the match, and one to start the export.
Note that I haven't actually run this Python code, and there might be some small mistakes; if it gives you any trouble, print(ts_collection_info) and examine the structure you actually received to figure out how to interpret it.

Pause/Stop AjaxDataSource Bokeh Stream

I am able to set up my graph for streaming just fine. Here's the initialization:
self.data_source = AjaxDataSource(data_url='my_route',
polling_interval=1000, mode='append', max_size=300)
Now I want to 'pause' the polling of the AjaxDataSource. I couldn't find a way to do this in the documentation. I'm NOT running a bokeh server, so bokeh server solutions I am unable to use.
I came up with one possible solution: just return empty data set to the function that is appending the data via AjaxDataSource. So in the example above, the my_route function would look something like this:
def my_route:
if not self.is_paused:
data = normal_data_to_graph
else:
data = []
return data
Once you set the polling_interval = None in Python, it will not request it. In CustomJS, you can start the paused request. Here, the source is an AJaxDataSource instance.
source.polling_interval = 1000; // the interval you want
source.intialized = false;
source.setup();

Losing some z3c relation data on restart

I have the following code which is meant to programmatically assign relation values to a custom content type.
publications = # some data
catalog = getToolByName(context, 'portal_catalog')
for pub in publications:
if pub['custom_id']:
results = catalog(custom_id=pub['custom_id'])
if len(results) == 1:
obj = results[0].getObject()
measures = []
for m in pub['measure']:
if m in context.objectIds():
m_id = intids.getId(context[m])
relation = RelationValue(m_id)
measures.append(relation)
obj.measures = measures
obj.reindexObject()
notify(ObjectModifiedEvent(obj))
Snippet of schema for custom content type
measures = RelationList(
title=_(u'Measure(s)'),
required=False,
value_type=RelationChoice(title=_(u'Measure'),
source=ObjPathSourceBinder(object_provides='foo.bar.interfaces.measure.IMeasure')),
)
When I run my script everything looks good. The problem is when my template for the custom content tries to call "pub/from_object/absolute_url" the value is blank - only after a restart. Interestingly, I can get other attributes of pub/from_object after a restart, just not it's URL.
from_object retrieves the referencing object from the relation catalog, but doesn't put the object back in its proper Acquisition chain. See http://docs.plone.org/external/plone.app.dexterity/docs/advanced/references.html#back-references for a way to do it that should work.

What's wrong with my filter query to figure out if a key is a member of a list(db.key) property?

I'm having trouble retrieving a filtered list from google app engine datastore (using python for server side). My data entity is defined as the following
class Course_Table(db.Model):
course_name = db.StringProperty(required=True, indexed=True)
....
head_tags_1=db.ListProperty(db.Key)
So the head_tags_1 property is a list of keys (which are the keys to a different entity called Headings_1).
I'm in the Handler below to spin through my Course_Table entity to filter the courses that have a particular Headings_1 key as a member of the head_tags_1 property. However, it doesn't seem like it is retrieving anything when I know there is data there to fulfill the request since it never displays the logs below when I go back to iterate through the results of my query (below). Any ideas of what I'm doing wrong?
def get(self,level_num,h_key):
path = []
if level_num == "1":
q = Course_Table.all().filter("head_tags_1 =", h_key)
for each in q:
logging.info('going through courses with this heading name')
logging.info("course name filtered is %s ", each.course_name)
MANY MANY THANK YOUS
I assume h_key is key of headings_1, since head_tags_1 is a list, I believe what you need is IN operator. https://developers.google.com/appengine/docs/python/datastore/queries
Note: your indentation inside the for loop does not seem correct.
My bad apparently '=' for list is already check membership. Using = to check membership is working for me, can you make sure h_key is really a datastore key class?
Here is my example, the first get produces result, where the 2nd one is not
import webapp2 from google.appengine.ext import db
class Greeting(db.Model):
author = db.StringProperty()
x = db.ListProperty(db.Key)
class C(db.Model): name = db.StringProperty()
class MainPage(webapp2.RequestHandler):
def get(self):
ckey = db.Key.from_path('C', 'abc')
dkey = db.Key.from_path('C', 'def')
ekey = db.Key.from_path('C', 'ghi')
Greeting(author='xxx', x=[ckey, dkey]).put()
x = Greeting.all().filter('x =',ckey).get()
self.response.write(x and x.author or 'None')
x = Greeting.all().filter('x =',ekey).get()
self.response.write(x and x.author or 'None')
app = webapp2.WSGIApplication([('/', MainPage)],
debug=True)

Resources