How should we use the setDictionary for the lemmatization annotator in Spark-NLP? - johnsnowlabs-spark-nlp

I have a requirement where I have to add a dictionary in the lemmatization step. While trying to use it in a pipeline and doing pipeline.fit() I get a arrayIndexOutOfBounds exception.
What is the correct way to implement this? are there any examples?
I am passing token as the inputcol for lemmatization and lemma as the outputcol. Following is my code:
// DocumentAssembler annotator
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// SentenceDetector annotator
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
// tokenizer annotaor
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
import com.johnsnowlabs.nlp.util.io.ExternalResource
// lemmatizer annotator
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary(ExternalResource("C:/data/notebook/lemmas001.txt","LINE_BY_LINE",Map("keyDelimiter"->",","valueDelimiter"->"|")))
val pipeline = new Pipeline().setStages(Array(document,sentenceDetector,token,lemmatizer))
val result= pipeline.fit(df).transform(df)
The error message is:
Name: java.lang.ArrayIndexOutOfBoundsException
Message: 1
StackTrace: at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:315)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1$$anonfun$apply$14.apply(ResourceHelper.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1.apply(ResourceHelper.scala:312)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$$anonfun$flattenRevertValuesAsKeys$1.apply(ResourceHelper.scala:312)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at com.johnsnowlabs.nlp.util.io.ResourceHelper$.flattenRevertValuesAsKeys(ResourceHelper.scala:312)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:52)
at com.johnsnowlabs.nlp.annotators.Lemmatizer.train(Lemmatizer.scala:19)
at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:45)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)

Your pipeline looks good to me so everything depends on what is inside lemmas001.txt and are you being able to access it on Windows.
NOTE: I have seen users on Windows using this inside Apache Spark:
"C:\\Users\\something\\Desktop\\someDirectory\\somefile.txt"
How to train Lemmatizer in Spark NLP is simple:
val lemmatizer = new Lemmatizer()
.setInputCols(Array("token"))
.setOutputCol("lemma")
.setDictionary("AntBNC_lemmas_ver_001.txt", "->", "\t")
The file must have the following format where the keyDelimiter in this case is -> and the valueDelimiter is \t:
abnormal -> abnormal abnormals
abode -> abode abodes
abolish -> abolishing abolished abolish abolishes
abolitionist -> abolitionist abolitionists
abominate -> abominate abominated abominates
abomination -> abomination abominations
aboriginal -> aboriginal aboriginals
aborigine -> aborigines aborigine
abort -> aborted abort aborts aborting
abortifacient -> abortifacients abortifacient
abortionist -> abortionist abortionists
abortion -> abortion abortions
abo -> abo abos
abotrite -> abotrites abotrite
abound -> abound abounds abounding abounded
Also, if you don't want to train your own Lemmatizer, you can use the pre-trained models as follow:
English
val lemmatizer = new LemmatizerModel.pretrained(name="lemma_antbnc", lang="en")
.setInputCols(Array("token"))
.setOutputCol("lemma")
French
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="fr")
.setInputCols(Array("token"))
.setOutputCol("lemma")
Italian
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="it")
.setInputCols(Array("token"))
.setOutputCol("lemma")
German
val lemmatizer = new LemmatizerModel.pretrained(name="lemma", lang="de")
.setInputCols(Array("token"))
.setOutputCol("lemma")
List of all pre-trained models is here:
https://nlp.johnsnowlabs.com/docs/en/models
List of all pre-trained pipelines is here:
https://nlp.johnsnowlabs.com/docs/en/pipelines
Please let me know in the comment if you have more questions.
Full disclosure: I am one of the contributors of Spark NLP library.
Update: I found this example for you in Scala on Databricks in case you are interested (This is actually how they trained Italian Lemmatizer model)

Related

How to load Hydra parameters from previous jobs (without having to use argparse and the compose API)?

I'm using Hydra for training machine learning models. It's great for doing complex commands like python train.py data=MNIST batch_size=64 loss=l2. However, if I want to then run the trained model with the same parameters, I have to do something like python reconstruct.py --config_file path_to_previous_job/.hydra/config.yaml. I then use argparse to load in the previous yaml and use the compose API to initialize the Hydra environment. The path to the trained model is inferred from the path to Hydra's .yaml file. If I want to modify one of the parameters, I have to add additional argparse parameters and run something like python reconstruct.py --config_file path_to_previous_job/.hydra/config.yaml --batch_size 128. The code then manually overrides any Hydra parameters with those that were specified on the command line.
What's the right way of doing this?
My current code looks something like the following:
train.py:
import hydra
#hydra.main(config_name="config", config_path="conf")
def main(cfg):
# [training code using cfg.data, cfg.batch_size, cfg.loss etc.]
# [code outputs model checkpoint to job folder generated by Hydra]
main()
reconstruct.py:
import argparse
import os
from hydra.experimental import initialize, compose
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('hydra_config')
parser.add_argument('--batch_size', type=int)
# [other flags and parameters I may need to override]
args = parser.parse_args()
# Create the Hydra environment.
initialize()
cfg = compose(config_name=args.hydra_config)
# Since checkpoints are stored next to the .hydra, we manually generate the path.
checkpoint_dir = os.path.dirname(os.path.dirname(args.hydra_config))
# Manually override any parameters which can be changed on the command line.
batch_size = args.batch_size if args.batch_size else cfg.data.batch_size
# [code which uses checkpoint_dir to load the model]
# [code which uses both batch_size and params in cfg to set up the data etc.]
This is my first time posting, so let me know if I should clarify anything.
If you want to load the previous config as is and not change it, use OmegaConf.load(file_path).
If you want to re-compose the config (and it sounds like you do, because you added that you want override things), I recommend that you use the Compose API and pass in parameters from the overrides file in the job output directory (next to the stored config.yaml), but concatenate the current run parameters.
This script seems to be doing the job:
import os
from dataclasses import dataclass
from os.path import join
from typing import Optional
from omegaconf import OmegaConf
import hydra
from hydra import compose
from hydra.core.config_store import ConfigStore
from hydra.core.hydra_config import HydraConfig
from hydra.utils import to_absolute_path
# You can also use a yaml config file instead of this Structured Config
#dataclass
class Config:
load_checkpoint: Optional[str] = None
batch_size: int = 16
loss: str = "l2"
cs = ConfigStore.instance()
cs.store(name="config", node=Config)
#hydra.main(config_path=".", config_name="config")
def my_app(cfg: Config) -> None:
if cfg.load_checkpoint is not None:
output_dir = to_absolute_path(cfg.load_checkpoint)
original_overrides = OmegaConf.load(join(output_dir, ".hydra/overrides.yaml"))
current_overrides = HydraConfig.get().overrides.task
hydra_config = OmegaConf.load(join(output_dir, ".hydra/hydra.yaml"))
# getting the config name from the previous job.
config_name = hydra_config.hydra.job.config_name
# concatenating the original overrides with the current overrides
overrides = original_overrides + current_overrides
# compose a new config from scratch
cfg = compose(config_name, overrides=overrides)
# train
print("Running in ", os.getcwd())
print(OmegaConf.to_yaml(cfg))
if __name__ == "__main__":
my_app()
~/tmp$ python train.py
Running in /home/omry/tmp/outputs/2021-04-19/21-23-13
load_checkpoint: null
batch_size: 16
loss: l2
~/tmp$ python train.py load_checkpoint=/home/omry/tmp/outputs/2021-04-19/21-23-13
Running in /home/omry/tmp/outputs/2021-04-19/21-23-22
load_checkpoint: /home/omry/tmp/outputs/2021-04-19/21-23-13
batch_size: 16
loss: l2
~/tmp$ python train.py load_checkpoint=/home/omry/tmp/outputs/2021-04-19/21-23-13 batch_size=32
Running in /home/omry/tmp/outputs/2021-04-19/21-23-28
load_checkpoint: /home/omry/tmp/outputs/2021-04-19/21-23-13
batch_size: 32
loss: l2

H20 Driverless AI, Not able to load custom recipe

I am using H2O DAI 1.9.0.6. I am tring to load custom recipe (BERT pretained model using custom recipe) on Expert settings. I am using local file to upload. However upload is not happning. No error, no progress nothing. After that activity I am not able to see this model under RECIPE tab.
Took Sample Recipe from below URL and Modified for my need. Thanks for the person who created this Recipe.
https://github.com/h2oai/driverlessai-recipes/blob/master/models/nlp/portuguese_bert.py
Custom Recipe
import os
import shutil
from urllib.parse import urlparse
import requests
from h2oaicore.models import TextBERTModel, CustomModel
from h2oaicore.systemutils import make_experiment_logger, temporary_files_path, atomic_move, loggerinfo
def is_url(url):
try:
result = urlparse(url)
return all([result.scheme, result.netloc, result.path])
except:
return False
def maybe_download_language_model(logger,
save_directory,
model_link,
config_link,
vocab_link):
model_name = "pytorch_model.bin"
if isinstance(model_link, str):
model_name = model_link.split('/')[-1]
if '.bin' not in model_name:
model_name = "pytorch_model.bin"
maybe_download(url=config_link,
dest=os.path.join(save_directory, "config.json"),
logger=logger)
maybe_download(url=vocab_link,
dest=os.path.join(save_directory, "vocab.txt"),
logger=logger)
maybe_download(url=model_link,
dest=os.path.join(save_directory, model_name),
logger=logger)
def maybe_download(url, dest, logger=None):
if not is_url(url):
loggerinfo(logger, f"{url} is not a valid URL.")
return
dest_tmp = dest + ".tmp"
if os.path.exists(dest):
loggerinfo(logger, f"already downloaded {url} -> {dest}")
return
if os.path.exists(dest_tmp):
loggerinfo(logger, f"Download has already started {url} -> {dest_tmp}. "
f"Delete {dest_tmp} to download the file once more.")
return
loggerinfo(logger, f"Downloading {url} -> {dest}")
url_data = requests.get(url, stream=True)
if url_data.status_code != requests.codes.ok:
msg = "Cannot get url %s, code: %s, reason: %s" % (
str(url), str(url_data.status_code), str(url_data.reason))
raise requests.exceptions.RequestException(msg)
url_data.raw.decode_content = True
if not os.path.isdir(os.path.dirname(dest)):
os.makedirs(os.path.dirname(dest), exist_ok=True)
with open(dest_tmp, 'wb') as f:
shutil.copyfileobj(url_data.raw, f)
atomic_move(dest_tmp, dest)
def check_correct_name(custom_name):
allowed_pretrained_models = ['bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', 'xlm-roberta',
'xlm', 'roberta', 'distilbert', 'camembert', 'ctrl', 'albert']
assert len([model_name for model_name in allowed_pretrained_models
if model_name in custom_name]), f"{custom_name} needs to contain the name" \
" of the pretrained model architecture (e.g. bert or xlnet) " \
"to be able to process the model correctly."
class CustomBertModel(TextBERTModel, CustomModel):
"""
Custom model class for using pretrained transformer models.
The class inherits :
- CustomModel that really is just a tag. It's there to make sure DAI knows it's a custom model.
- TextBERTModel so that the custom model inherits all the properties and methods.
Supported model architecture:
'bert', 'openai-gpt', 'gpt2', 'transfo-xl', 'xlnet', 'xlm-roberta',
'xlm', 'roberta', 'distilbert', 'camembert', 'ctrl', 'albert'
How to use:
- You have already downloaded the weights, the vocab and the config file:
- Set _model_path as the folder where the weights, the vocab and the config file are stored.
- Set _model_name according to the pretrained architecture (e.g. bert-base-uncased).
- You want to to download the weights, the vocab and the config file:
- Set _model_link, _config_link and _vocab_link accordingly.
- _model_path is the folder where the weights, the vocab and the config file will be saved.
- Set _model_name according to the pretrained architecture (e.g. bert-base-uncased).
- Important:
_model_path needs to contain the name of the pretrained model architecture (e.g. bert or xlnet)
to be able to load the model correctly.
- Disable genetic algorithm in the expert setting.
"""
# _model_path is the full path to the directory where the weights, vocab and the config will be saved.
_model_name = NotImplemented # Will be used to create the MOJO
_model_path = NotImplemented
_model_link = NotImplemented
_config_link = NotImplemented
_vocab_link = NotImplemented
_booster_str = "pytorch-custom"
# Requirements for MOJO creation:
# _model_name needs to be one of
# bert-base-uncased, bert-base-multilingual-cased, xlnet-base-cased, roberta-base, distilbert-base-uncased
# vocab.txt needs to be the same as vocab.txt used in _model_name (no custom vocabulary yet).
_mojo = False
#staticmethod
def is_enabled():
return False # Abstract Base model should not show up in models.
def _set_model_name(self, language_detected):
self.model_path = self.__class__._model_path
self.model_name = self.__class__._model_name
check_correct_name(self.model_path)
check_correct_name(self.model_name)
def fit(self, X, y, sample_weight=None, eval_set=None, sample_weight_eval_set=None, **kwargs):
logger = None
if self.context and self.context.experiment_id:
logger = make_experiment_logger(experiment_id=self.context.experiment_id, tmp_dir=self.context.tmp_dir,
experiment_tmp_dir=self.context.experiment_tmp_dir)
maybe_download_language_model(logger,
save_directory=self.__class__._model_path,
model_link=self.__class__._model_link,
config_link=self.__class__._config_link,
vocab_link=self.__class__._vocab_link)
super().fit(X, y, sample_weight, eval_set, sample_weight_eval_set, **kwargs)
class GermanBertModel(CustomBertModel):
_model_name = "bert-base-german-dbmdz-uncased"
_model_path = os.path.join(temporary_files_path, "german_bert_language_model/")
_model_link = "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/pytorch_model.bin"
_config_link = "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/config.json"
_vocab_link = "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt"
_mojo = True
#staticmethod
def is_enabled():
return True
Check that your custom recipe has is_enabled() returning True.
def is_enabled():
return True

Ejabberd: error in simple module to handle offline messages

I have an Ejabberd 17.01 installation where I need to push a notification in case a recipient is offline. This seems the be a common task and solutions using a customized Ejabberd module can be found everywhere. However, I just don't get it running. First, here's me script:
-module(mod_offline_push).
-behaviour(gen_mod).
-export([start/2, stop/1]).
-export([push_message/3]).
-include("ejabberd.hrl").
-include("logger.hrl").
-include("jlib.hrl").
start(Host, _Opts) ->
?INFO_MSG("mod_offline_push loading", []),
ejabberd_hooks:add(offline_message_hook, Host, ?MODULE, push_message, 10),
ok.
stop(Host) ->
?INFO_MSG("mod_offline_push stopping", []),
ejabberd_hooks:add(offline_message_hook, Host, ?MODULE, push_message, 10),
ok.
push_message(From, To, Packet) ->
?INFO_MSG("mod_offline_push -> push_message", [To]),
Type = fxml:get_tag_attr_s(<<"type">>, Packet), % Supposedly since 16.04
%Type = xml:get_tag_attr_s(<<"type">>, Packet), % Supposedly since 13.XX
%Type = xml:get_tag_attr_s("type", Packet),
%Type = xml:get_tag_attr_s(list_to_binary("type"), Packet),
?INFO_MSG("mod_offline_push -> push_message", []),
ok.
The problem is the line Type = ... line in method push_message; without that line the last info message is logged (so the hook definitely works). When browsing online, I can find all kinds of function calls to extract elements from Packet. As far as I understand it changed over time with new releases. But it's not good, all variants lead in some kind of error. The current way returns:
2017-01-25 20:38:08.701 [error] <0.21678.0>#ejabberd_hooks:run1:332 {function_clause,[{fxml,get_tag_attr_s,[<<"type">>,{message,<<>>,normal,<<>>,{jid,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>},{jid,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>},[],[{text,<<>>,<<"sfsdfsdf">>}],undefined,[],#{}}],[{file,"src/fxml.erl"},{line,169}]},{mod_offline_push,push_message,3,[{file,"mod_offline_push.erl"},{line,33}]},{ejabberd_hooks,safe_apply,3,[{file,"src/ejabberd_hooks.erl"},{line,382}]},{ejabberd_hooks,run1,3,[{file,"src/ejabberd_hooks.erl"},{line,329}]},{ejabberd_sm,route,3,[{file,"src/ejabberd_sm.erl"},{line,126}]},{ejabberd_local,route,3,[{file,"src/ejabberd_local.erl"},{line,110}]},{ejabberd_router,route,3,[{file,"src/ejabberd_router.erl"},{line,87}]},{ejabberd_c2s,check_privacy_route,5,[{file,"src/ejabberd_c2s.erl"},{line,1886}]}]}
running hook: {offline_message_hook,[{jid,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>},{jid,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>},{message,<<>>,normal,<<>>,{jid,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>},{jid,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>},[],[{text,<<>>,<<"sfsdfsdf">>}],undefined,[],#{}}]}
I'm new Ejabberd and Erlang, so I cannot really interpret the error, but the Line 33 as mentioned in {mod_offline_push,push_message,3,[{file,"mod_offline_push.erl"}, {line,33}]} is definitely the line calling get_tag_attr_s.
UPDATE 2017/01/27: Since this cost me a lot of headache -- and I'm still not perfectly happy -- I post here my current working module in the hopes it might help others. My setup is Ejabberd 17.01 running on Ubuntu 16.04. Most stuff I tried and failed with seem to for older versions of Ejabberd:
-module(mod_fcm_fork).
-behaviour(gen_mod).
%% public methods for this module
-export([start/2, stop/1]).
-export([push_notification/3]).
%% included for writing to ejabberd log file
-include("ejabberd.hrl").
-include("logger.hrl").
-include("xmpp_codec.hrl").
%% Copied this record definition from jlib.hrl
%% Including "xmpp_codec.hrl" and "jlib.hrl" resulted in errors ("XYZ already defined")
-record(jid, {user = <<"">> :: binary(),
server = <<"">> :: binary(),
resource = <<"">> :: binary(),
luser = <<"">> :: binary(),
lserver = <<"">> :: binary(),
lresource = <<"">> :: binary()}).
start(Host, _Opts) ->
?INFO_MSG("mod_fcm_fork loading", []),
% Providing the most basic API to the clients and servers that are part of the Inets application
inets:start(),
% Add hook to handle message to user who are offline
ejabberd_hooks:add(offline_message_hook, Host, ?MODULE, push_notification, 10),
ok.
stop(Host) ->
?INFO_MSG("mod_fcm_fork stopping", []),
ejabberd_hooks:add(offline_message_hook, Host, ?MODULE, push_notification, 10),
ok.
push_notification(From, To, Packet) ->
% Generate JID of sender and receiver
FromJid = lists:concat([binary_to_list(From#jid.user), "#", binary_to_list(From#jid.server), "/", binary_to_list(From#jid.resource)]),
ToJid = lists:concat([binary_to_list(To#jid.user), "#", binary_to_list(To#jid.server), "/", binary_to_list(To#jid.resource)]),
% Get message body
MessageBody = Packet#message.body,
% Check of MessageBody is not empty
case MessageBody/=[] of
true ->
% Get first element (no idea when this list can have more elements)
[First | _ ] = MessageBody,
% Get message data and convert to string
MessageBodyText = binary_to_list(First#text.data),
send_post_request(FromJid, ToJid, MessageBodyText);
false ->
?INFO_MSG("mod_fcm_fork -> push_notification: MessageBody is empty",[])
end,
ok.
send_post_request(FromJid, ToJid, MessageBodyText) ->
%?INFO_MSG("mod_fcm_fork -> send_post_request -> MessageBodyText = ~p", [Demo]),
Method = post,
PostURL = gen_mod:get_module_opt(global, ?MODULE, post_url,fun(X) -> X end, all),
% Add data as query string. Not nice, query body would be preferable
% Problem: message body itself can be in a JSON string, and I couldn't figure out the correct encoding.
URL = lists:concat([binary_to_list(PostURL), "?", "fromjid=", FromJid,"&tojid=", ToJid,"&body=", edoc_lib:escape_uri(MessageBodyText)]),
Header = [],
ContentType = "application/json",
Body = [],
?INFO_MSG("mod_fcm_fork -> send_post_request -> URL = ~p", [URL]),
% ADD SSL CONFIG BELOW!
%HTTPOptions = [{ssl,[{versions, ['tlsv1.2']}]}],
HTTPOptions = [],
Options = [],
httpc:request(Method, {URL, Header, ContentType, Body}, HTTPOptions, Options),
ok.
Actually it fails with second arg Packet you pass to fxml:get_tag_attr_s in push_message function
{message,<<>>,normal,<<>>,
{jid,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>,
<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>},
{jid,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>,<<"carl">>,
<<"xxx.xxx.xxx.xxx">>,<<>>},
[],
[{text,<<>>,<<"sfsdfsdf">>}],
undefined,[],#{}}
because it is not xmlel
Looks like it is record "message" defined in tools/xmpp_codec.hrl
with <<>> id and type 'normal'
xmpp_codec.hrl
-record(message, {id :: binary(),
type = normal :: 'chat' | 'error' | 'groupchat' | 'headline' | 'normal',
lang :: binary(),
from :: any(),
to :: any(),
subject = [] :: [#text{}],
body = [] :: [#text{}],
thread :: binary(),
error :: #error{},
sub_els = [] :: [any()]}).
Include this file and use just
Type = Packet#message.type
or, if you expect binary value
Type = erlang:atom_to_binary(Packet#message.type, utf8)
The newest way to do that seems to be with xmpp:get_type/1:
Type = xmpp:get_type(Packet),
It returns an atom, in this case normal.

Create a portal_user_catalog and have it used (Plone)

I'm creating a fork of my Plone site (which has not been forked for a long time). This site has a special catalog object for user profiles (a special Archetypes-based object type) which is called portal_user_catalog:
$ bin/instance debug
>>> portal = app.Plone
>>> print [d for d in portal.objectMap() if d['meta_type'] == 'Plone Catalog Tool']
[{'meta_type': 'Plone Catalog Tool', 'id': 'portal_catalog'},
{'meta_type': 'Plone Catalog Tool', 'id': 'portal_user_catalog'}]
This looks reasonable because the user profiles don't have most of the indexes of the "normal" objects, but have a small set of own indexes.
Since I found no way how to create this object from scratch, I exported it from the old site (as portal_user_catalog.zexp) and imported it in the new site. This seemed to work, but I can't add objects to the imported catalog, not even by explicitly calling the catalog_object method. Instead, the user profiles are added to the standard portal_catalog.
Now I found a module in my product which seems to serve the purpose (Products/myproduct/exportimport/catalog.py):
"""Catalog tool setup handlers.
$Id: catalog.py 77004 2007-06-24 08:57:54Z yuppie $
"""
from Products.GenericSetup.utils import exportObjects
from Products.GenericSetup.utils import importObjects
from Products.CMFCore.utils import getToolByName
from zope.component import queryMultiAdapter
from Products.GenericSetup.interfaces import IBody
def importCatalogTool(context):
"""Import catalog tool.
"""
site = context.getSite()
obj = getToolByName(site, 'portal_user_catalog')
parent_path=''
if obj and not obj():
importer = queryMultiAdapter((obj, context), IBody)
path = '%s%s' % (parent_path, obj.getId().replace(' ', '_'))
__traceback_info__ = path
print [importer]
if importer:
print importer.name
if importer.name:
path = '%s%s' % (parent_path, 'usercatalog')
print path
filename = '%s%s' % (path, importer.suffix)
print filename
body = context.readDataFile(filename)
if body is not None:
importer.filename = filename # for error reporting
importer.body = body
if getattr(obj, 'objectValues', False):
for sub in obj.objectValues():
importObjects(sub, path+'/', context)
def exportCatalogTool(context):
"""Export catalog tool.
"""
site = context.getSite()
obj = getToolByName(site, 'portal_user_catalog', None)
if tool is None:
logger = context.getLogger('catalog')
logger.info('Nothing to export.')
return
parent_path=''
exporter = queryMultiAdapter((obj, context), IBody)
path = '%s%s' % (parent_path, obj.getId().replace(' ', '_'))
if exporter:
if exporter.name:
path = '%s%s' % (parent_path, 'usercatalog')
filename = '%s%s' % (path, exporter.suffix)
body = exporter.body
if body is not None:
context.writeDataFile(filename, body, exporter.mime_type)
if getattr(obj, 'objectValues', False):
for sub in obj.objectValues():
exportObjects(sub, path+'/', context)
I tried to use it, but I have no idea how it is supposed to be done;
I can't call it TTW (should I try to publish the methods?!).
I tried it in a debug session:
$ bin/instance debug
>>> portal = app.Plone
>>> from Products.myproduct.exportimport.catalog import exportCatalogTool
>>> exportCatalogTool(portal)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File ".../Products/myproduct/exportimport/catalog.py", line 58, in exportCatalogTool
site = context.getSite()
AttributeError: getSite
So, if this is the way to go, it looks like I need a "real" context.
Update: To get this context, I tried an External Method:
# -*- coding: utf-8 -*-
from Products.myproduct.exportimport.catalog import exportCatalogTool
from pdb import set_trace
def p(dt, dd):
print '%-16s%s' % (dt+':', dd)
def main(self):
"""
Export the portal_user_catalog
"""
g = globals()
print '#' * 79
for a in ('__package__', '__module__'):
if a in g:
p(a, g[a])
p('self', self)
set_trace()
exportCatalogTool(self)
However, wenn I called it, I got the same <PloneSite at /Plone> object as the argument to the main function, which didn't have the getSite attribute. Perhaps my site doesn't call such External Methods correctly?
Or would I need to mention this module somehow in my configure.zcml, but how? I searched my directory tree (especially below Products/myproduct/profiles) for exportimport, the module name, and several other strings, but I couldn't find anything; perhaps there has been an integration once but was broken ...
So how do I make this portal_user_catalog work?
Thank you!
Update: Another debug session suggests the source of the problem to be some transaction matter:
>>> portal = app.Plone
>>> puc = portal.portal_user_catalog
>>> puc._catalog()
[]
>>> profiles_folder = portal.some_folder_with_profiles
>>> for o in profiles_folder.objectValues():
... puc.catalog_object(o)
...
>>> puc._catalog()
[<Products.ZCatalog.Catalog.mybrains object at 0x69ff8d8>, ...]
This population of the portal_user_catalog doesn't persist; after termination of the debug session and starting fg, the brains are gone.
It looks like the problem was indeed related with transactions.
I had
import transaction
...
class Browser(BrowserView):
...
def processNewUser(self):
....
transaction.commit()
before, but apparently this was not good enough (and/or perhaps not done correctly).
Now I start the transaction explicitly with transaction.begin(), save intermediate results with transaction.savepoint(), abort the transaction explicitly with transaction.abort() in case of errors (try / except), and have exactly one transaction.commit() at the end, in the case of success. Everything seems to work.
Of course, Plone still doesn't take this non-standard catalog into account; when I "clear and rebuild" it, it is empty afterwards. But for my application it works well enough.

Error attempting to decode with wreq

I'm trying really hard to understand how to use lenses and wreq and its turning out to really slow me down.
The error seems to be claiming there's some mismatched type here. I'm not sure exactly how to handle that though. I'm still fairly new to haskell and these lenses are pretty confusing. However, wreq seems to be cleaner, which is why I chose to use it. Can anyone help me understand what the error is, and how to fix it? I seem to run into alot of these type errors. I am aware that Maybe TestInfo won't be returned by my code at the moment. That's ok. That error know how to handle. This error however, I don't.
Here is my code:
Module TestInformation:
{-# LANGUAGE OverloadedStrings #-}
module TestInformation where
import Auth
import Network.Wreq
import Control.Lens
import Data.Aeson
import Data.Aeson.Lens (_String)
type TestNumber = String
data TestInfo = TestInfo {
TestId :: Int,
TestName :: String,
}
instance FromJSON TestInfo
getTestInfo :: Key -> TestNumber -> Maybe TestInfo
getTestInfo key test =
decode (res ^. responseBody . _String)
where opts = defaults & auth ?~ oauth2Bearer key
res = getWith opts ("http://testsite.com/v1/tests/" ++ test)
Module Auth:
module Auth where
import qualified Data.ByteString as B
type Key = B.ByteString
The error:
GHCi, version 7.10.1: http://www.haskell.org/ghc/ :? for help
[1 of 2] Compiling Auth ( Auth.hs, interpreted )
[2 of 2] Compiling TestInformation ( TestInformation.hs, interpreted )
TestInformation.hs:36:18:
Couldn't match type ‘Response body10’
with ‘IO (Response Data.ByteString.Lazy.Internal.ByteString)’
Expected type: (body10
-> Const Data.ByteString.Lazy.Internal.ByteString body10)
-> IO (Response Data.ByteString.Lazy.Internal.ByteString)
-> Const
Data.ByteString.Lazy.Internal.ByteString
(IO (Response Data.ByteString.Lazy.Internal.ByteString))
Actual type: (body10
-> Const Data.ByteString.Lazy.Internal.ByteString body10)
-> Response body10
-> Const Data.ByteString.Lazy.Internal.ByteString (Response body10)
In the first argument of ‘(.)’, namely ‘responseBody’
In the second argument of ‘(^.)’, namely ‘responseBody . _String’
TestInformation.hs:36:33:
Couldn't match type ‘Data.ByteString.Lazy.Internal.ByteString’
with ‘Data.Text.Internal.Text’
Expected type: (Data.ByteString.Lazy.Internal.ByteString
-> Const
Data.ByteString.Lazy.Internal.ByteString
Data.ByteString.Lazy.Internal.ByteString)
-> body10 -> Const Data.ByteString.Lazy.Internal.ByteString body10
Actual type: (Data.Text.Internal.Text
-> Const
Data.ByteString.Lazy.Internal.ByteString Data.Text.Internal.Text)
-> body10 -> Const Data.ByteString.Lazy.Internal.ByteString body10
In the second argument of ‘(.)’, namely ‘_String’
In the second argument of ‘(^.)’, namely ‘responseBody . _String’
Failed, modules loaded: Auth.
Leaving GHCi.
This type checks for me:
getTestInfo :: Key -> TestNumber -> IO (Maybe TestInfo)
getTestInfo key test = do
res <- getWith opts ("http://testsite.com/v1/tests/" ++ test)
return $ decode (res ^. responseBody)
where opts = defaults & auth ?~ oauth2Bearer key
getWith is an IO action, so to get its return value you need to use the monadic binding operator <-.
Full program: http://lpaste.net/133443 http://lpaste.net/133498

Resources