Trying to train AllenNLP coreference resolution model on ontonotes: gets CUDA out of memory - bert-language-model

I'm trying to train AllenNLPs coreference model on a 16GB GPU, using this config file: https://github.com/allenai/allennlp-models/blob/main/training_config/coref/coref_spanbert_large.jsonnet
I created train, test, and dev files using this script: https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh
I got CUDA out of memory almost instantly, so I tried changing "spans_per_word" and "max_antecedents" to lower values. With spans_per_words set to 0.1 instead of 0.4, I could run a bit longer but not nearly a full epoch. Is a 16GB GPU not enough? Or are there other parameters I could try changing?
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/allennlp/bin/allennlp", line 8, in
sys.exit(run())
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/main.py", line 34, in run
main(prog="allennlp")
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/init.py", line 119, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 119, in train_model_from_args
file_friendly_logging=args.file_friendly_logging,
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 178, in train_model_from_file
file_friendly_logging=file_friendly_logging,
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 242, in train_model
file_friendly_logging=file_friendly_logging,
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 466, in _train_worker
metrics = train_loop.run()
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 528, in run
return self.trainer.train()
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 740, in train
metrics, epoch = self._try_train()
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 772, in _try_train
train_metrics = self._train_epoch(epoch)
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 523, in _train_epoch
loss.backward()
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.33 GiB (GPU 0; 14.76 GiB total capacity; 11.69 GiB already allocated; 639.75 MiB free; 13.09 GiB reserved in total by PyTorch)

16GB is on the low end for that model.
When this model receives a lot of text, it will split the text into multiple shorter sequences of 512 word pieces each, and run them all at the same time. That way you end up with a lot of sequences in memory at the same time even when the batch size is 1.
Try setting max_sentence to a lower value (default is 110), and see if that works.

Related

Using Camembert pre-trained model with DeepPavlov

I'm learning how to use DeepPavlov and can't figure how to use it for NER using the Camembert (french) pre-trained model. My goal is to tag a short paragraph in french.
The docs from deeppavlov explicitly list the Camembert model as a viable transformer architecture. I tried to follow as best as I could but I keep getting this error when I try to build the model.
>>> ner_model = build_model('ner_ontonotes_bert_mult')
/home/philippe/.local/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForTokenClassification: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2023-01-29 09:58:13.308 ERROR in 'deeppavlov.core.common.params'['params'] at line 108: Exception in <class 'deeppavlov.models.torch_bert.torch_transformers_sequence_tagger.TorchTransformersSequenceTagger'>
Traceback (most recent call last):
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/core/common/params.py", line 102, in from_params
component = obj(**dict(config_params, **kwargs))
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/models/torch_bert/torch_transformers_sequence_tagger.py", line 182, in __init__
super().__init__(optimizer=optimizer,
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/core/models/torch_model.py", line 98, in __init__
self.load()
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/models/torch_bert/torch_transformers_sequence_tagger.py", line 295, in load
self.crf = CRF(self.n_classes).to(self.device)
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/models/torch_bert/crf.py", line 13, in __init__
super().__init__(num_tags=num_tags, batch_first=batch_first)
File "/home/philippe/.local/lib/python3.10/site-packages/torchcrf/__init__.py", line 40, in __init__
raise ValueError(f'invalid number of tags: {num_tags}')
ValueError: invalid number of tags: 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/core/commands/infer.py", line 55, in build_model
component = from_params(component_config, mode=mode)
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/core/common/params.py", line 102, in from_params
component = obj(**dict(config_params, **kwargs))
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/models/torch_bert/torch_transformers_sequence_tagger.py", line 182, in __init__
super().__init__(optimizer=optimizer,
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/core/models/torch_model.py", line 98, in __init__
self.load()
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/models/torch_bert/torch_transformers_sequence_tagger.py", line 295, in load
self.crf = CRF(self.n_classes).to(self.device)
File "/home/philippe/.local/lib/python3.10/site-packages/deeppavlov/models/torch_bert/crf.py", line 13, in __init__
super().__init__(num_tags=num_tags, batch_first=batch_first)
File "/home/philippe/.local/lib/python3.10/site-packages/torchcrf/__init__.py", line 40, in __init__
raise ValueError(f'invalid number of tags: {num_tags}')
ValueError: invalid number of tags: 0
I downloaded the camembert-base model from https://huggingface.co/camembert-base and copied the files in .deeppavlov/models/camembert-base directory.
Then I figured the deeppavlov's model 'ner_ontonotes_bert_mult' was the best for my use so I edited the config file and changed thoses lines in the metadata section at the end. The docs from DeepPavlov ask to change the TRANSFORMER value, witch I did, and I changed the MODEL_PATH so it point to the files I downloaded previously.
"variables": {
"ROOT_PATH": "~/.deeppavlov",
"DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
"MODELS_PATH": "{ROOT_PATH}/models",
"TRANSFORMER": "camembert-base",
"MODEL_PATH": "{MODELS_PATH}/camembert-base"
},
I am aware that I should have copied the config file to a new one with a different name but this should not be a problem.
Then in python I did the following :
from deeppavlov import configs, build_model
build_model('ner_ontonotes_bert_mult')`
And then I get the error mentioned before. I am lost and don't know where to look from now.

Failing to train Google Colab: Custom Training StyleGan2-ADA

I've been trying to get the following colab notbook to work without success.
I was able to scrape images needed and (hopefully) resized them all to 256x256.
After connecting my drive and following all steps, I reached the training part.
There, I am using the default values with: resume_from = "ffhq256"
The process start but ends after ~20 secods with the following error:
Constructing networks...
Setting up TensorFlow plugin "fused_bias_act.cu": Compiling... Loading... Done.
Setting up TensorFlow plugin "upfirdn_2d.cu": Compiling... Loading... Done.
Resuming from "https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/transfer-learning-source-nets/ffhq-res256-mirror-paper256-noaug.pkl"
Downloading https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/transfer-learning-source-nets/ffhq-res256-mirror-paper256-noaug.pkl ... done
Traceback (most recent call last):
File "train.py", line 645, in <module>
main()
File "train.py", line 637, in main
run_training(**vars(args))
File "train.py", line 522, in run_training
training_loop.training_loop(**training_options)
File "/content/drive/MyDrive/colab-sg2-ada/stylegan2-ada/training/training_loop.py", line 129, in training_loop
G.copy_vars_from(rG)
File "/content/drive/MyDrive/colab-sg2-ada/stylegan2-ada/dnnlib/tflib/network.py", line 512, in copy_vars_from
self._components[name].copy_vars_from(src_comp)
File "/content/drive/MyDrive/colab-sg2-ada/stylegan2-ada/dnnlib/tflib/network.py", line 509, in copy_vars_from
self.copy_own_vars_from(src_net)
File "/content/drive/MyDrive/colab-sg2-ada/stylegan2-ada/dnnlib/tflib/network.py", line 482, in copy_own_vars_from
tfutil.set_vars({self._get_vars()[name]: value for name, value in value_dict.items() if name in self._get_vars()})
File "/content/drive/MyDrive/colab-sg2-ada/stylegan2-ada/dnnlib/tflib/tfutil.py", line 227, in set_vars
run(ops, feed_dict)
File "/content/drive/MyDrive/colab-sg2-ada/stylegan2-ada/dnnlib/tflib/tfutil.py", line 33, in run
return tf.get_default_session().run(*args, **kwargs)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py", line 1156, in _run
(np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (3, 3, 512, 256) for Tensor 'G_synthesis/64x64/Conv0_up/weight/new_value:0', which has shape '(3, 3, 512, 512)'
Any help would be greatly appreciated!

Functions work when in a Sage worksheet directly but not when in a library

I'm taking a class, Intro to Algebraic Cryptology. We're using Sage for everything and CoCalc. This class is the first I've heard of either. The instructor has provided many convenience functions for our use. I do not like repeatedly copying them into new Sage worksheets in CoCalc. So, I put them in a library.
It took some time but I finally learned that to use them I have to do this in Sage:
load_attach_path('/path/to/the/directory')
%attach elliptic_curve_common.sage
Now, there is a function which she wrote for us to use called HPSonEC. This function is about using the Hellman-Pohlig-Silver exploit for cracking encryption on elliptic curves. What's infuriating is that the function will not work when used as I have above and I get this error:
Error in lines 5-5
Traceback (most recent call last):
File "/cocalc/lib/python3.8/site-packages/smc_sagews/sage_server.py", line 1230, in execute
exec(
File "", line 1, in <module>
File "<string>", line 298, in HPSonEC
File "<string>", line 263, in listptorder
File "<string>", line 151, in ECTimes
File "sage/rings/rational.pyx", line 2401, in sage.rings.rational.Rational.__mul__ (build/cythonized/sage/rings/rational.c:20911)
return coercion_model.bin_op(left, right, operator.mul)
File "sage/structure/coerce.pyx", line 1248, in sage.structure.coerce.CoercionModel.bin_op (build/cythonized/sage/structure/coerce.c:11304)
raise bin_op_exception(op, x, y)
TypeError: unsupported operand parent(s) for *: 'Rational Field' and 'Abelian group of points on Elliptic Curve defined by y^2 = x^3 + 389787687398479 over Finite Field of size 324638246338947256483756487461'
However, if I take that function, and the others on which it depends, and copy them into my Sage worksheet, they work just fine. Literally, no differences in the code at all. What may be the issue?
When reading code from a worksheet or a .sage file,
the Sage preparser is applied.
When reading code from a .py file, it is not.
See many questions where this came up:
https://stackoverflow.com/search?q=%5Bsage%5D+preparser
https://ask.sagemath.org/questions/scope:all/sort:activity-desc/page:1/query:preparser/

Unix.error 31 write when using Functory module

I am using the functory module and I am facing a very bizarre issue with the code.
My code is working fine and I have been able to parallelized a play on my game but when I try to play once again (launch another time a parallelized function) it raises a really weird error.
Here you can find the error :
Fatal error: exception Unix.Unix_error(43, "write", "")
Raised by primitive operation at file "unix.ml", line 252, characters 7-34
Called from file "protocol.ml", line 45, characters 10-32
Re-raised at file "network.ml", line 536, characters 10-11
Called from file "network.ml", line 565, characters 47-80
Called from file "list.ml", line 73, characters 12-15
Called from file "network.ml", line 731, characters 4-27
Called from file "map_fold.ml", line 98, characters 4-242
Called from file "game_ia.ml", line 111, characters 10-54
Called from file "gameplay.ml", line 34, characters 12-48
Called from file "gameplay.ml", line 57, characters 22-37
Called from file "gameplay.ml", line 85, characters 5-22
So I've decided to search in the following folders to see what primitive operation has been raised :
(unix.ml) external rename : string -> string -> unit = "unix_rename"
(network.ml) Some jid when w.state <> Disconnected -> send w (Protocol.Master.Kill jid)
So for some reason, it seems that my worker disconnects by itself. I was wondering if any of you already had this issue and what to do in order to solve it ?
You can find my game here. The main files involved are game_ia.ml (best_move_parallelized) and gameplay.ml (at the very bottom).
Thank you in advance for your help.
The error you get is (type the following in the toploop)¹:
# (Obj.magic 43: Unix.error);;
- : Unix.error = Unix.EPROTOTYPE
which means: Protocol wrong type for socket. So you have to examine how you initialize your socket.
¹ You can also count the exceptions in unix.mli, knowing that the first one, E2BIG, is 0. Emacs C-u 43 ↓ helps.

How do I enable through-the-filesystem diazo editing with plone 4.3

Summary: through-the-filesystem editing not working for my diazo theme. Plone breaks.
Details:
I've created my first live plone site with 4.3.2 and diazo. You can see the live version at borogreen.org. I would like to keep editing the theme forward.
My ubuntu 12.04LTS test server has only plone432 + diazo + dexterity (not used) + Static resource storage 1.0.2 enabled. For test purposes, I'm using the available sunrain theme.
I've placed the sunrain theme manually inside the /resources folder, as suggested per
http://developer.plone.org/reference_manuals/external/plone.app.theming/userguide.html#deploying-and-testing-themes
Trying to enable that theme in the Site Setup | Theming panel | Advanced, I set the path to the theme rules to
/++theme++sunrain/rules.xml
and the absolute path prefix to
/++theme++sunrain/
Plone does not recognize it: no theme gets enabled. The debug mode spits out the following error codes
2014-03-29 00:10:07 ERROR plone.subrequest Error handling subrequest to /++theme++sunrain/rules.xml
Traceback (most recent call last):
File "/home/plone/Plone/buildout-cache/eggs/plone.subrequest-1.6.7-py2.7.egg/plone/subrequest/__init__.py", line 116, in subrequest
traversed = request.traverse(path)
File "/home/plone/Plone/buildout-cache/eggs/Zope2-2.13.21-py2.7.egg/ZPublisher/BaseRequest.py", line 502, in traverse
subobject = self.traverseName(object, entry_name)
File "/home/plone/Plone/buildout-cache/eggs/Zope2-2.13.21-py2.7.egg/ZPublisher/BaseRequest.py", line 326, in traverseName
ob2 = namespaceLookup(ns, nm, ob, self)
File "/home/plone/Plone/buildout-cache/eggs/zope.traversing-3.13.2-py2.7.egg/zope/traversing/namespace.py", line 112, in namespaceLookup
return traverser.traverse(name, ())
File "/home/plone/Plone/buildout-cache/eggs/plone.resource-1.0.2-py2.7.egg/plone/resource/traversal.py", line 27, in traverse
raise NotFound
NotFound
2014-03-29 00:10:07 ERROR plone.transformchain Unexpected error whilst trying to apply transform chain
Traceback (most recent call last):
File "/home/plone/Plone/buildout-cache/eggs/plone.transformchain-1.0.3-py2.7.egg/plone/transformchain/transformer.py", line 48, in __call__
newResult = handler.transformIterable(result, encoding)
File "/home/plone/Plone/buildout-cache/eggs/plone.app.theming-1.1.1-py2.7.egg/plone/app/theming/transform.py", line 170, in transformIterable
transform = self.setupTransform(runtrace=runtrace)
File "/home/plone/Plone/buildout-cache/eggs/plone.app.theming-1.1.1-py2.7.egg/plone/app/theming/transform.py", line 108, in setupTransform
transform = compileThemeTransform(rules, absolutePrefix, readNetwork, parameterExpressions, runtrace=runtrace)
File "/home/plone/Plone/buildout-cache/eggs/plone.app.theming-1.1.1-py2.7.egg/plone/app/theming/utils.py", line 580, in compileThemeTransform
runtrace=runtrace,
File "/home/plone/Plone/buildout-cache/eggs/diazo-1.0.4-py2.7.egg/diazo/compiler.py", line 115, in compile_theme
read_network=read_network,
File "/home/plone/Plone/buildout-cache/eggs/diazo-1.0.4-py2.7.egg/diazo/rules.py", line 195, in process_rules
rules_doc = etree.parse(rules, parser=rules_parser)
File "lxml.etree.pyx", line 2957, in lxml.etree.parse (src/lxml/lxml.etree.c:56299)
File "parser.pxi", line 1526, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82331)
File "parser.pxi", line 1555, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82624)
File "parser.pxi", line 1455, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81663)
File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78623)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74567)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75458)
File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74760)
IOError: Error reading file '/++theme++sunrain/rules.xml': failed to load external entity "/++theme++sunrain/rules.xml"
What's wrong here?
ps: of course I can upload the theme as zip file and enable it that way, which works fine. I would really like to edit through-the-filesystem as I can foresee a lot of development in the future.
An up-to-date and working write-up for plone432 how to edit diazo themes through the filesystem using the /resources directory would be the answer, but I have not found that either outside of the plone.app.theming user guide. Help!

Resources