Use pre-trained model vocabulary in an appropriate way with allennlp - vocabulary

When using a huggingface pre-traind model,i passed a tokennizer and indexer for my textfied in Datasetreader, also i want use the same tokennizer and indexer in my model. Which way is an appropriate way in allennlp ? (using config file ?)
Here is my code, i think this is a bad sloution. Give me some suggestions please.
`In my Dataset Reader::
self._tokenizer = PretrainedTransformerTokenizer("microsoft/DialoGPT-small",tokenizer_kwargs={'cls_token': '[CLS]',
'sep_token': '[SEP]',
'bos_token':'[BOS]'})
self._tokenindexer = {"tokens": PretrainedTransformerIndexer("microsoft/DialoGPT-small",
tokenizer_kwargs={'cls_token': '[CLS]',
'sep_token': '[SEP]',
'bos_token':'[BOS]'})}
In my Model:
self.tokenizer = GPT2Tokenizer.from_pretrained("microsoft/DialoGPT-small")
num_added_tokens = self.tokenizer.add_special_tokens({'bos_token':'[BOS]','sep_token': '[SEP]','cls_token':'[CLS]'})
self.emb_dim = len(self.tokenizer)
self.embeded_layer = self.encoder.resize_token_embeddings(self.emb_dim)
I have create two tokenizers for datasetreader and model, and both the tokenizers have the common vocabulary and special tokens. but when i add the three special token in the same order, the special token will have a different index. so i switched the order in Model`s codes to achieve the same indexs.(stupid but effective)
Is there exists a way to pass the tokennizer or vocab from DatasetReader to Model?
Which way is an appropriate way in allennlp to slove this problem ?

Related

OpenMDAO Optional Error on Unconnected Input

Is there any way to force OpenMDAO to raise an error if a given input is unconnected? I know for many inputs, defaults can be provided such that the input doesn't need to be connected, however is there a way to tell OpenMDAO to automatically raise an error if certain key inputs are unconnected?
This is not built into OpenMDAO, as of V3.17. However, it is possible to do it. The only caveat is that i had to use some non public APIs to make it work (notice the use of the p.model._conn_global_abs_in2out). So those APIs are subject to changer later.
This code should give you the behavior your want. You could augment things with the use of variable tagging if you wanted a solution that didn't require you to give a list of variable names to the validate function. The list_inputs method can accept tags to filter by instead if you prefer that.
import openmdao.api as om
def validate_connections(prob, force_connected):
# make sure its a set and not a generic iterator (i.e. array)
force_connected_set = set(force_connected)
model_inputs = prob.model.list_inputs(out_stream=None, prom_name=True)
#gets the promoted names from the list of inputs
input_set = set([inp[1]['prom_name'] for inp in model_inputs])
# filter the inputs into connected and unconnected sets
connect_dict = p.model._conn_global_abs_in2out
unconnected_inputs = set()
connected_inputs = set()
for abs_name, in_data in model_inputs:
if abs_name in connect_dict and (not 'auto_ivc' in connect_dict[abs_name]):
connected_inputs.add(in_data['prom_name'])
else:
unconnected_inputs.add(in_data['prom_name'])
# now we need to check if there are any unconnected inputs
# in the model that aren't in the spec
illegal_unconnected = force_connected_set.intersection(unconnected_inputs)
if len(illegal_unconnected) > 0:
raise ValueError(f'unconnected inputs {illegal_unconnected} are are not allowed')
p = om.Problem()
###############################################################################################
# comment and uncomment these three lines to change the error you get from the validate method
###############################################################################################
# p.model.add_subsystem('c0', om.ExecComp('x=3*a'), promotes_outputs=['x'])
# p.model.add_subsystem('c1', om.ExecComp('b=a+17'))
# p.model.connect('c1.b', 'c2.b')
p.model.add_subsystem('c2', om.ExecComp('y=2*x+b'), promotes_inputs=['x'])
p.model.add_subsystem('c3', om.ExecComp('z=x**2+y'))
p.setup()
p.final_setup()
If this is a feature you think should be added to OpenMDAO proper, then feel free to submit a POEM proposing how a formal feature and its API might look.

how to get list of Auto-IVC component output names

I'm switching over to using the Auto-IVC component as opposed to the IndepVar component. I'd like to be able to get a list of the promoted output names of the Auto-IVC component, so I can then use them to go and pull the appropriate value out of a configuration file and set the values that way. This will get rid of some boilerplate.
p.model._auto_ivc.list_outputs()
returns an empty list. It seems that p.model__dict__ has this information encoded in it, but I don't know exactly what is going on there so I am wondering if there is an easier way to do it.
To avoid confusion from future readers, I assume you meant that you wanted the promoted input names for the variables connected to the auto_ivc outputs.
We don't have a built-in function to do this, but you could do it with a bit of code like this:
seen = set()
for n in p.model._inputs:
src = p.model.get_source(n)
if src.startswith('_auto_ivc.') and src not in seen:
print(src, p.model._var_allprocs_abs2prom['input'][n])
seen.add(src)
assuming 'p' is the name of your Problem instance.
The code above just prints each auto_ivc output name followed by the promoted input it's connected to.
Here's an example of the output when run on one of our simple test cases:
_auto_ivc.v0 par.x

None of the keys entered are valid keys - R

I am trying to learn how to manipulate microarrays for differential expression analysis. While I am trying to add some annotation I can not find the keytype related to:
select(hugene10sttranscriptcluster.db,
keys = my_keys,
columns = c("GENENAME", "SYMBOL"),
keytype = "PROBEID")
-------------------------------------------------------
Error in .testForValidKeys(x, keys, keytype, fks) :
None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.
Being the keys:
my_keys
---------------------------------------------------------------------
[1] "16650045" "16650047" "16650049" "16650051" "16650053" "16650055" "16650057" "16650059"
I tried every possible type from keytypes(hugene10sttranscriptcluster.db) with no successful result:
"16650045" %in% keys(hugene10sttranscriptcluster.db, "GENEID")
------------------------------------------------------------------
[1] FALSE
Is there any documentation/alternative where I can find it. I have been looking through the documentation (Array Express) but did not help me. I am also not sure; is it possible that I require a different package (hugene10sttranscriptcluster.db)?
Effectively, I did have a problem with the package. If anyone has the same problem just try to look for the annotation of the microarray in the documentation (pd.hugene.2.0.st in my case) to install and use the proper package (hugene20sttranscriptcluster.db)

Is possible to use the solution template in exams2nops?

When I try to generate the exams' solution with the exams2nops(...template="solution"...) I get the following error message:
Error in exams2pdf(file, n = n, nsamp = nsamp, dir = dir, name = name, :
formal argument "template" matched by multiple actual arguments
How can I produce an exams' solution with the exams2nops?
You cannot do that in one go, you need two runs after setting the same seed, e.g.,
set.seed(1)
exams2nops(my_exam)
set.seed(1)
exams2pdf(my_exam, template = "my_solution.tex")
You can use the solution.tex provided within the package as a starting point for my_solution.tex. But you may want to translate it to your natural language, use the name of your university, possibly insert a logo, add your actual exam name, possibly some into text etc. In exams2pdf() you need to add these things in the template LaTeX file directly.
won't the template="solution" not work in the exams2pdf? Also, can we do something like:
usepackage = "pdfpages", intro = intro2,... ?

Use input variable in assert or specify the data to assert

I have a unit test for a function that adds data (untransformed) to the database. The data to insert is given to the create function.
Do I use the input data in my asserts or is it better to specify the data that I’m asserting?
For eample:
$personRequest = [
'name'=>'John',
'age'=>21,
];
$id = savePerson($personRequest);
$personFromDb = getPersonById($id);
$this->assertEquals($personRequest['name'], $personFromDb['name']);
$this->assertEquals($personRequest['age'], $personFromDb['age']);
Or
$id = savePerson([
'name'=>'John',
'age'=>21,
]);
$personFromDb = getPersonById($id);
$this->assertEquals('John', $personFromDb['name']);
$this->assertEquals(21, $personFromDb['age']);
I think 1st option is better. Your input data may change in future and if you go by 2nd option, you will have to change assertion data everytime.
2nd option is useful, when your output is going to be same irrespective of your input data.
I got an answer from Adam Wathan by e-mail. (i took his test driven laravel course and noticed he uses the 'specify' option)
I think it's just personal preference, I like to be able to visually
skim and see "ok this specific string appears here in the output and
here in the input", vs. trying to avoid duplication by storing things
in variables." Nothing wrong with either approach in my opinion!
So i can't choose a correct answer.

Resources