Transformer/BERT token prediction vocabulary (filtering the special tokens out of the set of possible tokens) - bert-language-model

With the Transformer model, especially with the BERT, does it make sense (and would it be statistically correct) to programmatically forbid the model to result with the special tokens as predictions?
How is that in the original implementations? During convergence the models have to learn not to predict these but would this intervention help (or the opposite)?
I would consider the [MASK], [CLS] tokens mainly
[PAD] token could have some sense as well (but that not in all situations)

If I understand your question, you're asking how BERT (or other transformer based models) handle special characters. This is less relevant to the model architecture (i.e. this answer is relevant to auto-regressive models, or even non-neural models) than it is to the pre-processing steps.
In particular, BERT tokenizers use a Byte-Pair Encoding tokenizer to split the text and tokens into subwords. Should the tokenizer not recognize a sequence of characters, it will replace the sequence of characters with an UNK meta-token, very much like the MASK or CLS tokens. Google will have many answers should you want to see more specifics, for example, from a blog article:
There is an important point to note when we use a pre-trained model. Since the model is pre-trained on a certain corpus, the vocabulary was also fixed. In other words, when we apply a pre-trained model to some other data, it is possible that some tokens in the new data might not appear in the fixed vocabulary of the pre-trained model. This is commonly known as the out-of-vocabulary (OOV) problem.
For tokens not appearing in the original vocabulary, it is designed that they should be replaced with a special token [UNK], which stands for unknown token.

Related

reusable holdout in mlr

How can someone change the cross validation or holdout procedures in mlr so that before testing with the validation set, that same validation set is changed according to a procedure, namely the reusable holdout procedure?
Procedure:
http://insilico.utulsa.edu/wp-content/uploads/2016/10/Dwork_2015_Science.pdf
Short answer: mlr doesn't support that.
Long answer: My experience with differential privacy for machine learning is that in practice it doesn't work as well as advertised. In particular, to apply thresholdout you need a) copious amounts of data and b) the a priori probability that a given classifier will overfit on the given data -- something you can't easily determine in practice. While the paper you reference comes with example code that shows that thresholdout works in this particular case, but the amount of noise added in the code looks like it was determined on an ad-hoc basis; the relationship to the thresholdout algorithm described in the paper isn't clear.
Before differential privacy can be robustly applied in practice in scenarios like that, mlr won't support it.

Is having multiple features for the same data bad practice (e.g. use both ordinal and binarized time series data)?

I'm trying to train a learning model on real estate sale data that includes dates. I've looked into 1-to-K binary encoding, per the advice in this thread, however my initial assessment is that it may have the weakness of not being able to train well on data that is not predictably cyclic. While real estate value crashes are recurring, I'm concerned (maybe wrongfully so, you tell me) that doing 1-to-K encoding will inadvertently overtrain on potentially irrelevant features if the recurrence is not explainable by a combination of year-month-day.
That said, I think there is potentially value in that method. I think that there is also merit to the argument of converting time series data to ordinal, as also recommended in the same thread. Which brings me to the real question: is it bad practice to duplicate the same initial feature (the date data) in two different forms in the same training data? I'm concerned if I use methods that rely on the assumption of feature independence I may be violating this by doing so.
If so, what are suggestions for how to best get the maximal information from this date data?
Edit: Please leave a comment how I can improve this question instead of down-voting.
Is it bad practice?
No, sometimes transformations make your Feature easier accesible for your algorithm. Following this line of thought you converting Features is completely fine.
Does it scew your algorithm?
Concerning runtime it might be better to not have to transform your data everytime. Depending on your algorithm you might get worse interpretability (if that is important for you) depending on the type of transformations.
Also if you want to restrict the amount / set of Features your algorithm should use, you might add Information redundancies by adding transformed Features.
So what should you do?
Transform your data / Features as much as you want and as often as you want.
That's not hurting anyone, but rather helping by increasing the Feature space. But after you did so, do a PCA or something similar in order to find redundancies in your Features and reduce your Feature space again.
Note:
I tried to be General, obviously this is highly dependant on the Kind of algorithm you're using.

Can I choose clustering methods base on internal validation result?

From what I understand, it is usually difficult to select the best possible clustering method for your data priori, and we can use cluster validity to compare the results of different clustering algorithms and choose the one with the best validation scores.
I use an internal validation function from R stats package on my clustering result (for clustering methods I used R igraph fast.greedy and walk.trap).
The outcome is a list of many validation scores.
In the list, almost in every validation Fast greedy method has better scores than Walk trap, except in entropy walk trap method has a better score.
Can I use this validation result list as one of my reasons to explain to others why I choose Fast greedy method rather than walk trap method?
Also, is there any way to validate a disconnected graph?
Short answer: NO!
You can't use a internal index to justify the choose of a algorithm over another. Why?
Because evaluation indexes were designed to evaluate clustering results, i.e., partitions and hierarchies. You can only use them to access the quality of a clustering and therefore justify its choice over the others options. But again, you can't use them to justify choosing a particular algorithm to apply on a different dataset based on a single previous experiment.
For this task, several benchmarks are needed to determine which algorithms are generally better and should be tried first. Here some paper about it: Community detection algorithms: a comparative analysis.
Edit: What I am saying is, your validation indexes may show that the fast.greed's solution is better than the walk.trap's. However, they do not explain why you chose these algorithms instead of any others. Only your data, your assumptions, and your constraints could do that.
Also, is there any way to validate a disconnected graph?
Theoretically, any evaluation index can do this. Technically, some implementations don't handle disconnected components.

How to perform Semantic Similarity in document

I am doing project In which I need to ranked text document according to search query like search engine but I need to rank documents having semantic similarity of the word or sentence,I am unable to start regarding how to find semantic similarity using java. Is there any link or any paper through which I can start finding semantic similarity of words in documents or any idea.
The standard way to represent documents in term-space is to treat the terms as mutually orthogonal or independent of each other, e.g. the terms "atomic" and "nuclear" although being synonymous and hence interchangeable are treated as distinct, whereas the semantic similarity between this pair of words should be fairly high.
Thus, for implementing a semantic similarity based score, you need to know the relation between a pair of words, for which you can use either of the following.
An external resource such as a Wordnet or a semantic similarity library such as DISCO.
A corpus analysis methodology such as Latent Semantic Analysis (LSA) which reduces the dimensionality of the term space by combining semantically similar terms such as "atomic" and "nuclear".
Have a look at this Demo for semantic similarity
It shows the demo for different algorithms. you can see which one works for you and try to go with it. Also the this "semilar" module can be used with the help of Java I think. You can try using it, I didnt tried it yet but the demo is for the same on that page. Thanks :)

How to get started on Information Extraction?

Could you recommend a training path to start and become very good in Information Extraction. I started reading about it to do one of my hobby project and soon realized that I would have to be good at math (Algebra, Stats, Prob). I have read some of the introductory books on different math topics (and its so much fun). Looking for some guidance. Please help.
Update: Just to answer one of the comment. I am more interested in Text Information Extraction.
Just to answer one of the comment. I am more interested in Text Information Extraction.
Depending on the nature of your project, Natural language processing, and Computational linguistics can both come in handy -they provide tools to measure, and extract features from the textual information, and apply training, scoring, or classification.
Good introductory books include OReilly's Programming Collective Intelligence (chapters on "searching, and ranking", Document filtering, and maybe decision trees).
Suggested projects utilizing this knowledge: POS (part-of-speech) tagging, and named entity recognition (ability to recognize names, places, and dates from the plain text). You can use Wikipedia as a training corpus since most of the target information is already extracted in infoboxes -this might provide you with some limited amount of measurement feedback.
The other big hammer in IE is search, a field not to be underestimated. Again, OReilly's book provides some introduction in basic ranking; once you have a large corpus of indexed text, you can do some really IE tasks with it. Check out Peter Norvig: Theorizing from data as a starting point, and a very good motivator -maybe you could reimplement some of their results as a learning exercise.
As a fore-warning, I think I'm obligated to tell you, that information extraction is hard. The first 80% of any given task is usually trivial; however, the difficulty of each additional percentage for IE tasks are usually growing exponentially -in development, and research time. It's also quite underdocumented -most of the high-quality info is currently in obscure white papers (Google Scholar is your friend) -do check them out once you've got your hand burned a couple of times. But most importantly, do not let these obstacles throw you off -there are certainly big opportunities to make progress in this area.
I would recommend the excellent book Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. It covers a broad area of issues which form a great and up-to-date (2008) basis for Information Extraction and is available online in full text (under the given link).
I would suggest you take a look at the Natural Language Toolkit (nltk) and the NLTK Book. Both are available for free and are great learning tools.
You don't need to be good at math to do IE just understand how the algorithm works, experiment on the cases for which you need an optimal result performance, and the scale with which you need to achieve target accuracy level and work with that. You are basically working with algorithms and programming and aspects of CS/AI/Machine learning theory not writing a PhD paper on building a new machine-learning algorithm where you have to convince someone by way of mathematical principles why the algorithm works so I totally disagree with that notion. There is a difference between practical and theory - as we all know mathematicians are stuck more on theory then the practicability of algorithms to produce workable business solutions. You would, however, need to do some background reading both books in NLP as well as journal papers to find out what people found from their results. IE is a very context-specific domain so you would need to define first in what context you are trying to extract information - How would you define this information? What is your structured model? Supposing you are extracting from semi and unstructured data sets. You would then also want to weigh out whether you want to approach your IE from a standard human approach which involves things like regular expressions and pattern matching or would you want to do it using statistical machine learning approaches like Markov Chains. You can even look at hybrid approaches.
A standard process model you can follow to do your extraction is to adapt a data/text mining approach:
pre-processing - define and standardize your data to extraction from various or specific sources cleansing your data
segmentation/classification/clustering/association - your black box where most of your extraction work will be done
post-processing - cleansing your data back to where you want to store it or represent it as information
Also, you need to understand the difference between what is data and what is information. As you can reuse your discovered information as sources of data to build more information maps/trees/graphs. It is all very contextualized.
standard steps for: input->process->output
If you are using Java/C++ there are loads of frameworks and libraries available you can work with.
Perl would be an excellent language to do your NLP extraction work with if you want to do a lot of standard text extraction.
You may want to represent your data as XML or even as RDF graphs (Semantic Web) and for your defined contextual model you can build up relationship and association graphs that most likely will change as you make more and more extractions requests. Deploy it as a restful service as you want to treat it as a resource for documents. You can even link it to taxonomized data sets and faceted searching say using Solr.
Good sources to read are:
Handbook of Computational Linguistics and Natural Language Processing
Foundations of Statistical Natural Language Processing
Information Extraction Applications in Prospect
An Introduction to Language Processing with Perl and Prolog
Speech and Language Processing (Jurafsky)
Text Mining Application Programming
The Text Mining Handbook
Taming Text
Algorithms of Intelligent Web
Building Search Applications
IEEE Journal
Make sure you do a thorough evaluation before deploying such applications/algorithms into production as they can recursively increase your data storage requirements. You could use AWS/Hadoop for clustering, Mahout for large scale classification amongst others. Store your datasets in MongoDB or unstructured dumps into jackrabbit, etc. Try experimenting with prototypes first. There are various archives you can use to base your training on say Reuters corpus, tipster, TREC, etc. You can even check out alchemy API, GATE, UIMA, OpenNLP, etc.
Building extractions from standard text is easier than say a web document so representation at pre-processing step becomes even more crucial to define what exactly it is you are trying to extract from a standardized document representation.
Standard measures include precision, recall, f1 measure amongst others.
I disagree with the people who recommend reading Programming Collective Intelligence. If you want to do anything of even moderate complexity, you need to be good at applied math and PCI gives you a false sense of confidence. For example, when it talks of SVM, it just says that libSVM is a good way of implementing them.
Now, libSVM is definitely a good package but who cares about packages. What you need to know is why SVM gives the terrific results that it gives and how it is fundamentally different from Bayesian way of thinking ( and how Vapnik is a legend).
IMHO, there is no one solution to it. You should have a good grip on Linear Algebra and probability and Bayesian theory. Bayes, I should add, is as important for this as oxygen for human beings ( its a little exaggerated but you get what I mean, right ?). Also, get a good grip on Machine Learning. Just using other people's work is perfectly fine but the moment you want to know why something was done the way it was, you will have to know something about ML.
Check these two for that :
http://pindancing.blogspot.com/2010/01/learning-about-machine-learniing.html
http://measuringmeasures.com/blog/2010/1/15/learning-about-statistical-learning.html
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Okay, now that's three of them :) / Cool
The Wikipedia Information Extraction article is a quick introduction.
At a more academic level, you might want to skim a paper like Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text.
Take a look here if you need enterprise grade NER service. Developing a NER system (and training sets) is a very time consuming and high skilled task.
This is a little off topic, but you might want to read Programming Collective Intelligence from O'Reilly. It deals indirectly with text information extraction, and it doesn't assume much of a math background.

Resources