BERT for spelling correction - bert-language-model

I am a beginner
Do we have to train our data with bert to correct the spelling with bert, then go to the spelling correction?
Or
Are these two steps done together?
Can you explain a little about the output of the bert algorithm
I'm looking for a way to spell check with bert for a language other than English

Related

Number is recognized as a noun in spacy portuguese model

Just out of curiosity I would like to ask why the number "4950" has the PoS (part of speech) of "NOUN" in spaCy v3.1.3, using the large model in Portuguese. It is not in the GitHub token exception file (https://github.com/explosion/spaCy/blob/master/spacy/lang/pt/tokenizer_exceptions.py).
nlp = spacy.load('pt_core_news_lg')
doc = nlp('4950')
print(doc[0].text, doc[0].pos_)
#4950 NOUN
Is there any way to know what the other particular cases are?
To be clear, this should normally be a NUM.
This looks like it's just an error, and it doesn't affect most numbers, including similar ones like 4951. It's possible that somewhere in the Portuguese training data 4950 is labelled NOUN for some reason.
It's hard to explain individual predictions by the statistical models, and they make errors sometimes. This one is particularly egregious and may indicate an issue with data preparation, but in general errors like this are always possible. See this thread.
Also note this doesn't seem to be an issue in the small model. I'll look into this internally to see if there's a bug somewhere.
Quick update: If you use this in a sentence, like 4950 maçãs, it's properly labelled as NUM. One-word sentences are not something the models are trained on a lot and might cause more weird results.

how does gensim's word2vec differ from tensorflow vector representation?

I am fairly new to the NLP embedding world. I used gensim's word2vec model and tensorflow vector representation.
I have a question that while training gensim's word2vec model it takes tokenize sentences, while tensorflow takes a long list of words. How does it differ in training. Is there any quality impact?
Also how does then tensorflow cater to the needs of skip-gram as now the data is a list of words and no more sentences.
I am referring to the tensorflow's tutorial found at link https://www.tensorflow.org/tutorials/word2vec
Pardon me if my understanding in this domain is wrong would appreciate if my understanding is cleared.
Thank you for your guidance and help.

How to do Language Modeling using HTK

I am in confusion on how to use HTK for Language Modeling.
I followed the tutorial example from the Voxforge site
http://www.voxforge.org/home/dev/acousticmodels/linux/create/htkjulius/tutorial
After training and testing I got around 78% accuracy. I did this for my native language.Now I have to use HTK for Language Modeling.
Is there any tutorial available for doing the same? Please help me.
Thanks
speech_tri
If I understand your question correctly, you are trying to change from a "grammar" to an "n-gram language model" approach. These two methods are alternative ways of specifying what combinations of words are permissible in the responses that a recognizer will return. Having followed the Voxforge process you will probably have a grammar in place.
A language model comes from the analysis of a corpus of text which defines the probabilities of words appearing together. The text corpus used can be very specialized. There are a number of analysis tools such as SRILM (http://www.speech.sri.com/projects/srilm/) and MITLM (https://github.com/mitlm/mitlm) which will read a corpus and produce a model.
Since you are using words from your native language you will need a unique corpus of text to analyze. One way to get a test corpus would be to artificially generate a number of sentences from your existing grammar and use that as the corpus. Then with the new language model in place, you just point the recognizer at it instead of the grammar and hope for the best.

Predicting next word with text2vec in R

I am building a language model in R to predict a next word in the sentence based on the previous words. Currently my model is a simple ngram model with Kneser-Ney smoothing. It predicts next word by finding ngram with maximum probability (frequency) in the training set, where smoothing offers a way to interpolate lower order ngrams, which can be advantageous in the cases where higher order ngrams have low frequency and may not offer a reliable prediction. While this method works reasonably well, it 'fails in the cases where the n-gram cannot not capture the context. For example, "It is warm and sunny outside, let's go to the..." and "It is cold and raining outside, let's go to the..." will suggest the same prediction, because the context of weather is not captured in the last n-gram (assuming n<5).
I am looking into more advanced methods and I found text2vec package, which allows to map words into vector space where words with similar meaning are represented with similar (close) vectors. I have a feeling that this representation can be helpful for the next word prediction, but i cannot figure out how exactly to define the training task. My quesiton is if text2vec is the right tool to use for next word prediction and if yes, what is the suitable prediction algorithm that can be used for this task?
You can try char-rnn or word-rnn (google a little bit).
For character-level model R/mxnet implementation take a look to mxnet examples. Probably it is possible to extend this code to word-level model using text2vec GloVe embeddings.
If you will have any success, let us know (I mean text2vec or/and mxnet developers). I will be very interesting case for R community. I wanted to perform such model/experiment, but still haven't time for that.
There is one implemented solution as an complete example using word embeddings. In fact, the paper from Makarenkov et al. (2017) named Language Models with Pre-Trained (GloVe) Word Embeddings presents a step-by-step implementation of training a Language Model, using Recurrent Neural Network (RNN) and pre-trained GloVe word embeddings.
In the paper the authors provide the instructions to run de code:
1. Download pre-trained GloVe vectors.
2. Obtain a text to train the model on.
3. Open and adjust the LM_RNN_GloVe.py file parameters inside the main
function.
4. Run the following methods:
(a) tokenize_file_to_vectors(glove_vectors_file_name, file_2_tokenize_name,
tokenized_file_name)
(b) run_experiment(tokenized_file_name)
The code in Python is here https://github.com/vicmak/ProofSeer.
I also found that #Dmitriy Selivanov recently published a nice and friendly tutorial using its text2vec package which can be useful to address the problem from the R perspective. (It would be great if he could comment further).
Your intuition is right that word embedding vectors can be used to improve language models by incorporating long distance dependencies. The algorithm you are looking for is called RNNLM (recurrent neural network language model). http://www.rnnlm.org/

Smarter than an Eighth grader? Kaggle AI Challenge. R

I am working on the Allen AI Science Challenge currently up on Kaggle.
The idea behind the challenge is to train to a model using the training data provided (a set of Eighth grade level science questions along with four answer options, one of which is the correct answer and the correct answer) along with any additional knowledge sources (Wikipedia, Science textbooks, etc) so that it can answer science questions as well as an (average?) Eighth grader can.
I'm thinking of taking the first crack at the problem in R (proficient only in R and C++; I don't think C++ will be a very useful language to solve this problem in). After exploring the Kaggle forums, I decided to use the TopicModels (tm), RWeka and Latent Dirichlet Algorithm (LDA) packages.
My current approach is to build a text predictor of some sort which on reading the question posed to it outputs a string of text and compute the cosine similarity between this output text and the four options given in the test set and predict the correct one to be with the highest cosine similarity.
I will train the model using the training data, a Wikipedia corpus along with a few Science textbooks so that the model does not overfit.
I have two questions here:
Does the overall approach make sense?
What would be a good starting point to build this text predictor? Will converting the corpus(training data, Wikipedia and Textbooks) to a Term Document/Document Term matrix help? I think forming n-grams for all the sources would help but I don't know what the next step would be, i.e. how exactly will the model predict and belt out a string of text(of say, size n) on reading a question.
I have tried implementing a part of the approach; finding out optimum number of topics and performing LDA over the training set; here's the code:
library(topicmodels)
library(RTextTools)
data<-read.delim("cleanset.txt", header = TRUE)
data$question<-as.character(data$question)
data$answerA<-as.character(data$answerA)
data$answerB<-as.character(data$answerB)
data$answerC<-as.character(data$answerC)
data$answerD<-as.character(data$answerD)
matrix <- create_matrix(cbind(as.vector(data$question),as.vector(data$answerA),as.vector(data$answerB),as.vector(data$answerC),as.vector(data$answerD)), language="english", removeNumbers=FALSE, stemWords=TRUE, weighting = tm::weightTf)
best.model<-lapply(seq(2,25,by=1),function(k){LDA(matrix,k)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(2:25), LL=as.numeric(as.matrix(best.model.logLik)))
best.model.logLik.df[which.max(best.model.logLik.df$LL),]
best.model.lda<-LDA(matrix,25)
Any help will be appreciated!

Resources