Fine-tuning BERT with [unusedxxx] tokens - bert-language-model

I used BertTokenizer and BertForMaskedLM from HuggingFace transformers to do word prediction. I put the model through a step of fine tuning on a custom dataset of old British government archive text records. Maybe these documents had some old English words which were not in the BERT vocab and were then assigned to the [unusedxxx] tokens.
Now I did not explicitly add any new words in the vocab. All I did was use the basic function on my own input.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
with open('./data/myData.txt', 'r') as fp:
text = fp.read().split('\n')
inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
While predicting, this was the output.
INPUT:
Lord [MASK].
OUTPUT:
Lord [PAD].
Lord [unused45].
Lord [unused221].
Lord [unused581].
Lord [unused708].\
INPUT:
THE [MASK] OF KIMBERLEY
OUTPUT:
THE [PAD] OF KIMBERLEY
THE [unused45] OF KIMBERLEY
THE [unused581] OF KIMBERLEY
THE [unused221] OF KIMBERLEY
THE [unused270] OF KIMBERLEY\
The actual output for the sentence should look something like
Lord Balfour, Lord Cornwall, etc. \
and for the second sentence it would be something like
The Earl of Kimberley \
How can I get it to display the actual words instead of the [unsusedxxx] tokens?

Related

PDF search for (key)words with Spanish and French letters in R

ISSUE:
I try to extract multiple keywords and their surrounding text from a suite of PDF documents in English, Spanish, and French. For English PDF documents it works like a charm, but not for terms that contain non-latin letters in Spanish and French (e.g., é, ê, ô). Code for reading English PDFs:
library(textreadr)
library(pdftools)
library(pdfsearch)
keyword = c('biology') # define searched keyword
dirct <- "~/Documents/pdfs" # define directory
### keyword search
result <- keyword_directory(dirct,
keyword = keyword,
surround_lines = 0, full_names = TRUE)
Running the same code for terms with letters specific to French or Spanish (e.g., é, ê, ô) does not yield any results.
WHAT I HAVE TRIED:
I saw that the letters are converted into different unicode:
keyword = c('biología') # keyword
""biolog\303\255a" # the keyword how its listed in Values
""biolog<U+00E1>" # unicode the *keyword_directory* function converts the keyword to
I have tried to change the keyword search to the unicode but this didnt yield any results.
keyword = c('biolog\303\255a') / keyword = c('biolog<U+00E1>')
I'm stuck with the keyword_directory function because it extracts both keywords and surrounding text from the PDF's.
Maybe you can try the following replacements (see "Hex code point" in the webpage http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A9&mode=char) :
1.é can be replaced "\U00E9" (if you type "\U00E9" in R, you will get "é");
2.ê can be replaced by "\U00EA";
etc.
I do not have access to your PDFs, so I can't test it. If you could provide some links to the pdf you consider for your search, it would be useful.

Statistical Machine Translation from Hindi to English using MOSES

I need to create a Hindi to English translation system using MOSES. I have got a parallel corpora containing about 10000 Hindi sentences and corresponding English translations. I followed the method described in the Baseline system creation page. But, just in the first stage, when I wanted to tokenise my Hindi corpus and tried to execute
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l hi < ~/corpus/training/hi-en.hi> ~/corpus/hi-en.tok.hi
, the tokeniser gave me the following output:
Tokenizer Version 1.1
Language: hi
Number of threads: 1
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...
I even tried with 'hin' but it still didn't recognise the language. Can anyone tell the correct way to make the translation system.
Moses does not support Hindi for tokenization, the tokenizer.perl uses the nonbreaking_prefix.* files (from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl#L516)
The languages available with nonbreaking prefixes from Moses are:
ca: Catalan
cs: Czech
de: German
el: Greek
en: English
es: Spanish
fi: Finnish
fr: French
hu: Hungarian
is: Icelandic
it: Italian
lv: Latvian
nl: Dutch
pl: Polish
pt: Portugese
ro: Romanian
ru: Russian
sk: Slovak
sl: Slovene
sv: Swedish
ta: Tamil
from https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes
However all hope is not lost, you can surely tokenize your text with other tokenizers before training machine translation model with Moses, try Googling "Hindi Tokenziers", there are tonnes of them around.

Language detection in R with the textcat package : how to restrict to a few languages?

I need to detect the language of many short texts, using R.
I am using the textcat package, which find which among many (say 30) European
languages is the one of each text. However, I know my texts are either French or English (or, more generally, a small subset of the langages handled by textcat).
How could add this knowledge when calling textcat functions ?
Thanks,
This might work. Presumably you wish to restrict the language choices to English or French to reduce the misclassification rate. Without example text for which the desired result is known I cannot test the approach below. However, it does seem to restrict the language choices to English and French.
my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c("english", "french")]
my.profiles
my.text <- c("This is an English sentence.",
"Das ist ein deutscher Satz.",
"Il s'agit d'une phrase française.",
"Esta es una frase en espa~nol.")
textcat(my.text, p = my.profiles)
# [1] "english" "english" "french" "french"
You can also achieve high classification accuracy with the builtin ECIMCI_profiles.
Call
textcat(my.text, p = ECIMCI_profiles)
and optionally combine this with the %in% line-of-code from Mark Miller's answer.
The ECIMCI_profiles database of package textcat uses a larger maximal n-gram size of 1000 (unlike 400 as the TC_byte_profiles or TC_char_profiles dbs).

Filtering out non-English words from a corpus using `textcat`

Similar to this SO member, I've been looking for a simple package in R that filters out words that are non-English. For example, I might have a list of words that looks like this:
Flexivel
eficaz
gut-wrenching
satisfatorio
apropiado
Benutzerfreundlich
interessante
genial
cool
marketing
clients
internet
My end goal is to simply filter out the non-English words from the corpus so that my list is simply:
gut-wrenching
cool
marketing
clients
internet
I've read in the data as a data.frame, although it will subsequently be transformed into a corpus and then a TermDocumentMatrix in order to create a wordcloud using wordcloud and tm.
I am currently using the package textcat to filter by language. The documentation is a bit above my head, but seems to indicate that you can run the command textcat on lists. For example, if the data above was in a data.frame called df with a single column called "words", I'd run the command:
library(textcat)
textcat(c(df$word))
However, this has the effect of reading the entire list of words as a single document, rather than looking at each row and determining it's language. Please help!
For a dictionary search you could use aspell:
txt <- c("Flexivel", "eficaz", "gut-wrenching", "satisfatorio", "apropiado",
"Benutzerfreundlich", "interessante", "genial", "cool", "marketing",
"clients", "internet")
fn <- tempfile()
writeLines(txt, fn)
result <- aspell(fn)
results$Original gives the non-matching words. From those you can select the matching words:
> result$Original
[1] "Flexivel" "eficaz" "satisfatorio"
[4] "apropiado" "interessante" "Benutzerfreundlich"
> english <- txt[!(txt %in% result$Original)]
> english
[1] "gut-wrenching" "genial" "cool" "marketing"
[5] "clients" "internet"
However, as Carl Witthoft indicates you can not be sure that these are actually English words. 'cool', 'marketing' and 'internet' are also valid Dutch words for example.

How to count the number of sentences in a text in R?

I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!
Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai
What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])

Resources