Statistical Machine Translation from Hindi to English using MOSES - hindi

I need to create a Hindi to English translation system using MOSES. I have got a parallel corpora containing about 10000 Hindi sentences and corresponding English translations. I followed the method described in the Baseline system creation page. But, just in the first stage, when I wanted to tokenise my Hindi corpus and tried to execute
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l hi < ~/corpus/training/hi-en.hi> ~/corpus/hi-en.tok.hi
, the tokeniser gave me the following output:
Tokenizer Version 1.1
Language: hi
Number of threads: 1
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...
I even tried with 'hin' but it still didn't recognise the language. Can anyone tell the correct way to make the translation system.

Moses does not support Hindi for tokenization, the tokenizer.perl uses the nonbreaking_prefix.* files (from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl#L516)
The languages available with nonbreaking prefixes from Moses are:
ca: Catalan
cs: Czech
de: German
el: Greek
en: English
es: Spanish
fi: Finnish
fr: French
hu: Hungarian
is: Icelandic
it: Italian
lv: Latvian
nl: Dutch
pl: Polish
pt: Portugese
ro: Romanian
ru: Russian
sk: Slovak
sl: Slovene
sv: Swedish
ta: Tamil
from https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes
However all hope is not lost, you can surely tokenize your text with other tokenizers before training machine translation model with Moses, try Googling "Hindi Tokenziers", there are tonnes of them around.

Related

Approach to problem solving, library for transliteration from Unicode symbols to Latin

I need to resolve the following task:
I have registration forms which not allow to use Non-Latin characters and I need to transliterate from Non-Latin characters to Latin characters.
Could you, please, tell me approach to this task solving or library for transliteration from Unicode symbols to Latin?
I need to transliterate form Unicode symbols for the following languages:
Arabic, Armenian, Bengali, Tibetan, Myanmar, Khmer, Chinese (simplified), Chinese, Ethiopian, Devanagari, Georgian, Greek, Hebrew, Hiragana, Katakana, Kanji, Thaana, Hangul, Hanja, Tamil,
Sinhala, Thai to the Latin symbols.
Thank you.

Why does right encoding to read file change depending on system language in R

I'm working with a .dta file that has some Spanish language strings in a table. I have two machines: one uses English as system language and the other uses Spanish. Depending on which machine I'm using, the correct way to read this file changes:
df <- haven::read_stata("data/input/data.dta", encoding = 'latin1') # Works on spanish computer
df <- haven::read_stata("data/input/data.dta") # Works on english computer
If I specify no encoding (default is Windows-1252 according to docs) on my Spanish machine, characters like "ñ", "°" are replaced with question marks.
If I specify latin1 as encoding on my English machine those characters get weird readings, like these:
ARROZ GRANO LARGO N° 2 instead of ARROZ GRANO LARGO N° 2 or LLAVE LAVAMANO BAÑO instead of LLAVE LAVAMANO BAÑO
This goes against what I thought I knew about encodings, so I'm a bit perplexed. I'm sorry I can't provide a reproducible example, but I hope someone has had this kind of issue and understands what's happening and if something can be done not to have to change the reading depending on the machine I'm using.

How do I stop bookdown reordering punctuation?

I have forked bookdown-minimal here to produce the minimal reproducible example of my issue.
I want the following sentence to be rendered as is. That is, I want the full stop (period) to remain outside the quotes.
This is the on-line version of "A Book".
I have made a minimal bookdown example here
The line bibliography: [book.bib] causes the sentence to be rendered using "Build book" as
This is the on-line version of "A Book."
I know this is a convention of American English, but other languages (and other variants of English) don't do this and I don't want to do it in the real sentences I have (it seems that the issue occurs with other items of punctuation, such as ! and ?, that even American English puts in the correct place).
What is driving this behaviour? (Note that I am not actually including references in my minimal example.) Is there any easy way to stop it?
The system respects the lang metadata variable. So if you are writing British English, then add this to your YAML metadata:
lang: en-GB
The result should be
This is the on-line version of “A Book”.
whereas
lang: en-US
gives
This is the on-line version of “A Book.”
If all else fails, you can resort to adding a Lua filter which adds an invisible character like a zero-width joiner. This will prevent the reordering from happening as well.
function Quoted (quote)
return {quote, pandoc.Str '\u{200d}'}
end
Note that you can take advantage of this trick on a per-case basis by using the ‍ entity in your Markdown input:
This is the on-line version of "A Book"‍.
This should do the trick:
This is the on-line version of "\text{A Book}"\text{.}

Language detection in R with the textcat package : how to restrict to a few languages?

I need to detect the language of many short texts, using R.
I am using the textcat package, which find which among many (say 30) European
languages is the one of each text. However, I know my texts are either French or English (or, more generally, a small subset of the langages handled by textcat).
How could add this knowledge when calling textcat functions ?
Thanks,
This might work. Presumably you wish to restrict the language choices to English or French to reduce the misclassification rate. Without example text for which the desired result is known I cannot test the approach below. However, it does seem to restrict the language choices to English and French.
my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c("english", "french")]
my.profiles
my.text <- c("This is an English sentence.",
"Das ist ein deutscher Satz.",
"Il s'agit d'une phrase française.",
"Esta es una frase en espa~nol.")
textcat(my.text, p = my.profiles)
# [1] "english" "english" "french" "french"
You can also achieve high classification accuracy with the builtin ECIMCI_profiles.
Call
textcat(my.text, p = ECIMCI_profiles)
and optionally combine this with the %in% line-of-code from Mark Miller's answer.
The ECIMCI_profiles database of package textcat uses a larger maximal n-gram size of 1000 (unlike 400 as the TC_byte_profiles or TC_char_profiles dbs).

How to count the number of sentences in a text in R?

I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each letter in the entire text. Now, I need to know the number of sentences in the whole text. Does R have any function, which can help me do that? Thank you very much!
Thank you #gui11aume for your answer. A very good package I just found that can help do the work is {openNLP}. This is the code to do that:
install.packages("openNLP") ## Installs the required natural language processing (NLP) package
install.packages("openNLPmodels.en") ## Installs the model files for the English language
library(openNLP) ## Loads the package for use in the task
library(openNLPmodels.en) ## Loads the model files for the English language
text = "Dr. Brown and Mrs. Theresa will be away from a very long time!!! I can't wait to see them again." ## This sentence has unusual punctuation as suggested by #gui11aume
x = sentDetect(text, language = "en") ## sentDetect() is the function to use. It detects and seperates sentences in a text. The first argument is the string vector (or text) and the second argument is the language.
x ## Displays the different sentences in the string vector (or text).
[1] "Dr. Brown and Mrs. Theresa will be away from a very long time!!! "
[2] "I can't wait to see them again."
length(x) ## Displays the number of sentences in the string vector (or text).
[1] 2
The {openNLP} package is really great for natural language processing in R and you can find a good and short intro to it here or you can check out the package's documentation here.
Three more languages are supported in the package. You just need to install and load the corresponding model files.
{openNLPmodels.es} for Spanish
{openNLPmodels.ge} for German
{openNLPmodels.th} for Thai
What you are looking for is sentence tokenization, and it is not as straightforward as it seems, even in English (sentences like "I met Dr. Bennett, the ex husband of Mrs. Johson." can contain full stops).
R is definitely not the best choice for natural language processing. If you are Python proficient, I suggest you have a look at the nltk module, which covers this and many other topics. You can also copy the code from this blog post, which does sentence tokenization and word tokenization.
If you want to stick to R, I would suggest you count the end-of-sentence characters (., ?, !), since you are able to count characters. A way of doing it with a regular expression is like so:
text <- 'Hello world!! Here are two sentences for you...'
length(gregexpr('[[:alnum:] ][.!?]', text)[[1]])

Resources