Google Translate API ignores accented characters - google-translate

It appears the Google Translate API (I'm using V3 with the default model) ignores accents, treating accented characters the same as non-accented, i.e. for Spanish:
habló = I speak (should be "I spoke")
hablo = I speak (correct)
This appears to be a fundamental problem, not just a few cases of mistranslation.

Related

Forcing "ct" ligature for Google Web Fonts

I'm using the font IM Fell English for my project: https://fonts.google.com/specimen/IM+Fell+English?selection.family=IM+Fell+English
It has support for the "long s" ſ, and other common ligatures such as "ff", "fi", "ft", etc.
However, I can't seem to get "ct" ligature working, albeit you can see that the font does have the glyph: here 1) https://www.fontsquirrel.com/fonts/im-fell-english-pro and here 2) https://iginomarini.com/fell/the-revival-fonts/
I have tried font-variant-ligatures CSS property as directed here, but it does not work regardless of which value I set: https://developer.mozilla.org/en-US/docs/Web/CSS/font-variant-ligatures
My current compromise is to replace the "s" with "ſ" Unicode character, but as far as I know, there is no such Unicode character for the "ct" ligature (nor a joiner glyph) (as far as I could find!)
Additionally, to get the long ſ to work, I had to #import the font in such way:
#import url('https://fonts.googleapis.com/css?family=IM+Fell+English:400,400i&subset=all&text=+!%22%23$%25%26()*%2B,-.%2F0123456789:;%3C%3D%3E%3F#ABCDEFGHIJKLMNOPQRSTUVWXYZ%5B%5C%5Dabcdefghijklmnopqrstuvwxyz%7B%7D%C2%A2%C2%A3%C2%A5%C2%A9%C2%AE%C3%97%C3%B7%C5%BF%E2%80%98%E2%80%99%E2%80%9C%E2%80%9D%E2%82%AC');
The method which I obtained the value for the text parameter was that I inputted the long ſ into the preview field on the Google Font webpage for IM Fell English (the first link above), and then I watch which request my browser sent out and copied it.
As you can see below, I am much in need of re-creating this! (Notice the "ct" in "Octaves"; I've taken care of the long ſ in the titles, but not the paragraphs)
Recreation:
Original Source:
Edit: I've found a workaround.
I downloaded the IM Fell English font, then use Character Map (which is available under Windows 10), and search for the glyph and copied it (U+E004 Private Use). The trouble now is that I cannot search (using Ctrl F) for anything that contains the ligature! So, I cannot search for "Octaves" because it is now "Oaves".
I believe the original question in the title still stands. I don't want a workaround, I want to have the browser force the ligature, if that's possible.

Google translate text to speech and apostrophes

I am using google API to translate a sentence. Once translated I use text to speech google API with the result of the translation.
Translation and text to speech work pretty well in general. However, I have a problem with the apostrophes. For example:
1) Translation result: I & # 3 9 ; m tired (Note: I had to separate the characters with spaces because it was shown as "I´m tired" in the preview...
2) Text to speech result says : "I and hash thirty nine m tired" (or something similar)
What kind of encoding do I need to use in the 1st step to get the output string right (i.e. I´m tired)
The program is in python. I include an extract here:
def tts_translated_text (self, input_text, input_language):
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
input_text = input_text.encode ("utf-8")
# Set the text input to be synthesized
synthesis_input = texttospeech.types.SynthesisInput(text=input_text)
voice = texttospeech.types.VoiceSelectionParams( language_code=input_language, ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)
audio_config = texttospeech.types.AudioConfig(
audio_encoding=texttospeech.enums.AudioEncoding.LINEAR16)
response = client.synthesize_speech(synthesis_input, voice, audio_config)
# The response's audio_content is binary.
with open('output.wav', 'wb') as out:
# Write the response to the output file.
out.write(response.audio_content)
Thanks in advance,
Ester
I finally found what was wrong. Google Translate API returns the string with HTML encoding. And Google Text-To-Speech expects UTF-8 encoding.
I am forced to use python2.7 so I did the following:
translated_text = HTMLParser.HTMLParser().unescape (translated_text_html)
Where translated_text_html is the returned string from the translation API invocation
In python3 it should be:
translated_text = html.unescape (translated_text_html)

Incorrect behaviour of Google Translation API with notranslate tags

Google Translate API allows indicating chunks of text that should not be translated with
<span translate='no'>Skip this text while translating</span>
In some cases there is an incorrect behaviour with non-translate tags, that causes the translation API to omit one of the words and to duplicate the non-translate tag. Input of the translation API:
0c40152c asdasd alsdls3 ec3f297a <span translate="no">AAAAA123AAAA</span> Nov 30 translate
When translating from Italian to English (not sure if the language matters), the following result is returned:
0c40152c asdasd alsdls3 ec3f297a <span translate="no">AAAAA123AAAA</span> Nov 30 <span translate="no">AAAAA123AAAA</span>
Please note that the 'translate' at the end of the text is substituted with the non-translate tag.
This issues are present if instead of <span translate='no'> I use the alternative syntax: <span class='notranslate'>.
Is this a known bug ? Does it have a sensible workaround ?
Is this a known bug?
Yes: https://issuetracker.google.com/issues/121076288
Translation problem with notranslate class in span tag
Problem you have encountered:
The translation API gives wrong results translating from german to arabic
German text:
QANTARA Migration - Kostenfreie Erstprüfung Ihrer Chancen für die erfolgreiche Immigration nach Deutschland
Arabic translation:
QANTARA Migration - إجراء فحص أولي مجاني لفرص نجاح QANTARA Migration إلى ألمانيا
What you expected to happen:
Correct translation without doubling the span with notranslate - this was doubled in arabic translation as you can see
There are also a few others that seem related, like https://issuetracker.google.com/issues/74168658 and https://issuetracker.google.com/issues/35902695.
Does it have a sensible workaround?
Only hacky ones, I'm afraid.
The easiest workaround is just to replace such sections with a token, like a unique number or url that Translate is smart enough not to touch, translate, then swap the original string back in.
A more general solution is to use something like ModelFront (full-disclosure: I work there) to detect errors, and do something only in those cases.
It seems like you have specified Italian as input language, but there are very few words in the text which can be translated (for example “translate”) and they are not recognised in the source language.
This can derivate in issues with the translation algorithm, which seems to be the case here.
A workaround would be setting the source language to get automatically detected by the API and checking the confidence value:
The confidence value is an optional floating point value between 0 and
1. The closer this value is to 1, the higher the confidence level for the language detection. This member is not always available
If the confidence value is high enough for your needs, it will try to detect the appropriate source language to translate from.
Another workaround could be adding more words to the text so the algorithm has more data to work with. I have tested the API with the same input as you describe but adding a few more words. The result output is the expected.

Using Japanese letters in paste function

I want to download search query data from google trends for both japanese and english search terms. It works perfectly fine when I use english search terms only, but it does not work as soon as I include japanese letters.
My code is the following(I included the default keyword just for this example to make it easier to use):
URL_GT=function(keyword="Toyota Aygo %2B Toyota Yaris %2B Toyota Vitz %2B
トヨタヴィッツ", year=2010, month=1, length=68){
start="http://www.google.com/trends/trendsReport?hl=en-US&q="
end="&cmpt=q&content=1&export=1"
date=""
queries=keyword[1]
if(length(keyword)>1) {
for(i in 2:length(keyword)){
queries=paste(queries, "%2C ", keyword[i], sep="")
}
}
#Dates
if(!is.na(year)){
date="&date="
date=paste(date, month, "%2F", year, " ", month+length-1, "m", sep="")
}
URL=paste(start, queries, date, end, sep="")
browseURL(URL)
}
When I look at the download URL that gets called in my browser I can see that the japanese letters got transformed into some %, numbers and letters, but they are not supposed to change at all.
When I use
Sys.setlocale("LC_CTYPE","japanese_JAPAN")
I get the following paste result
paste("トヨタヴィッツ","Toyota Vitz", sep = "")
[1] "ƒgƒˆƒ^ƒ”ƒBƒbƒcToyota Vitz"
I think this shows pretty good that the paste() function seems not to work as intended.
Using
Sys.setlocale("LC_CTYPE","german_GERMANY")
I get following error message
unexpected INCOMPLETE_STRING
1: URL_GT=function(keyword="Toyota Aygo %2B Toyota Yaris %2B Toyota Vitz %2B ?
indicating that R cannot interpret the japanese letters.
I tried finding a solution, but could only find tips which led me to change my locale. As discribed above this did not work for me so far. I also found this tip, but I got the same error as the enquirer of that question - namely
Warning message: In Sys.setlocale("LC_CTYPE", "UTF-8") : OS reports request
to set locale to "UTF-8" cannot be honored
I am very grateful for any help! Since this is my first post ever I hope that everything concerning structure and detail is alright.
I found a solution that works just fine for me. I had to change the language for unicode-incompatible programs in order for the japanese local to work properly.
On Windows 8.1 you have to go to the control panel, time, region & language, region, administration and there you can change the language accordingly - in my case japanese - restart your pc afterwards.
If you now set your local to
Sys.setlocale("LC_CTYPE","japanese_JAPAN")
typing in paste should return what you asked for e.g.
paste("It works", "トヨタヴィッツ", sep=" ")
[1] "It works トヨタヴィッツ"
The only thing that still confuses me is that when I open the Excel file after the download the Japanese letters appear in a new criptic way.
I tried downloading the data for the word manually and get the same result in the Excel file. So I guess the data should be the correct one. Unfortunately I did not download a CSV file of the japanese data before I changed my unicode language to see if excel messed it up there as well. But when I restored my settings to german again the same criptic letters appeared in the downloaded file.

Issue with regex in ASP.net for german, french & spanish languages

I want to support German, French & Spanish characters on a particular field of my website. I need a regex for this. Presently I am using -
^[\w\s-\+\$\*\.\?\:\;\!\,"'\%\&\/\(\)\#\#«»£°¿¡_ÀÂÆÇÈÉÊËÎÏÔŒÙÛÜàâæçèéêëîïôœùûüÄÖäößÁÍÑÓÚáíñóú\u201E\u201C\u201D\u20AC]{1,255}$
This regex basically uses all the char set from the 3 languages I mentioned.
Is there a neat way to avoid this lengthy regex? I tried /p{L}/p{Z} regex. However this didnt work.
My website is in ASP.net
/p{L}/p{Z} is wrong, should be \p{L}\{Z}.
all the letters, like "ÀÂÆÇÈ" shouldn't be needed, they are all included in \w in .net!
You don't need most of the escaping in a character class
You can't write something like " in a character class, only thing what happens is that every single character is added to the class.
This should be quite similar to what you used:
^[-\p{L}\p{N}\p{P}\p{Z}_+$*%&/##«»£°\u201E\u201C\u201D\u20AC]{1,255}$
I haven't checked those Unicode codepoints at the end of the class, I don't now if they are needed or not.
For an explanation of all the \p{...} items see Unicode Regular Expressions on regular-expressions.info

Resources