Google translate text to speech and apostrophes - google-translate

I am using google API to translate a sentence. Once translated I use text to speech google API with the result of the translation.
Translation and text to speech work pretty well in general. However, I have a problem with the apostrophes. For example:
1) Translation result: I & # 3 9 ; m tired (Note: I had to separate the characters with spaces because it was shown as "I´m tired" in the preview...
2) Text to speech result says : "I and hash thirty nine m tired" (or something similar)
What kind of encoding do I need to use in the 1st step to get the output string right (i.e. I´m tired)
The program is in python. I include an extract here:
def tts_translated_text (self, input_text, input_language):
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
input_text = input_text.encode ("utf-8")
# Set the text input to be synthesized
synthesis_input = texttospeech.types.SynthesisInput(text=input_text)
voice = texttospeech.types.VoiceSelectionParams( language_code=input_language, ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)
audio_config = texttospeech.types.AudioConfig(
audio_encoding=texttospeech.enums.AudioEncoding.LINEAR16)
response = client.synthesize_speech(synthesis_input, voice, audio_config)
# The response's audio_content is binary.
with open('output.wav', 'wb') as out:
# Write the response to the output file.
out.write(response.audio_content)
Thanks in advance,
Ester

I finally found what was wrong. Google Translate API returns the string with HTML encoding. And Google Text-To-Speech expects UTF-8 encoding.
I am forced to use python2.7 so I did the following:
translated_text = HTMLParser.HTMLParser().unescape (translated_text_html)
Where translated_text_html is the returned string from the translation API invocation
In python3 it should be:
translated_text = html.unescape (translated_text_html)

Related

How to send simultaneous keys in RSelenium ALT+S to web driver?

I would like to send two simultaneous keys such as ALT+S to the sendKeysToActiveElement( function of the R Selenium webdriver. I only see implementations in Java and C. Can this be done?
If you want to send a single keystroke then use:
cl$sendKeysToActiveElement(sendKeys = list(key = "tab"))
If you press more than two keystrokes then use:
cl$sendKeysToActiveElement(sendKeys = list(key = "alt", key = "S"))
There are 2 ways to send key presses in the in the R version of Selenium. The first way, as mentioned, is by sending the desired button in the key argument. The second way is by sending the raw UTF-8 character codes without the key argument. Generally, this is undesired because it's difficult to remember all the codes, but when wanting to input simultaneous key presses, it's the only way I've found to make it work since the list option does appear to send inputs sequentially.
In this scenario, the UTF 8 code for alt is \uE00a
and the UTF 8 code for s is \u0073
We can combine these into a single value, like so:
remDr$sendKeysToActiveElement(sendKeys = list("\uE00a\u0073"))
I'm unfamiliar with the alt + s shortcut, but this does work with something like shift + tab to navigate through different elements in reverse on a browser by sending them simultaneously.
I've also found the following links helpful for finding the actual UTF 8 codes:
http://unicode.org/charts/PDF/U0000.pdf
https://seleniumhq.github.io/selenium/docs/api/py/_modules/selenium/webdriver/common/keys.html
Use below code :-
String selectAll = Keys.chord(Keys.ALT, "s");
driver.findElement(By.xpath("YOURLOCATOR")).sendKeys(selectAll);
Hope it will help you :)

Decode Base64 text from Legiscan API in R

I am trying to analyze bill texts from LegisScan, but am running into problems decoding the text from the API pull response. Turns out LegisScan encodes the full text of all legislations in base 64 when pulled through their API and I am having some trouble decoding it.
This downloaded JSON request is an example of the full text portion of the JSON result that I downloaded through the API. However, the usual methods do not seem to be working on it.
What I have tried:
Legiscan does not seem to support R directly, so I used the package LegiscanR. I used LegiscanR's BillText function to get the correct JSON link, then used parseBillText to try to decode the text from the link into UTF-8. However, it throws up a fromJSON error even with the correct API key and document id stated in the link:
Error in fromJSON(content, handler, default.size, depth, allowComments, :
object 'Strict' not found
Using the base64decode (base64enc package) or base64Decode (RCurl package) function to convert the text from base 64 to raw, and then using the rawToChar function to convert it into characters.
My code:
text <- base64decode("https://www.dropbox.com/s/5ozd0a1zsb6y9pi/Legiscan_fulltext.txt?dl=0")
rawToChar(text)
Nul <- text == as.raw(00)
text[Nul] <- as.raw(20)
text2 <- rawToChar(text)
However, trying to use the rawToChar alone gives me an "embedded nul in string" error
Error in rawToChar(test2) :
embedded nul in string: '%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<>>>\r\nendobj\r\n2 0 obj\r\n<>\r\nendobj\r\n3 0 obj\r\n<>/ExtGState<>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 612 792] /Contents 4 0 R/Group<>/Tabs/S/StructParents 0>>\r\nendobj\r\n4 0 obj\r\n<>\r\nstream\r\nx\x9c\xb5ZYs\xdb8\022~w\x95\xff\003*O\u0516M\021ཛJ\x95\xe3ę̵\x99\xb1\xa7f\xb7\x92y\xa0$\xca\xe2\x86"\025\036\xf6\xe6\xdfow\003\x94\bR0sh\x93*\x99G\xa3\001|\xdd\xfdu7\xa4\xf9U\xd5d\xebdٰ\xe7\xcf\xe7WM\x93,7銽\x9f\u07d5\xbb\xbf\xe6w\x9fw\xe9\xfc]r\x9f\025I\x93\x95\xc5\xfc\xb6]4\xf8\xe8\x874Y\xa5Ջ\027\xec\xe5\xabk\xf6\xf2\xee\xfcl~\xc3Yl\xc7\
Substituting these nulls out to represent spaces allows rawToChar to run, but the output is gibberish, or in another form of encoding that is not the expected English text characters.
[1] "\x86\xdbi\xb3\xff\xf0\xc3\ak\xa2\x96\xe8\xc5\xca&\xfe\xcf\xf9\xa37tk\\xeco\xac\xbd\xa6/\xcbz\b\xacq\xa9\u07faYm{\033m\xc6\xd7e"
Any other ideas on what else to try? Thanks.
I have been dealing with the same problem in Python and in Python the following code worked:
import base64
raw = base64.b64decode(bill_text['doc'])
pdf_result = open(output_file, "wb").write(raw)
I think maybe in your case you are trying to convert the document immediately into text but that may not be so easy and I did in Python by parsing the saved PDF file with functions from the PyPDF2 library.

Translating a document from Spanish to English & preserve formatting

I have been using https://translate.google.com/ to translate Spanish PDFs & Word Documents to English.
Is there a Restful API (or other method) to translate a document from Spanish to English while preserving the formatting of the document?
I know I can extract the text then translate it using Google APIs but I would loose the formatting.
You could run this code in Python:
import goslate
big_files = ['lenin.txt', 'liga.txt']
gs = goslate.Goslate()
translation = []
for big_file in big_files:
with open(big_file, 'r') as f:
translated_lines = []
for line in f:
translated_line = gs.translate(line, "en")
translated_lines.append(translated_line)
translation.append('\n'.join(translated_lines))
You set in the directory where the docs you want to translate are, and then, in big_files you write the name of each document.

ASP Readline non-standard Line Endings

I'm using the ASP Classic ReadLine() function of the File System Object.
All has been working great until someone made their import file on a Mac in TextEdit.
The line endings aren't the same, and ReadLine() reads in the entire file, not just 1 line at a time.
Is there a standard way of handling this? Some sort of page directive, or setting on the File System Object?
I guess that I could read in the entire file, and split on vbLF, then for each item, replace vbCR with "", then process the lines, one at a time, but that seems a bit kludgy.
I have searched all over for a solution to this issue, but the solutions are all along the lines of "don't save the file with Mac[sic] line endings."
Anyone have a better way of dealing with this problem?
There is no way to change the behaviour of ReadLine, it will only recognize CRLF as a line terminator. Hence the only simply solution is the one you have already described.
Edit
Actually there is another library that ought to be available out of the box on an ASP server that might offer some help. That is the ADODB library.
The ADODB.Stream object has a LineSeparator property that can be assigned 10 or 13 to override the default CRLF it would normally use. The documentation is patchy because it doesn't describe how this can be used with ReadText. You can get the ReadText method to return the next line from the stream by passing -2 as its parameter.
Take a look at this example:-
Dim sLine
Dim oStreamIn : Set oStreamIn = CreateObject("ADODB.Stream")
oStreamIn.Type = 2 '' # Text
oStreamIn.Open
oStreamIn.CharSet = "Windows-1252"
oStreamIn.LoadFromFile "C:\temp\test.txt"
oStreamIn.LineSeparator = 10 '' # Linefeed
Do Until oStreamIn.EOS
sLine = oStreamIn.ReadText(-2)
'' # Do stuff with sLine
Loop
oStreamIn.Close
Note that by default the CharSet is unicode so you will need to assign the correct CharSet being used by the file if its not Unicode. I use the word "Unicode" in the sense that the documentation does which actually means UTF-16. One advantage here is that ADODB Stream can handle UTF-8 unlike the Scripting library.
BTW, I thought MACs used a CR for line endings? Its Unix file format that uses LFs isn't it?

How I encode the ugly string?

I have a string that is:
!"#$%&'()*+,-./0123456789:;?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]\^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª« ®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅàáâäèçéêëìíîïôö÷òóõùúý
I post that to service and used Htmlencode, then I get a result:
!#$%&'()* ,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~����������� ���������•������������������������������������
it isn't result that i need,how i get original string? thanks!
Your string is not ASCII, so you are either using a string to represent binary data, or you're not maintaining awareness of multi-byte encoding. In any case, the simplest way to deal with any Internet-based technology (HTTP, SMTP, POP, IMAP) is to encode it as 7-bit clean. One common way is to base64-encode your data, send it across the wire, then base64-decode it before trying to process it.
I believe this is what you're looking for:
!"#$%&&apos;()*+,-./0123456789:;?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]\\^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅàáâäèçéêëìíîïôö÷òóõùúý
You just need to use a better html entity/encoding library or tool. The one I used to generate this is from Ruby - I used the HTML Entities library. The code I wrote to do this follows. I had to put your text in input.txt to preserve Unicode (there was an EOF character in the string), but it worked great.
require 'rubygems'
require 'htmlentities'
str = File.read('input.txt')
coder = HTMLEntities.new
puts coder.encode(str, :named)

Resources