Characters in Japanese are not decoded properly - decode

I am currently crawling some website in Japanese using BeautifulSoup4.
I have some problem decoding Japanese characters. Only '~' and 'ー' are returned as square shape character. Does anyone know how to fix this issue? The website is encoded in 'utf-8'.
Here is my code for parsing:
response = requests.get(url)
soup = BeautifulSoup(resonse.text, 'html.parser')
Thanks in advance.

Related

Clean Tweets: What are UTF8 and non-UTF8 characters

I am attempting to analyze a corpus of tweets extracted from Twitter. A number of tweets appear in non-UTF characters.
For example, one tweet is: "[米国]一人㠮ワクムン未接種㠮å­\ 㠩も㠋ら広㠌㠣㠟麻疹〠㠮教訓。 #ShotbyShotorg: How one unvaccinated child sparked Minnesota measles outbreak \"
I am not familiar with these non-alphanumeric characters or how to convert/exclude these characters. Are these garbage characters or do they need to be converted? Thank you.
I found the original tweet: https://twitter.com/narumita/status/476295179796611072?s=21. From this tweet it’s quite clear that the “garbage” text was supposed to be Japanese.
The original text reads
[米国]一人のワクチン未接種の子どもから広がった麻疹、の教訓。
Somehow, your text has undergone two rounds of mojibake-ification: it was encoded as UTF-8, decoded as Windows Code Page 1252 (CP-1252), encoded as UTF-8 again, and decoded as CP-1252 again. Unfortunately the text is not recoverable from what you posted since the CP-1252 encoding cannot fully decode all UTF-8 bytes. However, a quick Python script recovers a couple characters, enough to confirm this is how it was corrupted:
t = '[米国]一人㠮ワクムン未接種㠮å­\ 㠩も㠋ら広㠌㠣㠟麻疹〠㠮教訓。'
print(t.encode('cp1252', errors='replace').decode('utf8', errors='replace').encode('cp1252', errors='replace').decode('utf8', errors='replace'))
This outputs:
[米国]一人� �ワク� ン未接種� ��\ � �も� �ら広� �� �� �麻疹� � �教訓。
EDITED: A round-trip analysis (taking the original text and badly encoding it twice) revealed that it was likely using CP-1252, rather than ISO-8859-1; the encodings are identical on most codepoints. The post has been edited to use CP-1252 instead.

Google translate text to speech and apostrophes

I am using google API to translate a sentence. Once translated I use text to speech google API with the result of the translation.
Translation and text to speech work pretty well in general. However, I have a problem with the apostrophes. For example:
1) Translation result: I & # 3 9 ; m tired (Note: I had to separate the characters with spaces because it was shown as "I´m tired" in the preview...
2) Text to speech result says : "I and hash thirty nine m tired" (or something similar)
What kind of encoding do I need to use in the 1st step to get the output string right (i.e. I´m tired)
The program is in python. I include an extract here:
def tts_translated_text (self, input_text, input_language):
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
input_text = input_text.encode ("utf-8")
# Set the text input to be synthesized
synthesis_input = texttospeech.types.SynthesisInput(text=input_text)
voice = texttospeech.types.VoiceSelectionParams( language_code=input_language, ssml_gender=texttospeech.enums.SsmlVoiceGender.FEMALE)
audio_config = texttospeech.types.AudioConfig(
audio_encoding=texttospeech.enums.AudioEncoding.LINEAR16)
response = client.synthesize_speech(synthesis_input, voice, audio_config)
# The response's audio_content is binary.
with open('output.wav', 'wb') as out:
# Write the response to the output file.
out.write(response.audio_content)
Thanks in advance,
Ester
I finally found what was wrong. Google Translate API returns the string with HTML encoding. And Google Text-To-Speech expects UTF-8 encoding.
I am forced to use python2.7 so I did the following:
translated_text = HTMLParser.HTMLParser().unescape (translated_text_html)
Where translated_text_html is the returned string from the translation API invocation
In python3 it should be:
translated_text = html.unescape (translated_text_html)

Handle utf 8 characters in unix

I was trying to find a solution for my problem and after looking at the forums I couldn't so I'll explain my problem here.
We receive a csv file from a client with some special characters and encoded as unknown-8bit. We convert this csv file to xml using an awk script. With the xml file we make an API call to our system using utf-8 as default encoding. The response is an error with following information:
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence
The content of the file is as bellow:
151215901579-109617744500,sandra,sandra,Coesfeld,,Coesfeld,48653,DE,1,2.30,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10,53.82,GB,,.80,3,ASTRA 16V CAVALIER CALIBRA TURBO BLUE 10MM 4CORE IGNITION HT LEADS WIRES MLR.CR,,sandra#online.de,parcel1,Invalid Request,,%004865315500320004648880276,INTL,%004865315500320004648880276,1,INTL,DPD,180380,INTL,2.30,Send A2B Ltd,4th Floor,200 Gray’s Inn Road,LONDON,,WC1X8XZ,GBR,
I think the problem is in the field "200 Gray’s Inn Road" cause when I use utf-8 encoding it automatically converts "'" character by a x92 value.
Does anybody know how can I handle this?
Thanks in advance,
Sandra
Find out the actual encoding first, best would be asking the sender.
If you cannot do so, and also for sanity-checking, the unix command file is very useful for that (the linked page shows more options).
Next step, convert to UTF-8.
As it is obviously an ASCII-based encoding, you could just discard all non-ASCII or replace them on encoding, if that loss is acceptable.
As an alternative, open it in the editor of your choice and flip the encoding used for interpreting the data until you get something useful. My guess is you'll have either Latin-1 or Windows-1252, but check it for yourself.
Last step, do what you wanted to do, in comforting knowledge that you now have valid UTF-8.
Obviously, don't pretend it's UTF-8 if it isn't. Find out what the encoding is, or replace all non-ASCII characters with the UTF-8 REPLACEMENT CHARACTER sequence 0xEF 0xBF 0xBD.
Since you are able to view this particular sample just fine, you apparently already know which encoding it is (even if you don't know that you know -- it would be whatever your current set-up is using) -- I would guess Windows-1252 which uses 0x92 for a curvy right single quote.

DataMatrix barcode with non-Latin characters

I need to create DataMatrix barcodes which may contain non-Latin characters. I have code which creates the barcodes correctly when they only consist of Latin characters; when I run the same code with non-Latin (Hebrew or Russian) characters, however, although the code runs to completion and the barcode is created, the non-Latin characters are not deciphered by the barcode reader.
Any assistance or ideas would be greatly appreciated!
Your issue is related to the character encoding used prior to generating the barcode. The encoding used by the generator to encode must match the encoding used by the reader to decode.
Possible encodings are:
Extended Channel Interpretations (ECI) is supported by DataMatrix and other 2D barcode standards. The generator places an ECI identifying code inside the barcode data, so the reader knows to use ECI to correctly convert the data back to text.
UTF-8 encodes pretty much any language.
Code page is an older encoding, but if your generator is using it, you can use 1255 for Hebrew code page or 1251 for Russian code page. See this SO answer for more info.
To test your encoding, try Inlite's Online Barcode Reader (OBR) which should read the correct text for ECI and UTF-8 encoded barcodes. If it does, the problem is with your barcode reader which is not decoding correctly.
If OBR returns binary data, either your generator uses code page or does not encode correctly at all. Try another generator that supports ECI or UTF-8.

HttpUtility.HtmlDecode cannot decode ASCII greater than 127

I have a list of character that display fine in WebBrowser in the form of encoded characters such as €  ...
But when posting these characters onto server to I realized that HttpUtility.HtmlDecode cannot convert them to characters as browser did, they all become space.
text = System.Web.HttpUtility.HtmlDecode("€");
I expect it to return € but it return space instead. The same thing happen for some other characters as well.
Does anyone know how to fix this or any workaround?
This is commonly result of using literal values and mixing UTF-8 and ASCII. In UTF-8 euro sign is encoded as 3 bytes so there is no ASCII counterpart for it.
Update
Your code is illegal if you are using UTF-8 since it only supports the first 128 characters and the rest are encoded is multiple bytes. You need to use the Unicode syntax:
// !!! NOT HtmlDecode!!!
text = System.Web.HttpUtility.UrlDecode("%E2%82%AC");
UPDATE
OK, I have left the code as it was but added the comment that it does not work. It does not work because it is not an encoding which is of concern for HTML - it is not an HTML. This is of concern for the URL and as such you need to use UrlDecode instead.
ASCII is 7-Bit; there are no characters 128 through 255. The MSDN article you linked is following the long tradition of pretending ASCII is 8-Bit; the article actually shows code page 437.
I'm not sure why you're not simply writing € (compatibility?), but € or € should do, too.
You typically want to do something like:
string html = "€"
string trash = WebUtility.HtmlDecode(html);
//Convert from default encoding to UTF8
byte[] bytes = Encoding.Default.GetBytes(trash);
string proper = Encoding.UTF8.GetString(bytes);

Resources