How to fix UTF-16 encoded as UTF-8 in R? - r

I'm dealing with a text where I find UTF-16 encoded as UTF-8 and I am unable to translate from one to the other in the R language.
For example, and looking at this codepoint (https://codepoints.net/U+D83D) representation in UTF-8 as a text string "ED A0 BD" and I want to convert it to also a text string"D8 3D".
How can I achieve this?
More info on what I want to achieve: stackoverflow.com/questions/35670238/emoji-in-r-utf-8-encodi‌​ng

Related

Encoding problem: Convert bytes to Chinese characters in R

I read from a html file in R which contains Chinese characters. But it shows something like
" <td class=\"forumCell\">\xbbָ\xb4</td>"
It is the "\x" strings that I need to extract. How can I convert them into readable Chinese characters?
By the way, somehow simply copy and pasting the above \x strings would not replicate the problem.
are you sure they are all chinese characters? what is the html page encoding? the strings you pasted looks to be a mix of hex \xc4\xe3 and unicode chars \u0237.

Clean Tweets: What are UTF8 and non-UTF8 characters

I am attempting to analyze a corpus of tweets extracted from Twitter. A number of tweets appear in non-UTF characters.
For example, one tweet is: "[米国]一人㠮ワクムン未接種㠮å­\ 㠩も㠋ら広㠌㠣㠟麻疹〠㠮教訓。 #ShotbyShotorg: How one unvaccinated child sparked Minnesota measles outbreak \"
I am not familiar with these non-alphanumeric characters or how to convert/exclude these characters. Are these garbage characters or do they need to be converted? Thank you.
I found the original tweet: https://twitter.com/narumita/status/476295179796611072?s=21. From this tweet it’s quite clear that the “garbage” text was supposed to be Japanese.
The original text reads
[米国]一人のワクチン未接種の子どもから広がった麻疹、の教訓。
Somehow, your text has undergone two rounds of mojibake-ification: it was encoded as UTF-8, decoded as Windows Code Page 1252 (CP-1252), encoded as UTF-8 again, and decoded as CP-1252 again. Unfortunately the text is not recoverable from what you posted since the CP-1252 encoding cannot fully decode all UTF-8 bytes. However, a quick Python script recovers a couple characters, enough to confirm this is how it was corrupted:
t = '[米国]一人㠮ワクムン未接種㠮å­\ 㠩も㠋ら広㠌㠣㠟麻疹〠㠮教訓。'
print(t.encode('cp1252', errors='replace').decode('utf8', errors='replace').encode('cp1252', errors='replace').decode('utf8', errors='replace'))
This outputs:
[米国]一人� �ワク� ン未接種� ��\ � �も� �ら広� �� �� �麻疹� � �教訓。
EDITED: A round-trip analysis (taking the original text and badly encoding it twice) revealed that it was likely using CP-1252, rather than ISO-8859-1; the encodings are identical on most codepoints. The post has been edited to use CP-1252 instead.

Convert encodings to raw characters using R language

I have encoded characters such as "\032\032\032\032\032\032\032\032" and I want to convert them to raw characters in R. How can this be done?

What is wrong with the following URL encoding?

https://twitter.com/intent/tweet?source=webclient&text=G%C5
produces the following error:
Invalid Unicode value in one or more parameters
btw, that is the Å character
twitter expects parameters to be encoded as utf-8.
So Å is Unicode U+00C5, and represented as utf-8 is C3 85
With url-escape this means that the query should be ...&text=G%C3%85
Since I don't know how you are building that query (programming language/environment), I can't really tell you how to do it right. Only that you should convert your string to utf-8 before escaping.

Printing ASCII value of BB (HEX) in Unix

When I am trying to paste the character » (right double angle quotes) in Unix from my Notepad, it's converting to /273. The corresponding Hex value is BB and the Decimal value is 187.
My actual requirement is to have this character as the file delimiter when I export a .dat file from a database table. So, this character was put in as the delimiter after each column name. But, while copy-pasting, it's getting converted to /273.
Any idea about how to fix this? I am on Solaris (SunOS 5.10).
Thanks,
Visakh
ASCII only defines the character codes up to 127 (0x7F) - everything after that is another encoding, such as ISO-8859-1 or UTF-8. Make sure your locale is set to the encoding you are trying to use - the locale command will report your current locale settings, the locale(5) and environ(5) man pages cover how to set them. A much more in-depth introduction to the whole character encoding concept can be found in Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The character code 0xBB is shown as » in the IS0-8859-1 character chart, so that's probably the character set you want, so the locale would be something like en_US.ISO8859-1 for that character set with US/English messages/date formats/currency settings/etc.

Resources