Encoding problem: Convert bytes to Chinese characters in R

Encoding problem: Convert bytes to Chinese characters in R - r

I read from a html file in R which contains Chinese characters. But it shows something like
" <td class=\"forumCell\">\xbbָ\xb4</td>"
It is the "\x" strings that I need to extract. How can I convert them into readable Chinese characters?
By the way, somehow simply copy and pasting the above \x strings would not replicate the problem.

are you sure they are all chinese characters? what is the html page encoding? the strings you pasted looks to be a mix of hex \xc4\xe3 and unicode chars \u0237.

Related

Convert encodings to raw characters using R language

I have encoded characters such as "\032\032\032\032\032\032\032\032" and I want to convert them to raw characters in R. How can this be done?

Jasypt Encryption: Possible Characters?

How do I know exactly what characters are used for the encrypted output using jasypt? Can I force that my output does not contain certain characters or are always all ASCII characters used?
Reason I am asking is that the encrypted text is part of a file with delimiters and I would like to avoid that this delimiter is part of the encrypted text. The delimiter should also not be a hidden character, like SOH, because the file can be edited manually.

"Base64 only uses 6 bits (corresponding to 2^6 = 64 characters) to ensure encoded data is printable and humanly readable. None of the special characters available in ASCII are used. The 64 characters (hence the name Base64) are 10 digits, 26 lowercase characters, 26 uppercase characters as well as '+' and '/'."
So, looks like I can use ASCII special characters as a delimiter.

converting Unicode characters from string column in R

Imported a bunch of CSV and one of the columns has what i think are Unicode chars.
something like:
PEÃ<U+0083>â<U+0080><U+0098>A
SOPEÃ<U+0083>â<U+0080><U+0098>A
Not in all rows, just some, but I've tried to convert to "human readable" chars but to no avail.
Tested this solution from SO but no success so far: unicode characters conversion in R
and this brute substitution didn't worked
gsub('Ã<U+0083>â<U+0080><U+0098>', 'Ñ', 'PEÃ<U+0083>â<U+0080><U+0098>A')
[1] "Ã<U+0083>â<U+0080><U+0098>"

How to fix UTF-16 encoded as UTF-8 in R?

I'm dealing with a text where I find UTF-16 encoded as UTF-8 and I am unable to translate from one to the other in the R language.
For example, and looking at this codepoint (https://codepoints.net/U+D83D) representation in UTF-8 as a text string "ED A0 BD" and I want to convert it to also a text string"D8 3D".
How can I achieve this?
More info on what I want to achieve: stackoverflow.com/questions/35670238/emoji-in-r-utf-8-encodi‌ng

Printing ASCII value of BB (HEX) in Unix

When I am trying to paste the character » (right double angle quotes) in Unix from my Notepad, it's converting to /273. The corresponding Hex value is BB and the Decimal value is 187.
My actual requirement is to have this character as the file delimiter when I export a .dat file from a database table. So, this character was put in as the delimiter after each column name. But, while copy-pasting, it's getting converted to /273.
Any idea about how to fix this? I am on Solaris (SunOS 5.10).
Thanks,
Visakh

ASCII only defines the character codes up to 127 (0x7F) - everything after that is another encoding, such as ISO-8859-1 or UTF-8. Make sure your locale is set to the encoding you are trying to use - the locale command will report your current locale settings, the locale(5) and environ(5) man pages cover how to set them. A much more in-depth introduction to the whole character encoding concept can be found in Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The character code 0xBB is shown as » in the IS0-8859-1 character chart, so that's probably the character set you want, so the locale would be something like en_US.ISO8859-1 for that character set with US/English messages/date formats/currency settings/etc.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Encoding problem: Convert bytes to Chinese characters in R - r

are you sure they are all chinese characters? what is the html page encoding? the strings you pasted looks to be a mix of hex \xc4\xe3 and unicode chars \u0237.

Related

Convert encodings to raw characters using R language

Jasypt Encryption: Possible Characters?

converting Unicode characters from string column in R

How to fix UTF-16 encoded as UTF-8 in R?

Printing ASCII value of BB (HEX) in Unix

Categories

Resources