Can R read html-encoded emoji characters? - r

Question
My question, explained below, is:
How can R be used to read a string that includes HTML emoji codes like 🤗?
I'd like to:
(1) represent the emoji symbol (e.g., as a unicode symbol: 🤗) in the parsed string, OR(2) convert it into its text equivalent (":hugging face:")
Background
I have an XML dataset of text messages (from the Android/iOS app Signal) that I am reading into R for a text mining project. The data look like this, with each text message represented in an sms node:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!-- File Created By Signal -->
<smses count="1">
<sms protocol="0" address="+15555555555" contact_name="Jane Doe" date="1483256850399" readable_date="Sat, 31 Dec 2016 23:47:30 PST" type="1" subject="null" body="Hug emoji: 🤗" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" />
</smses>
Problem
I am currently reading the data using the xml2 package for R. When I use the xml2::read_xml function, however, I get the following error message:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
xmlParseCharRef: invalid xmlChar value 55358
Which, as I understand, indicates that the emoji character is not recognized as valid XML.
Using the xml2::read_html function does work, but drops the emoji character. A small example of this is here:
example_text <- "Hugging emoji: 🤗"
xml2::xml_text(xml2::read_html(paste0("<x>", example_text, "</x>")))
(Output: [1] "Hugging emoji: ")
This character is valid HTML -- Googling 🤗 actually converts it in the search bar to the "hugging face" emoji, and brings up results relating to that emoji.
Other information I've found that seems relevant to this question
I've been searching Stack Overflow, and have not found any questions relating to this particular issue. I've also not been able to find a table that straightforwardly gives HTML codes next to the emoji they represent, and so am not able to do an (albeit inefficient) conversion of these HTML codes to their textual equivalents in a big loop before parsing the dataset; for example, neither this list nor its underlying dataset seem to include the string 55358.

tl;dr: the emoji aren't valid HTML entities; UTF-16 numbers have been used to build them instead of Unicode code points. I describe an algorithm at the bottom of the answer to convert them so that they are valid XML.
Identifying the Problem
R definitely handles emoji:
In fact, a few packages exist for handling emoji in R. For example, the emojifont and emo packages both let you retrieve emoji based on Slack-style keywords. It's just a question of getting your source characters through from the HTML-escaped format so that you can convert them.
xml2::read_xml seems to do fine with other HTML entities, like an ampersand or double quotes. I looked at this SO answer to see whether there were any XML-specific constraints on HTML entities, and it seemed like they were storing emoji fine. So I tried changing the emoji codes in your reprex to the ones in that answer:
body="Hug emoji: 😀😃"
And, sure enough, they were preserved (though they're obviously not the hug emoji anymore):
> test8 = read_html('Desktop/test.xml')
> test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body')
[1] "Hug emoji: \U0001f600\U0001f603"
I looked up the hug emoji on this page, and the decimal HTML entity given there is not 🤗. It looks like the UTF-16 decimal codes for the emoji have been wrapped in &# and ;.
In conclusion, I think the answer is that your emoji are, in fact, not valid HTML entities. If you can't control the source, you might need to do some pre-processing to account for these errors.
So, why does the browser convert them properly? I'm wondering if the browser is a little more flexible with these things and is making some guesses about what those codes could be. I'm just speculating, though.
Converting UTF-16 to Unicode code points
After some more investigation, it looks like valid emoji HTML entities use the Unicode code point (in decimal, if it's &#...;, or hex, if it's &#x...;). The Unicode code point is different from the UTF-8 or UTF-16 code. (That link explains a lot about how emoji and other characters are variously encoded, BTW! Good read.)
So we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16, I've verified how it's done. Each Unicode code point (our target) is a 20-bit number, or five hex digits. When going from Unicode to UTF-16, you split it up into two 10-bit numbers (the middle hex digit gets cut in half, with two of its bits going to each block), do some maths on them and get your result).
Going backwards, as you want to, it's done like this:
Your decimal UTF-16 number (which is in two separate blocks for now) is 55358 56599
Converting those blocks to hex (separately) gives 0x0d83e 0x0dd17
You subtract 0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117
Converting them to binary, padding them out to 10 bits and concatenating them, it's 0b0000 1111 1001 0001 0111
Then we convert that back to hex, which is 0x0f917
Finally, we add 0x10000, giving 0x1f917
Therefore, our (hex) HTML entity is 🤗. Or, in decimal, &#129303
So, to preprocess this dataset, you'll need to extract the existing numbers, use the algorithm above, then put the result back in (with one &#...;, not two).
Displaying emoji in R
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.

JavaScript Solution
I had this exact same problem, but needed the solution in JavaScript, not R. Using rensa's comment above (hugely helpful!), I created the following code to solve this issue, and I just wanted to share it in case anyone else happens across this thread as I did, but needed it in JavaScript.
str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
And, here's a full snippet of it working if you'd like to run it:
var str = '😊😘😀😆😂😁'
str = str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
document.getElementById('result').innerHTML = str;
// 😊😘😀😆😂😁
// is turned into
// 😊😘😀😆😂😁
// which is rendered by the browser as the emojis
Original:<br>😊😘😀😆😂😁<br><br>
Result:<br>
<div id='result'></div>
My SMS XML Parser application is working great now, but it stalls out on large XML files so, I'm thinking about rewriting it in PHP. If/when I do, I'll post that code as well.

I've implemented the algorithm described by rensa above in R, and am sharing it here. I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
This is a quick and unpolished implementation of rensa's algorithm, but it works!
utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){
string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2]
string3a <- string_elements[1]
string3b <- string_elements[2]
string4a <- sprintf("0x0%x", as.numeric(string3a))
string4b <- sprintf("0x0%x", as.numeric(string3b))
string5a <- paste0(
# "0x",
as.hexmode(string4a) - 0xd800
)
string5b <- paste0(
# "0x",
as.hexmode(string4b) - 0xdc00
)
string6 <- paste0(
stringi::stri_pad(
paste0(BMS::hex2bin(string5a), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = ""),
stringi::stri_pad(
paste0(BMS::hex2bin(string5b), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = "")
)
string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]]))
string8 <- as.hexmode(string7) + 0x10000
unicode_pattern <- string8
unicode_pattern
}
make_unicode_entity <- function(x) {
paste0("\\U000", utf16_double_dec_code_to_utf8(x))
}
make_html_entity <- function(x) {
paste0("&#x", utf16_double_dec_code_to_utf8(x), ";")
}
# An example string, using the "hug" emoji:
example_string <- "test 🤗 test"
output_string <- stringr::str_replace_all(
example_string,
"(&#[0-9]*?;){2}", # Find all two-character "&#...;&#...;" codes.
make_unicode_entity
# make_html_entity
)
cat(output_string)
# To print Unicode string (doesn't display in R console, but can be copied and
# pasted elsewhere:
# (This assumes you've used 'make_unicode_entity' above in the str_replace_all
# call):
stringi::stri_unescape_unicode(output_string)

Translated Chad's JavaScript answer to Go since I too had the same issue, but needed a solution in Go.
https://play.golang.org/p/h9JBFzqcd90
package main
import (
"fmt"
"html"
"regexp"
"strconv"
"strings"
)
func main() {
emoji := "😊😘😀😆😂😁"
regexp := regexp.MustCompile(`(&#\d+;){2}`)
matches := regexp.FindAllString(emoji, -1)
var builder strings.Builder
for _, match := range matches {
s := strings.Replace(match, "&#", "", -1)
parts := strings.Split(s, ";")
a := parts[0]
b := parts[1]
c, err := strconv.Atoi(a)
if err != nil {
panic(err)
}
d, err := strconv.Atoi(b)
if err != nil {
panic(err)
}
c = c - 0xd800
d = d - 0xdc00
e := strconv.FormatInt(int64(c), 2)
f := strconv.FormatInt(int64(d), 2)
g := "0000000000"[2:len(e)] + e
h := "0000000000"[10:len(f)] + f
j, err := strconv.ParseInt(g + h, 2, 64)
if err != nil {
panic(err)
}
k := j + 0x10000
_, err = builder.WriteString("&#x" + strconv.FormatInt(k, 16) + ";")
if err != nil {
panic(err)
}
}
fmt.Println(html.UnescapeString(emoji))
emoji = html.UnescapeString(builder.String())
fmt.Println(emoji)
}

Related

Avoiding JSON error displaying Japanese strings within Plotly (R) / Running a function on one variable at a time

I'm very new to R and beginner level at programming in general, and trying to figure out how to get hovertext in plotly to display a Japanese string from my dataframe. After venturing through character encoding hell, I've got things mostly worked out but am getting stuck on a single point: Getting the Japanese string to display in the final plot.
plot_ly(df, x = ~cost, y = ~grossSales, type = "scatter", mode = "markers",
hoverinfo = "text",
text = ~paste0("Product name: ", productName,
"<br>Gross: ", grossSales, "<br> Cost: ", cost,
)
)
The problem I encounter is that using 'productName' returns the Japanese string from the dataframe, which causes the plot to fail to render. DOM Inspector's console shows JSON encountering issues with the string (even though it's just encoded in UTF-8).
Using toJSON(productName), I am able to render the table, however this renders the hover textbox with the full information of the productName column (e.g., ["","Product1","Product2","Product3"...]). I only want the name of that specific product; just as 'grossSales' and 'cost' only return one the data specific to that product at each point on the plot.
Is there a way I can execute toJSON() only on each specific instance of 'productName'? (i.e., output should be "Product1" with JSON friendly string format) Alternatively, is there a way I can have plotly read the list output and select only the correct productName?
Stepping away from the problem to continue studying other things, I found a partial solution in using a for-loop:
productNames <- NULL
for (i in 1:nrow(df))
{
productNames <- c(productNames, toJSON(df[i, "productName"]))
}
df$jsonProductNames <- productNames
Using the jsonProductNames variable within plotly, the graph renders and displays only the name for each product! The sole issue remaining is that it is displayed with the JSON [""] formatting around each product's name.
Update:
I've finally got this working fully how I want it. I imagine there are more elegant solutions, and I'd still be interested to learn how to achieve what I originally was looking at if possible (run a function on a variable within R for each time it is encountered in a loop), but here is how I have it working:
colToJSON <- function(df, colStr)
{
JSONCol <- NULL
for (i in 1:nrow(df))
{
JSONCol <- c(JSONCol, toJSON(df[i, colStr]))
}
JSONCol <- gsub("\\[\"", "", JSONCol)
JSONCol <- gsub("\"\\]", "", JSONCol)
return(JSONCol)
}
df$jsonProductNames <- colToJSON(df, "productName")

Nginx: How to convert post data encoding from UTF-8 to TIS-620 before pass them through proxy

I would like to convert POST data received from request and convert it from UTF-8 to TIS-620 before pass it to backend via proxy_pass using code below but I am not sure which way to do it
location / {
proxy_pass http://targetwebsite;
}
If I am not wrong, I believe I have to use Lua module to manipulate the request but I don't know if they support any character conversion.
Could anybody help me with sample code to convert POST data from UTF-8 toTIS-620 using LUA and how to validate if POST data is UTF-8 before convert it or if there is other better way to manipulate/convert POST data in nginx ?
This solution works on Lua 5.1/5.2/5.3
local function utf8_to_unicode(utf8str, pos)
-- pos = starting byte position inside input string
local code, size = utf8str:byte(pos), 1
if code >= 0xC0 and code < 0xFE then
local mask = 64
code = code - 128
repeat
local next_byte = utf8str:byte(pos + size)
if next_byte and next_byte >= 0x80 and next_byte < 0xC0 then
code, size = (code - mask - 2) * 64 + next_byte, size + 1
else
return
end
mask = mask * 32
until code < mask
elseif code >= 0x80 then
return
end
-- returns code, number of bytes in this utf8 char
return code, size
end
function utf8to620(utf8str)
local pos, result_620 = 1, {}
while pos <= #utf8str do
local code, size = utf8_to_unicode(utf8str, pos)
if code then
pos = pos + size
code =
(code < 128 or code == 0xA0) and code
or (code >= 0x0E01 and code <= 0x0E3A or code >= 0x0E3F and code <= 0x0E5B) and code - 0x0E5B + 0xFB
end
if not code then
return utf8str -- wrong UTF-8 symbol, this is not a UTF-8 string, return original string
end
table.insert(result_620, string.char(code))
end
return table.concat(result_620) -- return converted string
end
Usage:
local utf8string = "UTF-8 Thai text here"
local tis620string = utf8to620(utf8string)
I looked up the encoding on Wikipedia and came up the following solution for converting from UTF-8 to TIS-620. It assumes that all the codepoints in the UTF-8 string have an encoding in TIS-620. It will work if the UTF-8 string only contains ASCII printable characters (codepoints " " to "~") or Thai characters (codepoints "ก" to "๛"). Otherwise, it will give wrong and possibly very strange results.
This assumes you have Lua 5.3's utf8 library or an equivalent. If you're using an earlier version of Lua, one possibility is the pure-Lua version of the ustring library from MediaWiki (used by Wikipedia and Wiktionary, for instance). It provides a function to validate UTF-8, and many of the other functions will validate strings automatically. (That is, they throw an error if the string is invalid UTF-8.) If you use that library, you just have to replace utf8.codepoint with ustring.codepoint in the code below.
-- Add this number to TIS-620 values above 0x80 to get the Unicode codepoint.
-- 0xE00 is the first codepoint of Thai block, 0xA0 is the corresponding byte used in TIS-620.
local difference = 0xE00 - 0xA0
function UTF8_to_TIS620(UTF8_string)
local TIS620_string = UTF8_string:gsub(
'[\194-\244][\128-\191]+',
function (non_ASCII)
return string.char(utf8.codepoint(non_ASCII) - difference)
end)
return TIS620_string
end

How do i decode this string? \xc3\x99\xc3\xa9\xc2\x87-B[x\xc2

This is what I need to decode
\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab
it is generated by String.fromCharCode(arrayPw[i]);
but i don't understand how to decode it :(
Please help
Python:
data = "\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab"
udata = data.decode("utf-8")
asciidata = udata.encode("ascii","ignore")
JavaScript:
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Otherwise do more research about decoding UTF-8.
https://gist.github.com/chrisveness/bcb00eb717e6382c5608
There's also an online UTF-8 decoder/encoder:
https://mothereff.in/utf-8
HINT: ÙÙé-B[x¾æEz«
duplicate of this : https://stackoverflow.com/a/70815136/5902698
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))

Get last page http request golang

I am doing an http requestlike this one:
resp, err := http.Get("http://example.com/")
Then I am getting the header link:
link := resp.Header.Get("link")
Which gives me a result like this:
<page=3>; rel="next",<page=1>; rel="prev";<page=5>; rel="last"
Question
How can I parse this into a more legible way? I specifically trying to get the last page but firstand nextpage should useful as well.
I tried with Splitsand Regular expressionswithout success.
Are you sure that is the format of the output? It looks like one of ; should be a ,.
A single Link http header with multiple values, should be of the format (notice the comma after "prev")
<page=3>; rel="next",<page=1>; rel="prev",<page=5>; rel="last"
The order should be split on , for each link. Split each link on ; for values or key-value pairs, and then if they value matches <(.*=.*)>, discard the angle brackets and use the remaining key and value.
Here's a solution of how to match your page numbers.
http://play.golang.org/p/kzurb38Fwx
text := `<page=3>; rel="next",<page=1>; rel="prev";<page=2>; rel="last"`
re := regexp.MustCompile(`<page=([0-9]+)>; rel="next",<page=([0-9]+)>; rel="prev";<page=([0-9]+)>; rel="last"`)
matches:= re.FindStringSubmatch(text)
if matches != nil {
next := matches[1]
prev := matches[2]
last := matches[3]
fmt.Printf("next = %s, prev = %s, last = %s\n", next, prev, last)
}
Later Edit: you can probably also use the xml package to achieve the same result, by parsing that output as an XML, but you would need to transform your output a bit.

QRegExp: individual quantifiers can't be non-greedy, but what good alternatives then?

I'm trying to write code that appends ending _my_ending to the filename, and does not change file extension.
Examples of what I need to get:
"test.bmp" -> "test_my_ending.bmp"
"test.foo.bar.bmp" -> "test.foo.bar_my_ending.bmp"
"test" -> "test_my_ending"
I have some experience in PCRE, and that's trivial task using it. Because of the lack of experience in Qt, initially I wrote the following code:
QString new_string = old_string.replace(
QRegExp("^(.+?)(\\.[^.]+)?$"),
"\\1_my_ending\\2"
);
This code does not work (no match at all), and then I found in the docs that
Non-greedy matching cannot be applied to individual quantifiers, but can be applied to all the quantifiers in the pattern
As you see, in my regexp I tried to reduce greediness of the first quantifier + by adding ? after it. This isn't supported in QRegExp.
This is really disappointing for me, and so, I have to write the following ugly but working code:
//-- write regexp that matches only filenames with extension
QRegExp r = QRegExp("^(.+)(\\.[^.]+)$");
r.setMinimal(true);
QString new_string;
if (old_string.contains(r)){
//-- filename contains extension, so, insert ending just before it
new_string = old_string.replace(r, "\\1_my_ending\\2");
} else {
//-- filename does not contain extension, so, just append ending
new_string = old_string + time_add;
}
But is there some better solution? I like Qt, but some things that I see in it seem to be discouraging.
How about using QFileInfo? This is shorter than your 'ugly' code:
QFileInfo fi(old_string);
QString new_string = fi.completeBaseName() + "_my_ending"
+ (fi.suffix().isEmpty() ? "" : ".") + fi.suffix();

Resources