First of all, I am sorry if this is a repeated question. I tried for several hours already and I see different solutions for PHP or other languages but not for R.
I am retrieving data from the last.fm website using their API.
You do need an API key to retrieve the data I am trying to get but I will make it simpler here and hopefully you can answer my question.
Here is my problem:
At certain point, when retrieving the data, I encounter an error which stops my request. I skipped it once but it comes back again and again. I always get the same: PCDATA invalid Char value #
Here is an example:
string = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<lfm status=\"ok\">\n<results for=\"a\" xmlns:opensearch=\"http://a9.com/-/spec/opensearch/1.1/\">\n<opensearch:Query role=\"request\" searchTerms=\"a\" startPage=\"1382\" />\n<opensearch:totalResults>212588</opensearch:totalResults>\n<opensearch:startIndex>1381</opensearch:startIndex>\n<opensearch:itemsPerPage>1</opensearch:itemsPerPage><artistmatches>\n<artist>\n <name>!B0A \0348E09;>2</name>\n <listeners>1672</listeners>\n <mbid></mbid>\n <url>http://www.last.fm/music/!B0A+%1C8E09;%3E2</url>\n <streamable>0</streamable>\n <image size=\"small\">http://userserve-ak.last.fm/serve/34/88015017.png</image>\n <image size=\"medium\">http://userserve-ak.last.fm/serve/64/88015017.png</image>\n <image size=\"large\">http://userserve-ak.last.fm/serve/126/88015017.png</image>\n <image size=\"extralarge\">http://userserve-ak.last.fm/serve/252/88015017.png</image>\n <image size=\"mega\">http://userserve-ak.last.fm/serve/_/88015017/B0A+8E092+15286997.png</image>\n </artist></artistmatches>\n</results></lfm>\n"
When I try to parse this text I get the error:
doc = xmlParse(string, asText = TRUE)
PCDATA invalid Char value 28
Error: 1: PCDATA invalid Char value 28
I believe the part that is making this happen comes from this part of the string:
<name>!B0A \0348E09;>2</name>\n
But I can't be sure now.
What I am looking for is one of these solutions, being the first one the ideally situation but any of the others will make me happy:
1 - Allow R to receive these invalid characters
2 - Eliminate the invalid characters and continue with the parse without stopping.
3 - Skip the string with the invalid characters and continue with the parse
4 - Create a function to find the invalid characters so I can include that when retrieving the data from last.fm
I hope you can understand the question and help me with it.
Thanks in advance
You are right. The artist name has an illegal characters for XML parsing.
Try this out:
illegal <- "[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]"
utf8_for_xml <- function(x) {
return(gsub(illegal, "", x))
}
string_formatted <- utf8_for_xml(string)
xmlParse(string_formatted)
<?xml version="1.0" encoding="utf-8"?>
<lfm status="ok">
<results xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" for="a">
<opensearch:Query role="request" searchTerms="a" startPage="1382"/>
<opensearch:totalResults>212588</opensearch:totalResults>
<opensearch:startIndex>1381</opensearch:startIndex>
<opensearch:itemsPerPage>1</opensearch:itemsPerPage>
<artistmatches>
<artist>
<name>!B0A 8E09;>2</name>
<listeners>1672</listeners>
<mbid/>
<url>http://www.last.fm/music/!B0A+%1C8E09;%3E2</url>
<streamable>0</streamable>
<image size="small">http://userserve-ak.last.fm/serve/34/88015017.png</image>
<image size="medium">http://userserve-ak.last.fm/serve/64/88015017.png</image>
<image size="large">http://userserve-ak.last.fm/serve/126/88015017.png</image>
<image size="extralarge">http://userserve-ak.last.fm/serve/252/88015017.png</image>
<image size="mega">http://userserve-ak.last.fm/serve/_/88015017/B0A+8E092+15286997.png</image>
</artist>
</artistmatches>
</results>
</lfm>
Extra:
Let's find out which character is illegal for XML in your string object.
The function gregexpr finds the character number:
gregexpr(illegal, string)
[1] 403
attr(,"match.length")
[1] 1
using "Unicode" package:
require(Unicode)
unicode_string <- as.u_char(utf8ToInt(string))
unicode_string[403]
[1] U+001C
The Unicode U+001C is the "Information Separator Four" and it is illegal for parsing in XML.
Related
I'm trying to work with a JSON file in R, but unfortunately the JSON file is unreadable by jsonlite in its current state. It's missing commas between the objects(arrays elements?). My objective is to form a data frame from this almost-JSON file. Example JSON file, code, and result below.
[
{"Source":"ADSB","Id":43061,"FlightId":"N668XX","Latitude":44.000083,"Longitude":-96.654788,"Alt":4450}
{"Source":"ADSB","Id":43062,"FlightId":"N683XX","Latitude":44.000083,"Longitude":-96.654788,"Alt":4450}
{"Source":"ADSB","Id":43063,"FlightId":"N652XX","Latitude":44.000083,"Longitude":-96.654788,"Alt":4450}
]
> jsondata = fromJSON("asdf.json")
Error in parse_con(txt, bigint_as_char) :
parse error: after array element, I expect ',' or ']'
"Heading":280,"Speed":124} {"Source":"ADSB","Id":43062,"Fl
(right here) ------^
After inserting comma's between the objects in the JSON file, it works no problem.
[
{"Source":"ADSB","Id":43061,"FlightId":"N668XX","Latitude":44.000083,"Longitude":-96.654788,"Alt":4450},
{"Source":"ADSB","Id":43062,"FlightId":"N683XX","Latitude":44.000083,"Longitude":-96.654788,"Alt":4450},
{"Source":"ADSB","Id":43063,"FlightId":"N652XX","Latitude":44.000083,"Longitude":-96.654788,"Alt":4450},
]
> jsondata = fromJSON("asdf.json")
> names(jsondata)
[1] "Source" "Id" "FlightId" "Latitude" "Longitude" "Alt"
How do I insert commas throughout this JSON file between all of the curvy brackets? (i.e. "}{" --> "},{"
Or is there another way for R to read my incomplete JSON file?
I'm less than a novice, so any help is much appreciated, thanks!!
i'm getting a parse error when I split a text line on multiple lines and show the JSON file on screen with the command "jq . words.json".
The JSON file with the text value on a single line looks like this
{
"words" : "one two three four five"
}
The command "jq . words.json" works fine and shows the JSON file on screen.
But when i split the value "one two three four five" on two lines and run the same command I get a parse error
{
"words" : "one two
three four five"
^
}
parse error: Invalid string: control characters from U+0000 through
U+001F must be escaped at line 3, column 20
The parse error points to the " at the end of the third line.
How can i solve this?
Tia,
Anthony
That's because the JSON format is invalid. It should look like this:
{
"words" : "one two \nthree four five"
}
You have to escape end of line in JSON:
{
"words" : "one two\nthree four five"
}
To convert the text with the multi-line string to valid JSON, you could use any-json (https://www.npmjs.com/package/any-json), and pipe that into jq:
$ any-json --input-format=cson split-string.txt
{
"words": "one two three four five"
}
$ any-json --input-format=cson split-string.txt | jq length
1
For more on handling almost-JSON texts, see the jq FAQ: https://github.com/stedolan/jq/wiki/FAQ#processing-not-quite-valid-json
The parse error points to the " at the end of the third line.
The way jq flags this error may be counterintuitive, but the error in the JSON precedes the indicated quote-mark.
If the error is non-obvious, it may be that an end-quote is missing on the prior key or value. In this case, the value that matches the criteria U+0000 through U+001F could be U+000A, which is the line feed character in ASCII.
In the case of this question, the line feed was inserted intentionally. But, unescaped, this is invalid JSON.
In case it helps somebody, I had this error:
E: parse error: Invalid string: control characters from U+0000 through
U+001F must be escaped at line 3, column 5
jq was parsing the file containing this data, with missing " after "someKey
{
"someKey: {
"someData": "someValue"
}
}
Question
My question, explained below, is:
How can R be used to read a string that includes HTML emoji codes like 🤗?
I'd like to:
(1) represent the emoji symbol (e.g., as a unicode symbol: 🤗) in the parsed string, OR(2) convert it into its text equivalent (":hugging face:")
Background
I have an XML dataset of text messages (from the Android/iOS app Signal) that I am reading into R for a text mining project. The data look like this, with each text message represented in an sms node:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!-- File Created By Signal -->
<smses count="1">
<sms protocol="0" address="+15555555555" contact_name="Jane Doe" date="1483256850399" readable_date="Sat, 31 Dec 2016 23:47:30 PST" type="1" subject="null" body="Hug emoji: 🤗" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" />
</smses>
Problem
I am currently reading the data using the xml2 package for R. When I use the xml2::read_xml function, however, I get the following error message:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
xmlParseCharRef: invalid xmlChar value 55358
Which, as I understand, indicates that the emoji character is not recognized as valid XML.
Using the xml2::read_html function does work, but drops the emoji character. A small example of this is here:
example_text <- "Hugging emoji: 🤗"
xml2::xml_text(xml2::read_html(paste0("<x>", example_text, "</x>")))
(Output: [1] "Hugging emoji: ")
This character is valid HTML -- Googling 🤗 actually converts it in the search bar to the "hugging face" emoji, and brings up results relating to that emoji.
Other information I've found that seems relevant to this question
I've been searching Stack Overflow, and have not found any questions relating to this particular issue. I've also not been able to find a table that straightforwardly gives HTML codes next to the emoji they represent, and so am not able to do an (albeit inefficient) conversion of these HTML codes to their textual equivalents in a big loop before parsing the dataset; for example, neither this list nor its underlying dataset seem to include the string 55358.
tl;dr: the emoji aren't valid HTML entities; UTF-16 numbers have been used to build them instead of Unicode code points. I describe an algorithm at the bottom of the answer to convert them so that they are valid XML.
Identifying the Problem
R definitely handles emoji:
In fact, a few packages exist for handling emoji in R. For example, the emojifont and emo packages both let you retrieve emoji based on Slack-style keywords. It's just a question of getting your source characters through from the HTML-escaped format so that you can convert them.
xml2::read_xml seems to do fine with other HTML entities, like an ampersand or double quotes. I looked at this SO answer to see whether there were any XML-specific constraints on HTML entities, and it seemed like they were storing emoji fine. So I tried changing the emoji codes in your reprex to the ones in that answer:
body="Hug emoji: 😀😃"
And, sure enough, they were preserved (though they're obviously not the hug emoji anymore):
> test8 = read_html('Desktop/test.xml')
> test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body')
[1] "Hug emoji: \U0001f600\U0001f603"
I looked up the hug emoji on this page, and the decimal HTML entity given there is not 🤗. It looks like the UTF-16 decimal codes for the emoji have been wrapped in &# and ;.
In conclusion, I think the answer is that your emoji are, in fact, not valid HTML entities. If you can't control the source, you might need to do some pre-processing to account for these errors.
So, why does the browser convert them properly? I'm wondering if the browser is a little more flexible with these things and is making some guesses about what those codes could be. I'm just speculating, though.
Converting UTF-16 to Unicode code points
After some more investigation, it looks like valid emoji HTML entities use the Unicode code point (in decimal, if it's &#...;, or hex, if it's &#x...;). The Unicode code point is different from the UTF-8 or UTF-16 code. (That link explains a lot about how emoji and other characters are variously encoded, BTW! Good read.)
So we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16, I've verified how it's done. Each Unicode code point (our target) is a 20-bit number, or five hex digits. When going from Unicode to UTF-16, you split it up into two 10-bit numbers (the middle hex digit gets cut in half, with two of its bits going to each block), do some maths on them and get your result).
Going backwards, as you want to, it's done like this:
Your decimal UTF-16 number (which is in two separate blocks for now) is 55358 56599
Converting those blocks to hex (separately) gives 0x0d83e 0x0dd17
You subtract 0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117
Converting them to binary, padding them out to 10 bits and concatenating them, it's 0b0000 1111 1001 0001 0111
Then we convert that back to hex, which is 0x0f917
Finally, we add 0x10000, giving 0x1f917
Therefore, our (hex) HTML entity is 🤗. Or, in decimal, 🤗
So, to preprocess this dataset, you'll need to extract the existing numbers, use the algorithm above, then put the result back in (with one &#...;, not two).
Displaying emoji in R
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
JavaScript Solution
I had this exact same problem, but needed the solution in JavaScript, not R. Using rensa's comment above (hugely helpful!), I created the following code to solve this issue, and I just wanted to share it in case anyone else happens across this thread as I did, but needed it in JavaScript.
str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
And, here's a full snippet of it working if you'd like to run it:
var str = '😊😘😀😆😂😁'
str = str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
document.getElementById('result').innerHTML = str;
// 😊😘😀😆😂😁
// is turned into
// 😊😘😀😆😂😁
// which is rendered by the browser as the emojis
Original:<br>😊😘😀😆😂😁<br><br>
Result:<br>
<div id='result'></div>
My SMS XML Parser application is working great now, but it stalls out on large XML files so, I'm thinking about rewriting it in PHP. If/when I do, I'll post that code as well.
I've implemented the algorithm described by rensa above in R, and am sharing it here. I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
This is a quick and unpolished implementation of rensa's algorithm, but it works!
utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){
string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2]
string3a <- string_elements[1]
string3b <- string_elements[2]
string4a <- sprintf("0x0%x", as.numeric(string3a))
string4b <- sprintf("0x0%x", as.numeric(string3b))
string5a <- paste0(
# "0x",
as.hexmode(string4a) - 0xd800
)
string5b <- paste0(
# "0x",
as.hexmode(string4b) - 0xdc00
)
string6 <- paste0(
stringi::stri_pad(
paste0(BMS::hex2bin(string5a), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = ""),
stringi::stri_pad(
paste0(BMS::hex2bin(string5b), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = "")
)
string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]]))
string8 <- as.hexmode(string7) + 0x10000
unicode_pattern <- string8
unicode_pattern
}
make_unicode_entity <- function(x) {
paste0("\\U000", utf16_double_dec_code_to_utf8(x))
}
make_html_entity <- function(x) {
paste0("&#x", utf16_double_dec_code_to_utf8(x), ";")
}
# An example string, using the "hug" emoji:
example_string <- "test 🤗 test"
output_string <- stringr::str_replace_all(
example_string,
"(&#[0-9]*?;){2}", # Find all two-character "&#...;&#...;" codes.
make_unicode_entity
# make_html_entity
)
cat(output_string)
# To print Unicode string (doesn't display in R console, but can be copied and
# pasted elsewhere:
# (This assumes you've used 'make_unicode_entity' above in the str_replace_all
# call):
stringi::stri_unescape_unicode(output_string)
Translated Chad's JavaScript answer to Go since I too had the same issue, but needed a solution in Go.
https://play.golang.org/p/h9JBFzqcd90
package main
import (
"fmt"
"html"
"regexp"
"strconv"
"strings"
)
func main() {
emoji := "😊😘😀😆😂😁"
regexp := regexp.MustCompile(`(&#\d+;){2}`)
matches := regexp.FindAllString(emoji, -1)
var builder strings.Builder
for _, match := range matches {
s := strings.Replace(match, "&#", "", -1)
parts := strings.Split(s, ";")
a := parts[0]
b := parts[1]
c, err := strconv.Atoi(a)
if err != nil {
panic(err)
}
d, err := strconv.Atoi(b)
if err != nil {
panic(err)
}
c = c - 0xd800
d = d - 0xdc00
e := strconv.FormatInt(int64(c), 2)
f := strconv.FormatInt(int64(d), 2)
g := "0000000000"[2:len(e)] + e
h := "0000000000"[10:len(f)] + f
j, err := strconv.ParseInt(g + h, 2, 64)
if err != nil {
panic(err)
}
k := j + 0x10000
_, err = builder.WriteString("&#x" + strconv.FormatInt(k, 16) + ";")
if err != nil {
panic(err)
}
}
fmt.Println(html.UnescapeString(emoji))
emoji = html.UnescapeString(builder.String())
fmt.Println(emoji)
}
I've a json file with plone objects and there is one field of the objects giving me an error:
UnicodeDecodeError('ascii', '{"id":"aluminio-prata", "nome":"ALUM\xc3\x8dNIO PRATA", "num_demaos":0, "rendimento": 0.0, "unidade":"litros", "url":"", "particular":[], "profissional":[], "unidades":[]},', 36, 37, 'ordinal not in range(128)') (Also, the following error occurred while attempting to render the standard error message, please see the event log for full details: 'NoneType' object has no attribute 'getMethodAliases')
I already know witch field is, is the "title" from title = obj.pretty_title_or_id(), when I remove it from here its ok:
json += '{"id":"' + str(id) + '", "nome":"' + title + '", "num_demaos":' + str(num_demaos) + ', "rendimento": ' + str(rendimento) + ', "unidade":"' + str(unidade) + '", "url":"' + link_produto + '", "particular":' + arr_area_particular + ', "profissional":' + arr_area_profissional + ', "unidades":' + json_qtd + '},
but when I leave it I've got this error.
UnicodeDecodeError('ascii', '{"id":"aluminio-prata", "nome":"ALUM\xc3\x8dNIO PRATA", "num_demaos":0, "rendimento": 0.0, "unidade":"litros", "url":"", "particular":[], "profissional":[], "unidades":[]},', 36, 37, 'ordinal not in range(128)') (Also, the following error occurred while attempting to render the standard error message, please see the event log for full details: 'NoneType' object has no attribute 'getMethodAliases')
I'm going to assume that the error occurs when you're reading the JSON file.
Internally, Plone uses Python Unicode strings for nearly everything. If you read a string from a file, it will need to be decoded into Unicode before Plone can use it. If you give no instructions otherwise, Python will assume that the string was encoded as ASCII, and will attempt its Unicode conversion on that basis. It would be similar to writing:
unicode("ALUM\xc3\x8dNIO PRATA")
which will produce the same kind of error.
In fact, the string you're using was evidently encoded with the UTF-8 character set. That's evident from the "\xc3", and it also makes sense, because that's the character set Plone uses when it sends data to the outside world.
So, how do you fix this? You have to specify the character set that you wish to use when you convert to Unicode:
"ALUM\xc3\x8dNIO PRATA".decode('UTF8')
This gives you a Python Unicode string with no error.
So, after you've read your JSON file into a string (let's call it mystring), you will need to explicitly decode it by using mystring.decode('UTF8'). unicode(mystring, 'UTF8') is another form of the same operation.
As Steve already wrote do title.decode('utf8')
An Example illustrate the facts:
>>> u"Ä" == u"\xc4"
True # the native unicode char and escaped versions are the same
>>> "Ä" == u"\xc4"
False # the native unicode char is '\xc3\x84' in latin1
>>> "Ä".decode('utf8') == u"\xc4"
True # one can decode the string to get unicode
>>> "Ä" == "\xc4"
False # the native character and the escaped string are
# of course not equal ('\xc3\x84' != '\xc4').
I find this Thread very helpfull for Problems and Understanding with Encode/Decode of UTF-8.
This is what I need to decode
\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab
it is generated by String.fromCharCode(arrayPw[i]);
but i don't understand how to decode it :(
Please help
Python:
data = "\xc3\x99\xc3\x99\xc3\xa9\xc2\x87-B[x\xc2\x99\xc2\xbe\xc3\xa6\x14Ez\xc2\xab"
udata = data.decode("utf-8")
asciidata = udata.encode("ascii","ignore")
JavaScript:
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
Otherwise do more research about decoding UTF-8.
https://gist.github.com/chrisveness/bcb00eb717e6382c5608
There's also an online UTF-8 decoder/encoder:
https://mothereff.in/utf-8
HINT: ÙÙé-B[x¾æEz«
duplicate of this : https://stackoverflow.com/a/70815136/5902698
You load a dataset and you have some strange characters.
Exemple :
'戴森美å�‘é€\xa0型器完整版套装Dyson Airwrap
HS01(铜金色礼盒版)'
In my case, I know that the strange characters are chineses. So I can figure that the one who send me the data have encode it in utf-8 but should do it in 'ISO-8859-1'.
So first step, I had encoded the string, then I decode with utf-8.
so my lines are :
_encoding = 'ISO-8859-1'
_my_str.encode(_encoding, 'ignore').decode("utf-8", 'ignore')
Then my output is :
"'森Dyson Airwrap HS01礼'"
This works for me, but I guess that I do not really well understood under the hood. So feel free to tell me if you have further information.
Bonus. I'll try to detect when the str is in the first strange format because some of my entries are in chinese but others are in english
EDIT : The Bonus is useless. I Just use lamba on ma column to encode and decode without care about format. So I changed the encoding after loading the dataframe
_encoding = 'ISO-8859-1'
_decoding = "utf-8"
df[col] = df[col].apply(lambda x : x.encode(_encoding, 'ignore').decode(_decoding , 'ignore'))