I have a data in text file. The example of the text file looks like this:
"vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);"
Can I change it to like this in R?
eg.
latitute longtitude contents
23.8145833 90.4043056 LRP: LRPS Start...Tongi
Solution 1
That looks a lot like javascript code. Execute the javascript (using a web browser) and save the result to JSON, then open the file with R with jsonlite.
With your example, create this file and save it as my_page.html:
<html>
<header>
<script>
// Initialize locations to be able to push more values in it
// probably not required with your full code
var locations = [];
vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);
// convert locations to json
var jsonData = JSON.stringify(locations);
// actually write the json to file
function download(content, fileName, contentType) {
var a = document.createElement("a");
var file = new Blob([content], {type: contentType});
a.href = URL.createObjectURL(file);
a.download = fileName;
a.click();
}
download(jsonData, 'export_json.txt', 'text/plain');
</script>
</header>
<body>
Download should start automatically. You can look at the web console for errors.
</body>
</html>
When you open it with your web browser it should "download" a file, that you can open with R:
jsonlite::read_json("export_json.txt",simplifyVector = TRUE)
One problem is that the javascript code is created an array without names. So the names are not exported. I don't see how you could make javascript export it.
Solution 2
Instead of relying on a browser to execute the javascript code, you could do it directly in R with a javascript engine. It should give you the same result, but makes communication between the two easier.
Solution 3
If the file really looks like that all along, you might be able to remove the javascript lines that organize the arrays, and only keep the lines that define variables. In R, the symbols = and ; are technically valid, it's not too hard to rewrite the javascript into R code. Note this solution could be very fragile depending on what else is in your javascript code!
js_script <- "var locations = [];
vLatitude ='23.8145833';
vLongitude ='90.4043056';
vcontents ='LRP: LRPS</br>Start of Road From the End of Banani Rail Crossing Over Pass</br>Division:Gazipur</br>Sub-Division:Tongi';
vLocations = new Array(vcontents, vLatitude, vLongitude);
locations.push(vLocations);
// convert locations to json
var jsonData = JSON.stringify(locations);" %>%
str_split(pattern = "\n", simplify=TRUE) %>%
as.character() %>%
str_trim()
# Find the lines that look like defining variables
js_script <- js_script[str_detect(js_script, pattern = "^\\w+ ?= ?'.*' ?;$")]
# make it into an R expression
r_code <- str_remove(js_script, ";$") %>%
paste(collapse = ",")
r_code <- paste0("c(", r_code, ")")
# Execute
eval(str2expression(r_code))
Question
My question, explained below, is:
How can R be used to read a string that includes HTML emoji codes like 🤗?
I'd like to:
(1) represent the emoji symbol (e.g., as a unicode symbol: 🤗) in the parsed string, OR(2) convert it into its text equivalent (":hugging face:")
Background
I have an XML dataset of text messages (from the Android/iOS app Signal) that I am reading into R for a text mining project. The data look like this, with each text message represented in an sms node:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!-- File Created By Signal -->
<smses count="1">
<sms protocol="0" address="+15555555555" contact_name="Jane Doe" date="1483256850399" readable_date="Sat, 31 Dec 2016 23:47:30 PST" type="1" subject="null" body="Hug emoji: 🤗" toa="null" sc_toa="null" service_center="null" read="1" status="-1" locked="0" />
</smses>
Problem
I am currently reading the data using the xml2 package for R. When I use the xml2::read_xml function, however, I get the following error message:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
xmlParseCharRef: invalid xmlChar value 55358
Which, as I understand, indicates that the emoji character is not recognized as valid XML.
Using the xml2::read_html function does work, but drops the emoji character. A small example of this is here:
example_text <- "Hugging emoji: 🤗"
xml2::xml_text(xml2::read_html(paste0("<x>", example_text, "</x>")))
(Output: [1] "Hugging emoji: ")
This character is valid HTML -- Googling 🤗 actually converts it in the search bar to the "hugging face" emoji, and brings up results relating to that emoji.
Other information I've found that seems relevant to this question
I've been searching Stack Overflow, and have not found any questions relating to this particular issue. I've also not been able to find a table that straightforwardly gives HTML codes next to the emoji they represent, and so am not able to do an (albeit inefficient) conversion of these HTML codes to their textual equivalents in a big loop before parsing the dataset; for example, neither this list nor its underlying dataset seem to include the string 55358.
tl;dr: the emoji aren't valid HTML entities; UTF-16 numbers have been used to build them instead of Unicode code points. I describe an algorithm at the bottom of the answer to convert them so that they are valid XML.
Identifying the Problem
R definitely handles emoji:
In fact, a few packages exist for handling emoji in R. For example, the emojifont and emo packages both let you retrieve emoji based on Slack-style keywords. It's just a question of getting your source characters through from the HTML-escaped format so that you can convert them.
xml2::read_xml seems to do fine with other HTML entities, like an ampersand or double quotes. I looked at this SO answer to see whether there were any XML-specific constraints on HTML entities, and it seemed like they were storing emoji fine. So I tried changing the emoji codes in your reprex to the ones in that answer:
body="Hug emoji: 😀😃"
And, sure enough, they were preserved (though they're obviously not the hug emoji anymore):
> test8 = read_html('Desktop/test.xml')
> test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body')
[1] "Hug emoji: \U0001f600\U0001f603"
I looked up the hug emoji on this page, and the decimal HTML entity given there is not 🤗. It looks like the UTF-16 decimal codes for the emoji have been wrapped in &# and ;.
In conclusion, I think the answer is that your emoji are, in fact, not valid HTML entities. If you can't control the source, you might need to do some pre-processing to account for these errors.
So, why does the browser convert them properly? I'm wondering if the browser is a little more flexible with these things and is making some guesses about what those codes could be. I'm just speculating, though.
Converting UTF-16 to Unicode code points
After some more investigation, it looks like valid emoji HTML entities use the Unicode code point (in decimal, if it's &#...;, or hex, if it's &#x...;). The Unicode code point is different from the UTF-8 or UTF-16 code. (That link explains a lot about how emoji and other characters are variously encoded, BTW! Good read.)
So we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16, I've verified how it's done. Each Unicode code point (our target) is a 20-bit number, or five hex digits. When going from Unicode to UTF-16, you split it up into two 10-bit numbers (the middle hex digit gets cut in half, with two of its bits going to each block), do some maths on them and get your result).
Going backwards, as you want to, it's done like this:
Your decimal UTF-16 number (which is in two separate blocks for now) is 55358 56599
Converting those blocks to hex (separately) gives 0x0d83e 0x0dd17
You subtract 0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117
Converting them to binary, padding them out to 10 bits and concatenating them, it's 0b0000 1111 1001 0001 0111
Then we convert that back to hex, which is 0x0f917
Finally, we add 0x10000, giving 0x1f917
Therefore, our (hex) HTML entity is 🤗. Or, in decimal, 🤗
So, to preprocess this dataset, you'll need to extract the existing numbers, use the algorithm above, then put the result back in (with one &#...;, not two).
Displaying emoji in R
As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.
JavaScript Solution
I had this exact same problem, but needed the solution in JavaScript, not R. Using rensa's comment above (hugely helpful!), I created the following code to solve this issue, and I just wanted to share it in case anyone else happens across this thread as I did, but needed it in JavaScript.
str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
And, here's a full snippet of it working if you'd like to run it:
var str = '😊😘😀😆😂😁'
str = str.replace(/(&#\d+;){2}/g, function(match) {
match = match.replace(/&#/g,'').split(';');
var binFirst = (parseInt('0x' + parseInt(match[0]).toString(16)) - 0xd800).toString(2);
var binSecond = (parseInt('0x' + parseInt(match[1]).toString(16)) - 0xdc00).toString(2);
binFirst = '0000000000'.substr(binFirst.length) + binFirst;
binSecond = '0000000000'.substr(binSecond.length) + binSecond;
return '&#x' + (('0x' + (parseInt(binFirst + binSecond, 2).toString(16))) - (-0x10000)).toString(16) + ';';
});
document.getElementById('result').innerHTML = str;
// 😊😘😀😆😂😁
// is turned into
// 😊😘😀😆😂😁
// which is rendered by the browser as the emojis
Original:<br>😊😘😀😆😂😁<br><br>
Result:<br>
<div id='result'></div>
My SMS XML Parser application is working great now, but it stalls out on large XML files so, I'm thinking about rewriting it in PHP. If/when I do, I'll post that code as well.
I've implemented the algorithm described by rensa above in R, and am sharing it here. I am happy to release the code snippet below under a CC0 dedication (i.e., putting this implementation into the public domain for free reuse).
This is a quick and unpolished implementation of rensa's algorithm, but it works!
utf16_double_dec_code_to_utf8 <- function(utf16_decimal_code){
string_elements <- str_match_all(utf16_decimal_code, "&#(.*?);")[[1]][,2]
string3a <- string_elements[1]
string3b <- string_elements[2]
string4a <- sprintf("0x0%x", as.numeric(string3a))
string4b <- sprintf("0x0%x", as.numeric(string3b))
string5a <- paste0(
# "0x",
as.hexmode(string4a) - 0xd800
)
string5b <- paste0(
# "0x",
as.hexmode(string4b) - 0xdc00
)
string6 <- paste0(
stringi::stri_pad(
paste0(BMS::hex2bin(string5a), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = ""),
stringi::stri_pad(
paste0(BMS::hex2bin(string5b), collapse = ""),
10,
pad = "0"
) %>%
stringr::str_trunc(10, side = "left", ellipsis = "")
)
string7 <- BMS::bin2hex(as.numeric(strsplit(string6, split = "")[[1]]))
string8 <- as.hexmode(string7) + 0x10000
unicode_pattern <- string8
unicode_pattern
}
make_unicode_entity <- function(x) {
paste0("\\U000", utf16_double_dec_code_to_utf8(x))
}
make_html_entity <- function(x) {
paste0("&#x", utf16_double_dec_code_to_utf8(x), ";")
}
# An example string, using the "hug" emoji:
example_string <- "test 🤗 test"
output_string <- stringr::str_replace_all(
example_string,
"(&#[0-9]*?;){2}", # Find all two-character "&#...;&#...;" codes.
make_unicode_entity
# make_html_entity
)
cat(output_string)
# To print Unicode string (doesn't display in R console, but can be copied and
# pasted elsewhere:
# (This assumes you've used 'make_unicode_entity' above in the str_replace_all
# call):
stringi::stri_unescape_unicode(output_string)
Translated Chad's JavaScript answer to Go since I too had the same issue, but needed a solution in Go.
https://play.golang.org/p/h9JBFzqcd90
package main
import (
"fmt"
"html"
"regexp"
"strconv"
"strings"
)
func main() {
emoji := "😊😘😀😆😂😁"
regexp := regexp.MustCompile(`(&#\d+;){2}`)
matches := regexp.FindAllString(emoji, -1)
var builder strings.Builder
for _, match := range matches {
s := strings.Replace(match, "&#", "", -1)
parts := strings.Split(s, ";")
a := parts[0]
b := parts[1]
c, err := strconv.Atoi(a)
if err != nil {
panic(err)
}
d, err := strconv.Atoi(b)
if err != nil {
panic(err)
}
c = c - 0xd800
d = d - 0xdc00
e := strconv.FormatInt(int64(c), 2)
f := strconv.FormatInt(int64(d), 2)
g := "0000000000"[2:len(e)] + e
h := "0000000000"[10:len(f)] + f
j, err := strconv.ParseInt(g + h, 2, 64)
if err != nil {
panic(err)
}
k := j + 0x10000
_, err = builder.WriteString("&#x" + strconv.FormatInt(k, 16) + ";")
if err != nil {
panic(err)
}
}
fmt.Println(html.UnescapeString(emoji))
emoji = html.UnescapeString(builder.String())
fmt.Println(emoji)
}
When I save my data files, I have a parameter that it is a float, which I want to keep it as a float in the filename. I don't have round up errors, because I define the values of the parameter using
parameters = zeros(Float64, 1000)##50)
iijj = 4.8999
for jjj in 1:1000
iijj += 1/10000
iijj = round(iijj, 4)
parameters[jjj] = iijj
end
and thus every parameter[i] is a float with just 4decimals.
My issue comes when printing the files, I am using
printfile = open("outfile_param$(param).dat" ,"w")
where param=parameters[i]. If I have for example 4.89, I would like to have the name outfile_param4.8900.dat, instead of outfile_param4.89.dat.
I know there are several ways to write in an outputfile, but I would like to keep the format that I have because if not it would be a pain to correct the programs that I work with.
You can use #sprintf to have more precise control over the formatting:
julia> #sprintf("outfile_param%.4f.dat", 4.89)
"outfile_param4.8900.dat"
I am using jpeg package to read an image into R. This creates an object of class nativeRaster. I take a portion of this image using [ operator. The resulting object is a matrix of integers. Attempting to save this object returns the error image must be a matrix or array of raw or real numbers. What should I do to be able to save this new image?
Below is a snippet to reproduce the error
imageFile = 'address of the jpg file'
outputFile = 'new file to write into'
image = jpeg::readJPEG(imageFile, native=TRUE)
output = image[1:10,1:10]
writeJPEG(image = output, target = outputFile)
I think the function writeJPEG takes image of type nativeRaster. I am not entirely sure about this but converting class of output to nativeRaster works for me.
class(output) <- "nativeRaster"
writeJPEG(image = output, target = outputFile)
Even though I was unable to find a solution, I am working around this by using solely imagemagick to crop the image.
system(paste('identify',
imageFile), intern = TRUE) %>%
regmatches(.,regexpr("(?<=[ ])[0-9]*?x[0-9]*?(?=[ ])",.,perl=T)) %>%
strsplit('x') %>% .[[1]] %>% as.double
will return the dimensions of the image and
system(paste0('convert "',imageFile,
'" -crop ',
sizeX,
'x',
sizeY,
'+',
beginningX,
'+',
beginningY,
' "',
outputFile,'"'))
will crop the image and save it to outputFile
Here sizeX and sizeY are desired dimensions of the cropped image and beginningX and beginningY designate the top left corner of the crop site on the image.
I am writing a Matlab program that loads a data file created in another C++ program.
planet = input('What is the name of your planet? ', 's')
data_file = strcat(planet, '.dat')
load(data_file);
data_file;
x = data_file(:,1);
y = data_file(:,2);
plot (x,y,'r*')
The program takes the name of the planet as the user input, then concatenates ".dat" to the end of the planet name. This gives, for example, "earth.dat," which is the name of the file created by the other C++ program.
I have made sure that the data file being loaded is in the correct folder; however, MATLAB still gives an error when I run the program.
What is the correct command for loading this file?
Thank you!
try using this instead:
planet = input('What is the name of your planet? ', 's')
filename=[num2str(planet) '.dat'];
data_file=load(filename);
x = data_file(:,1);
y = data_file(:,2);
plot (x,y,'r*')