Problem with encoding of character strings when loading json files to RStudio under Windows 10

Problem with encoding of character strings when loading json files to RStudio under Windows 10 - r

I am trying to extract Tweets from json files and save them as RData under Windows 10 and using RStudio version 1.2.5033 and streamR. However, Windows (and RStudio and streamR subsequently) assumes that the input is Windows-1252 although it is UTF-8 which leads to serious encoding issues.
To replicate the problem, please use this json file with two fake Tweets since I could not replicate the structure of the original json files within R. But this structure leads to issues with the only solution I found for the encoding issue (see below).
The code I used is the following:
df <- streamR::parseTweets("test.json")
The output I get with df$text is: '[1] "RT #bkabka:EikÃ¶ tÃ¤mÃ¤" "RT #bkabka:EspaÃ±a"'.
The output should be: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
My question is therefore: (How) can I tell R and streamR to interpret the text as being encoded using UTF-8 rather than Windows-1252?
Since all this happens because the function wrongly assumes that the text is encoded with Windows-1252, one solution would be to go through the whole corpus and replace all of these wrongly interpreted special characters with the correct one, for example using the table I found here. In my case however, the corpus is very very large making this a very suboptimal solution in the long run. Additionally, I would not have the possibility to check whether it actually replaced all special characters correctly.
Some additional information:
Using rjson and the following code somehow makes R interpret the encoding correctly, but has troubles with the structure of the json files since it only extracts the first line:
lt <- rjson::fromJSON(file="test.json")
I guess it cannot extract the subsequent line because it does not recognise the line break which is an actual line break and not \n or any other character combination. Unfortunately, I do not have the possibility to change the json files.
The json files were created by another person under macOS using streamR - if I am not mistaken.
The same problem appears using simple R instead of RStudio. The problem does not appear on macOS.
The problem is even more serious when using tweet2r, the only other package I am aware of that allows to extract Tweets from json files using R. Tweet2r deletes specific special characters such as "¶" and thus, the wrongly interpreted special characters cannot be replaced anymore with the correct ones.

Thanks to MrFlick (see his comment), here is a solution using jsonlite that results in a very similar data frame structure and reads the encoding correctly:
df <- jsonlite::stream_in(file("~/../Downloads/test.json"))
Just some further information for those being used to the luxury of streamR handling tweets that might encounter a similar issue in the future, there are two main differences in the data frames created by parseTweets and stream_in:
parseTweets does not extract data for broken Tweets. stream_in does. Hence, the data frame has more rows when using stream_in but contains the same Tweets.
stream_in creates fewer variables since some of the columns in the data frame are themselves data frames. This might lead to issues when using the data frame without further transforming the data frame created with stream_in. parseTweets does that for you.

Related

Standardise strings in cells of a dataframe in R that appear bold

Loading an Excel sheet into R, some strings in the cells of the dataframe appear to be bold and in a different format. For example, like so:
𝐇𝐚𝐢𝐝𝐚𝐫𝐚
And when I copy paste this string into the R console, it appears like this:
Anyone know how to fix this (revert these strings into the standard format) in R?
Want to avoid going back into Excel to fix it.
Thanks!

These are actually UTF-8 encoded letters in the Mathematical Alphanumeric Symbols block in Unicode, and they don't map nicely back on to 'standard' ASCII letters in R unless you have a pre-existing mapping function such as utf8_normalize from the utf8 package:
library(utf8)
utf8_normalize('𝐇𝐚𝐢𝐝𝐚𝐫𝐚', map_compat = TRUE)
#> [1] "Haidara"
However, I would strongly recommend that you fix your Excel file before importing to avoid having to do this; it works with the example you have given us here, but there may be unwelcome surprises in converting some of your other strings.

Attempts to parse bencode / torrent file in R

I wish I could parse torrent files automatically via R. I tried to use R-bencode package:
library('bencode')
test_torrent <- readLines('/home/user/Downloads/some_file.torrent', encoding = "UTF-8")
decoded_torrent <- bencode::bdecode(test_torrent)
but faced to error:
Error in bencode::bdecode(test_torrent) :
input string terminated unexpectedly
In addition if I try to parse just part of this file bdecode('\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf'), I get
Error in bdecode("\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf") :
Wrong encoding '�'. Allowed values are i, l, d or a digit.
Maybe there are another ways to do it in R? Or probably I can insert another language code in Rscript?
Thanks in advance!

It might be that the torrent file is somehow corrupted.
A bencode value must begin with the character i (for integers), l (for lists), d (for dictionaries) or a number (for the length of a string).
The example string ('\xe7\xc9...'), doesn't start with any of those characters, and hence it can't be decoded.
See this for more info on the bencode format.

There seem to be several issues here.
Firstly, your code should not treat torrent files as text files in UTF-8 encoding. Each torrent file is split into equally-sized pieces (except for the last piece ; )). Torrents contain a concatenation of SHA1 hashes of each of the pieces. SHA1 hashes are unlikely to be valid UTF-8 strings.
So, you should not read the file into memory using readLines, because that is for text files. Instead, you should use a connection:
test_torrent <- file("/home/user/Downloads/some_file.torrent")
open(test_torrent, "rb")
bencode::bdecode(test_torrent)
Secondly, it seems that this library is also suffering from a similar issue. As readChar that it makes use of, also assumes that it's dealing with text.
This might be due to recent R version changes though seeing as the library is over 6 years old. I was able to apply a quick hack and get it working by passing useBytes=TRUE to readChar.
https://github.com/UkuLoskit/R-bencode/commit/b97091638ee6839befc5d188d47c02567499ce96
You can install my version as follows:
install.packages("devtools")
library(devtools)
devtools::install_github("UkuLoskit/R-bencode")
Caveat lector! I'm not a R programmer :).

Importing data from Excel to vector in R

I am a novice in R and I have been having some trouble trying to get R and Excel to cooperate.
I have written a code that makes it able to compare two vectors with each other and determine the differences between them:
data.x<-read.csv(file.choose(), header=T)
data.y<-read.csv(file.choose(), header=T)
newdata.x<-grep("DAG36|G379",data.x,value=TRUE,invert=TRUE)
newdata.x
newdata.y<-grep("DAG36|G379",data.y,value=TRUE,invert=TRUE)
newdata.y
setdiff(newdata.x,newdata.y)
setdiff(newdata.y,newdata.x)
The data I want to transfer from Excel to R is a long row of numbers placed as so:
“312334-2056”, “457689-0932”, “857384-9857”,….,
There are about 350 of these numbers placed in their own separate cell along a single row.
I used the command: = """" & A1 & """" To put double quotes around every number in order for R to read it properly.
At first I tried to simply copy/paste the data directly into a vector in R, but it's as if R won’t read it as a single row of data and therefore splits it up.
I also tried to save the excel file as a CSV file but that didn’t work either.
Lastly I tried to open it directly in to R using the command:
data.x<- read.csv(file.choose(), header=T)
But as I type in: data.x and press enter it simply says:
<0 rows> (or 0-lenghts row.names)
I simply can’t figure out what I’m doing wrong. Any help would be greatly appreciated.

It's hard to access without a reproducible example, but you should be able to transpose the Excel file into a single column. Then import using read_csv from the readr package. Take a look at the tidyverse package, which will contain some great tools to import and work with this type of data.

I use https://github.com/tidyverse/readxl/. It makes it easy to maintain formatting from excel into type safe tibbles.
If you can share some sample data a working solution can be generated.

R displaying unicode/utf-8 encoding rather the special characters

I have a dataframe in R which has one row of utf-8 encoded special characters and one integer row.
If I display both rows, or go into the view(), I do not see the characters displayed correctly.
However, if I only select the row with the special characters, it works. Any ideas?
This is the output (if I paste it, the encoding disappears):

This looks like a bug in R. I've worked around a number of these in the corpus package. Try the following
library(corpus)
print.corpus_frame(WW_mapping[1:3,])
Alternatively, do
library(corpus)
class(WW_mapping) <- c("corpus_frame", "data.frame")
WW_mapping[1:3,]
Adding the "corpus_frame" class to the data frame changes the print and format methods; otherwise, it does not change the behavior of the object.
If that doesn't work, please report your sessionInfo() along with dput(WW_mapping). (Actually, even if this fix does work, please report this information so that we can let the R core developers know about the problem.)

Using ggplot2 and special characters

I am reading in data from a web site, with text identifying each row. I simply copied and pasted the data into Excel, and the file is then read by R. One of these rows contains the name of a German city, "Würzburg", which includes a lower case u with an umlaut. I have no problem seeing the special character on the web or on Excel. The problem is, when this word is passed to ggplot2, it is displayed in the plot as "WÃzburg", with tilde over the capital A. RStudio shows both forms depending on the area in which it is displayed. I would assume that ggplot2 uses a different language for interpreting the special characters.
Is there a way to tell ggplot how to read, interpret and display the special characters? I do not want to write specialized code just for this city, but to solve the problem in general. I am likely to encounter other characters as the data expands over time.

I encountered a similar error with ggplot2, when I used a hardcoded data.frame (i.e., I would write Großbritannien (Great Britain) and it would get encoded to some gibberish).
My solution was to include
Sys.setlocale("LC_ALL", "German")
options(encoding = "UTF-8")
in the beginning of the script.

Read the file in as follows
library('data.table')
fread('path_to_file', ..., encoding = 'UTF-8')

My solution to this problem is switching to cairo for pdf plotting. All special characters are shown properly by the ggplot2. It is enough to put this line of code among the knitr settings:
knitr::opts_chunk$set(dev='cairo_pdf')