Standardise strings in cells of a dataframe in R that appear bold - r

Loading an Excel sheet into R, some strings in the cells of the dataframe appear to be bold and in a different format. For example, like so:
𝐇𝐚𝐢𝐝𝐚𝐫𝐚
And when I copy paste this string into the R console, it appears like this:
Anyone know how to fix this (revert these strings into the standard format) in R?
Want to avoid going back into Excel to fix it.
Thanks!

These are actually UTF-8 encoded letters in the Mathematical Alphanumeric Symbols block in Unicode, and they don't map nicely back on to 'standard' ASCII letters in R unless you have a pre-existing mapping function such as utf8_normalize from the utf8 package:
library(utf8)
utf8_normalize('𝐇𝐚𝐢𝐝𝐚𝐫𝐚', map_compat = TRUE)
#> [1] "Haidara"
However, I would strongly recommend that you fix your Excel file before importing to avoid having to do this; it works with the example you have given us here, but there may be unwelcome surprises in converting some of your other strings.

Related

Problem with encoding of character strings when loading json files to RStudio under Windows 10

I am trying to extract Tweets from json files and save them as RData under Windows 10 and using RStudio version 1.2.5033 and streamR. However, Windows (and RStudio and streamR subsequently) assumes that the input is Windows-1252 although it is UTF-8 which leads to serious encoding issues.
To replicate the problem, please use this json file with two fake Tweets since I could not replicate the structure of the original json files within R. But this structure leads to issues with the only solution I found for the encoding issue (see below).
The code I used is the following:
df <- streamR::parseTweets("test.json")
The output I get with df$text is: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
The output should be: '[1] "RT #bkabka:Eikö tämä" "RT #bkabka:España"'.
My question is therefore: (How) can I tell R and streamR to interpret the text as being encoded using UTF-8 rather than Windows-1252?
Since all this happens because the function wrongly assumes that the text is encoded with Windows-1252, one solution would be to go through the whole corpus and replace all of these wrongly interpreted special characters with the correct one, for example using the table I found here. In my case however, the corpus is very very large making this a very suboptimal solution in the long run. Additionally, I would not have the possibility to check whether it actually replaced all special characters correctly.
Some additional information:
Using rjson and the following code somehow makes R interpret the encoding correctly, but has troubles with the structure of the json files since it only extracts the first line:
lt <- rjson::fromJSON(file="test.json")
I guess it cannot extract the subsequent line because it does not recognise the line break which is an actual line break and not \n or any other character combination. Unfortunately, I do not have the possibility to change the json files.
The json files were created by another person under macOS using streamR - if I am not mistaken.
The same problem appears using simple R instead of RStudio. The problem does not appear on macOS.
The problem is even more serious when using tweet2r, the only other package I am aware of that allows to extract Tweets from json files using R. Tweet2r deletes specific special characters such as "¶" and thus, the wrongly interpreted special characters cannot be replaced anymore with the correct ones.
Thanks to MrFlick (see his comment), here is a solution using jsonlite that results in a very similar data frame structure and reads the encoding correctly:
df <- jsonlite::stream_in(file("~/../Downloads/test.json"))
Just some further information for those being used to the luxury of streamR handling tweets that might encounter a similar issue in the future, there are two main differences in the data frames created by parseTweets and stream_in:
parseTweets does not extract data for broken Tweets. stream_in does. Hence, the data frame has more rows when using stream_in but contains the same Tweets.
stream_in creates fewer variables since some of the columns in the data frame are themselves data frames. This might lead to issues when using the data frame without further transforming the data frame created with stream_in. parseTweets does that for you.

Exporting large number to csv from R

I came across a strange problem when trying to export an R dataframe to a csv file.
The dataframe contains some big numbers, but when they are written to the csv file, they "lose" the decimal part and are instead written without it.
But not like one would expect, but like this:
Say 3224571816.5649 is the correct value in R. When written to csv, it becomes 32245718165649.
I am using the write.csv2 function to write the csv. The separators are correct, as it works normally for smaller values. Is the problem occurring because the number (with decimals) is bigger than 32bit?
And more importantly, how can I solve this, as I have a whole dataframe with values as big (or bigger) than this? Also, it has to be written in to a csv.
write.csv2 is intended for a different standard of csv (Western European styling, which based on your use of a "." as a decimal indicator, I am guessing you are not looking for). write.csv2 uses a comma as a decimal indicator and a semicolon as the field delimiter, so if you are trying to read the result in as a comma separated file, it will look strange indeed.
I suggest you use write.csv (or even better, write.table) to output your file. write.csv assumes a comma separator and period for decimal marker.
both write.csv and write.csv2 are just wrappers for write.table, which is the underlying method. In general, I recommend use of write.table because it does not assume your region and you can explicitly pass it sep = ",", dec = ".", etc. This not only lets you know what you are using for sure, but it also makes your code a lot more readable.
for more, check the rdocumentation.org site for write.table: https://www.rdocumentation.org/packages/utils/versions/3.5.3/topics/write.table

R displaying unicode/utf-8 encoding rather the special characters

I have a dataframe in R which has one row of utf-8 encoded special characters and one integer row.
If I display both rows, or go into the view(), I do not see the characters displayed correctly.
However, if I only select the row with the special characters, it works. Any ideas?
This is the output (if I paste it, the encoding disappears):
This looks like a bug in R. I've worked around a number of these in the corpus package. Try the following
library(corpus)
print.corpus_frame(WW_mapping[1:3,])
Alternatively, do
library(corpus)
class(WW_mapping) <- c("corpus_frame", "data.frame")
WW_mapping[1:3,]
Adding the "corpus_frame" class to the data frame changes the print and format methods; otherwise, it does not change the behavior of the object.
If that doesn't work, please report your sessionInfo() along with dput(WW_mapping). (Actually, even if this fix does work, please report this information so that we can let the R core developers know about the problem.)

read data into R

The World Health Organization dataset is available here: http://www.filedropper.com/who
When the data is read using fread (from the data.table package), or read_csv (from the readr package) some variables are wrapped within letter r, and are shown as character type. Like so:
"\r31.1\r".
I checked the dataset in notepad and indeed it looks weird as these values are wrapped within (' '). However they are numeric, and when the regular read.csv is used there is no such problem.
What's the reason behind this? How to fix?
the '\r' is e special character used as a new line delimiter for files on windows.
When using read_csv setting the argument escape_backslash=TRUE might do the trick.
Check this for further reading.

Using ggplot2 and special characters

I am reading in data from a web site, with text identifying each row. I simply copied and pasted the data into Excel, and the file is then read by R. One of these rows contains the name of a German city, "Würzburg", which includes a lower case u with an umlaut. I have no problem seeing the special character on the web or on Excel. The problem is, when this word is passed to ggplot2, it is displayed in the plot as "WÃzburg", with tilde over the capital A. RStudio shows both forms depending on the area in which it is displayed. I would assume that ggplot2 uses a different language for interpreting the special characters.
Is there a way to tell ggplot how to read, interpret and display the special characters? I do not want to write specialized code just for this city, but to solve the problem in general. I am likely to encounter other characters as the data expands over time.
I encountered a similar error with ggplot2, when I used a hardcoded data.frame (i.e., I would write Großbritannien (Great Britain) and it would get encoded to some gibberish).
My solution was to include
Sys.setlocale("LC_ALL", "German")
options(encoding = "UTF-8")
in the beginning of the script.
Read the file in as follows
library('data.table')
fread('path_to_file', ..., encoding = 'UTF-8')
My solution to this problem is switching to cairo for pdf plotting. All special characters are shown properly by the ggplot2. It is enough to put this line of code among the knitr settings:
knitr::opts_chunk$set(dev='cairo_pdf')

Resources