RStudio - Weird characters becoming regular characters - r

I'm working with a messy database, in which I need to give format to some columns of data. For this, I use a lot of GSub and other forms of regular expressions. My problem is some of the characters I need to clean are "weird" characters, specially the A with the curly thing above followed by other weird character (Ñ).
When I copy from the database and then paste on my gsub function:
gsub("CALLÑE", "CALLE", data)
It works fine until I close and RStudio and reopen it. Then the characters are different in the RScript file. It is as if RStudio didn't support weird characters itself, and removes them from the Scripts when they are reopened:
gsub("CALLÃ'E", "CALLE", data)
How can I avoid this? And keep my weird characters even after closing the file.

In RStudio, go to File -> Save with Encoding...
Select UTF-8 option.

Related

R version 4.2.0 and Swedish letters (ä ö å) not working in newest R release. Anyone found a solution?

I have updated to the latest R release (R version 4.2.0), but I am now facing the problem that all the Swedish special letters cannot be read anymore. I am working with a database that has many Swedish letters in its factor lables and even if I am reading them in as strings R doesn't recognise them, with the consequence that all summary tables that are based on these factors as groups are not calculated correctly anymore. The code has been working fine under the previous release (but I had issues with knitting Rmarkdown files, therefore the need for updating).
I have set the encoding to iso-5889-4 (which is nothern languages) after UTF-8 has not worked. Is there anything else I could try? Or has anyone come to a solution on how to fix this, other than to rename all lables before reading in the .csv files? (I would really like to avoid this fix, since I am often working with similar data)
I have used read.csv() and it produces cryptic outputs replacing the special letters with for example <d6> instead of ö and <c4> instead of ä.
I hope that someone has an idea for a fix. Thanks.
edit: I use windows.
Sys.getlocale("LC_CTYPE")
[1] "Swedish_Sweden.utf8"
Use the encoding parameter
I have been able to detect failed loads by attempting to apply toupper to strings, which gives me errors such as
Error in toupper(dataset$column) :
invalid multibyte string 999751
This is resolved and expected outcomes obtained by using
read.csv(..., encoding = 'latin1')
or
data.table::fread(..., encoding = 'Latin-1')
I believe this solution should apply to Swedish characters as they are also covered by the Latin-1 encoding.
I have the same problem, what worked for me was like the answer above said but I used encoding ISO-8859-1 instead. It works for both reading from file and saving to file for Swedish characters å,ä,ö,Å,Ä,Ä, i.e:
read.csv("~/test.csv", fileEncoding = "ISO-8859-1")
and
write.csv2(x, file="test.csv", row.names = FALSE, na = "", fileEncoding = "ISO-8859-1")
It's tedious but it works right now. Another tip is if you use Rstudio is to go to Global options -> Code -> Saving and set your default text encoding to ISO-8859-1 and restart Rstudio. It will save and read your scripts in that encoding as default if I understand correctly. I had the problem when I opened my scripts with Swedish characters, they would display wrong characters. This solution fixed that.

RStudio - some non-standard special characters in the R Script change by themselves

I occasionally work with data frames where unorthodox special characters are used that look identical to standard characters in RStudio's in-built viewing functionality. I refer to these characters in my scripts, but sometimes when I open the file, these characters have been changed to standard keyboard characters within the script.
For example, in my script, ’ changes to a standard apostrophe ' and – changes to a standard hyphen -.
These scripts are ones I have to run regularly, so having to manually correct this each time is a chore. I also haven't worked out what it is that triggers RStudio to make these changes. I've tried closing and reopening to test if that's the trigger, and the characters have remained correct. It only seems to happen after I've turned off my computer.
Does anyone know of a workaround for this and/or what is causing this? TIA
EDIT: the reason I need to do this is I export to csv which is UTF-8 encoded.
I've found a workaround, although I welcome any feedback on any drawbacks to this.
If you have already written your code (including the special characters):
Click File > Save with Encoding... > Show all encodings > unicodeFFFE
Now when you reopen the file:
Click File > Reopen with Encoding... > Show all encodings > unicodeFFFE
If you haven't already written your code, it should just be a case of saving your file from the start with the unicodeFFFE encoding (instructions above) before you write the code and then using the reopen with encoding option whenever you open the file.

How does R (RGui) parse multiline character strings?

The RGui (Windows; R version 3.5.3) appears to ignore tab characters that occur at the beginning of a line within a character string (press CTRL+R over the lines of code):
# REPLACE "<TAB>" WITH AN ACTUAL TAB CHARACTER TO GET THE CODE INTENDED BELOW.
foo <- 'LINE1
<TAB>LINE2
<TAB>LINE3
'
foo
# [1] "LINE1\nLINE2\nLINE3\n"
longstring <- removetabsatbeginningoflines('
<TAB>Sometimes I have really long strings that I format
<TAB>so that they read nicely (not with too long of a
<TAB>line length). Tabs at the beginning of the lines
<TAB>within a string preserve my code indenting scheme
<TAB>that I use to make the code more readable. If the
<TAB>tabs are not removed automatically by the parser,
<TAB>then I need to wrap the string in a function that
<TAB>removes them.')
The tab characters are preserved when the above code is source'd from a file.
Why doesn't RGui keep the tab characters?
Where is this behavior documented?
What other non-intuitive, related behaviors does RGui have with regard to parsing (multiline) strings?
I can't find anywhere that this is documented, so this is a worthwhile question to publicly answer.
If you are using RGui's built-in 'R Editor', then all Tab characters entered via the Tab key, or already existing in a text file that you have opened into the 'R Editor', will not be respected when submitting using Ctrl-R (This is difficult to represent here in an example given that tabs are stripped from answers).
I imagine the 'R Editor' is not meant to be used for serious code editing, and you may be better off using a dedicated IDE (e.g. RStudio) or more full-featured editor (e.g. Emacs, Notepad++).
You can route around this issue in RGui by manually replacing tabs as \t when editing in RGui, but this may not be appropriate if you want to keep actual Tabs in your file. Tabs will also be correctly processed when using source() to directly run the code stored in the text file.

Reading CSV files with Chinese Characters

There are a number of StackOverflow posts about opening CSV files containing (UTF-8 encoded) Chinese characters into R, in Windows. None of the answers I've found seem to work completely.
If I read.csv with encoding="UTF-8", then the Chinese characters are shown encoded (<U+XXXX>, which I've manually verified are at least correct). However, if I interrogate the data frame to get just one row or a specific cell from a row, then it's printed properly.
One post suggested this is due to strings being typed as factors. However, setting stringsAsFactors=FALSE had no effect.
Other posts say the locale must be set correctly. My system locale is apparently English_United Kingdom.1252; a Windows code page looks decidedly non-Unicode friendly! If I try to change it to any of en.UTF-8, en_GB.UTF-8 or en_US.UTF-8 (or even UTF-8 or Unicode), I get an error saying that my OS cannot honour the request.
If I try Sys.setlocale(category="LC_ALL", locale="Chinese"), the locale does change (albeit to another Windows code page; still no Unicode) but then the CSV files can't be parsed. That said, if I read the files in the English locale and then switch to Chinese afterwards, the data frame is printed out correctly in the console. However, this is cludgy and, regardless, View(myData) now shows mojibake rather than the encoded Unicode code points.
Is there any way to just make it all work? That is, correct Chinese characters are echoed from the data frame to the console and View works, without having to perform secret handshakes when reading the data?
My gut feeling is that the problem is the locale: It should be set to a UTF-8 locale and then everything should [might] just work. However, I don't know how to do that...
The UTF notation is good and it means your characters were read in property. The issue is on R's side with printing to console, which shouldn't be a big problem unless you are copying and pasting output. Writing out is a bit tricky: you want to open a UTF-8 file connection, then write to that file.

Track the exact place of a not encoded character in an R script file

more a tip question that can save lots of time in many cases. I have a script.R file which I try to save and get the error:
Not all of the characters in ~/folder/script.R could be encoded using ASCII. To save using a different encoding, choose "File | Save with Encoding..." from the main menu.
I was working on this file for months and today I was editing like crazy my code and got this error for the first time, so obviously I inserted a character that can not be encoded while I was working today.
My question is, can I track and find this specific character and where exactly in the document is?
There are about 1000 lines in my code and it's almost impossible to manually search it.
Use tools::showNonASCIIfile() to spot the non-ascii.
Let me suggest two slight improvements this.
Process:
Save your file using a different encoding (eg UTF-8)
set a variable 'f' to the name of that file. something like this f <- yourpath\\yourfile.R
Then use tools::showNonASCIIfile(f) to display the faulty characters.
Something to check:
I have a Markdown file which I run to output to Word document (not important).
Some of the packages I used to initialise overload previous functions. I have found that the warning messages sometimes have nonASCII characters and this seems to have caused this message for me - some fault put all that output at the end of the file and I had to delete it anyway!
Check where characters are coming back from Warnings!
Cheers
Expanding the accepted answer with this answer to another question, to check for offending characters in the script currently open in RStudio, you can use this:
tools::showNonASCIIfile(rstudioapi::getSourceEditorContext()$path)

Resources