Converting accents to ASCII in R - r

I'm trying to convert special characters to ASCII in R. I tried using Hadley's advice in this question:
stringi::stri_trans_general('Jos\xe9', 'latin-ascii')
But I get "Jos�". I'm using stringi v1.1.1.
I'm running a Mac. My friends who are running Windows machines seem to get the desired result of "Jose".
Any idea what is going on?

The default encoding on Windows is different from the typical default encoding on other operating systems (UTF-8). x ='Jos\xe9' means something in
Latin1, but not in UTF-8. So, on Linux or OS X you need to tell R what the encoding is:
x ='Jos\xe9'
Encoding(x) <- 'latin1'
stri_trans_general(x, 'Latin-ASCII')

Related

How can I use R-scripts containing Umlauts cross-platform?

I'm using Windows on my Desktop PC at work and Linux on my Laptop. I frequently work on R-scripts in Rstudio alternating between the two machines. Whenever I try to import scripts that contain umlauts on one system after working with the other, the umlauts (e.g. ä,ü,ß,ö) are replaced with question marks. Importantly, I'm not talking about data that I am importing but the text in the script itself. For example, writing the following script file in Linux:
# This iß an exämple
text <- c("R kann äußerst nervig sein")
Will be displayed differently when opened on Windows:
# This i? an ex?mple
text <- c("R kann ?u?erst nervig sein")
Are there any settings that prevent this from happening? I've already tried to set the standard encoding to utf-8 on both machines but it didn't seem to change anything.
The standard R build on Windows doesn't fully support UTF-8, because Windows itself just added that capability very recently. So you could download the "WinUCRT" build of R (though I forget the location, Google probably knows), and then things would be fine.
Alternatively, for widest portability you could write your scripts in pure ascii by encoding the accented letters as Unicode escapes. The stringi package can help with this, e.g.
cat(stringi::stri_escape_unicode("R kann äußerst nervig sein"))
#> R kann \u00e4u\u00dferst nervig sein
Created on 2021-11-09 by the reprex package (v2.0.1)
so you'd put this in your code:
text <- "R kann \u00e4u\u00dferst nervig sein"
(There's no need to call c() for one element.) This is inconvenient, but should work on all systems.

Problem with spell checking packages in R

I'm trying to check spelling some words in Russian using "hunspell" library in R.
bad_words <- hunspell("Язвенная болзень", dict='ru_RU.dic')
I have installed Russian dictionary, from here: https://code.google.com/archive/p/hunspell-ru/
It has encoding UTF-8. However, I have following error:
Failed to convert line 1 to ISO8859-1 encoding. Try spelling with a UTF8 dictionary.
It seems strange, neither dict nor R file don't have encoding ISO8859-1...
What is the problem?
If you are operating on Windows, my first guess would be that this is related to the lack of native UTF-8 support in R on Windows. This will be resolved when R4.2 is released; you might wish to try using the development release and seeing whether the problem persists.
Another thing to check is whether your DESCRIPTION file contains the line Encoding: UTF-8, such that your source files are treated as having this encoding.

Turkish Language Encoding Problem in R Studio

I am using R Studio for creating plots for economic variables. But in our language when you don't use our specific letters as "ğ,ş,ı,ü,ç" the word means different. And even sometimes it means swearing. I can't create graphs with this letters. I tried to use this command;
Sys.setlocale(category = "LC_ALL", locale = "Turkish")
The output is
OS reports request to set locale to "Turkish" cannot be honored[1] ""
How can i solve this problem? Any idea?
If you have the same problem and your system is Mac. First open terminal and then copy this code:
defaults write org.R-project.R force.LANG en_US.UTF-8
paste and run. I solved. I hope it works for your system, too.

Issues with OSM encoding

I am having troubles with the encoding of the osm data.
Here is a reproducible example using the osmar package:
osmData <- osmar::get_osm(osmar::center_bbox(23.334360, 42.693180, 100, 100))
osmData$nodes$tags[80:100, ] #the output is not UTF-8
I have also downloaded a planet file from https://download.geofabrik.de/europe/
After unzipping it and using it with osmar::get_osm I still have the same issue. The cyrilics letters are not readable.
Any ideas how can I counter this?
Ok, Answering my own question:
I ran the above code in linux and understood that the issue was with the windows locale. The workaround I found was to use iconv with from and to parameters set to "UTF-8".
iconv(osmData$nodes$tags[80:100,3][11], from="UTF-8", to="UTF-8")
This works and could be applied to all columns.

R doesn't recognize Latin7 characters

I have really strange problem. I am using Lithuanian keyboard, but R doesn't recognize letters such as į, š, č.
For example when I write:
žodis <- "žibutė"
in R console I see
þodis <- "þibutë".
I have R in several computers, all work fine except this one. Can you help me with this issue? Is any function to let R know that I'm using Lithuanian keyboard? My computer's operating system is Windows 10 and R version 3.3.2.

Resources