Encoding discrepancy in RScript - r

I have been struggling with an encoding problem with a program that needs to run both in RStudio and using RScript. After wasting half a day on this I have a kludgy workaround, but would like to understand why the RScript version marks a string as latin1 when it is in fact UTF-8, and whether there is a better alternative to my solution. Example:
x <- "Ø28"
print(x)
print(paste("Marked as", Encoding(x)))
print(paste("Valid UTF = ", validUTF8(x)))
x <- iconv(x, "UTF-8", "latin1")
print(x)
In RStudio, the output is:
[1] "Ø28"
[1] "Marked as latin1"
[1] "Valid UTF = FALSE"
[1] NA
and when run using RScript from a batch file in Windows the output from the same code is:
[1] "Ã\23028"
[1] "Marked as latin1"
[1] "Valid UTF = TRUE"
[1] "Ø28"
In the latter case, it does not strike me as entirely helpful that a string defined within an R program by a simple assignment is marked as Latin-1 when in fact it is UTF-8. The solution I used in the end was to write a function that tests the actual (rather than declared) encoding of character variables using validUTF8, and if that returns TRUE, then use iconv to convert to latin1. It is still a bit of a pain since I have to call that repeatedly, and it would be better to have a global solution. There is quite a bit out there on encoding problems with R, but nothing that I can find that solves this when running programs with RScript. Any suggestions?
R 3.5.0, RStudio 1.1.453, Windows 7 / Windows Server 2008 (don't ask...)

Related

R: load warnings

I have a file foo, which based on file utility is:
foo: LaTeX 2e document, UTF-8 Unicode text, with CRLF, LF line terminators
I have the following meta-code
txt <- readLines("foo")
....
save(txt, file="bar")
load("bar")
There were 50 or more warnings (use warnings() to see the first 50)
warnings()
Warning messages:
1: In load("bar") :
input string '...' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?
2: ...
The warnings cause some parts of the txt character vector to be cut when I later save it with writeLines
I was unable to find an explanations for the warnings. The warnings occur for lines in the TeX file with accented letters
In Linux, R reacts to the value of the LANG environment variable. If it is not set to an UTF-8 locale, R is unable to manage this type of encoding.
So if, calling R like so (no UTF-8):
LANG=en_US R
the console gives:
a <- rawToChar(as.raw(c(0xc3, 0xa0)))
a
# [1] "\303\240"
tmp <- file("/tmp/foo", encoding="UTF-8")
writeLines(a, tmp)
# Warning message:
# In writeLines(a, tmp) : invalid char string in output conversion
close(tmp)
readLines("/tmp/foo")
# character(0)
Without an UTF-8 locale, the value of a (an accented a) is printed as raw bytes.
Forcing UTF-8 encoding when writing to a file gives a warning and nothing is printed.
Calling with:
LANG=en_US.UTF-8 R
does not give errors:
a <- rawToChar(as.raw(c(0xc3, 0xa0)))
a
# [1] "à"
tmp <- file("/tmp/foo", encoding="UTF-8")
writeLines(a, tmp)
close(tmp)
readLines("/tmp/foo")
# [1] "à"
The problem here is that it is difficult to create portable scripts, because their results depends on the environment variables.
In Windows, R does not react to the LANG variable and the encoding depends on the language of Windows installed. For Latin languages this is ANSI. That makes simple the portability inside Windows, but R does not read properly Linux data files, based on UTF-8.

Strange behaviour of regex in R

I have a simple web scraper that seems to behave strangely:
- in the desktop version of RStudio (running R version 3.3.3 on Windows) it behaves as expected and produces a numeric vector
- in the server version of RStudio (running R version 3.4.1 on Linux) the gsub() (and hence the numeric conversion afterwards) fails, and the code produces a vector of NAs.
Do you have any idea what could cause the difference?
library(rvest)
url <- "http://benzin.impuls.cz/benzin.aspx?strana=3"
impuls <- read_html(url, encoding = "windows-1250")
asdf <- impuls %>%
html_table()
Benzin <- asdf[[1]]$X7
chrBenzin <- gsub("\\sKč","",Benzin) # something is wrong here...
numBenzin <- as.double(chrBenzin)
numBenzin
The whitespace in the values is a hard space, U+00A0. After I ran the code, I got this output for Benzin (copy/pasted at ideone.com):
Then, I was already sure those were hard spaces, but I doubled checked here.
What shall we do when we have hard spaces is to try two alternatives.
One is using [[:space:]] in a TRE (default regex engine in Base R functions).
The other is using a PCRE regex with a (*UCP) verb at the start to let the regex engine know we deal with Unicode.
In your case, on Linux, it seems the PCRE works, thus you should stick to the PCRE version (that is just more consistent than TRE):
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
A quick online test on Linux R:
Benzin <- "29.60 Kč"
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
## => [1] "29.60"

iconvlist() inconsistency on alpine linux

I have a docker container set up that is based on artemklevtsov/r-alpine:latest. When I run my R scripts I see this error:
Invalid encoding UTF-8: defaulting to UTF-8.
I tracked this down to this code in the httr library:
https://github.com/hadley/httr/blob/master/R/content-parse.r#L5
It looks like iconvlist() on alpine returns encodings that have a trailing comma at the end, ex:
iconvlist()
[1] "..." "ISO8859-1," "ISO8859-2," "ISO8859-3," "ISO8859-4,"
[6] "ISO8859-5," "ISO8859-6," "ISO8859-7," "UCS-2BE," "UCS-2LE,"
[11] "US_ASCII," "UTF-16BE," "UTF-16LE," "UTF-32BE," "UTF-8,"
Therefore UTF-8 never matches UTF-8,. Has anyone ran into this issue before? The list of encodings I get on my local Mac (OSX) is correct and doesn't have trailing commas. It also doesn't happen on CentOS, so it looks like it's specific to alpine.
Is there a way to get around this? Maybe through a configuration in R or by modifying the iconvlist() output?
I have the same issue, this time from calling read::read_csv, which uses base::iconvlist and gives the same error message Invalid encoding UTF-8: defaulting to UTF-8.. This is on alpine:3.12 using R 3.6.3 provided by apk add R and, based on the details below, I think the issue will be present on any version of alpine and R unless steps have been taken to address it directly.
I found a couple of solutions. TLDR:
Remove the commas from the file at system.file("iconvlist", package = "utils"), or
Recompile R using the gnu-libiconv library for more comprehensive iconv support.
Solution 1
The base::iconvlist() function uses an iconvlist file as a fallback method to get the list of encodings the system supports. On alpine this fallback method will always be used, for reasons outlined below, but the iconvlist file has commas in, which R is not expecting.
The easiest solution is to remove the commas from the iconvlist file, which can be found with base::system.file().
> system.file("iconvlist", package = "utils")
[1] "/usr/lib/R/library/utils/iconvlist"
One way to remove the commas, from the command line (not R) is:
sed -i 's/,//g' /usr/lib/R/library/utils/iconvlist
Subsequent calls to base::iconvlist() will read and parse the new file without the commas, and other functions that rely on base::iconvlist() will be able to successfully check for support, e.g. for "UTF-8".
> iconvlist()
[1] "..." "ISO8859-1" "ISO8859-2" "ISO8859-3" "ISO8859-4" "ISO8859-5"
[7] "ISO8859-6" "ISO8859-7" "UCS-2BE" "UCS-2LE" "US_ASCII" "UTF-16BE"
[13] "UTF-16LE" "UTF-32BE" "UTF-8" "UTF32-LE" "WCHAR_T"
> "UTF-8" %in% iconvlist()
[1] TRUE
Why is this necessary?
International conversion (iconv) of character encodings is a feature that R expects to be provided by the operating system, stipulated in the R Administration and Installation Manual. Operating systems provide their own implementations of iconv functionality, sometimes with fewer features. Since alpine is designed to be minimal, it is not surprising that it provides only what is necessary to meet the POSIX standards.
When R is built on a system it first checks the extent of iconv support from the host's C development libraries, before it compiles features into R's internals. Crucially, support for the C function iconvlist is checked for, which is not present on alpine, as shown in the apk build log for R: checking for iconvlist... no, so this C function is not available to R internally.
R's base::iconvlist() function will first try to get encodings using pre-compiled C code via .Internal(iconv(..., which will call iconvlist (in C) if available. As the iconvlist C function is not present on alpine, this .Internal call will always return NULL, and the R function will fall back to reading the info from the iconvlist file:
> iconvlist
function ()
{
int <- .Internal(iconv(NULL, "", "", "", TRUE, FALSE))
if (length(int))
return(sort.int(int))
icfile <- system.file("iconvlist", package = "utils")
# ... (truncated)
Why is the iconvlist file in an unexpected format?
The iconvlist file is created when R is built, from the command iconv -l which lists the available encodings. This is the utility program at /usr/bin/iconv not an R or C function. There is no standard for the format of the output of iconv -l. Alpine tries to conform to POSIX standards, and these only require that the -l option writes "values to standard output in an unspecified format".
R is expecting the file format to contain values separated by spaces (base::iconvlist() parses the file with strsplit(ext, "[[:space:]]")), which is true for other Linux variants, e.g. Debian, CentOS, but not for alpine's musl libc version, which has the commas.
Solution 2
A more rigorous solution is to build R from source using an alternative iconv C library implementation that provides the iconvlist C function. base::iconvlist() can then fetch the encodings via its .Internal(iconv(... call, and never needs to fall back to the iconvlist file.
An implementation that provides iconvlist is GNU libiconv, which has been packaged for alpine and can be installed with:
apk add gnu-libiconv gnu-libiconv-dev
The package gnu-libiconv-dev provides headers in /usr/include/gnu-libiconv/, so the compiler needs to be pointed here in preference to the existing ones in /usr/include. This is outside my expertise but can be done by adding -I/usr/include/gnu-libiconv to the CFLAGS environment variable.
export CFLAGS=-I/usr/include/gnu-libiconv $CFLAGS
Running ./configure should yield check results similar to:
... (truncated)
checking for iconv.h... yes
checking for iconv... in libiconv
checking whether iconv accepts "UTF-8", "latin1", "ASCII" and "UCS-*"... yes
checking whether iconv accepts "CP1252"... yes
checking for iconvlist... yes
... (truncated)
After make I can run ./bin/R and, even if the iconvlist file still contains commas, calls to base::iconvlist() yields well-formatted results:
> iconvlist()
[1] "850"
[2] "862"
[3] "866"
[4] "ANSI_X3.4-1968"
[5] "ANSI_X3.4-1986"
... (truncated)
# The unsorted list is coming from the internal C functions, not the file
> .Internal(iconv(NULL, "", "", "", TRUE, FALSE))
[1] "ANSI_X3.4-1968"
[2] "ANSI_X3.4-1986"
[3] "ASCII"
[4] "CP367"
[5] "IBM367"
... (truncated)

Unicode character with subscript

I want to add a Unicode character which has two letters as subscripts to my plot legend in R.
The character is an r with an accent breve (ř) and the two letters are i and j.
I already looked at this question: Unicode character with superscript and tried to adapt the answers to my problem.
Here is what I tried:
plot(1,pch=NA,ylab="",xlab="",axes=F)
legend("top",legend=paste("1-","\u{0159}"),bty ="n",bg = "white",cex=2)
legend("center",legend=paste("1-","\u{0159}","\u{0069}","\u{006A}"),bty="n",bg = "white",cex=2)
legend("bottomleft",legend=expression("1-"*"\u0159"["\u0069"*"\u006A"]),bty="n",bg = "white",cex=2)
legend("bottomright", legend = quote("1-" *"\u0159"["\u0069"*"\u006A"]),bty="n",bg = "white",cex=2)
The resulting plot can be found below
Both the Unicode letter and the subscript work fine by themselves but not together.
paste() with any combination of [ ] does return an error, but I think this is to be expected as paste can't handle [ ] for subscripts.
The FAQ site on CRAN might give a hint as I am using Windows but I am not sure how to implement this:
3.6 I don't see characters with accents at the R console, for example in ?text.
You need to specify a font in Rconsole (see Q5.2) that supports the encoding in use. This used to be a problem in earlier versions of Windows, but now it is hard to find a font which does not.
Support for these characters within Rterm depends on the environment (the terminal window and shell, including locale and codepage settings) within which it is run as well as the font used by the terminal window. Those are usually on legacy DOS settings and need to altered.
It has to do with the system locale, as seen e.g. if you try
# intToUtf8(345)
# [1] "ř"
# iconv(intToUtf8(345), "utf-8", localeToCharset())
# [1] "r"
This should fix it (I used Czech but other locales will probably work too):
Sys.setlocale("LC_CTYPE", "czech")
# [1] "Czech_Czech Republic.1250"
text(..., labels = quote("\u{0159}"[ij]))

renderMarkdown locally vs. shiny-server

For a shiny application, I have a small issue with renderMarkdown.
Consider a text file with the following simple contents:
Markdown Test File
+ Item 1
+ Item 2
Let's save this file as "Markdown Test.txt". Now, let's read it in and process it, using the following R code:
filename <- "Markdown Test.txt"
text.in <- readLines(filename)
text.out <- renderMarkdown(text=text.in)
When I run this locally - i.e. on my Windows machine - I get:
> text.out
[1] "<p>Markdown Test File</p>\n\n<ul>\n<li>Item 1</li>\n<li>Item 2</li>\n</ul>\n"
This looks good. However, running the same code on the machine that hosts shiny server, I get:
> text.out
[1] "<p>Markdown Test File+ Item 1+ Item 2</p>\n"
As you can see, the Markdown conversion is far from perfect; e.g. the list is not converted.
On the Windows machine I have:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
On the shiny machine, I get:
> Sys.getlocale()
[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
So, I'm assuming that this has to do with the encoding, but the little I know about encoding I wish I didn't... my experiments with dos2unix and Sys.setlocale() let to nothing but frustration.
Would anyone happen to have a clever "one liner" that can fix this? Any help appreciated!
Thanks, Philipp
I'm not sure if R has a dedicated package to fix line encodings, but one way is to use sub to replace \r\n with \n (or just strip \rs).

Resources