iconvlist() inconsistency on alpine linux - r

I have a docker container set up that is based on artemklevtsov/r-alpine:latest. When I run my R scripts I see this error:
Invalid encoding UTF-8: defaulting to UTF-8.
I tracked this down to this code in the httr library:
https://github.com/hadley/httr/blob/master/R/content-parse.r#L5
It looks like iconvlist() on alpine returns encodings that have a trailing comma at the end, ex:
iconvlist()
[1] "..." "ISO8859-1," "ISO8859-2," "ISO8859-3," "ISO8859-4,"
[6] "ISO8859-5," "ISO8859-6," "ISO8859-7," "UCS-2BE," "UCS-2LE,"
[11] "US_ASCII," "UTF-16BE," "UTF-16LE," "UTF-32BE," "UTF-8,"
Therefore UTF-8 never matches UTF-8,. Has anyone ran into this issue before? The list of encodings I get on my local Mac (OSX) is correct and doesn't have trailing commas. It also doesn't happen on CentOS, so it looks like it's specific to alpine.
Is there a way to get around this? Maybe through a configuration in R or by modifying the iconvlist() output?

I have the same issue, this time from calling read::read_csv, which uses base::iconvlist and gives the same error message Invalid encoding UTF-8: defaulting to UTF-8.. This is on alpine:3.12 using R 3.6.3 provided by apk add R and, based on the details below, I think the issue will be present on any version of alpine and R unless steps have been taken to address it directly.
I found a couple of solutions. TLDR:
Remove the commas from the file at system.file("iconvlist", package = "utils"), or
Recompile R using the gnu-libiconv library for more comprehensive iconv support.
Solution 1
The base::iconvlist() function uses an iconvlist file as a fallback method to get the list of encodings the system supports. On alpine this fallback method will always be used, for reasons outlined below, but the iconvlist file has commas in, which R is not expecting.
The easiest solution is to remove the commas from the iconvlist file, which can be found with base::system.file().
> system.file("iconvlist", package = "utils")
[1] "/usr/lib/R/library/utils/iconvlist"
One way to remove the commas, from the command line (not R) is:
sed -i 's/,//g' /usr/lib/R/library/utils/iconvlist
Subsequent calls to base::iconvlist() will read and parse the new file without the commas, and other functions that rely on base::iconvlist() will be able to successfully check for support, e.g. for "UTF-8".
> iconvlist()
[1] "..." "ISO8859-1" "ISO8859-2" "ISO8859-3" "ISO8859-4" "ISO8859-5"
[7] "ISO8859-6" "ISO8859-7" "UCS-2BE" "UCS-2LE" "US_ASCII" "UTF-16BE"
[13] "UTF-16LE" "UTF-32BE" "UTF-8" "UTF32-LE" "WCHAR_T"
> "UTF-8" %in% iconvlist()
[1] TRUE
Why is this necessary?
International conversion (iconv) of character encodings is a feature that R expects to be provided by the operating system, stipulated in the R Administration and Installation Manual. Operating systems provide their own implementations of iconv functionality, sometimes with fewer features. Since alpine is designed to be minimal, it is not surprising that it provides only what is necessary to meet the POSIX standards.
When R is built on a system it first checks the extent of iconv support from the host's C development libraries, before it compiles features into R's internals. Crucially, support for the C function iconvlist is checked for, which is not present on alpine, as shown in the apk build log for R: checking for iconvlist... no, so this C function is not available to R internally.
R's base::iconvlist() function will first try to get encodings using pre-compiled C code via .Internal(iconv(..., which will call iconvlist (in C) if available. As the iconvlist C function is not present on alpine, this .Internal call will always return NULL, and the R function will fall back to reading the info from the iconvlist file:
> iconvlist
function ()
{
int <- .Internal(iconv(NULL, "", "", "", TRUE, FALSE))
if (length(int))
return(sort.int(int))
icfile <- system.file("iconvlist", package = "utils")
# ... (truncated)
Why is the iconvlist file in an unexpected format?
The iconvlist file is created when R is built, from the command iconv -l which lists the available encodings. This is the utility program at /usr/bin/iconv not an R or C function. There is no standard for the format of the output of iconv -l. Alpine tries to conform to POSIX standards, and these only require that the -l option writes "values to standard output in an unspecified format".
R is expecting the file format to contain values separated by spaces (base::iconvlist() parses the file with strsplit(ext, "[[:space:]]")), which is true for other Linux variants, e.g. Debian, CentOS, but not for alpine's musl libc version, which has the commas.
Solution 2
A more rigorous solution is to build R from source using an alternative iconv C library implementation that provides the iconvlist C function. base::iconvlist() can then fetch the encodings via its .Internal(iconv(... call, and never needs to fall back to the iconvlist file.
An implementation that provides iconvlist is GNU libiconv, which has been packaged for alpine and can be installed with:
apk add gnu-libiconv gnu-libiconv-dev
The package gnu-libiconv-dev provides headers in /usr/include/gnu-libiconv/, so the compiler needs to be pointed here in preference to the existing ones in /usr/include. This is outside my expertise but can be done by adding -I/usr/include/gnu-libiconv to the CFLAGS environment variable.
export CFLAGS=-I/usr/include/gnu-libiconv $CFLAGS
Running ./configure should yield check results similar to:
... (truncated)
checking for iconv.h... yes
checking for iconv... in libiconv
checking whether iconv accepts "UTF-8", "latin1", "ASCII" and "UCS-*"... yes
checking whether iconv accepts "CP1252"... yes
checking for iconvlist... yes
... (truncated)
After make I can run ./bin/R and, even if the iconvlist file still contains commas, calls to base::iconvlist() yields well-formatted results:
> iconvlist()
[1] "850"
[2] "862"
[3] "866"
[4] "ANSI_X3.4-1968"
[5] "ANSI_X3.4-1986"
... (truncated)
# The unsorted list is coming from the internal C functions, not the file
> .Internal(iconv(NULL, "", "", "", TRUE, FALSE))
[1] "ANSI_X3.4-1968"
[2] "ANSI_X3.4-1986"
[3] "ASCII"
[4] "CP367"
[5] "IBM367"
... (truncated)

Related

Encoding discrepancy in RScript

I have been struggling with an encoding problem with a program that needs to run both in RStudio and using RScript. After wasting half a day on this I have a kludgy workaround, but would like to understand why the RScript version marks a string as latin1 when it is in fact UTF-8, and whether there is a better alternative to my solution. Example:
x <- "Ø28"
print(x)
print(paste("Marked as", Encoding(x)))
print(paste("Valid UTF = ", validUTF8(x)))
x <- iconv(x, "UTF-8", "latin1")
print(x)
In RStudio, the output is:
[1] "Ø28"
[1] "Marked as latin1"
[1] "Valid UTF = FALSE"
[1] NA
and when run using RScript from a batch file in Windows the output from the same code is:
[1] "Ã\23028"
[1] "Marked as latin1"
[1] "Valid UTF = TRUE"
[1] "Ø28"
In the latter case, it does not strike me as entirely helpful that a string defined within an R program by a simple assignment is marked as Latin-1 when in fact it is UTF-8. The solution I used in the end was to write a function that tests the actual (rather than declared) encoding of character variables using validUTF8, and if that returns TRUE, then use iconv to convert to latin1. It is still a bit of a pain since I have to call that repeatedly, and it would be better to have a global solution. There is quite a bit out there on encoding problems with R, but nothing that I can find that solves this when running programs with RScript. Any suggestions?
R 3.5.0, RStudio 1.1.453, Windows 7 / Windows Server 2008 (don't ask...)

How to ensure english error messages in testthat unit tests

I have a lot of unit tests using the testthat package that expect english error messages.
If other developer run the tests on a computer configured for a non-english locale the error message are emitted in a different language and my tests fail.
How can I initialize testthat to change the language settings only during the test run-time without manually or permanently changing the language or test environment from outside of R (like e. g. proposed here: in R how to get error messages in english)?
library(testthat)
# works only in english locales...
expect_error(log("a"), "non-numeric argument to mathematical function", fixed = TRUE)
Edit 1: Changing the locale during run-time does not change the language of the error messages (using Ubuntu and OSX High Sierra):
Sys.setlocale( locale = "en_US.UTF-8")
Sys.getlocale() # en_US is active now but messages are still in another language
Edit 2: It seems that Sys.setenv("LANGUAGE"="EN") seems to change the error message language immediately (tested using OSX). Where should I put this command for testthat? In the testthat.R file?
The R console is in German language, how can I set R to English?
Edit 3: As a first work-around I have put
Sys.setenv("LANGUAGE"="EN") # work-around to always create english R (error) messages
into my testthat.R file under the tests folder (it seems to work but I am not sure whether this is the right or best way...
Setting Sys.setenv("LANGUAGE" = "EN") works for me as well.
However, when testing with devtools::test() - as ctrl + shift + T in Rstudio will do - I had to call Sys.setenv() in the test scripts inside the tests/testthat/ directory. The reason being that devtools::test() will call testthat::test_dir() circumventing the tests/testthat.R file.
So far, this did not have undesirable side-effects. The environment variable will only be set for that particular R process as described in the help page:
Sys.setenv sets environment variables (for other processes called from within R or future calls to Sys.getenv from this R process).
For completeness, you can also unset the variable again on Windows (see comments).
Sys.setenv("LANGUAGE" = "DE")
expect_error(log("a"), "Nicht-numerisches Argument")
Sys.setenv("LANGUAGE" = "FR")
expect_error(log("a"), "argument non numérique ")
Sys.unsetenv("LANGUAGE")
RStudio might also give trouble (I was not able to change the language there interactively), but when executing with devtools::test() it works.
Finally, wrapping it in a helper function.
expect_error_lang <- function(..., lang = "EN") {
Sys.setenv("LANGUAGE" = lang)
expect_error(...)
Sys.unsetenv("LANGUAGE")
}
#...
expect_error_lang(log("a"), "non-numeric")
expect_error_lang(log("a"), "Nicht-numerisches", lang = "DE")
expect_error_lang(log("a"), "argument non", lang = "FR")

Travis builds fail on dplyr vector calls within vignette

When I push my package to travis the vignette checks fail. I'm aware I can skip it but it's bugging me.
I have identified at least one of the issues belongs to a specific call in my code.
The RMD chunk in question:
get_storm_data("fstadv", link = al.1998.charley %>% .$Link)
al.1998.charley:
al.1998.charley <- structure(list(Year = 1998,
Name = "TROPICAL STORM CHARLEY",
Basin = "AL",
Link = "http://www.nhc.noaa.gov/archive/1998/1998CHARLEYadv.html"),
class = "data.frame",
row.names = c(NA, -1L),
.Names = c("Year", "Name", "Basin", "Link"))
The link parameter is a character string sent to the function get_storm_data which does some web scraping.
The following three values are identical:
a <- al.1998.charley %>% .$Link
b <- al.1998.charley %>% `[[`('Link')
c <- "http://www.nhc.noaa.gov/archive/1998/1998CHARLEYadv.html"
identical(a, b)
[1] TRUE
identical(b, c)
[1] TRUE
However, only the value c passed to the link parameter will pass travis.
The travis error:
Building with: R CMD build
6.09s$ R CMD build .
* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘Hurricanes’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
Quitting from lines 86-87 (getting-started.Rmd)
Error: processing vignette 'getting-started.Rmd' failed with diagnostics:
is.character(url) is not TRUE
Execution halted
All of my previous attempts to resolve this can be found here, if necessary.
Edit
This vignette never caused issues before. Build 16 passed. Nothing was different in the vignette. I did git diff between that commit and HEAD to bring the branch to it's exact match (since original branch had since been deleted). Other than cosmetic differences in other files (spaces, newlines, etc.), it still failed.
This led me to believe it was an upgrade or change on travis' end. I read over the blog but saw no related changes between the last successful pass (Build 16) and current.
Final Edit
When I mentioned in the last edit the vignette had not changed since the last succesful build, I should have been more clear. That entire commit that passed in March would not pass today. I'm still not clear why.
I would like to point out for any newcomers: check the travis packages to make sure they match your system. R CMD build on my system passed with no issues. But some of my packages were outdated compared to travis'. Thanks to #jimhester (GitHub) for pointing that out.
Session Info
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
_LC_CTYPE=en_US.UTF-8_, _LC_NUMERIC=C_, _LC_TIME=en_US.UTF-8_, _LC_COLLATE=en_US.UTF-8_, _LC_MONETARY=en_US.UTF-8_, _LC_MESSAGES=en_US.UTF-8_, _LC_PAPER=en_US.UTF-8_, _LC_NAME=C_, _LC_ADDRESS=C_, _LC_TELEPHONE=C_, _LC_MEASUREMENT=en_US.UTF-8_ and _LC_IDENTIFICATION=C_
attached base packages:
stats, graphics, grDevices, utils, datasets, methods and base
other attached packages:
Hurricanes(v.0.1.0), dplyr(v.0.5.0), purrr(v.0.2.2), readr(v.1.0.0), tidyr(v.0.6.1), tibble(v.1.2), ggplot2(v.2.2.1), tidyverse(v.1.0.0) and magrittr(v.1.5)
loaded via a namespace (and not attached):
Rcpp(v.0.12.7), digest(v.0.6.10), assertthat(v.0.1), R6(v.2.2.0), grid(v.3.4.0), plyr(v.1.8.4), DBI(v.0.5-1), gtable(v.0.2.0), scales(v.0.4.1), lazyeval(v.0.2.0), data.table(v.1.10.4), tools(v.3.4.0), pander(v.0.6.0), munsell(v.0.4.3), compiler(v.3.4.0) and colorspace(v.1.3-0)
Link in al.1998.charley is stored as a factor on Travis (perhaps you have set the stringsAsFactors option to FALSE somewhere locally?). httr won't coerce a factor vector to character, hence you get an error. Try wrapping as.character() around al.1998.charley %>% .$Link.

read.csv replaces column-name characters like `?` with `.`, `-` with `...`

I'm using RStudio and my output on the Console gets truncated. I can't find how to stop the truncation (I tried searching ?options as well as googling around for longer than I'd like to admit).
EDIT: My apologies everyone! I originally had the long name as 'ThisIsAReallyReallyReallyReallyReallyLongName', but the issue only came up with the long name of 'Translation Service Info - Which translation service?'. I think I found the issue. The ... wasn't truncating, it was replacing the unknown characters like ? and - with . and ....
Code
# Load File
myfile <- read.csv(file="C:\\Users\\wliu\\Desktop\\myfile.csv",
sep=",", header=TRUE, stringsAsFactors=FALSE, skip=2)
# Get my column names
mycolnames <- colnames(myfile)
# When I request a shorter name, this returns the full name
mycolnames[1] # Assuming first col is a short name
[1] "ThisIsAShortName"
# However, when I request a longer name, this returns a truncated version
mycolnames[2] # Assuming second col is a really long name
[1] "ThisIsA...Long...Name"
I want to get back the non-truncated version of mycolnames[2] (e.g. "ThisIsAReallyReallyReallyReallyReallyLongName")
Setup
I'm on Windows 7 64bit, RStudio Version 0.98.1091, R version 3.0.1 (2013-05-16) -- "Good Sport" with Platform: x86_64-w64-mingw32/x64 (64-bit). I tried with 'Use Git Bash as shell for Git projects' on and off.
myfile.csv
ThisIsAShortName, Translation Service Info - Which translation service?
23143505, Yes
23143614, No
23143324, Yes
This is expected behavior by read.csv, not a truncation problem in R. When you have spaces and special characters in the column names of a file, read.csv replaces each of them with a . unless you specify check.names = FALSE
Here's a glimpse at make.names, which is how read.table produces the column names.
nm <- "Translation Service Info - Which translation service?"
make.names(nm)
# [1] "Translation.Service.Info...Which.translation.service."
And here's the relevant line from read.table
if (check.names)
col.names <- make.names(col.names, unique = TRUE)
In RStudio, use the menu:
Tools > Global Options > Code > Display
In the Console section, set Limit length of lines displayed in console to: to some number much larger than it currently is set to.
Click OK.
How about:
options(width=300)
Does that solve the issue?

Remove a library from .libPaths() permanently without Rprofile.site

How can I permanently remove a library in R?
.libPaths()
[1] "\\\\per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell/R/win-library/2.15"
[2] "C:/Program Files/R/R-2.15.2/library"
[3] "C:/Program Files/RStudio/R/library"
The first item is my corporate "My Documents" folder, and the apostrophe in the path from my surname is causing all kinds of grief when using R CMD INSTALL --build on a package I'm making, not to mention issues using packages installed there when I'm offline from the network.
I want to use C:/Program Files/R/R-2.15.2/library as the default instead, but I don't want to have to rely on an Rprofile.site.
What I've tried
> .libPaths(.libPaths()[2:3])
> .libPaths()
[1] "C:/Program Files/R/R-2.15.2/library" "C:/Program Files/RStudio/R/library"
That seems to work, but only until I restart my R session, and then I'm back to the original .libPaths() output...
Restarting R session...
> .libPaths()
[1] "\\\\per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell/R/win-library/2.15"
[2] "C:/Program Files/R/R-2.15.2/library"
[3] "C:/Program Files/RStudio/R/library"
I thought maybe .libPaths() was using R_LIBS_USER
> Sys.getenv("R_LIBS_USER")
[1] "//per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell/R/win-library/2.15"
So I've tried to unset it using Sys.unsetenv("R_LIBS_USER") but it doesn't persist between sessions.
Additional Info
If it matters, here are some environment variables that might be relevant...
> Sys.getenv("R_HOME")
[1] "C:/PROGRA~1/R/R-215~1.2"
> Sys.getenv("R_HOME")
[1] "C:/PROGRA~1/R/R-215~1.2"
> Sys.getenv("R_USER")
[1] "//per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell"
> Sys.getenv("R_LIBS_USER")
[1] "//per-homedrive1.corp.riotinto.org/homedrive$/Tommy.O'Dell/R/win-library/2.15"
> Sys.getenv("R_LIBS_SITE")
[1] ""
I've tried Sys.unsetenv("R_LIBS_USER") but this also doesn't stick between sessions
Just set the environment variable R_LIBS in Windows to something like
R_LIBS=C:/Program Files/R/R-2.15.2/library
Restart R.
This is bit late response to the question, but might be useful for others.
I order to set up my own path (and remove one of the original ones) I have:
used .libPaths() inside R to check current library paths;
identified which paths to keep. In my case, it kept R's original library but removed link to my documents.
found R-Home path using R.home() or Sys.getenv("R_HOME");
R-Home\R-3.2.2\etc\Rprofile.site is read every time R kernel starts. Therefore, any modification will be persistent to every run of R.
Edited Rprofile.site by adding the following,
.libPaths(.libPaths()[2])
.libPaths("d:/tmp/R/win-library/3.2")
How it works?
First line remove all but one path (second from the original list), the second line adds an additional path. We end up with two paths.
note that I use Unix path notation despite using windows. R always use Unix notation, regardless of operating system
restarted R (using Ctr+Shift+F10)
This will work every time now.
Use this function in .Rprofile
set_lib_paths <- function(lib_vec) {
lib_vec <- normalizePath(lib_vec, mustWork = TRUE)
shim_fun <- .libPaths
shim_env <- new.env(parent = environment(shim_fun))
shim_env$.Library <- character()
shim_env$.Library.site <- character()
environment(shim_fun) <- shim_env
shim_fun(lib_vec)
}
set_lib_paths("~/code/library") # where "~/code/library" is your package directory
Original Source: https://milesmcbain.xyz/hacking-r-library-paths/
I have put the Sys.unsetenv("R_LIBS_USER") command in a .Rprofile file in my windows "own documents" folder. Seems to help. My problem was that being in an active directory environment made R upstart and package loading incredibly slow when connected via vpn.
If you want to do this at RProfile file (#library/base/R/), you can search the lines where R_LIBS_* environment variables are set (for e.g. Sys.setenv(R_LIBS_SITE=....) and Sys.setenv(R_LIBS_USER=.....))
You can also search the code .libPaths(), which sets the library tree. So you can achieve your goal by a combination of commenting, unsetting and setting the R_LIBS variables before the .libPaths() call as you wish. For e.g. Something like:
Sys.unsetenv("R_LIBS")
Sys.unsetenv("R_LIBS_USER")
Sys.setenv(R_LIBS_SITE = "D:/R/libs/site")
Sys.setenv(R_LIBS_USER = "D:/R/libs/user")
Sys.setenv(R_LIBS = "D:/R/libs")

Resources