R: load warnings - r

I have a file foo, which based on file utility is:
foo: LaTeX 2e document, UTF-8 Unicode text, with CRLF, LF line terminators
I have the following meta-code
txt <- readLines("foo")
....
save(txt, file="bar")
load("bar")
There were 50 or more warnings (use warnings() to see the first 50)
warnings()
Warning messages:
1: In load("bar") :
input string '...' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?
2: ...
The warnings cause some parts of the txt character vector to be cut when I later save it with writeLines
I was unable to find an explanations for the warnings. The warnings occur for lines in the TeX file with accented letters

In Linux, R reacts to the value of the LANG environment variable. If it is not set to an UTF-8 locale, R is unable to manage this type of encoding.
So if, calling R like so (no UTF-8):
LANG=en_US R
the console gives:
a <- rawToChar(as.raw(c(0xc3, 0xa0)))
a
# [1] "\303\240"
tmp <- file("/tmp/foo", encoding="UTF-8")
writeLines(a, tmp)
# Warning message:
# In writeLines(a, tmp) : invalid char string in output conversion
close(tmp)
readLines("/tmp/foo")
# character(0)
Without an UTF-8 locale, the value of a (an accented a) is printed as raw bytes.
Forcing UTF-8 encoding when writing to a file gives a warning and nothing is printed.
Calling with:
LANG=en_US.UTF-8 R
does not give errors:
a <- rawToChar(as.raw(c(0xc3, 0xa0)))
a
# [1] "à"
tmp <- file("/tmp/foo", encoding="UTF-8")
writeLines(a, tmp)
close(tmp)
readLines("/tmp/foo")
# [1] "à"
The problem here is that it is difficult to create portable scripts, because their results depends on the environment variables.
In Windows, R does not react to the LANG variable and the encoding depends on the language of Windows installed. For Latin languages this is ANSI. That makes simple the portability inside Windows, but R does not read properly Linux data files, based on UTF-8.

Related

Encoding discrepancy in RScript

I have been struggling with an encoding problem with a program that needs to run both in RStudio and using RScript. After wasting half a day on this I have a kludgy workaround, but would like to understand why the RScript version marks a string as latin1 when it is in fact UTF-8, and whether there is a better alternative to my solution. Example:
x <- "Ø28"
print(x)
print(paste("Marked as", Encoding(x)))
print(paste("Valid UTF = ", validUTF8(x)))
x <- iconv(x, "UTF-8", "latin1")
print(x)
In RStudio, the output is:
[1] "Ø28"
[1] "Marked as latin1"
[1] "Valid UTF = FALSE"
[1] NA
and when run using RScript from a batch file in Windows the output from the same code is:
[1] "Ã\23028"
[1] "Marked as latin1"
[1] "Valid UTF = TRUE"
[1] "Ø28"
In the latter case, it does not strike me as entirely helpful that a string defined within an R program by a simple assignment is marked as Latin-1 when in fact it is UTF-8. The solution I used in the end was to write a function that tests the actual (rather than declared) encoding of character variables using validUTF8, and if that returns TRUE, then use iconv to convert to latin1. It is still a bit of a pain since I have to call that repeatedly, and it would be better to have a global solution. There is quite a bit out there on encoding problems with R, but nothing that I can find that solves this when running programs with RScript. Any suggestions?
R 3.5.0, RStudio 1.1.453, Windows 7 / Windows Server 2008 (don't ask...)

lesson containing non-ascii characters produces error when i try to run test in swirlify

It seems that swirlify can not handle non-ascii characters (like accented character). when trying to test or run the demo with test_lesson() or demo_lesson (), it causes a file read error.
1: In readLines(con) :
invalid input found on input connection '..../lesson.yaml'
2: In readLines(con) :
incomplete final line found on '..../lesson.yaml'
the error cames from the line
con <- file(input, encoding = "UTF-8")
of the yaml.load_file function
the solution is in R studio save the yaml file with save with encoding and choose UTF-8

Encoding in developing a R package

While I run devtools::check(), one warning appears:
checking data for non-ASCII characters ... WARNING
Warning: found non-ASCII string
'Tanaid<c3><a6>' in object 'data_m'
I did the following check
library(stringi)
stri_enc_mark("Tanaid<c3><a6>") which shows "[1] "ASCII""
and all(stri_enc_isutf8('Tanaid<c3><a6>' )) which shows "[1] TRUE"
UPDATE
I followed up the comment, and attempted to convert the string from native encoding to ASCII.
I did iconv("Tanaid\xc3\xa6>", "native", "UTF-8")
However, iconv does not take native encoding and reports Error in iconv("Tanaidæ>", "native", "UTF-8") :
unsupported conversion from 'native' to 'UTF-8'
iconv("Tanaid\xc3\xa6", "latin1", "ASCII") or iconv("Tanaid\xc3\xa6", "latin2", "ASCII") also does not yield the right string.
A better solution is stri_trans_general("Tanaid\xc3\xa6", "latin-ascii") which does the work.

trouble unzipping file under Windows

I have the following code:
download.file(
"http://www.wikipathways.org//wpi/batchDownload.php?species=Homo%20sapiens&fileType=txt",
destfile="human.zip")
files <- unzip( "human.zip", list=T)
It works on Linux, but throws the following error on Windows:
Error in unzip("human.zip", list = T) :
error -103 with zipfile in unzGetCurrentFileInfo
Do you happen to know what's the problem?
In ?download.file, we read that:
If mode is not supplied and url ends in one of .gz, .bz2, .xz, .tgz,
.zip, .rda or .RData a binary transfer is done. Since Windows (unlike
Unix-alikes) does distinguish between text and binary files, care is
needed that other binary file types are transferred with mode = "wb".
Note that this list does not include .zip, although it is a binary file type. So you need to pass mode="wb".
I cannot reproduce your example, but it solved my identical problem. Here is an example:
url <- "https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx"
download.file(url, 'file1.xlsx')
download.file(url, 'file2.xlsx', mode="wb") # Try this instead
library(readxl)
read_xlsx('file1.xlsx', sheet='Variables') # Fails
# Error in sheets_fun(path) :
# Evaluation error: error -103 with zipfile in unzGetCurrentFileInfo
read_xlsx('file2.xlsx', sheet='Variables') # Works
# A tibble: 3,580 x 13

"Unrecognized escape in character string" while attempting to read a CSV file

I am trying to import a .csv file, so that I can follow along with this video: R ggplot2 Graphics Histograms.
I installed all proper packages including ggplot and related packages. The first instruction in the video says to type afl.df=read.csv("afl_2003_2007.csv")
So, I downloaded afl_2003_2007.csv file, and I tried all the below, which was basically putting the file in different directories (shared drive, then C drive, etc.). I also tried using setwd, but no luck.
I am using R in windows.
Here's what I tried, and the errors I got:
> afl.df=read.csv("afl_2003_2007.csv")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'afl_2003_2007.csv': No such file or directory
> afl.df=read.csv("\\the-lab.llnl.gov\llnlusers1\lopez235\Data\Documents\Dashboards,HRBI, Visulizations and Analytics\Math and Statistics and Predictive Modeling1\R Programming\afl_2003_2007.csv")
Error: '\l' is an unrecognized escape in character string starting "\\the-lab.llnl.gov\l"
> afl.df=read.csv("C:\Users\lopez235\Local-NOTBackedUp\R Files Local\afl_2003_2007.csv")
Error: '\U' used without hex digits in character string starting "C:\U"
> setwd("\\the-lab.llnl.gov\llnlusers1\lopez235\Data\Documents\Dashboards,HRBI, Visulizations and Analytics\Math and Statistics and Predictive Modeling1\R Programming\afl_2003_2007.csv")
Error: '\l' is an unrecognized escape in character string starting "\\the-lab.llnl.gov\l"
> setwd("\\the-lab.llnl.gov\llnlusers1\lopez235\Data\Documents\Dashboards,HRBI, Visulizations and Analytics\Math and Statistics and Predictive Modeling1\R Programming")
Error: '\l' is an unrecognized escape in character string starting "\\the-lab.llnl.gov\l"
> setwd("C:\Users\lopez235\Local-NOTBackedUp\R Files Local")
Error: '\U' used without hex digits in character string starting "C:\U"
Use / instead of \ in your path:
afl.df=read.csv("C:/Users/lopez235/Local-NOTBackedUp/R Files Local/afl_2003_2007.csv")
When encountering issues with importing datasets I prefer to use file.choose() and then pick my file manually.
For example :
newdataset <- read.csv(file.choose(), header = T)
a window asking you to selext your file manually will pop-up and header = T (or TRUE) tells R that these are the variable names. If you have data write header = FALSE.
If you want to confirm that now R knows which are the variable names you can call:
names(newdataset)
You can use the \\ instead of \
afl.df=read.csv("C:\\Users\\lopez235\\Local-NOTBackedUp\\R Files Local\\afl_2003_2007.csv")

Resources