R: Encoding of labelled data and knit to html problems - r

First of all, sorry for not providing a reproducible example and posting images, a word of explanation why I did it is at the end.
I'd really appreciate some help - comments or otherwise, I think I did my best to be as specific and concise as I can
Problem I'm trying to solve is how to set up (and where to do it) encoding in order to get polish letters after a .Rmd document is knitted to html.
I'm working with a labelled spss file imported to R via haven library and using sjPlot tools to make tables and graphs.
I already spent almost all day trying to sort this out, but I feel I'm stucked with no idea where to go.
My sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
Whenever I run (via console / script)
sjt.frq(df$sex, encoding = "Windows-1250")
I get a nice table with proper encoding in the rstudio viewer pane:
Trying with no encoding sjt.frq(df$sex) gives this:
I could live with setting encoding each time a call to sjt.frq is made, but problem is, that no matter how I set up sjt.frq inside a markdown document, it always gets knited the wrong way.
Running chunk inside .Rmd is OK (for a completely unknown reason encoding = "UTF-8 worked as well here and it didn't previously):
Knitting same document, not OK:
(note, that html header has all the polish characters)
Also, it looks like that it could be either html or sjPlot specific because knitr can print polish letters when they are in a vector and are passed as if they where printed to console:
Is there anything I can set up / change in order to make this work?
While testing different options I discovered, that manually converting sex variable to factor and assigning labels again, works and Rstudio knits to html with proper encoding
df$sex <- factor(df$sex, label = c("kobieta", "mężczyzna"))
sjt.frq(df$sex, encoding = "Windows-1250")
Regarding no reproducible example:
I tried to simulate this example with fake data:
# Get libraries
library(sjPlot)
library(sjlabelled)
x <- rep(1:4, 4)
x<- set_labels(x, labels = c("ąę", "ćŁ", "óŚŚ", "abcd"))
# Run freq table similar to df$sex above
sjt.frq(x)
sjt.frq(x, encoding = "UTF-8")
sjt.frq(x, encoding = "Windows-1250")
Thing is, each sjt.frq call knits the way it should (although only encoding = "Windows-1250" renders properly in rstudio viewer pane.

If you run sjt.frq(), a complete HTML-page is returned, which is displayed in a viewer.
However, for use inside markdown/knitr-documents, there are only parts of the HTML-output required: You don't need the <head> part, for instance, as the knitr-document creates an own header for the HTML-page. Thus, there's an own print()-method for knitr-documents, which use another return-value to include into the knitr-file.
Compare:
dummy <- sjt.frq(df$sex, encoding = "Windows-1250")
dummy$output.complete # used for default display in viewer
dummy$knitr # used in knitr-documents
Since the encoding is located in the <meta>-tag, which is not included in the $knitr-value, the encoding-argument in sjt.frq() has no effect on knitr-documents.
I think that this might help you: rmarkdown::render_site(encoding = 'UTF-8'). Maybe there are also other options to encode text, or you need to modify the final HTML-file, changing the charset encoding there.

Related

*platform-independent* language/locale settings?

tl;dr
Is it possible to change the LANGUAGE environment setting (or the LC_MESSAGES component of the locale) platform-independently, and if not (which is probably the case) to detect whether the value requested was in fact legal/took effect?
Context
One of my functions tried to detect the occurrence of a particular error by grep()ing the text of the error message. As previously pointed on an R mailing list (can't find the reference right now), this approach is fragile: the text of the message can change based on the setting of the LANGUAGE environment variable.
So I'm changing my code to not do that, but I'd like to add a regression test to the package that tests whether the problem occurs. And I'd like it to work (or at least not fail) across platforms.
As described e.g. in this question, I can use Sys.setenv(LANGUAGE=...) or Sys.setlocale("LC_MESSAGES",.) to change the message language. But this will only work if I guess the right LANGUAGE/LC_MESSAGES value, and it's available on the current system. (It seems that LANGUAGE is a language code, and LC_MESSAGES is a locale, and that the locale need not be installed if I just want to switch languages?)
Sys.getenv("LANGUAGE") ## [1] "en_CA:en"
test_msg <- function(msg,silent=FALSE) {
grepl(msg,try(x ,silent=silent))
}
test_msg("not found") ## TRUE ("object 'x' not found")
Sys.setenv(LANGUAGE="fr")
test_msg("introuvable")
## Erreur : objet 'x' introuvable
So far, so good. But resetting the LANGUAGE variable doesn't seem to work ...
Sys.setenv(LANGUAGE="en")
Sys.getenv("LANGUAGE")=="en" ## TRUE
test_msg("not found") ## FALSE ('introuvable')
Sys.setenv(LANGUAGE="en_CA:en")
test_msg("not found") ## FALSE ('introuvable')
Setting LC_MESSAGES to C seems more reliable ...
Sys.setlocale("LC_MESSAGES", "C")
test_msg("not found")
FWIW I've found that this sequence of commands actually seems to be unreliable, i.e. the same attempt to set the environment variable seems to work sometimes depending on what pissing around I've done previously ... I have to go back and keep starting with a clean session to get reproducible results ...
In any case, I know that if I were to try this on a computer without a French locale available, or where the locale/language name was different, that it wouldn't work. I'd like to be able to figure out whether the language setting has worked. I can use test_msg("introuvable") as brute force, but I'm hoping there's a more elegant option ...
relevant bits of sessionInfo():
R Under development (unstable) (2018-11-26 r75681)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS
locale:
[1] LC_CTYPE=en_CA.UTF8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF8 LC_COLLATE=en_CA.UTF8
[5] LC_MONETARY=en_CA.UTF8 LC_MESSAGES=en_CA.UTF8

encoding error with read_html

I am trying to web scrape a page. I thought of using the package rvest.
However, I'm stuck in the first step, which is to use read_html to read the content.
Here´s my code:
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
obra_caridade <- read_html(url,
encoding = "ISO-8895-1")
And I got the following error:
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0x6F 0x20 0x65 [9]
I tried using what similar questions had as answers, but it did not solve my issue:
obra_caridade <- read_html(iconv(url, to = "UTF-8"),
encoding = "UTF-8")
obra_caridade <- read_html(iconv(url, to = "ISO-8895-1"),
encoding = "ISO-8895-1")
Both attempts returned a similar error.
Does anyone has any suggestion about how to solve this issue?
Here's my session info:
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rvest_0.3.2 xml2_1.1.1
loaded via a namespace (and not attached):
[1] httr_1.2.1 magrittr_1.5 R6_2.2.1 tools_3.3.1 curl_2.6 Rcpp_0.12.11
What's the issue?
Your issue here is in correctly determining the encoding of the webpage.
The good news
Your approach looks like a good one to me since you looked at the source code and found the Meta charset, given as ISO-8895-1. It is certainly ideal to be told the encoding, rather than have to resort to guess-work.
The bad news
I don't believe that encoding exists. Firstly, when I search for it online the results tend to look like typos. Secondly, R provides you with a list of supported encodings via iconvlist(). ISO-8895-1 is not in the list, so entering it as an argument to read_html isn't useful. I think it'd be nice if entering a non-supported encoding threw a warning, but this doesn't seem to happen.
Quick solution
As suggested by #MrFlick in a comment, using encoding = "latin1" appears to work.
I suspect the Meta charset has a typo and it should read ISO-8859-1 (which is the same thing as latin1).
Tips on guessing an encoding
What is your browser doing?
When loading the page in a browser, you can see what encoding it is using to read the page. If the page looks right, this seems like a sensible guess. In this instance, Firefox uses Western encoding (i.e. ISO-8859-1).
Guessing with R
rvest::guess_encoding is a nice, user-friendly function which can give a quick estimate. You can provide the function with a url e.g. guess_encoding(url), or copy in phrases with more complex characters e.g. guess_encoding("Situação do Termo/Convênio:").
One thing to note about this function is it can only detect from 30 of the more common encodings, but there are many more possibilities.
As mentioned earlier, iconvlist() provides a list of supported encodings. By looping through these encodings and examining some text in the page to see if it's what we expect, we should end up with a shortlist of possible encodings (and rule many encodings out).
Sample code can be found at the bottom of this answer.
Final comments
All the above points towards ISO-8859-1 being a sensible guess for the encoding.
The page url contains a .br extension indicating it's Brazilian, and - according to Wikipedia - this encoding has complete language coverage for Brazilian Portuguese, which suggests it might not be a crazy choice for whoever created the webpage. I believe this is also a reasonably common encoding type.
Code
Sample code for 'Guessing with R' point 2 (using iconvlist()):
library(rvest)
url <- "http://simec.mec.gov.br/painelObras/recurso.php?obra=17956"
# 1. See which encodings don't throw an error
read_page <- lapply(unique(iconvlist()), function(encoding_attempt) {
# Optional print statement to show progress to 1 since this can take some time
print(match(encoding_attempt, iconvlist()) / length(iconvlist()))
read_attempt <- tryCatch(expr=read_html(url, encoding=encoding_attempt),
error=function(condition) NA,
warning=function(condition) message(condition))
return(read_attempt)
})
names(read_page) <- unique(iconvlist())
# 2. See which encodings correctly display some complex characters
read_phrase <- lapply(x, function(encoded_page)
if(!is.na(encoded_page))
html_text(html_nodes(encoded_page, ".dl-horizontal:nth-child(1) dt")))
# We've ended up with 27 encodings which could be sensible...
encoding_shortlist <- names(read_phrase)[read_phrase == "Situação:"]

Putting a Cyrillic character in an R plot title: is there a command equivalent to "vfont"?

I would like to add a Cyrillic character to a plot title, but seem to be missing some crucial piece of information. Using "expression(...)" isn't suitable as it doesn't have the characters I require.
I currently have an oldish mac running OS X 10.10.5 with
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Platform: x86_64-apple-darwin13.4.0 (64-bit)
I can add Cyrillic to a plot, say
x<-runif(10,0,1); y<-runif(10,0,1); plot(x,y)
# first choose the Hershey font family
par(family="Hershey")
# The fifth table of demo(Hershey) to helps find the three digit code I need
# for my desired symbol, \366 .
# Adding the character to the plot proceeds by using a snippet found
# elsewhere to get an octal code:
code<-366
ocode <- 64*(code%/%100) + 8*(code%/%10)%%10 + code%%10
string <- rawToChar(as.raw(ocode), multiple=TRUE)
# Adding the Cyrillic character to the plot is achieved by choosing the particular
# face in Hershey I require using vfont
text(x=0.5,y=0.5,vfont=c("serif","cyrillic"),labels=paste("Cyrillic sample", string[1] ) )
However, I cannot seem to find an equivalent of "vfont" for the plot title.
I've read many webpages and haven't seen this addressed. It could be that there's something subtle that I haven't understood. Any tips are welcome.
This is a kind of hacky answer (but still an answer) that isn't very general or elegant. As I don't know much about fonts or unicode, this is probably as well as I can do in the short term. I will be very happy for a more general solution from someone who knows about such things.
Changing the font encoding solves the problem in this limited case
Sys.setlocale(category="LC_CTYPE", locale="ru_RU.KOI8-R")
quartz(); plot.new(); plot(z,y)
title(main="testing cyrillic \366")
[![An example plot with Cyrillic character in title][1]][1]
I'm sure there must be better ways of doing this, that don't require so much researching and playing around. Suggestions welcome!

read.csv replaces column-name characters like `?` with `.`, `-` with `...`

I'm using RStudio and my output on the Console gets truncated. I can't find how to stop the truncation (I tried searching ?options as well as googling around for longer than I'd like to admit).
EDIT: My apologies everyone! I originally had the long name as 'ThisIsAReallyReallyReallyReallyReallyLongName', but the issue only came up with the long name of 'Translation Service Info - Which translation service?'. I think I found the issue. The ... wasn't truncating, it was replacing the unknown characters like ? and - with . and ....
Code
# Load File
myfile <- read.csv(file="C:\\Users\\wliu\\Desktop\\myfile.csv",
sep=",", header=TRUE, stringsAsFactors=FALSE, skip=2)
# Get my column names
mycolnames <- colnames(myfile)
# When I request a shorter name, this returns the full name
mycolnames[1] # Assuming first col is a short name
[1] "ThisIsAShortName"
# However, when I request a longer name, this returns a truncated version
mycolnames[2] # Assuming second col is a really long name
[1] "ThisIsA...Long...Name"
I want to get back the non-truncated version of mycolnames[2] (e.g. "ThisIsAReallyReallyReallyReallyReallyLongName")
Setup
I'm on Windows 7 64bit, RStudio Version 0.98.1091, R version 3.0.1 (2013-05-16) -- "Good Sport" with Platform: x86_64-w64-mingw32/x64 (64-bit). I tried with 'Use Git Bash as shell for Git projects' on and off.
myfile.csv
ThisIsAShortName, Translation Service Info - Which translation service?
23143505, Yes
23143614, No
23143324, Yes
This is expected behavior by read.csv, not a truncation problem in R. When you have spaces and special characters in the column names of a file, read.csv replaces each of them with a . unless you specify check.names = FALSE
Here's a glimpse at make.names, which is how read.table produces the column names.
nm <- "Translation Service Info - Which translation service?"
make.names(nm)
# [1] "Translation.Service.Info...Which.translation.service."
And here's the relevant line from read.table
if (check.names)
col.names <- make.names(col.names, unique = TRUE)
In RStudio, use the menu:
Tools > Global Options > Code > Display
In the Console section, set Limit length of lines displayed in console to: to some number much larger than it currently is set to.
Click OK.
How about:
options(width=300)
Does that solve the issue?

Importing non-English shapefiles

I'm trying to import a shape-file like this:
fn <- "Proj1"
my_shp <- readShapeSpatial(fn)
On a windows-computer (32-bit) it works ok, but when I do the same from a Ubuntu-machine (64-bit, English OS, R2.14.0), I get "Error in make.names(onames, unique = TRUE) : invalid multibyte string 9".
I suspect it is because the shapefile has Spanish origins, i.e. the names of polygons in it have accents like in "México" (not "Mexico").
As a quick fix, I did the import in windows, saved as .rda and loaded it in Ubuntu, but then I get for example "M\xfexico" as polygon name.
I'm not so experienced in Linux so I don't know if the fix is in R or in Ubuntu. Your help is highly appreciated.
The solution is to start R on the Ubuntu-computer by writing "LC_ALL=C R" in a terminal window. Thanks to Oscar Perpiñán for the solution.
Update: I use RStudio, where as far as I know it is not possible to start R with command-line parameters, but this works from inside RStudio:
Sys.setlocale(category = "LC_ALL", locale = "C")
/Chris

Resources