Unicode character with subscript - r

I want to add a Unicode character which has two letters as subscripts to my plot legend in R.
The character is an r with an accent breve (ř) and the two letters are i and j.
I already looked at this question: Unicode character with superscript and tried to adapt the answers to my problem.
Here is what I tried:
plot(1,pch=NA,ylab="",xlab="",axes=F)
legend("top",legend=paste("1-","\u{0159}"),bty ="n",bg = "white",cex=2)
legend("center",legend=paste("1-","\u{0159}","\u{0069}","\u{006A}"),bty="n",bg = "white",cex=2)
legend("bottomleft",legend=expression("1-"*"\u0159"["\u0069"*"\u006A"]),bty="n",bg = "white",cex=2)
legend("bottomright", legend = quote("1-" *"\u0159"["\u0069"*"\u006A"]),bty="n",bg = "white",cex=2)
The resulting plot can be found below
Both the Unicode letter and the subscript work fine by themselves but not together.
paste() with any combination of [ ] does return an error, but I think this is to be expected as paste can't handle [ ] for subscripts.
The FAQ site on CRAN might give a hint as I am using Windows but I am not sure how to implement this:
3.6 I don't see characters with accents at the R console, for example in ?text.
You need to specify a font in Rconsole (see Q5.2) that supports the encoding in use. This used to be a problem in earlier versions of Windows, but now it is hard to find a font which does not.
Support for these characters within Rterm depends on the environment (the terminal window and shell, including locale and codepage settings) within which it is run as well as the font used by the terminal window. Those are usually on legacy DOS settings and need to altered.

It has to do with the system locale, as seen e.g. if you try
# intToUtf8(345)
# [1] "ř"
# iconv(intToUtf8(345), "utf-8", localeToCharset())
# [1] "r"
This should fix it (I used Czech but other locales will probably work too):
Sys.setlocale("LC_CTYPE", "czech")
# [1] "Czech_Czech Republic.1250"
text(..., labels = quote("\u{0159}"[ij]))

Related

R: read.csv introduced unreadable characters in one column name [duplicate]

I have a text file with Byte order mark (U+FEFF) at the beginning. I am trying to read the file in R. Is it possible to avoid the Byte order mark?
The function fread (from the data.table package) reads the file, but adds ļ»æ at the beginning of the first variable name:
> names(frame_pers)[1]
[1] "ļ»æreg_date"
The same is with read.csv function.
Currently I have made a function which removes the BOM from the first column name, but I believe there should be a way how to automatically strip the BOM.
remove.BOM <- function(x) setnames(x, 1, substring(names(x)[1], 4))
> names(frame_pers)[1]
[1] "ļ»æreg_date"
> remove.BOM(frame_pers)
> names(frame_pers)[1]
[1] "reg_date"
I am using the native encoding for the R session:
> options("encoding" = "")
> options("encoding")
$encoding
[1] ""
Have you tried read.csv(..., fileEncoding = "UTF-8-BOM")?. ?file says:
As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted and will remove
a Byte Order Mark if present (which it often is for files and webpages
generated by Microsoft applications).
This was handled between versions 1.9.6 and 1.9.8 with this commit; update your data.table installation to fix this.
Once done, you can just use fread:
fread("file_name.csv")
I know it's been 8 years but I just had this problem and came across this so it might help. An important detail (mentioned by hadley above) is that it needs to be fileEncoding="UTF-8-BOM" not just encoding="UTF-8-BOM". "encoding" works for a few options but not UTF-8-BOM. Go figure. Found this out here: https://www.johndcook.com/blog/2019/09/07/excel-r-bom/

R: Encoding of labelled data and knit to html problems

First of all, sorry for not providing a reproducible example and posting images, a word of explanation why I did it is at the end.
I'd really appreciate some help - comments or otherwise, I think I did my best to be as specific and concise as I can
Problem I'm trying to solve is how to set up (and where to do it) encoding in order to get polish letters after a .Rmd document is knitted to html.
I'm working with a labelled spss file imported to R via haven library and using sjPlot tools to make tables and graphs.
I already spent almost all day trying to sort this out, but I feel I'm stucked with no idea where to go.
My sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250
[4] LC_NUMERIC=C LC_TIME=Polish_Poland.1250
Whenever I run (via console / script)
sjt.frq(df$sex, encoding = "Windows-1250")
I get a nice table with proper encoding in the rstudio viewer pane:
Trying with no encoding sjt.frq(df$sex) gives this:
I could live with setting encoding each time a call to sjt.frq is made, but problem is, that no matter how I set up sjt.frq inside a markdown document, it always gets knited the wrong way.
Running chunk inside .Rmd is OK (for a completely unknown reason encoding = "UTF-8 worked as well here and it didn't previously):
Knitting same document, not OK:
(note, that html header has all the polish characters)
Also, it looks like that it could be either html or sjPlot specific because knitr can print polish letters when they are in a vector and are passed as if they where printed to console:
Is there anything I can set up / change in order to make this work?
While testing different options I discovered, that manually converting sex variable to factor and assigning labels again, works and Rstudio knits to html with proper encoding
df$sex <- factor(df$sex, label = c("kobieta", "mężczyzna"))
sjt.frq(df$sex, encoding = "Windows-1250")
Regarding no reproducible example:
I tried to simulate this example with fake data:
# Get libraries
library(sjPlot)
library(sjlabelled)
x <- rep(1:4, 4)
x<- set_labels(x, labels = c("ąę", "ćŁ", "óŚŚ", "abcd"))
# Run freq table similar to df$sex above
sjt.frq(x)
sjt.frq(x, encoding = "UTF-8")
sjt.frq(x, encoding = "Windows-1250")
Thing is, each sjt.frq call knits the way it should (although only encoding = "Windows-1250" renders properly in rstudio viewer pane.
If you run sjt.frq(), a complete HTML-page is returned, which is displayed in a viewer.
However, for use inside markdown/knitr-documents, there are only parts of the HTML-output required: You don't need the <head> part, for instance, as the knitr-document creates an own header for the HTML-page. Thus, there's an own print()-method for knitr-documents, which use another return-value to include into the knitr-file.
Compare:
dummy <- sjt.frq(df$sex, encoding = "Windows-1250")
dummy$output.complete # used for default display in viewer
dummy$knitr # used in knitr-documents
Since the encoding is located in the <meta>-tag, which is not included in the $knitr-value, the encoding-argument in sjt.frq() has no effect on knitr-documents.
I think that this might help you: rmarkdown::render_site(encoding = 'UTF-8'). Maybe there are also other options to encode text, or you need to modify the final HTML-file, changing the charset encoding there.

Strange behaviour of regex in R

I have a simple web scraper that seems to behave strangely:
- in the desktop version of RStudio (running R version 3.3.3 on Windows) it behaves as expected and produces a numeric vector
- in the server version of RStudio (running R version 3.4.1 on Linux) the gsub() (and hence the numeric conversion afterwards) fails, and the code produces a vector of NAs.
Do you have any idea what could cause the difference?
library(rvest)
url <- "http://benzin.impuls.cz/benzin.aspx?strana=3"
impuls <- read_html(url, encoding = "windows-1250")
asdf <- impuls %>%
html_table()
Benzin <- asdf[[1]]$X7
chrBenzin <- gsub("\\sKč","",Benzin) # something is wrong here...
numBenzin <- as.double(chrBenzin)
numBenzin
The whitespace in the values is a hard space, U+00A0. After I ran the code, I got this output for Benzin (copy/pasted at ideone.com):
Then, I was already sure those were hard spaces, but I doubled checked here.
What shall we do when we have hard spaces is to try two alternatives.
One is using [[:space:]] in a TRE (default regex engine in Base R functions).
The other is using a PCRE regex with a (*UCP) verb at the start to let the regex engine know we deal with Unicode.
In your case, on Linux, it seems the PCRE works, thus you should stick to the PCRE version (that is just more consistent than TRE):
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
A quick online test on Linux R:
Benzin <- "29.60 Kč"
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
## => [1] "29.60"

Putting a Cyrillic character in an R plot title: is there a command equivalent to "vfont"?

I would like to add a Cyrillic character to a plot title, but seem to be missing some crucial piece of information. Using "expression(...)" isn't suitable as it doesn't have the characters I require.
I currently have an oldish mac running OS X 10.10.5 with
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Platform: x86_64-apple-darwin13.4.0 (64-bit)
I can add Cyrillic to a plot, say
x<-runif(10,0,1); y<-runif(10,0,1); plot(x,y)
# first choose the Hershey font family
par(family="Hershey")
# The fifth table of demo(Hershey) to helps find the three digit code I need
# for my desired symbol, \366 .
# Adding the character to the plot proceeds by using a snippet found
# elsewhere to get an octal code:
code<-366
ocode <- 64*(code%/%100) + 8*(code%/%10)%%10 + code%%10
string <- rawToChar(as.raw(ocode), multiple=TRUE)
# Adding the Cyrillic character to the plot is achieved by choosing the particular
# face in Hershey I require using vfont
text(x=0.5,y=0.5,vfont=c("serif","cyrillic"),labels=paste("Cyrillic sample", string[1] ) )
However, I cannot seem to find an equivalent of "vfont" for the plot title.
I've read many webpages and haven't seen this addressed. It could be that there's something subtle that I haven't understood. Any tips are welcome.
This is a kind of hacky answer (but still an answer) that isn't very general or elegant. As I don't know much about fonts or unicode, this is probably as well as I can do in the short term. I will be very happy for a more general solution from someone who knows about such things.
Changing the font encoding solves the problem in this limited case
Sys.setlocale(category="LC_CTYPE", locale="ru_RU.KOI8-R")
quartz(); plot.new(); plot(z,y)
title(main="testing cyrillic \366")
[![An example plot with Cyrillic character in title][1]][1]
I'm sure there must be better ways of doing this, that don't require so much researching and playing around. Suggestions welcome!

read.csv replaces column-name characters like `?` with `.`, `-` with `...`

I'm using RStudio and my output on the Console gets truncated. I can't find how to stop the truncation (I tried searching ?options as well as googling around for longer than I'd like to admit).
EDIT: My apologies everyone! I originally had the long name as 'ThisIsAReallyReallyReallyReallyReallyLongName', but the issue only came up with the long name of 'Translation Service Info - Which translation service?'. I think I found the issue. The ... wasn't truncating, it was replacing the unknown characters like ? and - with . and ....
Code
# Load File
myfile <- read.csv(file="C:\\Users\\wliu\\Desktop\\myfile.csv",
sep=",", header=TRUE, stringsAsFactors=FALSE, skip=2)
# Get my column names
mycolnames <- colnames(myfile)
# When I request a shorter name, this returns the full name
mycolnames[1] # Assuming first col is a short name
[1] "ThisIsAShortName"
# However, when I request a longer name, this returns a truncated version
mycolnames[2] # Assuming second col is a really long name
[1] "ThisIsA...Long...Name"
I want to get back the non-truncated version of mycolnames[2] (e.g. "ThisIsAReallyReallyReallyReallyReallyLongName")
Setup
I'm on Windows 7 64bit, RStudio Version 0.98.1091, R version 3.0.1 (2013-05-16) -- "Good Sport" with Platform: x86_64-w64-mingw32/x64 (64-bit). I tried with 'Use Git Bash as shell for Git projects' on and off.
myfile.csv
ThisIsAShortName, Translation Service Info - Which translation service?
23143505, Yes
23143614, No
23143324, Yes
This is expected behavior by read.csv, not a truncation problem in R. When you have spaces and special characters in the column names of a file, read.csv replaces each of them with a . unless you specify check.names = FALSE
Here's a glimpse at make.names, which is how read.table produces the column names.
nm <- "Translation Service Info - Which translation service?"
make.names(nm)
# [1] "Translation.Service.Info...Which.translation.service."
And here's the relevant line from read.table
if (check.names)
col.names <- make.names(col.names, unique = TRUE)
In RStudio, use the menu:
Tools > Global Options > Code > Display
In the Console section, set Limit length of lines displayed in console to: to some number much larger than it currently is set to.
Click OK.
How about:
options(width=300)
Does that solve the issue?

Resources