Strange behaviour of regex in R

Strange behaviour of regex in R - r

I have a simple web scraper that seems to behave strangely:
- in the desktop version of RStudio (running R version 3.3.3 on Windows) it behaves as expected and produces a numeric vector
- in the server version of RStudio (running R version 3.4.1 on Linux) the gsub() (and hence the numeric conversion afterwards) fails, and the code produces a vector of NAs.
Do you have any idea what could cause the difference?
library(rvest)
url <- "http://benzin.impuls.cz/benzin.aspx?strana=3"
impuls <- read_html(url, encoding = "windows-1250")
asdf <- impuls %>%
html_table()
Benzin <- asdf[[1]]$X7
chrBenzin <- gsub("\\sKč","",Benzin) # something is wrong here...
numBenzin <- as.double(chrBenzin)
numBenzin

The whitespace in the values is a hard space, U+00A0. After I ran the code, I got this output for Benzin (copy/pasted at ideone.com):
Then, I was already sure those were hard spaces, but I doubled checked here.
What shall we do when we have hard spaces is to try two alternatives.
One is using [[:space:]] in a TRE (default regex engine in Base R functions).
The other is using a PCRE regex with a (*UCP) verb at the start to let the regex engine know we deal with Unicode.
In your case, on Linux, it seems the PCRE works, thus you should stick to the PCRE version (that is just more consistent than TRE):
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
A quick online test on Linux R:
Benzin <- "29.60 Kč"
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
## => [1] "29.60"

Related

str_detect producing vector related errors in R code (which previously worked) since update 1.5.0

I'm trying to do some simple str_detects as follows:
index1 <- str_detect(colnames(DataFrame), paste0("^", name_))
also, name_ is just a character string so paste0("^", name_)) is of length 1.
which yields the following error:
Error in stop_vctrs(): ! Input must be a vector, not an environment.
When I check rlang::last_error() I get:
`Backtrace:
stringr::str_detect(colnames(DataFrame), paste0("^", name_))
vctrs:::stop_scalar_type(<fn>(<env>), "")
vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)`
I know that in this instance I could use the base R alternative:
grep(paste0("^", name_), colanmes(DataFrame))
but the issue is that I have many long scripts which feature str_detect many times...
I'd like to understand the ways around this new error so that I can best fix all these instances in my code, thank you.
I have read the update on Stringr 1.5.0 written by Hadley about the stricter vector definitions which have been implemented in tidyverse but I still pose my question
EDIT: uninstallation and reinstallation of R/studio/tools fixed the issue

Encoding discrepancy in RScript

I have been struggling with an encoding problem with a program that needs to run both in RStudio and using RScript. After wasting half a day on this I have a kludgy workaround, but would like to understand why the RScript version marks a string as latin1 when it is in fact UTF-8, and whether there is a better alternative to my solution. Example:
x <- "Ø28"
print(x)
print(paste("Marked as", Encoding(x)))
print(paste("Valid UTF = ", validUTF8(x)))
x <- iconv(x, "UTF-8", "latin1")
print(x)
In RStudio, the output is:
[1] "Ø28"
[1] "Marked as latin1"
[1] "Valid UTF = FALSE"
[1] NA
and when run using RScript from a batch file in Windows the output from the same code is:
[1] "Ã\23028"
[1] "Marked as latin1"
[1] "Valid UTF = TRUE"
[1] "Ø28"
In the latter case, it does not strike me as entirely helpful that a string defined within an R program by a simple assignment is marked as Latin-1 when in fact it is UTF-8. The solution I used in the end was to write a function that tests the actual (rather than declared) encoding of character variables using validUTF8, and if that returns TRUE, then use iconv to convert to latin1. It is still a bit of a pain since I have to call that repeatedly, and it would be better to have a global solution. There is quite a bit out there on encoding problems with R, but nothing that I can find that solves this when running programs with RScript. Any suggestions?
R 3.5.0, RStudio 1.1.453, Windows 7 / Windows Server 2008 (don't ask...)

Unicode character with subscript

I want to add a Unicode character which has two letters as subscripts to my plot legend in R.
The character is an r with an accent breve (ř) and the two letters are i and j.
I already looked at this question: Unicode character with superscript and tried to adapt the answers to my problem.
Here is what I tried:
plot(1,pch=NA,ylab="",xlab="",axes=F)
legend("top",legend=paste("1-","\u{0159}"),bty ="n",bg = "white",cex=2)
legend("center",legend=paste("1-","\u{0159}","\u{0069}","\u{006A}"),bty="n",bg = "white",cex=2)
legend("bottomleft",legend=expression("1-"*"\u0159"["\u0069"*"\u006A"]),bty="n",bg = "white",cex=2)
legend("bottomright", legend = quote("1-" *"\u0159"["\u0069"*"\u006A"]),bty="n",bg = "white",cex=2)
The resulting plot can be found below
Both the Unicode letter and the subscript work fine by themselves but not together.
paste() with any combination of [ ] does return an error, but I think this is to be expected as paste can't handle [ ] for subscripts.
The FAQ site on CRAN might give a hint as I am using Windows but I am not sure how to implement this:
3.6 I don't see characters with accents at the R console, for example in ?text.
You need to specify a font in Rconsole (see Q5.2) that supports the encoding in use. This used to be a problem in earlier versions of Windows, but now it is hard to find a font which does not.
Support for these characters within Rterm depends on the environment (the terminal window and shell, including locale and codepage settings) within which it is run as well as the font used by the terminal window. Those are usually on legacy DOS settings and need to altered.

It has to do with the system locale, as seen e.g. if you try
# intToUtf8(345)
# [1] "ř"
# iconv(intToUtf8(345), "utf-8", localeToCharset())
# [1] "r"
This should fix it (I used Czech but other locales will probably work too):
Sys.setlocale("LC_CTYPE", "czech")
# [1] "Czech_Czech Republic.1250"
text(..., labels = quote("\u{0159}"[ij]))

read.csv replaces column-name characters like `?` with `.`, `-` with `...`

I'm using RStudio and my output on the Console gets truncated. I can't find how to stop the truncation (I tried searching ?options as well as googling around for longer than I'd like to admit).
EDIT: My apologies everyone! I originally had the long name as 'ThisIsAReallyReallyReallyReallyReallyLongName', but the issue only came up with the long name of 'Translation Service Info - Which translation service?'. I think I found the issue. The ... wasn't truncating, it was replacing the unknown characters like ? and - with . and ....
Code
# Load File
myfile <- read.csv(file="C:\\Users\\wliu\\Desktop\\myfile.csv",
sep=",", header=TRUE, stringsAsFactors=FALSE, skip=2)
# Get my column names
mycolnames <- colnames(myfile)
# When I request a shorter name, this returns the full name
mycolnames[1] # Assuming first col is a short name
[1] "ThisIsAShortName"
# However, when I request a longer name, this returns a truncated version
mycolnames[2] # Assuming second col is a really long name
[1] "ThisIsA...Long...Name"
I want to get back the non-truncated version of mycolnames[2] (e.g. "ThisIsAReallyReallyReallyReallyReallyLongName")
Setup
I'm on Windows 7 64bit, RStudio Version 0.98.1091, R version 3.0.1 (2013-05-16) -- "Good Sport" with Platform: x86_64-w64-mingw32/x64 (64-bit). I tried with 'Use Git Bash as shell for Git projects' on and off.
myfile.csv
ThisIsAShortName, Translation Service Info - Which translation service?
23143505, Yes
23143614, No
23143324, Yes

This is expected behavior by read.csv, not a truncation problem in R. When you have spaces and special characters in the column names of a file, read.csv replaces each of them with a . unless you specify check.names = FALSE
Here's a glimpse at make.names, which is how read.table produces the column names.
nm <- "Translation Service Info - Which translation service?"
make.names(nm)
# [1] "Translation.Service.Info...Which.translation.service."
And here's the relevant line from read.table
if (check.names)
col.names <- make.names(col.names, unique = TRUE)

In RStudio, use the menu:
Tools > Global Options > Code > Display
In the Console section, set Limit length of lines displayed in console to: to some number much larger than it currently is set to.
Click OK.

How about:
options(width=300)
Does that solve the issue?

How can I generate a GUID in R?

How can I generate GUIDs and UUIDs in R?
I would like to be able to generate GUIDs based on the hardware etc. of the machine running the rsession.
As a fallback, however, I would be happy to create UUIDs that comply with rfc4122.
Is there a package that can create GUIDs? Otherwise, does someone have some RFC4122 compatible UUID code lying about?

The optimal choice for this now is the uuid package. It consists of one function (UUIDgenerate) that doesn't rely on R's internal random number generators and so doesn't suffer any consequences from using set.seed in a session as #thelatemail's answer does. You can choose to have the UUID be generated either by the package's internal random number generator or based on time.

If you are using R in Unix environment, you can get UUID in R using system() command.
On Linux (Ubuntu 12.04 LTS):
my_uuid <- system("uuid",intern=T)
my_uuid
[1] 0f62f1de-418d-11e3-8a19-cb0ceccb58ec
On Mac OS X 10.8:
my_uuid <- system("uuidgen", intern=T)
my_uuid
[1] 9A9D64DF-EB01-47E7-B16E-DC0343951883
Far as I know, both uuid and uuidgen follows UUID Version 4 format.

I know nothing about the intricacies of UUID's, but would something like this do?
baseuuid <- paste(sample(c(letters[1:6],0:9),30,replace=TRUE),collapse="")
paste(
substr(baseuuid,1,8),
"-",
substr(baseuuid,9,12),
"-",
"4",
substr(baseuuid,13,15),
"-",
sample(c("8","9","a","b"),1),
substr(baseuuid,16,18),
"-",
substr(baseuuid,19,30),
sep="",
collapse=""
)
# result like: "f7bd11ed-fca9-42e5-8c3e-4464cd02e0fa"
This should be in line with http://en.wikipedia.org/wiki/Uuid#Version_4_.28random.29

Bioconductor had an Ruuid package that doesn't seem to be there anymore, but a google search on "Ruuid" will point to places that you can download it.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Strange behaviour of regex in R - r

Related

str_detect producing vector related errors in R code (which previously worked) since update 1.5.0

Encoding discrepancy in RScript

Unicode character with subscript

read.csv replaces column-name characters like `?` with `.`, `-` with `...`

How can I generate a GUID in R?

Categories

Resources