r - source script file that contains unicode (Farsi) character

r - source script file that contains unicode (Farsi) character - r

write the text below in a buffer and save it as a .r script:
letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')
then try these lines to source() it:
script <- "path/to/script.R"
file(script,
encoding = "UTF-8") %>%
readLines() # works fine
file(script,
encoding = "UTF-8") %>%
source() # works fine
source(script) # the Farsi letters in the environment are misrepresented
source(script,
encoding = "UTF-8") # gives error
The last line throws error. I tried to debug it and I believe there is a bug in the source function, in the following lines:
...
loc <- utils::localeToCharset()[1L]
...
The error occurs at .Internal(parse( line.
...
exprs <- if (!from_file) {
if (length(lines))
.Internal(parse(stdin(), n = -1, lines, "?",
srcfile, encoding))
else expression()
}
else .Internal(parse(file, n = -1, NULL, "?", srcfile,
encoding))
...
The exact error is:
Error in source(script, encoding = "UTF-8") :
script.R:2:17: unexpected INCOMPLETE_STRING
1: #' #export
2: letters_fa <- c('
^

The solution to this problem is to either change the OS Locale to a native Locale (e.g. Persian in this case) or use R built-in function Sys.setlocale(locale="Persian") to change an R session native Locale.

Use source without specifying the encoding, and then modify the vector's encoding with Encoding:
source(script)
letters_fa
# [1] "Ø§Ù„Ù\u0081" "Ø¨" "Ù¾" "Øª" "Ø«"
# [6] "Ø¬" "Ú†" "Ø" "Ø®" "Ø±"
# [11] "Ø²" "Ø¯"
Encoding(letters_fa) <- "UTF-8"
letters_fa
# [1] "الف" "ب" "پ" "ت" "ث" "ج" "چ" "ح" "خ" "ر" "ز" "د"

Related

Can I import variables into R from a global file?

I am integrating an R script to produce some graphics into a larger project that is pulled together with a Makefile. In this larger project, I have a file called globals.mk that contains global variables used by many other scripts in the project. For example, the number of simulations I want to run is a global that I want to use in this R script. Can I "import" this as a variable, or is it necessary to manually define every variable within the R script?
Edit: here is a sample of the globals that I would need to read in.
num = 100
path = ./here/is/a/path
file = $(path)/file.csv
And I would like the R script to set the variables num as 100 (or "100"), path as "./here/is/a/path" and file as "./here/is/a/path/file.csv".

If it is ok to replace the parentheses with brace brackets then readRenviron will read in such files and perform the substitutions returning the contents as environmental variables.
# write out test file globals2.mk which uses brace brackets
Lines <- "num = 100
path = ./here/is/a/path
file = ${path}/file.csv"
cat(Lines, file = "globals2.mk")
readRenviron("globals2.mk")
Sys.getenv("num")
## [1] "100"
Sys.getenv("path")
## [1] "./here/is/a/path"
Sys.getenv("file")
## [1] "./here/is/a/path/file.csv"
If it is important to use parentheses rather than brace brackets, read in globals.mk, replace the parentheses with brace brackets and then write the file out again.
# write out test file - this one uses parentheses as in question
Lines <- "num = 100
path = ./here/is/a/path
file = $(path)/file.csv"
cat(Lines, file = "globals.mk")
# read globals.mk, perform () to {} substitutions, write out and then re-read
tmp <- tempfile()
L <- readLines("globals.mk")
cat(paste(chartr("()", "{}", L), collapse = "\n"), file = tmp)
readRenviron(tmp)

If the .mk file has anything other than direct variable expansion (such as more complex make-rules/tricks/functions), it might be better to trust make to do the expansion for you, and then read it in. There's a post here that I found that dumps all variable contents (after processing).
TL;DR
expand_mkvars <- function(path, aslist = FALSE) {
stopifnot(file.exists(mk <- Sys.which("make")))
tf <- tempfile(fileext = ".mk")
# needed on my windows system
tf <- normalizePath(tf, winslash = "/", mustWork = FALSE) # tempfile should suffice
on.exit(suppressWarnings(file.remove(tf)), add = TRUE)
writeLines(c(".PHONY: printvars",
"printvars:",
"\t#$(foreach V,$(sort $(.VARIABLES)), \\",
"\t $(if $(filter-out environment% default automatic, \\",
"\t $(origin $V)),$(warning $V=$($V))))"), con = tf)
out <- system2(mk, c("-f", shQuote(path), "-f", shQuote(tf), "-n", "printvars"),
stdout = TRUE, stderr = TRUE)
out <- out[grepl(paste0("^", tf), out)]
out <- gsub(paste0("^", tf, ":[0-9]+:\\s*"), "", out)
known_noneed <- c(".DEFAULT_GOAL", "CURDIR", "GNUMAKEFLAGS", "MAKEFILE_LIST", "MAKEFLAGS")
out <- out[!grepl(paste0("^(", paste(known_noneed, collapse = "|"), ")="), out)]
if (aslist) {
spl <- strsplit(out, "=")
nms <- sapply(spl, `[[`, 1)
rest <- lapply(spl, function(a) paste(a[-1], collapse = "="))
setNames(rest, nms)
} else out
}
In action:
expand_mkvars("~/StackOverflow/karthikt.mk")
# [1] "file=./here/is/a/path/file.csv" "num=100"
# [3] "path=./here/is/a/path"
expand_mkvars("~/StackOverflow/karthikt.mk", aslist = TRUE)
# $file
# [1] "./here/is/a/path/file.csv"
# $num
# [1] "100"
# $path
# [1] "./here/is/a/path"
I have not tested on other systems, so you might need to adjust known_noneed to add extra variables that popup. Depending on your needs, you might be able to filter more-intelligently (e.g., none of your variables lead with a capital letter), but for this example I kept it to the known-not-wanted variables that make is giving us.
The blog post suggests using a phony target of
.PHONY: printvars
printvars:
#$(foreach V,$(sort $(.VARIABLES)), \
$(if $(filter-out environment% default automatic, \
$(origin $V)),$(warning $V=$($V))))
(some are tabs, not all spaces, very important for make)
Unfortunately, it produces more output than you technically need:
$ /c/Rtools/bin/make.exe -f ~/StackOverflow/karthikt.mk printvars
C:/Users/r2/StackOverflow/karthikt.mk:10: .DEFAULT_GOAL=all
C:/Users/r2/StackOverflow/karthikt.mk:10: CURDIR=/Users/r2/Projects/Ford/shiny/shinyobjects/inst
C:/Users/r2/StackOverflow/karthikt.mk:10: GNUMAKEFLAGS=
C:/Users/r2/StackOverflow/karthikt.mk:10: MAKEFILE_LIST= C:/Users/r2/StackOverflow/karthikt.mk
C:/Users/r2/StackOverflow/karthikt.mk:10: MAKEFLAGS=
C:/Users/r2/StackOverflow/karthikt.mk:10: SHELL=sh
C:/Users/r2/StackOverflow/karthikt.mk:10: file=./here/is/a/path/file.csv
C:/Users/r2/StackOverflow/karthikt.mk:10: num=100
C:/Users/r2/StackOverflow/karthikt.mk:10: path=./here/is/a/path
make: Nothing to be done for 'printvars'.
so we need a little filtering, ergo the majority of code in the function.
Edit: it the readRenviron-to-envvar is the best way for you, it would not be difficult to redirect the output of this make call to another file, parse out the relevant lines, and then do readRenviron on that new file. It seems more indirect due to the use of two temp files, but they're cleaned up so that should be nothing to worry about.

Error in gsub is too long with Arabic language in R

I am doing text mining in R with Arabic language
And use gsub function but I got an error as shown here
Error in gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
invalid regular expression '^\x{0627}\x{0644}(?=\p{L})'
In addition: Warning message:
In gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
PCRE pattern compilation error
'character value in \x{} or \o{} is too large'
at '}\x{0644}(?=\p{L})'
here is my code
x<-("الوطن")
# Remove leading alef lam with optional leading waw
m <- gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', x, perl = TRUE)
anyone can help me ?

Finally I solved the problem ,
the problem is : when I import data in Arabic language as csv then apply gsub I get the error
Error in gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
invalid regular expression '^\x{0627}\x{0644}(?=\p{L})'
In addition: Warning message:
In gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
PCRE pattern compilation error
'character value in \x{} or \o{} is too large'
at '}\x{0644}(?=\p{L})'
I figure out that I need to save the data with encode= UTF-8
then read it also with encode= UTF-8 Then change the Local .
like this code :
Sys.setlocale("LC_CTYPE","arabic")
[1] "Arabic_Saudi Arabia.1256"
> write.csv(x, file = "x.csv" , fileEncoding = "UTF-8")
y<-read.csv("C:/Users/Documents/x.csv",encoding ="UTF-8")
> Sys.setlocale("LC_CTYPE","arabic")
[1] "Arabic_Saudi Arabia.1256"

it seems to me the only problem is your quotation marks:
> x <- "الوطن"
> gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', x, perl = TRUE)
[1] "وطن"
also, check for your OS locale as I've experienced some similar issues when trying to process Hebrew text while my Windows locale was set to US.

Parsing Unicode string with nulls in R

I am having some trouble parsing a Unicode string a JSON object pulled from an API. As the string has Encoding() like "unknown", i need to parse it for the system to know what its dealing with. The string represents a decoded .png file in UTF-8 that I then need to decode back to latin1 before writing it to a .png file (I know, very backwards, and it would be much better if the API pushed a base64-string).
I get the string from the API as chr object, and try to let fromJSON do the job, but no dice. It cuts the string at the first null (\u0000).
> library(httr)
> library(jsonlite)
> library(tidyverse)
> m
Response [https://...]
Date: 2018-04-10 11:47
Status: 200
Content-Type: application/json
Size: 24.3 kB
{"artifact": "\u0089PNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0000\u0092\u0000\u0000\u0000\u00e3...
> x <- content(m, encoding = "UTF-8", as = "text")
> ## substing of the complete x:
> x <- "{\"artifact\": \"\\u0089PNG\\r\\n\\u001a\\n\\u0000\\u0000\\u0000\\rIHDR\\u0000\\u0000\\u0000\\u0092\\u0000\\u0000\\u0000\\u00e3\\b\\u0006\\u0000\\u0000\\u0000n\\u0005M\\u00ea\\u0000\\u0000\\u0000\\u0006bKGD\\u0000\\u00ff\\u0000\\u00ff\\u0000\\u00ff\\u00a0\\u00bd\\u00a7\\u0093\\u0000\\u0000\\u0016\\u00e7IDATx\\u009c\\u00ed\"}\n"
>
> ## the problem
> "\u0000"
Error: nul character not allowed (line 1)
> ## this works fine
> "\\u0000"
[1] "\\u0000"
>
> y <- fromJSON(txt = x)
> y # note how the string is cut!
$artifact
[1] "\u0089PNG\r\n\032\n"
When I replace the \\u0000 with char(0), everything works fine. The problem is that the nulls seems to play an important role in the binary representation of the file that I write to in the end, causing the resulting image to be corrupted in the viewer.
> x <- str_replace_all(string = x, pattern = "\\\\u0000", replacement = chr(0))
> y <- fromJSON(txt = x)
> y
$artifact
[1] "\u0089PNG\r\n\032\n\rIHDR\u0092ã\b\006n\005Mê\006bKGDÿÿÿ ½§\u0093\026çIDATx\u009cí"
> str(y$artifact)
chr "<U+0089>PNG\r\n\032\n\rIHDR<U+0092>ã\b\006n\005Mê\006bKGDÿÿÿ ½§<U+0093>\026çIDATx<U+009C>í"
> Encoding(y$artifact)
[1] "UTF-8"
> z <- iconv(y$artifact, from = "UTF-8", to = "latin1")
> writeBin(object = z, con = "test.png", useBytes = TRUE)
I have tried these commands with the original string, to no avail
> library(stringi)
> stri_unescape_unicode(str = x)
Error in stri_unescape_unicode(str = x) :
embedded nul in string: '{"artifact": "<U+0089>PNG\r\n\032\n'
> ## and
> parse(text = x)
Error in parse(text = x) : nul character not allowed (line 1)
Is there no way for R to handle this nul character?
Any idea on how I can get the complete encoded string and write it to a file?
The same story works just fine in Python, which uses a \x convention in stead of \u00
response = r.json()
artifact = response['artifact']
artifact
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
artifact_encoded = artifact.encode('latin-1')
artifact_encoded # note the binary form!
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
fh = open("abc.png", "wb")
fh.write(artifact_encoded)
fh.close()
FYI: I have cut some most of the actual string out, but enough to use for testing purposes. The actual string contained other symbols, and it seemed impossible to copy-paste the string in a script and assign it to a new variable (e.g. y <- "{\"artifact\": \"\\u0089PNG\\..."). So, I don't know what I would do if I had to read the string from e.g. a .csv file..
Any pointers in any of my struggles would be appreciated :)

specifying output path for knit2html

I'm having trouble specifying an output path for the html generated by knit2html or its dependent functions. I would like to specify 'outfile' in the the call to knit2html(), but I get the error,
Error in knit2html(input = "test.Rmd", output = "test-abcd.html") :
object 'outfile' not found
'output' is a parameter of markdownToHTML which should work I'd think. I can't find anywhere in the source where 'outfile' is used.
This should reproduce my experience.
library(knitr)
library(markdown)
# a minimal example
writeLines(c("```{r hello-random, echo=TRUE}", "rnorm(5)", "```"),
"test.Rmd")
# this works and outputs to test.html
knit2html(input = "test.Rmd")
# this generates the above error
knit2html(input = "test.Rmd",
output = "test-abcd.html")
# breaking it down into two steps works in this simple case,
# but not in my application. trying to diagnose that difference currently
knit("test.Rmd")
markdownToHTML("test.md",
output="test-abcd.html")
relevant version info might be useful?
sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-pc-linux-gnu (64-bit)
other attached packages:
[1] plyr_1.8 knitr_1.2 digest_0.6.3 markdown_0.5.4 xtable_1.7-1 reshape2_1.2.2 scales_0.2.3 ggplot2_0.9.3.1 data.table_1.8.8

First, thanks for the very clear and reproducible question. If you take a look at the knit2html function source code, you can understand what the problem is :
R> knit2html
function (input, ..., envir = parent.frame(), text = NULL, quiet = FALSE,
encoding = getOption("encoding"))
{
if (is.null(text)) {
out = knit(input, envir = envir, encoding = encoding,
quiet = quiet)
markdown::markdownToHTML(out, outfile <- sub_ext(out,
"html"), ...)
invisible(outfile)
}
else {
out = knit(text = text, envir = envir, encoding = encoding,
quiet = quiet)
markdown::markdownToHTML(text = out, ...)
}
}
<environment: namespace:knitr>
If the text argument is NULL (ie, if you provide a file as input instead of a character vector), then the given file is passed to the knit function, and the markdownToHTML function is called the following way :
markdown::markdownToHTML(out, outfile <- sub_ext(out, "html"), ...)
So in this case the output file name is generated by substituting the existing file name extension with html, and you can't provide your own output filename as an argument.

How to use a non-ASCII symbol (e.g. £) in an R package function?

I have a simple function in one of my R packages, with one of the arguments symbol = "£":
formatPound <- function(x, digits = 2, nsmall = 2, symbol = "£"){
paste(symbol, format(x, digits = digits, nsmall = nsmall))
}
But when running R CMD check, I get this warning:
* checking R files for non-ASCII characters ... WARNING
Found the following files with non-ASCII characters:
formatters.R
It's definitely that £ symbol that causes the problem. If I replace it with a legitimate ASCII character, like $, the warning disappears.
Question: How can I use £ in my function argument, without incurring a R CMD check warning?

Looks like "Writing R Extensions" covers this in Section 1.7.1 "Encoding Issues".
One of the recommendations in this page is to use the Unicode encoding \uxxxx. Since £ is Unicode 00A3, you can use:
formatPound <- function(x, digits=2, nsmall=2, symbol="\u00A3"){
paste(symbol, format(x, digits=digits, nsmall=nsmall))
}
formatPound(123.45)
[1] "£ 123.45"

As a workaround, you can use intToUtf8() function:
# this causes errors (non-ASCII chars)
f <- function(symbol = "➛")
# this also causes errors in Rd files (non-ASCII chars)
f <- function(symbol = "\u279B")
# this is ok
f <- function(symbol = intToUtf8(0x279B))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

r - source script file that contains unicode (Farsi) character - r

The solution to this problem is to either change the OS Locale to a native Locale (e.g. Persian in this case) or use R built-in function Sys.setlocale(locale="Persian") to change an R session native Locale.

Related

Can I import variables into R from a global file?

Error in gsub is too long with Arabic language in R

Parsing Unicode string with nulls in R

specifying output path for knit2html

How to use a non-ASCII symbol (e.g. £) in an R package function?

Categories

Resources