Error in gsub is too long with Arabic language in R - r

I am doing text mining in R with Arabic language
And use gsub function but I got an error as shown here
Error in gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
invalid regular expression '^\x{0627}\x{0644}(?=\p{L})'
In addition: Warning message:
In gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
PCRE pattern compilation error
'character value in \x{} or \o{} is too large'
at '}\x{0644}(?=\p{L})'
here is my code
x<-("الوطن")
# Remove leading alef lam with optional leading waw
m <- gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', x, perl = TRUE)
anyone can help me ?

Finally I solved the problem ,
the problem is : when I import data in Arabic language as csv then apply gsub I get the error
Error in gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
invalid regular expression '^\x{0627}\x{0644}(?=\p{L})'
In addition: Warning message:
In gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
PCRE pattern compilation error
'character value in \x{} or \o{} is too large'
at '}\x{0644}(?=\p{L})'
I figure out that I need to save the data with encode= UTF-8
then read it also with encode= UTF-8 Then change the Local .
like this code :
Sys.setlocale("LC_CTYPE","arabic")
[1] "Arabic_Saudi Arabia.1256"
> write.csv(x, file = "x.csv" , fileEncoding = "UTF-8")
y<-read.csv("C:/Users/Documents/x.csv",encoding ="UTF-8")
> Sys.setlocale("LC_CTYPE","arabic")
[1] "Arabic_Saudi Arabia.1256"

it seems to me the only problem is your quotation marks:
> x <- "الوطن"
> gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', x, perl = TRUE)
[1] "وطن"
also, check for your OS locale as I've experienced some similar issues when trying to process Hebrew text while my Windows locale was set to US.

Related

r - source script file that contains unicode (Farsi) character

write the text below in a buffer and save it as a .r script:
letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')
then try these lines to source() it:
script <- "path/to/script.R"
file(script,
encoding = "UTF-8") %>%
readLines() # works fine
file(script,
encoding = "UTF-8") %>%
source() # works fine
source(script) # the Farsi letters in the environment are misrepresented
source(script,
encoding = "UTF-8") # gives error
The last line throws error. I tried to debug it and I believe there is a bug in the source function, in the following lines:
...
loc <- utils::localeToCharset()[1L]
...
The error occurs at .Internal(parse( line.
...
exprs <- if (!from_file) {
if (length(lines))
.Internal(parse(stdin(), n = -1, lines, "?",
srcfile, encoding))
else expression()
}
else .Internal(parse(file, n = -1, NULL, "?", srcfile,
encoding))
...
The exact error is:
Error in source(script, encoding = "UTF-8") :
script.R:2:17: unexpected INCOMPLETE_STRING
1: #' #export
2: letters_fa <- c('
^
The solution to this problem is to either change the OS Locale to a native Locale (e.g. Persian in this case) or use R built-in function Sys.setlocale(locale="Persian") to change an R session native Locale.
Use source without specifying the encoding, and then modify the vector's encoding with Encoding:
source(script)
letters_fa
# [1] "الÙ\u0081" "ب" "Ù¾" "ت" "Ø«"
# [6] "ج" "چ" "ح" "خ" "ر"
# [11] "ز" "د"
Encoding(letters_fa) <- "UTF-8"
letters_fa
# [1] "الف" "ب" "پ" "ت" "ث" "ج" "چ" "ح" "خ" "ر" "ز" "د"

Parsing Unicode string with nulls in R

I am having some trouble parsing a Unicode string a JSON object pulled from an API. As the string has Encoding() like "unknown", i need to parse it for the system to know what its dealing with. The string represents a decoded .png file in UTF-8 that I then need to decode back to latin1 before writing it to a .png file (I know, very backwards, and it would be much better if the API pushed a base64-string).
I get the string from the API as chr object, and try to let fromJSON do the job, but no dice. It cuts the string at the first null (\u0000).
> library(httr)
> library(jsonlite)
> library(tidyverse)
> m
Response [https://...]
Date: 2018-04-10 11:47
Status: 200
Content-Type: application/json
Size: 24.3 kB
{"artifact": "\u0089PNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0000\u0092\u0000\u0000\u0000\u00e3...
> x <- content(m, encoding = "UTF-8", as = "text")
> ## substing of the complete x:
> x <- "{\"artifact\": \"\\u0089PNG\\r\\n\\u001a\\n\\u0000\\u0000\\u0000\\rIHDR\\u0000\\u0000\\u0000\\u0092\\u0000\\u0000\\u0000\\u00e3\\b\\u0006\\u0000\\u0000\\u0000n\\u0005M\\u00ea\\u0000\\u0000\\u0000\\u0006bKGD\\u0000\\u00ff\\u0000\\u00ff\\u0000\\u00ff\\u00a0\\u00bd\\u00a7\\u0093\\u0000\\u0000\\u0016\\u00e7IDATx\\u009c\\u00ed\"}\n"
>
> ## the problem
> "\u0000"
Error: nul character not allowed (line 1)
> ## this works fine
> "\\u0000"
[1] "\\u0000"
>
> y <- fromJSON(txt = x)
> y # note how the string is cut!
$artifact
[1] "\u0089PNG\r\n\032\n"
When I replace the \\u0000 with char(0), everything works fine. The problem is that the nulls seems to play an important role in the binary representation of the file that I write to in the end, causing the resulting image to be corrupted in the viewer.
> x <- str_replace_all(string = x, pattern = "\\\\u0000", replacement = chr(0))
> y <- fromJSON(txt = x)
> y
$artifact
[1] "\u0089PNG\r\n\032\n\rIHDR\u0092ã\b\006n\005Mê\006bKGDÿÿÿ ½§\u0093\026çIDATx\u009cí"
> str(y$artifact)
chr "<U+0089>PNG\r\n\032\n\rIHDR<U+0092>ã\b\006n\005Mê\006bKGDÿÿÿ ½§<U+0093>\026çIDATx<U+009C>í"
> Encoding(y$artifact)
[1] "UTF-8"
> z <- iconv(y$artifact, from = "UTF-8", to = "latin1")
> writeBin(object = z, con = "test.png", useBytes = TRUE)
I have tried these commands with the original string, to no avail
> library(stringi)
> stri_unescape_unicode(str = x)
Error in stri_unescape_unicode(str = x) :
embedded nul in string: '{"artifact": "<U+0089>PNG\r\n\032\n'
> ## and
> parse(text = x)
Error in parse(text = x) : nul character not allowed (line 1)
Is there no way for R to handle this nul character?
Any idea on how I can get the complete encoded string and write it to a file?
The same story works just fine in Python, which uses a \x convention in stead of \u00
response = r.json()
artifact = response['artifact']
artifact
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
artifact_encoded = artifact.encode('latin-1')
artifact_encoded # note the binary form!
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
fh = open("abc.png", "wb")
fh.write(artifact_encoded)
fh.close()
FYI: I have cut some most of the actual string out, but enough to use for testing purposes. The actual string contained other symbols, and it seemed impossible to copy-paste the string in a script and assign it to a new variable (e.g. y <- "{\"artifact\": \"\\u0089PNG\\..."). So, I don't know what I would do if I had to read the string from e.g. a .csv file..
Any pointers in any of my struggles would be appreciated :)

Ignoring the symbol ° in R devtools function document()

I would like to create a package for internal usage (not to distribute somewhere). One of my functions contains the line
if (data$unit[i] != "°C") {
It works perfectly in the script, but if I want to create the documentation for my package using document() from devtools, i get the error
Error in parse(text = lines, keep.source = TRUE, srcfile = srcfilecopy(file, path_to_my_code: unexpected INCOMPLETE_STRING
279: if (! is.na(data$unit[i]){
280: if (data$unit[i] != "
addition: Warning message:
In readLines(con, warn = FALSE, n = n, ok = ok, skipNul = skipNul) :
invalid input found on input connection 'path_to_my_code'
If I delete the °-character, document() works. But I need this character there, so this is not an option.
When using double-\ in the if-clause, my function doesn't detect °C anymore as shown here:
test <- c("mg/l", "°C")
"\\°C" %in% test
[1] FALSE
If I use tryCatch, the documentation is also not created.
Replacing "°C" by gsub(pattern = '\\\\', replacement = "", x = '\\°C') causes the function to crash at the double-\ .
How can I tell document() that everything is fine and it should just create the files?

How to replace "unexpected escaped character" in R

When I try to parse JSON from the character object from a Facebook URL I got "Error in fromJSON(data) : unexpected escaped character '\o' at pos 130". Check this out:
library(RCurl)
library(rjson)
data <- getURL("https://graph.facebook.com/search?q=multishow&type=post&limit=1500", cainfo="cacert.perm")
fbData <- fromJSON(data)
Error in fromJSON(data) : unexpected escaped character '\o' at pos 130
#with RSONIO also error
> fbData <- fromJSON(data)
Erro em fromJSON(content, handler, default.size, depth, allowComments, :
invalid JSON input
Is there any way to replace this '\o' character before I try to parse JSON? I tried gsub but it didn't work (or i'm doing something wrong).
datafixed <- gsub('\o',' ',data)
Error: '\o' is an unrecognized escape sequence in string starting with "\o"
Can somebody hel me with this one? Thanks.
You need to escape \ in your pattern.
Try
gsub('\\o',' ',data)
You could do
fbData <- fromJSON(data,unexpected.escape = "keep")
you will see a warning
Warning message:
In fromJSON(individual_page, unexpected.escape = "keep") :
unexpected escaped character '\m' at pos 10. Keeping value.
if you want you can suppress the warning using
suppressWarnings(fromJSON(data,unexpected.escape = "keep"))
unexpected.escape : changed handling of unexpected escaped characters. Handling value should be one of "error", "skip", or "keep"; on unexpected characters issue an error, skip
the character, or keep the character
You can find more details here - http://cran.r-project.org/web/packages/rjson/rjson.pdf

How to use a non-ASCII symbol (e.g. £) in an R package function?

I have a simple function in one of my R packages, with one of the arguments symbol = "£":
formatPound <- function(x, digits = 2, nsmall = 2, symbol = "£"){
paste(symbol, format(x, digits = digits, nsmall = nsmall))
}
But when running R CMD check, I get this warning:
* checking R files for non-ASCII characters ... WARNING
Found the following files with non-ASCII characters:
formatters.R
It's definitely that £ symbol that causes the problem. If I replace it with a legitimate ASCII character, like $, the warning disappears.
Question: How can I use £ in my function argument, without incurring a R CMD check warning?
Looks like "Writing R Extensions" covers this in Section 1.7.1 "Encoding Issues".
One of the recommendations in this page is to use the Unicode encoding \uxxxx. Since £ is Unicode 00A3, you can use:
formatPound <- function(x, digits=2, nsmall=2, symbol="\u00A3"){
paste(symbol, format(x, digits=digits, nsmall=nsmall))
}
formatPound(123.45)
[1] "£ 123.45"
As a workaround, you can use intToUtf8() function:
# this causes errors (non-ASCII chars)
f <- function(symbol = "➛")
# this also causes errors in Rd files (non-ASCII chars)
f <- function(symbol = "\u279B")
# this is ok
f <- function(symbol = intToUtf8(0x279B))

Resources