I am doing text mining in R with Arabic language
And use gsub function but I got an error as shown here
Error in gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
invalid regular expression '^\x{0627}\x{0644}(?=\p{L})'
In addition: Warning message:
In gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
PCRE pattern compilation error
'character value in \x{} or \o{} is too large'
at '}\x{0644}(?=\p{L})'
here is my code
x<-("الوطن")
# Remove leading alef lam with optional leading waw
m <- gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', x, perl = TRUE)
anyone can help me ?
Finally I solved the problem ,
the problem is : when I import data in Arabic language as csv then apply gsub I get the error
Error in gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
invalid regular expression '^\x{0627}\x{0644}(?=\p{L})'
In addition: Warning message:
In gsub("^\\x{0627}\\x{0644}(?=\\p{L})", "", x, perl = TRUE) :
PCRE pattern compilation error
'character value in \x{} or \o{} is too large'
at '}\x{0644}(?=\p{L})'
I figure out that I need to save the data with encode= UTF-8
then read it also with encode= UTF-8 Then change the Local .
like this code :
Sys.setlocale("LC_CTYPE","arabic")
[1] "Arabic_Saudi Arabia.1256"
> write.csv(x, file = "x.csv" , fileEncoding = "UTF-8")
y<-read.csv("C:/Users/Documents/x.csv",encoding ="UTF-8")
> Sys.setlocale("LC_CTYPE","arabic")
[1] "Arabic_Saudi Arabia.1256"
it seems to me the only problem is your quotation marks:
> x <- "الوطن"
> gsub('^\\x{0627}\\x{0644}(?=\\p{L})', '', x, perl = TRUE)
[1] "وطن"
also, check for your OS locale as I've experienced some similar issues when trying to process Hebrew text while my Windows locale was set to US.
Related
write the text below in a buffer and save it as a .r script:
letters_fa <- c('الف','ب','پ','ت','ث','ج','چ','ح','خ','ر','ز','د')
then try these lines to source() it:
script <- "path/to/script.R"
file(script,
encoding = "UTF-8") %>%
readLines() # works fine
file(script,
encoding = "UTF-8") %>%
source() # works fine
source(script) # the Farsi letters in the environment are misrepresented
source(script,
encoding = "UTF-8") # gives error
The last line throws error. I tried to debug it and I believe there is a bug in the source function, in the following lines:
...
loc <- utils::localeToCharset()[1L]
...
The error occurs at .Internal(parse( line.
...
exprs <- if (!from_file) {
if (length(lines))
.Internal(parse(stdin(), n = -1, lines, "?",
srcfile, encoding))
else expression()
}
else .Internal(parse(file, n = -1, NULL, "?", srcfile,
encoding))
...
The exact error is:
Error in source(script, encoding = "UTF-8") :
script.R:2:17: unexpected INCOMPLETE_STRING
1: #' #export
2: letters_fa <- c('
^
The solution to this problem is to either change the OS Locale to a native Locale (e.g. Persian in this case) or use R built-in function Sys.setlocale(locale="Persian") to change an R session native Locale.
Use source without specifying the encoding, and then modify the vector's encoding with Encoding:
source(script)
letters_fa
# [1] "الÙ\u0081" "ب" "Ù¾" "ت" "Ø«"
# [6] "ج" "Ú†" "Ø" "Ø®" "ر"
# [11] "ز" "د"
Encoding(letters_fa) <- "UTF-8"
letters_fa
# [1] "الف" "ب" "پ" "ت" "ث" "ج" "چ" "ح" "خ" "ر" "ز" "د"
I am having some trouble parsing a Unicode string a JSON object pulled from an API. As the string has Encoding() like "unknown", i need to parse it for the system to know what its dealing with. The string represents a decoded .png file in UTF-8 that I then need to decode back to latin1 before writing it to a .png file (I know, very backwards, and it would be much better if the API pushed a base64-string).
I get the string from the API as chr object, and try to let fromJSON do the job, but no dice. It cuts the string at the first null (\u0000).
> library(httr)
> library(jsonlite)
> library(tidyverse)
> m
Response [https://...]
Date: 2018-04-10 11:47
Status: 200
Content-Type: application/json
Size: 24.3 kB
{"artifact": "\u0089PNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0000\u0092\u0000\u0000\u0000\u00e3...
> x <- content(m, encoding = "UTF-8", as = "text")
> ## substing of the complete x:
> x <- "{\"artifact\": \"\\u0089PNG\\r\\n\\u001a\\n\\u0000\\u0000\\u0000\\rIHDR\\u0000\\u0000\\u0000\\u0092\\u0000\\u0000\\u0000\\u00e3\\b\\u0006\\u0000\\u0000\\u0000n\\u0005M\\u00ea\\u0000\\u0000\\u0000\\u0006bKGD\\u0000\\u00ff\\u0000\\u00ff\\u0000\\u00ff\\u00a0\\u00bd\\u00a7\\u0093\\u0000\\u0000\\u0016\\u00e7IDATx\\u009c\\u00ed\"}\n"
>
> ## the problem
> "\u0000"
Error: nul character not allowed (line 1)
> ## this works fine
> "\\u0000"
[1] "\\u0000"
>
> y <- fromJSON(txt = x)
> y # note how the string is cut!
$artifact
[1] "\u0089PNG\r\n\032\n"
When I replace the \\u0000 with char(0), everything works fine. The problem is that the nulls seems to play an important role in the binary representation of the file that I write to in the end, causing the resulting image to be corrupted in the viewer.
> x <- str_replace_all(string = x, pattern = "\\\\u0000", replacement = chr(0))
> y <- fromJSON(txt = x)
> y
$artifact
[1] "\u0089PNG\r\n\032\n\rIHDR\u0092ã\b\006n\005Mê\006bKGDÿÿÿ ½§\u0093\026çIDATx\u009cí"
> str(y$artifact)
chr "<U+0089>PNG\r\n\032\n\rIHDR<U+0092>ã\b\006n\005Mê\006bKGDÿÿÿ ½§<U+0093>\026çIDATx<U+009C>í"
> Encoding(y$artifact)
[1] "UTF-8"
> z <- iconv(y$artifact, from = "UTF-8", to = "latin1")
> writeBin(object = z, con = "test.png", useBytes = TRUE)
I have tried these commands with the original string, to no avail
> library(stringi)
> stri_unescape_unicode(str = x)
Error in stri_unescape_unicode(str = x) :
embedded nul in string: '{"artifact": "<U+0089>PNG\r\n\032\n'
> ## and
> parse(text = x)
Error in parse(text = x) : nul character not allowed (line 1)
Is there no way for R to handle this nul character?
Any idea on how I can get the complete encoded string and write it to a file?
The same story works just fine in Python, which uses a \x convention in stead of \u00
response = r.json()
artifact = response['artifact']
artifact
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
artifact_encoded = artifact.encode('latin-1')
artifact_encoded # note the binary form!
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR....'
fh = open("abc.png", "wb")
fh.write(artifact_encoded)
fh.close()
FYI: I have cut some most of the actual string out, but enough to use for testing purposes. The actual string contained other symbols, and it seemed impossible to copy-paste the string in a script and assign it to a new variable (e.g. y <- "{\"artifact\": \"\\u0089PNG\\..."). So, I don't know what I would do if I had to read the string from e.g. a .csv file..
Any pointers in any of my struggles would be appreciated :)
I would like to create a package for internal usage (not to distribute somewhere). One of my functions contains the line
if (data$unit[i] != "°C") {
It works perfectly in the script, but if I want to create the documentation for my package using document() from devtools, i get the error
Error in parse(text = lines, keep.source = TRUE, srcfile = srcfilecopy(file, path_to_my_code: unexpected INCOMPLETE_STRING
279: if (! is.na(data$unit[i]){
280: if (data$unit[i] != "
addition: Warning message:
In readLines(con, warn = FALSE, n = n, ok = ok, skipNul = skipNul) :
invalid input found on input connection 'path_to_my_code'
If I delete the °-character, document() works. But I need this character there, so this is not an option.
When using double-\ in the if-clause, my function doesn't detect °C anymore as shown here:
test <- c("mg/l", "°C")
"\\°C" %in% test
[1] FALSE
If I use tryCatch, the documentation is also not created.
Replacing "°C" by gsub(pattern = '\\\\', replacement = "", x = '\\°C') causes the function to crash at the double-\ .
How can I tell document() that everything is fine and it should just create the files?
When I try to parse JSON from the character object from a Facebook URL I got "Error in fromJSON(data) : unexpected escaped character '\o' at pos 130". Check this out:
library(RCurl)
library(rjson)
data <- getURL("https://graph.facebook.com/search?q=multishow&type=post&limit=1500", cainfo="cacert.perm")
fbData <- fromJSON(data)
Error in fromJSON(data) : unexpected escaped character '\o' at pos 130
#with RSONIO also error
> fbData <- fromJSON(data)
Erro em fromJSON(content, handler, default.size, depth, allowComments, :
invalid JSON input
Is there any way to replace this '\o' character before I try to parse JSON? I tried gsub but it didn't work (or i'm doing something wrong).
datafixed <- gsub('\o',' ',data)
Error: '\o' is an unrecognized escape sequence in string starting with "\o"
Can somebody hel me with this one? Thanks.
You need to escape \ in your pattern.
Try
gsub('\\o',' ',data)
You could do
fbData <- fromJSON(data,unexpected.escape = "keep")
you will see a warning
Warning message:
In fromJSON(individual_page, unexpected.escape = "keep") :
unexpected escaped character '\m' at pos 10. Keeping value.
if you want you can suppress the warning using
suppressWarnings(fromJSON(data,unexpected.escape = "keep"))
unexpected.escape : changed handling of unexpected escaped characters. Handling value should be one of "error", "skip", or "keep"; on unexpected characters issue an error, skip
the character, or keep the character
You can find more details here - http://cran.r-project.org/web/packages/rjson/rjson.pdf
I have a simple function in one of my R packages, with one of the arguments symbol = "£":
formatPound <- function(x, digits = 2, nsmall = 2, symbol = "£"){
paste(symbol, format(x, digits = digits, nsmall = nsmall))
}
But when running R CMD check, I get this warning:
* checking R files for non-ASCII characters ... WARNING
Found the following files with non-ASCII characters:
formatters.R
It's definitely that £ symbol that causes the problem. If I replace it with a legitimate ASCII character, like $, the warning disappears.
Question: How can I use £ in my function argument, without incurring a R CMD check warning?
Looks like "Writing R Extensions" covers this in Section 1.7.1 "Encoding Issues".
One of the recommendations in this page is to use the Unicode encoding \uxxxx. Since £ is Unicode 00A3, you can use:
formatPound <- function(x, digits=2, nsmall=2, symbol="\u00A3"){
paste(symbol, format(x, digits=digits, nsmall=nsmall))
}
formatPound(123.45)
[1] "£ 123.45"
As a workaround, you can use intToUtf8() function:
# this causes errors (non-ASCII chars)
f <- function(symbol = "➛")
# this also causes errors in Rd files (non-ASCII chars)
f <- function(symbol = "\u279B")
# this is ok
f <- function(symbol = intToUtf8(0x279B))