Encoding in developing a R package - r

While I run devtools::check(), one warning appears:
checking data for non-ASCII characters ... WARNING
Warning: found non-ASCII string
'Tanaid<c3><a6>' in object 'data_m'
I did the following check
library(stringi)
stri_enc_mark("Tanaid<c3><a6>") which shows "[1] "ASCII""
and all(stri_enc_isutf8('Tanaid<c3><a6>' )) which shows "[1] TRUE"
UPDATE
I followed up the comment, and attempted to convert the string from native encoding to ASCII.
I did iconv("Tanaid\xc3\xa6>", "native", "UTF-8")
However, iconv does not take native encoding and reports Error in iconv("Tanaidæ>", "native", "UTF-8") :
unsupported conversion from 'native' to 'UTF-8'
iconv("Tanaid\xc3\xa6", "latin1", "ASCII") or iconv("Tanaid\xc3\xa6", "latin2", "ASCII") also does not yield the right string.
A better solution is stri_trans_general("Tanaid\xc3\xa6", "latin-ascii") which does the work.

Related

I'm getting a unicode error that prevents my code from running

I think my problem is Turkish character error in R.
Warning message:
In normalizePath(path.expand(path), winslash, mustWork) : path[1]="C:/Users/sample/OneDrive - Sa?l?k �r/Belgeler":The filename, directory name, or volume label syntax is incorrect inside batch
How can i fix it?
you could try check the encoding using
enc <- Encoding(path)
and then convert it to UTF-8
enc.path <- iconv(path, enc, "UTF-8")
and then use enc.path as the path
Seriously, I don know if this works, but you could try it (it's free!)

"invalid multibyte string 8" error popping up for read.csv in R version 4.2.0

I installed the brand-new R version 4.2.0 and tried to run my code written with version 4.1.x.
When reading in data with read.csv this new error popped up:
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string 8
I figure that this has to do with the new native UTF-8 support?
I am running R under Windows 11 with English language support and I am not aware of any special characters in the csv file but I cannot rule it out completely either because it is quite a huge file.
What can I do to switch back to the old encoding which ran without any errors?
The default behaviour for R for versions < 4.2 has been:
If you don't set a default encoding, files will be opened using UTF-8
(on Mac desktop, Linux desktop, and server) or the system's default
encoding (on Windows).
This behaviour has changed in R 4.2:
R 4.2 for Windows will support UTF-8 as native encoding
To find out the default encoding on Windows 10, run the following Powershell command:
[System.Text.Encoding]::Default
The output for this on my Windows 10 machine is:
IsSingleByte : True
BodyName : iso-8859-1
EncodingName : Western European (Windows)
HeaderName : Windows-1252
WebName : Windows-1252
WindowsCodePage : 1252
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : True
CodePage : 1252
This can be passed to read.csv as the encoding to use:
read.csv(path_to_file, encoding = "windows-1252")
If you are unsure how to translate the output from Powershell into the relevant string, you can search the list of all encodings with the stringi package:
# Replace "1252" with the relevant output from the Powershell command
cat(grep("1252", stringi::stri_enc_list(simplify = FALSE), value = TRUE, ignore.case = TRUE))
You can take your pick from any of the options in the output:
# c("ibm-1252", "ibm-1252_P100-2000", "windows-1252") c("cp1252", "ibm-5348", "ibm-5348_P100-1997", "windows-1252")
read.csv(path_to_file, fileEncoding= "windows-1252")
did it for me. Note fileEncoding instead of encoding.

R: load warnings

I have a file foo, which based on file utility is:
foo: LaTeX 2e document, UTF-8 Unicode text, with CRLF, LF line terminators
I have the following meta-code
txt <- readLines("foo")
....
save(txt, file="bar")
load("bar")
There were 50 or more warnings (use warnings() to see the first 50)
warnings()
Warning messages:
1: In load("bar") :
input string '...' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?
2: ...
The warnings cause some parts of the txt character vector to be cut when I later save it with writeLines
I was unable to find an explanations for the warnings. The warnings occur for lines in the TeX file with accented letters
In Linux, R reacts to the value of the LANG environment variable. If it is not set to an UTF-8 locale, R is unable to manage this type of encoding.
So if, calling R like so (no UTF-8):
LANG=en_US R
the console gives:
a <- rawToChar(as.raw(c(0xc3, 0xa0)))
a
# [1] "\303\240"
tmp <- file("/tmp/foo", encoding="UTF-8")
writeLines(a, tmp)
# Warning message:
# In writeLines(a, tmp) : invalid char string in output conversion
close(tmp)
readLines("/tmp/foo")
# character(0)
Without an UTF-8 locale, the value of a (an accented a) is printed as raw bytes.
Forcing UTF-8 encoding when writing to a file gives a warning and nothing is printed.
Calling with:
LANG=en_US.UTF-8 R
does not give errors:
a <- rawToChar(as.raw(c(0xc3, 0xa0)))
a
# [1] "à"
tmp <- file("/tmp/foo", encoding="UTF-8")
writeLines(a, tmp)
close(tmp)
readLines("/tmp/foo")
# [1] "à"
The problem here is that it is difficult to create portable scripts, because their results depends on the environment variables.
In Windows, R does not react to the LANG variable and the encoding depends on the language of Windows installed. For Latin languages this is ANSI. That makes simple the portability inside Windows, but R does not read properly Linux data files, based on UTF-8.

lesson containing non-ascii characters produces error when i try to run test in swirlify

It seems that swirlify can not handle non-ascii characters (like accented character). when trying to test or run the demo with test_lesson() or demo_lesson (), it causes a file read error.
1: In readLines(con) :
invalid input found on input connection '..../lesson.yaml'
2: In readLines(con) :
incomplete final line found on '..../lesson.yaml'
the error cames from the line
con <- file(input, encoding = "UTF-8")
of the yaml.load_file function
the solution is in R studio save the yaml file with save with encoding and choose UTF-8

Error in str.default(tweets_text) : invalid multibyte string 53

I'm able to get 500 tweets in to R however when converting to character with below code i am getting the Error in str.default(tweets_text) : invalid multibyte string 53
tweets_b <-searchTwitter('bahubali',lang="en",n=500,resultType="recent")
tweets_txt <-sapply(tweets_b,function(x) x$getText() )
str(tweets_txt)
Can someone help me out?
The error is because of the encoding, check your encoding by using the code
Encoding(tweets_txt)
If you see any UTF-8 encoding, you can overcome the error by converting the encoding to latin1, by using this code
Encoding(tweets_txt) <- "latin1"
After this the str() would work.
Thank you for your answer however I had figured the resolution long time back and implemented successfully.
FYI :
Here is the code i used .
bahubali_text <- sapply(bahubali_tweets, function(x) x$getText())
removing the non-ASCII chracters in corpus
b_convert_text <- sapply(bahubali_text,function(row) iconv(row, "latin1", "ASCII", sub=""))
thanks again
cheers

Resources