R string encoding cyrillic - r

It seems that I have some cyrillic strings stored as UTF-8 in my database. However I need to restore in cyrillic using R.
For example in database it's stored as: "õÆ¿ª®Ï". What I need is Москва.
I tried some stuff using iconv, but not sure if I need to double-convert the string first:
1. iconv(x, "UTF-8", "CP1251") # I get NA
2. iconv(x, "CP1251", "UTF-8") # I get ûûû \"òƸл°¸»ª¿-õƸƺ±Ð\"
I assumed I need to restore the string from UTF-8 to cyrillic first, but I get NA.
Help appreciated

enc2native and enc2utf8 convert elements of character vectors to the native encoding or UTF-8 respectively, taking any marked encoding into account. They are primitive functions, designed to do minimal copying.

Related

How to solve error **Error in nchar(rownames(m)) : invalid multibyte string, element 1**?

Create document-term matrix
dtm <- DocumentTermMatrix(docs, control = params)
Error in nchar(rownames(m)) : invalid multibyte string, element 1
Anyone who knows how to tackle this error?
Working in Rstudio
Sys.setlocale( 'LC_ALL','C' )
In R studio apply this code .. It will refresh the locale .. worked for me many times.
This happens when your input text isn't UTF-8 encoded. You can read about character encoding here.
Another good reference is this
I've found that the best way to handle these issues is to use stringr::str_conv.
mydocs <- c("doc1", "doc2", "doc3")
stringr::str_conv(mydocs, "UTF-8")
Where you have non-UTF-8 characters, you'll get a warning, but the character vector that comes out the other side will be usable.
Do that to your docs vector before calling `DocumentTermMatrix.
I encountered this error while trying to write a data frame to a SQL server table. This function helped me, I used it to remove all non-UTF8 characters from a data frame before writing it to the server. It's built off another post, linked below.
# Create a function to convert all columns to UTF-8 encoding,
# dropping any characters that can't be converted.
df_convert_utf8 <- function(df_data){
# Convert all character columns to UTF-8
# Source: https://stackoverflow.com/questions/54633054/dbidbwritetable-invalid-multibyte-string
df_data[,sapply(df_data,is.character)] <- sapply(
df_data[,sapply(df_data,is.character)],
iconv,"WINDOWS-1252","UTF-8",sub = "")
return(df_data)
}
Example usage:
# Convert all character strings to UTF8, removing any characters we can't use
df_chunk <- df_convert_utf8(df_chunk)

Entering and viewing Cyrillic strings in R

How to handle Cyrillic strings in R?
Sys.setlocale("LC_ALL","Polish")
dataset <- data.frame( ProductName = c('ąęćśżźół','тест') )
#Encoding(dataset) <- "UTF-8" #this line does not change anything
View(dataset)
The code above results in:
But I would like to get what I typed in тест instead of sequence <U+number>. Is there any way for that?
This works for me and see the cyrillic test in my data frame.
I think you should check what your locale is (with sessionInfo) and whether it supports UTF.
Also check this link and try to maybe change the encoding of your column.
Encoding(dataset$Cyrillic) <- "UTF-8"

R character Encoding goes wrong (English - Spanish)

Im trying to load a dataset into R using an API that lets me do a query and returns back the data I need (I cant configure on server side).
I know it has something to do with Encoding. When i check the string in from by dataframe in R in gives me ENC: UTF-8 "Cosmética".
When i copy the source string "Cosmética" it gives me latin1.
How can i get the UTF-8 string properly formatted like the latin1?
Ive tried this below:
Sys.setlocale("LC_ALL","Spanish")
tried directly on the string:
Enconding(Description) <- "latin1"
unfortunately i cant get it to work. Any ideas are welcome! Thanks.
You can use iconv to change to encoding of the string:
iconv(mystring, to = "ISO-8859-1")
# [1] "Cosmética"
ISO 8859-1 is the common character encoding in Western Europe.

Get R to keep UTF-8 Codepoint representation

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

Is it possible to write package documentation using non-ASCII characters with roxygen2?

Is it possible to write documentation in R using non-ASCII characters (such as å, ä, ö) using roxygen2? I'm asking because I am writing an package with internal functions in Swedish.
I have use the following code using roxygen to write documentation:
#' #param data data frame där variablen finns
#' #param x variabeln, måste vara en av typen character
This results in the non-ASCII characters being distorted. I can change the .Rd files manually but I'd rather not.
On Windows, encoding sucks in R, and is very complicated - and those developing packages don't always consider it as a real issue (see roxygen or devtools). What worked for me:
if you have data in your package with non-ASCII labels, e.g. a colorvector c(rød = "#C30000", blå = "#00A9E0"), you have to escape the names/values in code:
c(r\u00f8d = "#C30000", bl\u00e5 = "#00A9E0")
in the documentation (if you use roxygenize or devtools::document()) you have to place #encoding UTF-8 before EVERY function description but then use regular keyboard.
If you have two functions in the same file (e.g. "palette" and "saturation" in a design package for your organisation), you have to place the tag in every description block, not just once.
Example:
#' #encoding UTF-8
#' datastruktur for å definere firmapalett med æøå
dummypalett <- structure(.Data = c("#c30000", "#00A9E0"),
names = c("r\u00f8d", "bl\u00e5"))
#' #encoding UTF-8
#' neste funksjon som er beskrevet med æøåäö
For good measure, I placed Language: nob in the DESCRIPTION file and changed the encoding tag in Rprofile to "UTF-8".
Non-ASCII characters are tricky to use with R (https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Package-subdirectories).
Only ASCII characters (and the control characters tab, formfeed, LF and CR) should be used in code files. Other characters are accepted in comments13, but then the comments may not be readable in e.g. a UTF-8 locale. Non-ASCII characters in object names will normally14 fail when the package is installed. Any byte will be allowed in a quoted character string but \uxxxx escapes should be used for non-ASCII characters. However, non-ASCII character strings may not be usable in some locales and may display incorrectly in others.
For documentation you have to add the tag #encoding UTF-8 to your roxygen2 code.
You can check whether \uxxxx escapes have been successfully employed by the tag using the following.
path <- "path to Rd file"
tools::checkRd(path)
I solved this problem by putting
##' #encoding UTF-8
in the roxygen2 documentation comment and then typing
options(encoding = "UTF-8")
in the R console before roxygenizing. For future sessions, it is helpful to add the line
options(encoding = "UTF-8")
in the R/etc/Rprofile.site file.

Resources