I want to convert character strings to UTF-8. At the moment, I've managed to do this using stringi, like this:
test_string <- c("Fiancé is great.")
stringi::stri_encode(test_string, "UTF-8")
However, how can I do the same using R base or stringr?
Thanks in advance
iconv function may a choice.
Example if current encoding is latin1
iconv(test_string, "latin1", "UTF-8")
You can use Encoding and enc2utf8 from base:
test_string <- c("Fiancé is great.")
Encoding(test_string)
# [1] "latin1"
Encoding(test_string) <- Encoding(enc2utf8(test_string))
Encoding(test_string)
# [1] "UTF-8"
And you can find more alternatives here.
Related
I have an amount of 1000 csv files which contains Hebrew.
I'm trying to import them into R but there is a problem reading Hebrew into the program.
When using this, I get arount 80% of the files with correct hebrew but other 20% not:
data_lst <- lapply(files_to_read,function(i){
read.csv(i, encoding = "UTF-8")
})
When using this, I get the other 20% right but the 80% that worked before does not work here:
data_lst <- lapply(files_to_read,function(i){
read.csv(i, encoding = 'utf-8-sig')
})
I'm unable to use read_csv from library(readr) and have to stay with the format of read.csv.
Thank you for you help!
It sounds like you have two different file encodings, utf-8 and utf-8-sig. The latter has a Byte Order Mark of 0xef, 0xbb, 0xbf at the start indicating the encoding.
I wrote the iris dataset to csv in both encodings - the only difference is the first line.
UTF-8:
sepal.length,sepal.width,petal.length,petal.width,species
UTF-8-SIG:
sepal.length,sepal.width,petal.length,petal.width,species
In your case, it sounds like R is not detecting the encodings correctly, but using encoding="utf-8" works for some files, and encoding="utf-8-sig" works for the other files. The natural course of action seems to me to be to read in the first line and see if it has that pattern at the start:
BOM_pattern <- "^"
encodings <- vapply(
files_to_read,
\(file) {
line <- readLines(file, n = 1L, encoding = "utf-8")
ifelse(grepl(BOM_pattern, line), "utf-8-sig", "utf-8")
},
character(1)
)
This will return a (named) character vector of c("utf-8", "utf-8-sig") as appropriate. You can then supply the encoding to read.csv:
data_lst <- Map(
\(file, encoding) read.csv(file, encoding = encoding),
files_to_read,
encodings
)
This should read in each data frame with the correct encoding and store them in the list data_lst.
does someone know how it is possible de detect and replace "\x" in R?
library(stringr)
x <- "gesh\xfc"
str_detect(x, "\\x")
# Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
# Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE)
nchar(x)
# Error in nchar(x) : invalid multibyte string, element 1
iconv(x, "latin1", "utf-8")
# [1] "geshü"
Encoding(x)
# [1] "unknown"
Session Info:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
...
locale:
[1] fr_CH.UTF-8/fr_CH.UTF-8/fr_CH.UTF-8/C/fr_CH.UTF-8/fr_CH.UTF-8
Context: I read a .csv file with data.table::fread() but this file has colnames in German with letters such as ä,ö,ü,etc. Once read in R, those letters transform into something starting with "\x". This is simple unusable in R afterwards.
Just to summarize what happened here. The "\x" is NOT part of the string. This is just how R escapes values that it can't otherwise print. In the case of "gesh\xfc", the first 4 characters are basic ASCII characters, but the last character is encoded is "\xfc". In the latin1 encoding (which Windows uses by default) the fc character is the "ü" character. So in my windows machine, I see
x <- "gesh\xfc"
x
# [1] "geshü"
And you can look at the raw bytes of that string with
charToRaw("gesh\xfc")
# [1] 67 65 73 68 fc
You can see the ASCII hex character codes for the first 4 values, and then you can see that the \x was actually just used to include the "fc" character code in the string. The string itself only has 5 "characters".
But if you are not using latin1, the "fc" character doesn't map to anything. Basically that string doesn't make any sense in the utf-8 encoding which is what the mac uses by default. You can convert to utf-8 with
iconv("gesh\xfc", "latin1", "utf-8")
But since you got this file by importing a text file, the problem was that the R didn't know the encoding of the file wasn't UTF-8 so you wound up with these weird values. You should tell fread that the file came from windows so it can import the strings properly from the start
fread(file, encoding = "Latin-1")
You need to know what encoding was used to make a file you are importing especially when made by someone else. It's not really possible for programs to guess correctly.
Since fwrite() cannot apply encoding argument , how can i export csv file in specific encode as fast as fwrite() ? (fwrite() is the fastest function within my acknowledgement so far)
fwrite(DT,"DT.csv",encoding = "UTF-8")
Error in fwrite(DT, "DT.csv", encoding = "UTF-8") :
unused argument (encoding = "UTF-8")
You should post a reproducible example, but I would guess you could do this by making sure the data in DT is in UTF-8 within R, then setting the encoding of each column to "unknown". R will then assume the data is encoded in the native encoding when you write it out.
For example,
DF <- data.frame(text = "á", stringsAsFactors = FALSE)
DF$text <- enc2utf8(DF$text) # Only necessary if Encoding(DF$text) isn't "UTF-8"
Encoding(DF$text) <- "unknown"
data.table::fwrite(DF, "DF.csv", bom = TRUE)
If the columns of DF are factors, you'll need to convert them to character vectors before this will work.
As of writing this, fwrite does not support forcing encoding. There is a workaround that I use, but it's a bit more obtuse than I'd like. For your example:
readr::write_excel_csv(DT[,0],"DT.csv")
data.table::fwrite(DT,file = "DT.csv",append = T)
The first line will save only the headers of your data table to the CSV, defaulting to UTF-8 with the Byte order mark required to let Excel know that the file is encoded UTF-8. The fwrite statement then uses the append option to add additional lines to the original CSV. This retains the encoding from write_excel_csv, while maximizing the write speed.
If you work within R,
try this as working approach:
# You have DT
# DT is a data.table / data.frame
# DT$text contains any text data not encoded with 'utf-8'
library(data.table)
DT$text <– enc2utf8(DT$text) # it forces underlying data to be encoded with 'utf-8'
fwrite(DT, "DT.csv", bom = T) # Then save the file using ' bom = TRUE '
Hope that helps.
I know some people have already answered but I wanted to contribute a more holistic solution using the answer from user2554330.
# Encode data in UTF-8
for (col in colnames(DT)) {
names(DT) <- enc2utf8(names(DT)) # Column names need to be encoded too
DT[[col]] <- as.character(DT[[col]]) # Allows for enc2utf8() and Encoding()
DT[[col]] <- enc2utf8(DT[[col]]) # same as users' answer
Encoding(DT[[col]]) <- "unknown"
}
fwrite(DT, "DT.csv", bom = T)
# When re-importing your data be sure to use encoding = "UTF-8"
DT2 <- fread("DT.csv", encoding = "UTF-8")
# DT2 should be identical to the original DT
This should work for any and all UTF-8 characters anywhere on a data.table
I have been using the googlesheets package to upload and download data from a websheet. Previously, it had been downloading strings with non-ASCII symbols with the icon �. Now, for no apparent reason, it has started downloading them with the following string: �. How can I convert � to the diamond questionmark symbol (�)?
It's likely that you have an encoding problem. I suspect that the raw data is encoded in UTF-8, but at some point it is getting treated as Windows-1252.
This is what happens when the encoding is wrongly marked as Windows-1252, and then converted to UTF-8:
x <- "Here is a raw string: � is getting converted to �"
(y <- iconv(x, "WINDOWS-1252", "UTF-8"))
#> [1] "Here is a raw string: � is getting converted to �"
You can fix the encoding error by converting from UTF-8 to Windows-1252, then marking the result as UTF-8:
z <- iconv(y, "UTF-8", "WINDOWS-1252")
Encoding(z) <- "UTF-8"
print(z)
#> [1] "Here is a raw string: � is getting converted to �"
Note: The code will still work on MacOS and Linux if you leave out the Encoding(z) <- "UTF-8" line, but it will break on Windows. If you leave out that line then z will have "unknown" encoding, which gets interpreted as "UTF-8" on Linux and MacOS but not on Windows.
Windows Users
If you're using Windows, then the fix could be much simpler. If your data has "unknown" encoding, then on MacOS and Linux it will (correctly) be interpreted as UTF-8, but on Windows it will be interpreted using your native encoding, usually Windows-1252. If you are on Windows, then something like the following happens:
x <- "Here is a raw string: � is getting converted to �"
y <- x
Encoding(y) <- "unknown"
print(y)
#> [1] "Here is a raw string: � is getting converted to �"
You can fix this as follows:
z <- y
Encoding(z) <- "UTF-8"
print(z)
#> [1] "Here is a raw string: � is getting converted to �"
This question already has answers here:
Cannot read unicode .csv into R
(3 answers)
Closed 5 years ago.
I am trying to read in data from a csv file and specify the encoding of the characters to be UTF-8. From reading through the ?read.csv() instructions, it seems that fileEncoding set equal to UTF-8 should accomplish this, however, I am not seeing that when checking. Is there a better way to specify the encoding of character strings to be UTF-8 when importing the data?
Sample Data:
Download Sample Data here
fruit<- read.csv("fruit.csv", header = TRUE, fileEncoding = "UTF-8")
fruit[] <- lapply(fruit, as.character)
Encoding(fruit$Fruit)
The output is "uknown" but I would expect this to be "UTF-8". What is the best way to ensure all imported characters are UTF-8? Thank you.
fruit <- read.csv("fruit.csv", header = TRUE)
fruit[] <- lapply(fruit, as.character)
fruit$Fruit <- paste0(fruit$Fruit, "\xfcmlaut") # Get non-ASCII char and jam it in!
Encoding(fruit$Fruit)
[1] "latin1" "latin1" "latin1"
fruit$Fruit <- enc2utf8(fruit$Fruit)
Encoding(fruit$Fruit)
[1] "UTF-8" "UTF-8" "UTF-8"