reading Hebrew language read.csv (mixed problem) - r

I have an amount of 1000 csv files which contains Hebrew.
I'm trying to import them into R but there is a problem reading Hebrew into the program.
When using this, I get arount 80% of the files with correct hebrew but other 20% not:
data_lst <- lapply(files_to_read,function(i){
read.csv(i, encoding = "UTF-8")
})
When using this, I get the other 20% right but the 80% that worked before does not work here:
data_lst <- lapply(files_to_read,function(i){
read.csv(i, encoding = 'utf-8-sig')
})
I'm unable to use read_csv from library(readr) and have to stay with the format of read.csv.
Thank you for you help!

It sounds like you have two different file encodings, utf-8 and utf-8-sig. The latter has a Byte Order Mark of 0xef, 0xbb, 0xbf at the start indicating the encoding.
I wrote the iris dataset to csv in both encodings - the only difference is the first line.
UTF-8:
sepal.length,sepal.width,petal.length,petal.width,species
UTF-8-SIG:
sepal.length,sepal.width,petal.length,petal.width,species
In your case, it sounds like R is not detecting the encodings correctly, but using encoding="utf-8" works for some files, and encoding="utf-8-sig" works for the other files. The natural course of action seems to me to be to read in the first line and see if it has that pattern at the start:
BOM_pattern <- "^"
encodings <- vapply(
files_to_read,
\(file) {
line <- readLines(file, n = 1L, encoding = "utf-8")
ifelse(grepl(BOM_pattern, line), "utf-8-sig", "utf-8")
},
character(1)
)
This will return a (named) character vector of c("utf-8", "utf-8-sig") as appropriate. You can then supply the encoding to read.csv:
data_lst <- Map(
\(file, encoding) read.csv(file, encoding = encoding),
files_to_read,
encodings
)
This should read in each data frame with the correct encoding and store them in the list data_lst.

Related

R import of stata file has problems with French accented characters

I have a large stata file that I think has some French accented characters that have been saved poorly.
When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.
To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.
Reproducible code is here:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()
Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).
So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.
First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.
The pes19_occ_text is encoded in UTF-8, as you can check with:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
This "é" is characteristic of UTF-8 data (here "é") read as latin1.
However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.
Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".
To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.
Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.
Here is the code:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
You can do the same on other variables with diacritics.
The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).
Actually, "c3 83" is "Ã" and "c2 a9" is "©".
Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).
See also the help pages of Encoding and iconv.
Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.
The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.
Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.
Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.
ces19web <- read_dta("CES-E-2019-online_F1.dta")
for (n in names(ces19web)) {
v <- ces19web[[n]]
if (is.character(v)) {
v <- iconv(v, from = "UTF-8", to = "latin1")
Encoding(v) <- "UTF-8"
}
ces19web[[n]] <- v
}

Using foreign characters in R data frame

i tried to import some data(csv file) to R but it is in Hebrew and sadly the text is transformed to this for example : ׳¨׳׳™׳“׳” ׳₪׳¡׳™׳›׳™׳׳˜׳¨׳™׳” ׳׳ ׳¢׳¦׳׳׳™ 43.61
3 ׳™׳¢׳¨׳™ ׳׳‘׳™׳‘ ׳₪׳¡׳™׳›׳™׳׳˜׳¨׳™׳” ׳׳ ׳¢׳¦׳׳׳™ 45.00
4 ׳׳’׳¨׳‘ ׳׳ ׳˜׳•׳ ׳₪׳¡׳™׳›׳™׳׳˜׳¨׳™׳” ׳׳ ׳¢׳¦
what can i do to keep the hebrew text ? thank you :)
For reading csv files with Hebrew characters, you can use readr package, which is a part of the tidyverse package. This package has a lot of utilities for language encoding and localization like guess_encoding and locale.
Try code below:
install.packages("tidyverse")
library(readr)
locale("he")
guess_encoding(file = "path_to_your_file", n_max = 10000, threshold = 0.2) //replace with your data
df <- read_csv(file = "path_to_your_file", locale = locale(date_names = "he", encoding = "UTF-8")) //replace with your data
guess_encoding will help you to determine which encoding is more optimal for your file (for example, UTF-8, ISO 8859-8, Windows-1255, etc.); this function calculates the probability of a file of being encoded in several encodings. You should use the encoding with the highest probability.

How can i specify encode in fwrite() for export csv file R?

Since fwrite() cannot apply encoding argument , how can i export csv file in specific encode as fast as fwrite() ? (fwrite() is the fastest function within my acknowledgement so far)
fwrite(DT,"DT.csv",encoding = "UTF-8")
Error in fwrite(DT, "DT.csv", encoding = "UTF-8") :
unused argument (encoding = "UTF-8")
You should post a reproducible example, but I would guess you could do this by making sure the data in DT is in UTF-8 within R, then setting the encoding of each column to "unknown". R will then assume the data is encoded in the native encoding when you write it out.
For example,
DF <- data.frame(text = "á", stringsAsFactors = FALSE)
DF$text <- enc2utf8(DF$text) # Only necessary if Encoding(DF$text) isn't "UTF-8"
Encoding(DF$text) <- "unknown"
data.table::fwrite(DF, "DF.csv", bom = TRUE)
If the columns of DF are factors, you'll need to convert them to character vectors before this will work.
As of writing this, fwrite does not support forcing encoding. There is a workaround that I use, but it's a bit more obtuse than I'd like. For your example:
readr::write_excel_csv(DT[,0],"DT.csv")
data.table::fwrite(DT,file = "DT.csv",append = T)
The first line will save only the headers of your data table to the CSV, defaulting to UTF-8 with the Byte order mark required to let Excel know that the file is encoded UTF-8. The fwrite statement then uses the append option to add additional lines to the original CSV. This retains the encoding from write_excel_csv, while maximizing the write speed.
If you work within R,
try this as working approach:
# You have DT
# DT is a data.table / data.frame
# DT$text contains any text data not encoded with 'utf-8'
library(data.table)
DT$text <– enc2utf8(DT$text) # it forces underlying data to be encoded with 'utf-8'
fwrite(DT, "DT.csv", bom = T) # Then save the file using ' bom = TRUE '
Hope that helps.
I know some people have already answered but I wanted to contribute a more holistic solution using the answer from user2554330.
# Encode data in UTF-8
for (col in colnames(DT)) {
names(DT) <- enc2utf8(names(DT)) # Column names need to be encoded too
DT[[col]] <- as.character(DT[[col]]) # Allows for enc2utf8() and Encoding()
DT[[col]] <- enc2utf8(DT[[col]]) # same as users' answer
Encoding(DT[[col]]) <- "unknown"
}
fwrite(DT, "DT.csv", bom = T)
# When re-importing your data be sure to use encoding = "UTF-8"
DT2 <- fread("DT.csv", encoding = "UTF-8")
# DT2 should be identical to the original DT
This should work for any and all UTF-8 characters anywhere on a data.table

read.csv() with UTF-8 encoding [duplicate]

This question already has answers here:
Cannot read unicode .csv into R
(3 answers)
Closed 5 years ago.
I am trying to read in data from a csv file and specify the encoding of the characters to be UTF-8. From reading through the ?read.csv() instructions, it seems that fileEncoding set equal to UTF-8 should accomplish this, however, I am not seeing that when checking. Is there a better way to specify the encoding of character strings to be UTF-8 when importing the data?
Sample Data:
Download Sample Data here
fruit<- read.csv("fruit.csv", header = TRUE, fileEncoding = "UTF-8")
fruit[] <- lapply(fruit, as.character)
Encoding(fruit$Fruit)
The output is "uknown" but I would expect this to be "UTF-8". What is the best way to ensure all imported characters are UTF-8? Thank you.
fruit <- read.csv("fruit.csv", header = TRUE)
fruit[] <- lapply(fruit, as.character)
fruit$Fruit <- paste0(fruit$Fruit, "\xfcmlaut") # Get non-ASCII char and jam it in!
Encoding(fruit$Fruit)
[1] "latin1" "latin1" "latin1"
fruit$Fruit <- enc2utf8(fruit$Fruit)
Encoding(fruit$Fruit)
[1] "UTF-8" "UTF-8" "UTF-8"

Base64 encoding a .Rda file

All,
I'm trying to figure out how to put a .Rda file into Base64 encoding for it to be shipped to and from an API. I am really struggling with how to do this. Here's what I've got, but I think it's way off target:
cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", header=TRUE)
lrfit <- glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial, data=cuse)
filename <- "C:/test.Rda"
save(lrfit, file=filename)
library("base64enc")
tst <- base64encode(filename)
save(tst, file="C:/encode.Rda")
base64decode(file="C:/encode.Rda", output = "C:/decode.Rda")
When I try to open the decode.Rda file, it throws a magic number error. Like I said, I think I'm way off base here, and any help would be appreciated. Thank you so much.
Here a correct sequence of steps that should allow for the correct encoding/decoding
#sample data
dd<-iris
fn <- "test.rda"
fnb4 <- "test.rdab64"
#save rda
save(iris, file=fn)
#write base64 encoded version
library(base64enc)
txt <- base64encode(fn)
ff <- file(fnb4, "wb")
writeBin(txt, ff)
close(ff)
#decode base64 encoded version
base64decode(file=fnb4, output = "decode.rda")
(load("decode.rda"))
# [1] "iris"
The problem was your second save(). That was creating another RDA file with the base64 data encoded inside. It was not writing a base64 encoded version of the RDA file to disc.

Resources