R import of stata file has problems with French accented characters - r

I have a large stata file that I think has some French accented characters that have been saved poorly.
When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.
To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.
Reproducible code is here:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()

Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).
So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.
First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.
The pes19_occ_text is encoded in UTF-8, as you can check with:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
This "é" is characteristic of UTF-8 data (here "é") read as latin1.
However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.
Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".
To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.
Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.
Here is the code:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
You can do the same on other variables with diacritics.
The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).
Actually, "c3 83" is "Ã" and "c2 a9" is "©".
Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).
See also the help pages of Encoding and iconv.
Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.
The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.
Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.
Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.
ces19web <- read_dta("CES-E-2019-online_F1.dta")
for (n in names(ces19web)) {
v <- ces19web[[n]]
if (is.character(v)) {
v <- iconv(v, from = "UTF-8", to = "latin1")
Encoding(v) <- "UTF-8"
}
ces19web[[n]] <- v
}

Related

reading Hebrew language read.csv (mixed problem)

I have an amount of 1000 csv files which contains Hebrew.
I'm trying to import them into R but there is a problem reading Hebrew into the program.
When using this, I get arount 80% of the files with correct hebrew but other 20% not:
data_lst <- lapply(files_to_read,function(i){
read.csv(i, encoding = "UTF-8")
})
When using this, I get the other 20% right but the 80% that worked before does not work here:
data_lst <- lapply(files_to_read,function(i){
read.csv(i, encoding = 'utf-8-sig')
})
I'm unable to use read_csv from library(readr) and have to stay with the format of read.csv.
Thank you for you help!
It sounds like you have two different file encodings, utf-8 and utf-8-sig. The latter has a Byte Order Mark of 0xef, 0xbb, 0xbf at the start indicating the encoding.
I wrote the iris dataset to csv in both encodings - the only difference is the first line.
UTF-8:
sepal.length,sepal.width,petal.length,petal.width,species
UTF-8-SIG:
sepal.length,sepal.width,petal.length,petal.width,species
In your case, it sounds like R is not detecting the encodings correctly, but using encoding="utf-8" works for some files, and encoding="utf-8-sig" works for the other files. The natural course of action seems to me to be to read in the first line and see if it has that pattern at the start:
BOM_pattern <- "^"
encodings <- vapply(
files_to_read,
\(file) {
line <- readLines(file, n = 1L, encoding = "utf-8")
ifelse(grepl(BOM_pattern, line), "utf-8-sig", "utf-8")
},
character(1)
)
This will return a (named) character vector of c("utf-8", "utf-8-sig") as appropriate. You can then supply the encoding to read.csv:
data_lst <- Map(
\(file, encoding) read.csv(file, encoding = encoding),
files_to_read,
encodings
)
This should read in each data frame with the correct encoding and store them in the list data_lst.

Using foreign characters in R data frame

i tried to import some data(csv file) to R but it is in Hebrew and sadly the text is transformed to this for example : ׳¨׳׳™׳“׳” ׳₪׳¡׳™׳›׳™׳׳˜׳¨׳™׳” ׳׳ ׳¢׳¦׳׳׳™ 43.61
3 ׳™׳¢׳¨׳™ ׳׳‘׳™׳‘ ׳₪׳¡׳™׳›׳™׳׳˜׳¨׳™׳” ׳׳ ׳¢׳¦׳׳׳™ 45.00
4 ׳׳’׳¨׳‘ ׳׳ ׳˜׳•׳ ׳₪׳¡׳™׳›׳™׳׳˜׳¨׳™׳” ׳׳ ׳¢׳¦
what can i do to keep the hebrew text ? thank you :)
For reading csv files with Hebrew characters, you can use readr package, which is a part of the tidyverse package. This package has a lot of utilities for language encoding and localization like guess_encoding and locale.
Try code below:
install.packages("tidyverse")
library(readr)
locale("he")
guess_encoding(file = "path_to_your_file", n_max = 10000, threshold = 0.2) //replace with your data
df <- read_csv(file = "path_to_your_file", locale = locale(date_names = "he", encoding = "UTF-8")) //replace with your data
guess_encoding will help you to determine which encoding is more optimal for your file (for example, UTF-8, ISO 8859-8, Windows-1255, etc.); this function calculates the probability of a file of being encoded in several encodings. You should use the encoding with the highest probability.

Importing xlsx data to R when numbers have a comma as decimal separator

How can I import data from a .xlsx file into R so that numbers are represented as numbers, when their original decimal separator is comma not a dot?
The only package I know of, when dealing with excel is readxl from tidyverse.
I'm looking for a solution that won't need opening and editing excel files in any other software (and can deal with hundreds of columns to import) - if that would be possible I'd export all excels to .csv and import them using tools I know of, that can take the dec= argument.
So far my best working solution is to import numbers as characters and then transform it:
library(dplyr)
library(stringr)
var1<- c("2,1", "3,2", "4,5")
var2<- c("1,2", "3,33", "5,55")
var3<- c("3,44", "2,2", "8,88")
df<- data.frame(cbind(var1, var2, var3))
df %>%
mutate_at(vars(contains("var")),
str_replace,
pattern = ",",
replacement = "\\.") %>%
mutate_at(vars(contains("var")), funs(as.numeric))
I suspect strongly that there is some other reason these columns are being read as character, most likely that they are the dreaded "Number Stored as Text".
For ordinary numbers (stored as numbers), after switching to comma as decimal separator either for an individual file or in the overall system settings, readxl::read_excel reads in a numeric properly. (This is on my Windows system.) Even when adding a character to one of the cells in that column or setting col_types="text", I get the number read in using a period as decimal, not as comma, giving more evidence that readxl is using the internally stored data type.
The only way I have gotten R to read in a comma as a decimal is when the data is stored in Excel as text instead of as numeric. (You can enter this by prefacing the number with a single quote, like '1,7.) I then get a little green triangle in the corner of the cell, which gives the popup warning "Number Stored as Text". In my exploration, I was surprised to discover that Excel will do calculations on numbers stored as text, so that's not a valid way of checking for this.
It's pretty easy to replace the "," with a "." and recast the column as numeric. Example:
> x <- c('1,00','2,00','3,00')
> df <- data.frame(x)
> df
x
1 1,00
2 2,00
3 3,00
> df$x <- gsub(',','.',df$x)
> df$x <- as.numeric(df$x)
> df
x
1 1
2 2
3 3
> class(df$x)
[1] "numeric"
>
Just using base R and gsub.
I just had the same problem while dealing with an Excel spreadsheet I had received from a colleague. After I had tried to import the file using readxl (which failed), I converted the file into a csv file hoping to solve the problem using read_delim and fiddling with the locale and decimal sign options. But the problem was still there, no matter which options I used.
Here is the solution that worked for me: I found out that the characters that were used in the cells containing the missing values (. in my case) were causing trouble. I went back to the Excel file, replaced . in all cells with missing values with blanks while just keeping the default option for the decimals (,). After that, all columns were imported correctly as numeric using readxl.
If you should face this problem with your decimals set to . make sure to tick the box saying "Match entire cell contents" in Excel before replacing all instances of the missing values .

readr::read_csv issue: Chinese Character becomes messy codes

I'm trying to import a dataset to RStudio, however I am stuck with Chinese characters, as they become messy codes. Here is the code:
library(tidyverse)
df <- read_csv("中文,英文\n英文,德文")
df
# A tibble: 1 x 2
`\xd6\xd0\xce\xc4` `Ӣ\xce\xc4`
<chr> <chr>
1 "<U+04E2>\xce\xc4" "<U+00B5>\xc2\xce\xc4"
When I use the base function read.csv, it works well. I guess I must do something wrong with encoding. But there are no encoding option in read_csv, how can I do this?
This is because that the characters are marked as UTF-8 whereas the actual encoding is the system default (you can get by stringi::stri_enc_get()).
So, you can do either:
1) Read data with the correct encoding:
df <- read_csv("中文,英文\n英文,德文", locale = locale(encoding = stringi::stri_enc_get()))
2) Read data with the incorrect encoding and mark them with the correct encoding later (note that this does not always work):
df <- read_csv("中文,英文\n英文,德文")
df <- dplyr::mutate_all(df, `Encoding<-`, value = "unknown")

Automating a check for non-ASCII characters for a long list of variables in R

I am trying to check several hundred variables in my data frame to figure out which of them contain non ASCII characters so that I can then convert an SPSS dataset into a .dta dataset using R. The data set comes from SPSS (.sav), I used the foreign package and read.spss(filename, to.data.frame = TRUE) to read it in R. Now I would like to write.dta to put my dataframe back into stata. But I get the error:
In abbreviate(ll, 80L) : abbreviate used with non-ASCII chars
Thanks to Josh O'Brien's response to the following post: "Removing non-ASCII characters from data files", I am able to use his code to check one variable at a time for non-ASCII characters.
## Do any lines contain non-ASCII characters?
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE
and then check within any variable for which this is TRUE for the location of the non-ASCII characters.
## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Is there a way to use these functions in R to check multiple "x"s/variables/character vectors at once and return a list of the variables that contain non-ASCII characters, or can it only be done with a loop? Even more convenient would be a way to just tell R to convert all non-ASCII characters in the dataframe into something that is ASCII compatible so that I can write it into stata. So far I can envision using hadley's answer to the same post referenced above that I will need to convert each variable individually into an ascii compatible string variable and add it to my dataset and then drop the offending variable.
Expanding on code from Hadley's answer:
library('stringi')
library('dplyr')
# simulating an example
x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
df <- data.frame(id = 1:3,
logi = c(T, T, F),
test = x,
test2 = rev(x),
test_norm = c('Everything', 'is', 'perfect'))
# added several non-character columns to show that they are not affected
# Now translating every character column to ASCII
df2 <- df %>%
mutate_if(is.character,
stri_trans_general,
id = "latin-ascii")
df2
id logi test test2 test_norm
1 1 TRUE Ekstrom bisschen Zurcher Everything
2 2 TRUE Joreskog Joreskog is
3 3 FALSE bisschen Zurcher Ekstrom perfect
Of course it will work only with latin to ASCII.

Resources