readr::read_csv issue: Chinese Character becomes messy codes - r

I'm trying to import a dataset to RStudio, however I am stuck with Chinese characters, as they become messy codes. Here is the code:
library(tidyverse)
df <- read_csv("中文,英文\n英文,德文")
df
# A tibble: 1 x 2
`\xd6\xd0\xce\xc4` `Ӣ\xce\xc4`
<chr> <chr>
1 "<U+04E2>\xce\xc4" "<U+00B5>\xc2\xce\xc4"
When I use the base function read.csv, it works well. I guess I must do something wrong with encoding. But there are no encoding option in read_csv, how can I do this?

This is because that the characters are marked as UTF-8 whereas the actual encoding is the system default (you can get by stringi::stri_enc_get()).
So, you can do either:
1) Read data with the correct encoding:
df <- read_csv("中文,英文\n英文,德文", locale = locale(encoding = stringi::stri_enc_get()))
2) Read data with the incorrect encoding and mark them with the correct encoding later (note that this does not always work):
df <- read_csv("中文,英文\n英文,德文")
df <- dplyr::mutate_all(df, `Encoding<-`, value = "unknown")

Related

Modify encodings of accented characters in value labels

I am having a very hard time with accented characters in a stata file I have to import into R. I solved one problem over here, but there's another problem.
After import, anytime I use the lookfor command in the labelled package I get this error.
remotes::install_github("sjkiss/cesdata")
library(cesdata)
data("ces19web")
library(labelled)
look_for(ces19web, "vote")
invalid multibyte string at '<e9>bec Solidaire'
Now I can find one value label that has that label, but it actually appears properly, so I don't know what is going on.
val_labels(ces19web$pes19_provvote)
But, there are other problematic value labels that cause other problems. For example, the value labels for the 13th variable cause this problem.
# This works fine
ces19web %>%
select(1:12) %>%
look_for(., "[a-z]")
# This chokes
ces19web %>%
select(1:13) %>%
look_for(., "[a-z]")
# See the accented character
val_labels(ces19web[,13])
I have come up with this way of replacing the accented characters of the second type.
names(val_labels(ces19web$cps19_imp_iss_party))<-iconv(names(val_labels(ces19web$cps19_imp_iss_party)), from="latin1", to="UTF-8")
And this even solves the problem for look_for()
#This now works!
ces19web %>%
select(1:13) %>%
look_for(., "[a-z]")
But what I need is a way to loop through all of the names of all of the the value labels and make this conversion for all the bungled accented characters.
This is so close, but I don't a know how to save the results of this as the new names for the value labels
ces19web %>%
#map onto all the variables and get the value labels
map(., val_labels) %>%
#map onto each set of value labels
map(., ~{
#Skip if there are no value labels
if (!is.null(.x)){
#If not convert the names as above
names(.x)<-iconv(names(.x), from="latin1", to="UTF-8")
}
}) ->out
#Compare the 16th variable's value labels in the original
ces19web[,16]
#With the 16th set of value labels after the conversion function above
out[[16]]
But how do I make that conversion actually stick in the original dataset
Thank you!
There is a problem with character variables: all encodings are marked as either "unknown" (i.e. no non-ascii characters) or UTF-8, however there are strings which are really latin1 strings: for instance 0xe9 is the latin-1 encoding of "é".
Assuming all character variables are actually latin1, you can do this:
ces19web_corr <- as.data.frame(lapply(ces19web, function(v) {
if (is.character(v)) {
Encoding(v) <- "latin1"
v <- iconv(v, from = "latin1", to = "UTF-8")
} else if (is.factor(v)) {
lev <- levels(v)
Encoding(lev) <- "latin1"
lev <- iconv(lev, from = "latin1", to = "UTF-8")
levels(v) <- lev
}
v
}))
Alternately, if only some of them have the problem, you will have to select which one to fix.
Side comment: it might be that you applied my fix from the other post to a data file (or some of its columns) which hasn't the problem described in your other question. Then you accidentally forced the wrong encoding, and the code above is just forcing back the right one.
I don't know if I understand your problem correctly (since the explanations are very verbose), but is it just a matter of reassigning the dataframe ?
library(magrittr)
ces19web %<>% #### REASSIGN THE DATAFRAME WITH THE %<>% OPERATOR
#map onto all the variables and get the value labels
map(., val_labels) %>%
#map onto each set of value labels
map(., ~{
#Skip if there are no value labels
if (!is.null(.x)){
#If not convert the names as above
names(.x)<-iconv(names(.x), from="latin1", to="UTF-8")
}
}) ->out
#Compare the 16th variable's value labels in the original
ces19web[,16]
#With the 16th set of value labels after the conversion function above
out[[16]]

R import of stata file has problems with French accented characters

I have a large stata file that I think has some French accented characters that have been saved poorly.
When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.
To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.
Reproducible code is here:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()
Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).
So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.
First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.
The pes19_occ_text is encoded in UTF-8, as you can check with:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
This "é" is characteristic of UTF-8 data (here "é") read as latin1.
However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.
Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".
To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.
Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.
Here is the code:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
You can do the same on other variables with diacritics.
The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).
Actually, "c3 83" is "Ã" and "c2 a9" is "©".
Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).
See also the help pages of Encoding and iconv.
Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.
The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.
Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.
Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.
ces19web <- read_dta("CES-E-2019-online_F1.dta")
for (n in names(ces19web)) {
v <- ces19web[[n]]
if (is.character(v)) {
v <- iconv(v, from = "UTF-8", to = "latin1")
Encoding(v) <- "UTF-8"
}
ces19web[[n]] <- v
}

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

How can i specify encode in fwrite() for export csv file R?

Since fwrite() cannot apply encoding argument , how can i export csv file in specific encode as fast as fwrite() ? (fwrite() is the fastest function within my acknowledgement so far)
fwrite(DT,"DT.csv",encoding = "UTF-8")
Error in fwrite(DT, "DT.csv", encoding = "UTF-8") :
unused argument (encoding = "UTF-8")
You should post a reproducible example, but I would guess you could do this by making sure the data in DT is in UTF-8 within R, then setting the encoding of each column to "unknown". R will then assume the data is encoded in the native encoding when you write it out.
For example,
DF <- data.frame(text = "á", stringsAsFactors = FALSE)
DF$text <- enc2utf8(DF$text) # Only necessary if Encoding(DF$text) isn't "UTF-8"
Encoding(DF$text) <- "unknown"
data.table::fwrite(DF, "DF.csv", bom = TRUE)
If the columns of DF are factors, you'll need to convert them to character vectors before this will work.
As of writing this, fwrite does not support forcing encoding. There is a workaround that I use, but it's a bit more obtuse than I'd like. For your example:
readr::write_excel_csv(DT[,0],"DT.csv")
data.table::fwrite(DT,file = "DT.csv",append = T)
The first line will save only the headers of your data table to the CSV, defaulting to UTF-8 with the Byte order mark required to let Excel know that the file is encoded UTF-8. The fwrite statement then uses the append option to add additional lines to the original CSV. This retains the encoding from write_excel_csv, while maximizing the write speed.
If you work within R,
try this as working approach:
# You have DT
# DT is a data.table / data.frame
# DT$text contains any text data not encoded with 'utf-8'
library(data.table)
DT$text <– enc2utf8(DT$text) # it forces underlying data to be encoded with 'utf-8'
fwrite(DT, "DT.csv", bom = T) # Then save the file using ' bom = TRUE '
Hope that helps.
I know some people have already answered but I wanted to contribute a more holistic solution using the answer from user2554330.
# Encode data in UTF-8
for (col in colnames(DT)) {
names(DT) <- enc2utf8(names(DT)) # Column names need to be encoded too
DT[[col]] <- as.character(DT[[col]]) # Allows for enc2utf8() and Encoding()
DT[[col]] <- enc2utf8(DT[[col]]) # same as users' answer
Encoding(DT[[col]]) <- "unknown"
}
fwrite(DT, "DT.csv", bom = T)
# When re-importing your data be sure to use encoding = "UTF-8"
DT2 <- fread("DT.csv", encoding = "UTF-8")
# DT2 should be identical to the original DT
This should work for any and all UTF-8 characters anywhere on a data.table

Automating a check for non-ASCII characters for a long list of variables in R

I am trying to check several hundred variables in my data frame to figure out which of them contain non ASCII characters so that I can then convert an SPSS dataset into a .dta dataset using R. The data set comes from SPSS (.sav), I used the foreign package and read.spss(filename, to.data.frame = TRUE) to read it in R. Now I would like to write.dta to put my dataframe back into stata. But I get the error:
In abbreviate(ll, 80L) : abbreviate used with non-ASCII chars
Thanks to Josh O'Brien's response to the following post: "Removing non-ASCII characters from data files", I am able to use his code to check one variable at a time for non-ASCII characters.
## Do any lines contain non-ASCII characters?
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE
and then check within any variable for which this is TRUE for the location of the non-ASCII characters.
## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Is there a way to use these functions in R to check multiple "x"s/variables/character vectors at once and return a list of the variables that contain non-ASCII characters, or can it only be done with a loop? Even more convenient would be a way to just tell R to convert all non-ASCII characters in the dataframe into something that is ASCII compatible so that I can write it into stata. So far I can envision using hadley's answer to the same post referenced above that I will need to convert each variable individually into an ascii compatible string variable and add it to my dataset and then drop the offending variable.
Expanding on code from Hadley's answer:
library('stringi')
library('dplyr')
# simulating an example
x <- c("Ekstr\u00f8m", "J\u00f6reskog", "bi\u00dfchen Z\u00fcrcher")
df <- data.frame(id = 1:3,
logi = c(T, T, F),
test = x,
test2 = rev(x),
test_norm = c('Everything', 'is', 'perfect'))
# added several non-character columns to show that they are not affected
# Now translating every character column to ASCII
df2 <- df %>%
mutate_if(is.character,
stri_trans_general,
id = "latin-ascii")
df2
id logi test test2 test_norm
1 1 TRUE Ekstrom bisschen Zurcher Everything
2 2 TRUE Joreskog Joreskog is
3 3 FALSE bisschen Zurcher Ekstrom perfect
Of course it will work only with latin to ASCII.

Resources