I have been using the googlesheets package to upload and download data from a websheet. Previously, it had been downloading strings with non-ASCII symbols with the icon �. Now, for no apparent reason, it has started downloading them with the following string: �. How can I convert � to the diamond questionmark symbol (�)?
It's likely that you have an encoding problem. I suspect that the raw data is encoded in UTF-8, but at some point it is getting treated as Windows-1252.
This is what happens when the encoding is wrongly marked as Windows-1252, and then converted to UTF-8:
x <- "Here is a raw string: � is getting converted to �"
(y <- iconv(x, "WINDOWS-1252", "UTF-8"))
#> [1] "Here is a raw string: � is getting converted to �"
You can fix the encoding error by converting from UTF-8 to Windows-1252, then marking the result as UTF-8:
z <- iconv(y, "UTF-8", "WINDOWS-1252")
Encoding(z) <- "UTF-8"
print(z)
#> [1] "Here is a raw string: � is getting converted to �"
Note: The code will still work on MacOS and Linux if you leave out the Encoding(z) <- "UTF-8" line, but it will break on Windows. If you leave out that line then z will have "unknown" encoding, which gets interpreted as "UTF-8" on Linux and MacOS but not on Windows.
Windows Users
If you're using Windows, then the fix could be much simpler. If your data has "unknown" encoding, then on MacOS and Linux it will (correctly) be interpreted as UTF-8, but on Windows it will be interpreted using your native encoding, usually Windows-1252. If you are on Windows, then something like the following happens:
x <- "Here is a raw string: � is getting converted to �"
y <- x
Encoding(y) <- "unknown"
print(y)
#> [1] "Here is a raw string: � is getting converted to �"
You can fix this as follows:
z <- y
Encoding(z) <- "UTF-8"
print(z)
#> [1] "Here is a raw string: � is getting converted to �"
Related
I have a large stata file that I think has some French accented characters that have been saved poorly.
When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.
To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.
Reproducible code is here:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()
Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).
So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.
First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.
The pes19_occ_text is encoded in UTF-8, as you can check with:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
This "é" is characteristic of UTF-8 data (here "é") read as latin1.
However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.
Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".
To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.
Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.
Here is the code:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
You can do the same on other variables with diacritics.
The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).
Actually, "c3 83" is "Ã" and "c2 a9" is "©".
Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).
See also the help pages of Encoding and iconv.
Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.
The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.
Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.
Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.
ces19web <- read_dta("CES-E-2019-online_F1.dta")
for (n in names(ces19web)) {
v <- ces19web[[n]]
if (is.character(v)) {
v <- iconv(v, from = "UTF-8", to = "latin1")
Encoding(v) <- "UTF-8"
}
ces19web[[n]] <- v
}
I want to convert character strings to UTF-8. At the moment, I've managed to do this using stringi, like this:
test_string <- c("Fiancé is great.")
stringi::stri_encode(test_string, "UTF-8")
However, how can I do the same using R base or stringr?
Thanks in advance
iconv function may a choice.
Example if current encoding is latin1
iconv(test_string, "latin1", "UTF-8")
You can use Encoding and enc2utf8 from base:
test_string <- c("Fiancé is great.")
Encoding(test_string)
# [1] "latin1"
Encoding(test_string) <- Encoding(enc2utf8(test_string))
Encoding(test_string)
# [1] "UTF-8"
And you can find more alternatives here.
does someone know how it is possible de detect and replace "\x" in R?
library(stringr)
x <- "gesh\xfc"
str_detect(x, "\\x")
# Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
# Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE)
nchar(x)
# Error in nchar(x) : invalid multibyte string, element 1
iconv(x, "latin1", "utf-8")
# [1] "geshü"
Encoding(x)
# [1] "unknown"
Session Info:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
...
locale:
[1] fr_CH.UTF-8/fr_CH.UTF-8/fr_CH.UTF-8/C/fr_CH.UTF-8/fr_CH.UTF-8
Context: I read a .csv file with data.table::fread() but this file has colnames in German with letters such as ä,ö,ü,etc. Once read in R, those letters transform into something starting with "\x". This is simple unusable in R afterwards.
Just to summarize what happened here. The "\x" is NOT part of the string. This is just how R escapes values that it can't otherwise print. In the case of "gesh\xfc", the first 4 characters are basic ASCII characters, but the last character is encoded is "\xfc". In the latin1 encoding (which Windows uses by default) the fc character is the "ü" character. So in my windows machine, I see
x <- "gesh\xfc"
x
# [1] "geshü"
And you can look at the raw bytes of that string with
charToRaw("gesh\xfc")
# [1] 67 65 73 68 fc
You can see the ASCII hex character codes for the first 4 values, and then you can see that the \x was actually just used to include the "fc" character code in the string. The string itself only has 5 "characters".
But if you are not using latin1, the "fc" character doesn't map to anything. Basically that string doesn't make any sense in the utf-8 encoding which is what the mac uses by default. You can convert to utf-8 with
iconv("gesh\xfc", "latin1", "utf-8")
But since you got this file by importing a text file, the problem was that the R didn't know the encoding of the file wasn't UTF-8 so you wound up with these weird values. You should tell fread that the file came from windows so it can import the strings properly from the start
fread(file, encoding = "Latin-1")
You need to know what encoding was used to make a file you are importing especially when made by someone else. It's not really possible for programs to guess correctly.
This question already has answers here:
Cannot read unicode .csv into R
(3 answers)
Closed 5 years ago.
I am trying to read in data from a csv file and specify the encoding of the characters to be UTF-8. From reading through the ?read.csv() instructions, it seems that fileEncoding set equal to UTF-8 should accomplish this, however, I am not seeing that when checking. Is there a better way to specify the encoding of character strings to be UTF-8 when importing the data?
Sample Data:
Download Sample Data here
fruit<- read.csv("fruit.csv", header = TRUE, fileEncoding = "UTF-8")
fruit[] <- lapply(fruit, as.character)
Encoding(fruit$Fruit)
The output is "uknown" but I would expect this to be "UTF-8". What is the best way to ensure all imported characters are UTF-8? Thank you.
fruit <- read.csv("fruit.csv", header = TRUE)
fruit[] <- lapply(fruit, as.character)
fruit$Fruit <- paste0(fruit$Fruit, "\xfcmlaut") # Get non-ASCII char and jam it in!
Encoding(fruit$Fruit)
[1] "latin1" "latin1" "latin1"
fruit$Fruit <- enc2utf8(fruit$Fruit)
Encoding(fruit$Fruit)
[1] "UTF-8" "UTF-8" "UTF-8"
I performed LDA in Linux and didn't get characters like "ø" in topic 2. However, when run in Windows, they show. Does anyone know how to deal with this? I used packages quanteda and topicmodels.
> terms(LDAModel1,5)
Topic 1 Topic 2
[1,] "car" "ø"
[2,] "build" "ù"
[3,] "work" "network"
[4,] "drive" "ces"
[5,] "musk" "new"
Edit:
Data: https://www.dropbox.com/s/tdr9yok7tp0pylz/technology201501.csv
The code is something like this:
library(quanteda)
library(topicmodels)
myCorpus <- corpus(textfile("technology201501.csv", textField = "title"))
myDfm <- dfm(myCorpus,ignoredFeatures=stopwords("english"), stem = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
myDfm <-removeFeatures(myDfm, c("reddit", "redditors","redditor","nsfw", "hey", "vs", "versus", "ur", "they'r", "u'll", "u.","u","r","can","anyone","will","amp","http","just"))
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.9999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
LDAModel1 <- LDA(quantedaformat2dtm(myDfm2), 25, 'Gibbs', list(iter=4000,seed = 123))
It's an encoding issue, coupled with the different locales available in R using Windows and Linux. (Try: Sys.getlocale()) Windows uses .1252 by default (aka "cp1252", "WINDOWS-1252") while Linux and OS X use UTF-8. My guess is that technology201501.csv is encoded as UTF-8 and is getting converted to 1252 when you read it into R Windows, these characters are doing something odd to the words, and creating apparent tokens as the character (but without a reproducible example, it's impossible for me to tell). By contrast, in Linux the words containing "ø" etc se are preserved because there is no conversion. Conversion might be mangling the words with extended (outside of the 7-bit "ASCII" range) characters, since there is no mapping of these UTF-encoded Unicode code points to a place in the 8-bit WINDOWS-1252 encoding, even though such points exist in that encoding.
To convert, it should work if you alter your call to:
myCorpus <- corpus(textfile("technology201501.csv", textField = "title", fileEncoding = "UTF-8"))
as the last argument is passed straight to read.csv() by textfile(). (This is only true in the newest version however, 0.9.2.)
You can verify the encoding of your .csv file using file technology201501.csv at the command line. This is included with nearly every Linux distro and OS X, but also is installed with RTools on Windows.