R in Windows cannot handle some characters - r

I performed LDA in Linux and didn't get characters like "ø" in topic 2. However, when run in Windows, they show. Does anyone know how to deal with this? I used packages quanteda and topicmodels.
> terms(LDAModel1,5)
Topic 1 Topic 2
[1,] "car" "ø"
[2,] "build" "ù"
[3,] "work" "network"
[4,] "drive" "ces"
[5,] "musk" "new"
Edit:
Data: https://www.dropbox.com/s/tdr9yok7tp0pylz/technology201501.csv
The code is something like this:
library(quanteda)
library(topicmodels)
myCorpus <- corpus(textfile("technology201501.csv", textField = "title"))
myDfm <- dfm(myCorpus,ignoredFeatures=stopwords("english"), stem = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
myDfm <-removeFeatures(myDfm, c("reddit", "redditors","redditor","nsfw", "hey", "vs", "versus", "ur", "they'r", "u'll", "u.","u","r","can","anyone","will","amp","http","just"))
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.9999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
LDAModel1 <- LDA(quantedaformat2dtm(myDfm2), 25, 'Gibbs', list(iter=4000,seed = 123))

It's an encoding issue, coupled with the different locales available in R using Windows and Linux. (Try: Sys.getlocale()) Windows uses .1252 by default (aka "cp1252", "WINDOWS-1252") while Linux and OS X use UTF-8. My guess is that technology201501.csv is encoded as UTF-8 and is getting converted to 1252 when you read it into R Windows, these characters are doing something odd to the words, and creating apparent tokens as the character (but without a reproducible example, it's impossible for me to tell). By contrast, in Linux the words containing "ø" etc se are preserved because there is no conversion. Conversion might be mangling the words with extended (outside of the 7-bit "ASCII" range) characters, since there is no mapping of these UTF-encoded Unicode code points to a place in the 8-bit WINDOWS-1252 encoding, even though such points exist in that encoding.
To convert, it should work if you alter your call to:
myCorpus <- corpus(textfile("technology201501.csv", textField = "title", fileEncoding = "UTF-8"))
as the last argument is passed straight to read.csv() by textfile(). (This is only true in the newest version however, 0.9.2.)
You can verify the encoding of your .csv file using file technology201501.csv at the command line. This is included with nearly every Linux distro and OS X, but also is installed with RTools on Windows.

Related

R import of stata file has problems with French accented characters

I have a large stata file that I think has some French accented characters that have been saved poorly.
When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.
To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.
Reproducible code is here:
library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()
download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")
#Try with encoding set to blank, it won't work.
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")
unlink(c(temp, temp2))
#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec.
#I know this occupation has messed up characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>%
filter(str_detect(pes19_occ_text,"assembleur-m")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding)
#Write out messy occupation titles
ces19web %>%
filter(str_detect(pes19_occ_text,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, encoding) %>%
write_csv(file=here("Data/messy.csv"))
#Try to fix
source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
#Examine
ces19web %>%
filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>%
select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>%
head()
Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).
So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.
First, the truncated labels at accented letters (like Québec → Qu) are actually not caused by haven: they are stored truncated in the dta file.
The pes19_occ_text is encoded in UTF-8, as you can check with:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
This "é" is characteristic of UTF-8 data (here "é") read as latin1.
However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.
Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".
To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.
Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.
Here is the code:
ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)
output: "Producteur télé"
You can do the same on other variables with diacritics.
The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).
Actually, "c3 83" is "Ã" and "c2 a9" is "©".
Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).
See also the help pages of Encoding and iconv.
Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.
The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.
Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.
Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.
ces19web <- read_dta("CES-E-2019-online_F1.dta")
for (n in names(ces19web)) {
v <- ces19web[[n]]
if (is.character(v)) {
v <- iconv(v, from = "UTF-8", to = "latin1")
Encoding(v) <- "UTF-8"
}
ces19web[[n]] <- v
}

R not detecting \x pattern in string

does someone know how it is possible de detect and replace "\x" in R?
library(stringr)
x <- "gesh\xfc"
str_detect(x, "\\x")
# Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
# Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE)
nchar(x)
# Error in nchar(x) : invalid multibyte string, element 1
iconv(x, "latin1", "utf-8")
# [1] "geshü"
Encoding(x)
# [1] "unknown"
Session Info:
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
...
locale:
[1] fr_CH.UTF-8/fr_CH.UTF-8/fr_CH.UTF-8/C/fr_CH.UTF-8/fr_CH.UTF-8
Context: I read a .csv file with data.table::fread() but this file has colnames in German with letters such as ä,ö,ü,etc. Once read in R, those letters transform into something starting with "\x". This is simple unusable in R afterwards.
Just to summarize what happened here. The "\x" is NOT part of the string. This is just how R escapes values that it can't otherwise print. In the case of "gesh\xfc", the first 4 characters are basic ASCII characters, but the last character is encoded is "\xfc". In the latin1 encoding (which Windows uses by default) the fc character is the "ü" character. So in my windows machine, I see
x <- "gesh\xfc"
x
# [1] "geshü"
And you can look at the raw bytes of that string with
charToRaw("gesh\xfc")
# [1] 67 65 73 68 fc
You can see the ASCII hex character codes for the first 4 values, and then you can see that the \x was actually just used to include the "fc" character code in the string. The string itself only has 5 "characters".
But if you are not using latin1, the "fc" character doesn't map to anything. Basically that string doesn't make any sense in the utf-8 encoding which is what the mac uses by default. You can convert to utf-8 with
iconv("gesh\xfc", "latin1", "utf-8")
But since you got this file by importing a text file, the problem was that the R didn't know the encoding of the file wasn't UTF-8 so you wound up with these weird values. You should tell fread that the file came from windows so it can import the strings properly from the start
fread(file, encoding = "Latin-1")
You need to know what encoding was used to make a file you are importing especially when made by someone else. It's not really possible for programs to guess correctly.

How to treat encoding when reading .dta-files into R from Stata-files prior to version 14?

How can one dodge the encoding problems when reading Stata-data into R?
The dataset I wish to read is a .dta in either Stata 12 or Stata 13 (before Stata introduced support for utf-8 in version 14). Text-variables with Swedish and German letters å, ä, ö, ß, as well as other characters do not import well.
I have tried these answers, read.dta in foreign, the haven package (with no encoding-parameters), and now read_stata13, which informs me that it expects Stata files to be encoded in CP1252. But alas, the encoding doesn't work. Should I give up and and use a .csv-export as a bridge instead, or is it actually possible to read .dta-files in R?
Minimal example:
This code downloads the first few lines of my dataset, and illustrates the problem, for example in the variable vocation which contain Scandinavian languages.
setwd("~/Downloads/")
system("curl -O http://www.lilljegren.com/stackoverflow/example.stata13.dta", intern=F)
library(foreign)
?read_dta
df1 <- read_dta('example.stata13.dta', encoding="latin1")
df2 <- read_dta('example.stata13.dta', encoding="CP1252")
library(readstata13)
df3 <- read.dta13('example.stata13.dta', fromEncoding="latin1")
df4 <- read.dta13('example.stata13.dta', fromEncoding="CP1252")
df5 <- read.dta13('example.stata13.dta', fromEncoding="utf-8")
vocation <- c("Brandkorpral","Sömmerska","Jungfru","Timmerman","Skomakare","Skräddare","Föreståndare","Platsförsäljare","Sömmerska")
df4$vocation == vocation
# [1] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
The correct encoding to read files generated by Stata prior to version 14 on Macs is "macroman"
df <- read.dta13('example.stata13.dta', fromEncoding="macroman")
On my Mac, both .dta-files in stata13 and stata12 formats (saved by saveold in Stata 13) imported nicely like this.
Supposedly, the manual of read_stata13, correctly assumes "CP1252" on other platforms. To me, "macroman", however, did the trick, (also for the .csv-files that Stata 13 generated with export delimited).

Error extracting noun in R using KoNLP

I tried to extract noun for R. When using program R, an error appears. I wrote the following code:
setwd("C:\\Users\\kyu\\Desktop\\1-1file")
library(KoNLP)
useSejongDic()
txt <- readLines(file("1_2000.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
and, the error appear like this:
setwd("C:\\Users\\kyu\\Desktop\\1-1file")
library(KoNLP)
useSejongDic()
Backup was just finished!
87007 words were added to dic_user.txt.
txt <- readLines(file("1_2000.csv"))
nouns <- sapply(txt, extractNoun, USE.NAMES = F)
java.lang.ArrayIndexOutOfBoundsException Error in
Encoding<-(*tmp*, value = "UTF-8") : a character vector
argument expected
Why is this happening? I load 1_2000.csv file, there are 2000 lines of data. Is this too much data? How do I extract noun like large data file? I use R 3.2.4 with RStudio, and Excel version 2016 on Windows 8.1 x64.
The number of lines shouldn't be a problem.
I think that there might be a problem with the encoding. See this post. Your .csv file is encoded as EUC-KR.
I changed the encoding to UTF-8 using
txtUTF <- read.csv(file.choose(), encoding = 'UTF-8')
nouns <- sapply(txtUTF, extractNoun, USE.NAMES = F)
But that results in the following error:
Warning message:
In preprocessing(sentence) : Input must be legitimate character!
So this might be an error with your input. I can't read Korean so can't help you further.

R - textcat not executing because of supposed invalid UTF-8 strings

I'm trying to run a seemingly simple task, trying to identify the languages of a vector of text using the 'textcat' package. I've cleaned the text data (a sample of tweets) so as to only be left with standard characters, however, when I try to execute the textcat command as follows
text.df$language <- textcat(text.df$text)
I get the following error message:
Error in textcnt(x, n = max(n), split = split, tolower = tolower, marker = marker, :
not a valid UTF-8 string
Despite the fact that the following test
nchar(text.df$text, "c", allowNA=TRUE)
suggested that there are no non-utf8 characters in the data.
Does anyone have any ideas? Thanks in advance.
Try iconv on your input text...
text <- "i💙you"
> iconv(text, "UTF8", "ASCII", sub="")
[1] "iyou"

Resources