I'm trying to use R to create the content of a tex file. The content contains many accented letters and I not able to correctly write them to a tex file.
Here is a short minimal example of what I would like to perform:
I have a file texinput.tex, which already exists and is encoded as UTF8 without BOM. When I manually write é in Notepad++ and save this file, it compiles correctly in LaTex and the output is as expected.
Then I tried to do this in R:
str.to.write <- "é"
cat(str.to.write, file = "tex_list.tex", append=TRUE)
As a result, the encoded character xe9 appears in the tex file. LaTex throws this error when trying to compile:
! File ended while scanning use of \UTFviii#three#octets.<inserted text>\par \include{texinput}
I then tried all of the following things before the cat command:
Encoding(str.to.write) <- "latin1"
-> same output error as above
str.to.write <- enc2utf8(str.to.write)
-> same output and error as above
Encoding(str.to.write) <- "UTF-8"
-> this appears in the tex file: \xe9. LaTex throws this error: ! Undefined control sequence. \xe
Encoding(str.to.write) <- "bytes"
-> this appears in the tex file: \\xe9. LaTex compiles without error and the output is xe9
I know that I could replace é by \'{e}, but I would like to have an automatic method, because the real content is very long and contains words from 3 different Latin languages, so it has lots of different accented characters.
However, I would also be happy about a function to automatically sanitize the R output to be used with Latex. I tried using xtable and sanitize.text.function, but it appears that it doesn't accept character vectors as input.
After quite a bit of searching and trial-and-error, I found something that worked for me:
# create output function
writeTex <- function(x) {write.table(x, "tex_list.tex",
append = TRUE, row.names = FALSE,
col.names = FALSE, quote = FALSE,
fileEncoding = "UTF-8")}
writeTex("é")
Output is as expected (é), and it compiles perfectly well in LaTex.
Use TIPA for processing International
Phonetic Alphabet (IPA) symbols in Latex. It has become standard in the linguistics field.
Related
I am attempting to make a boatload of Anki flashcards for Thai, so I did some web-scraping with R to extract transliterated elements from a website (dictionary). Everything looks good when printing in the console, but when I try to write the transliteration to a text file, the encoding changes, and I lose tone marks. Using Encoding() revealed that most entries were "UTF-8", which should be fine, but some entries were labeled as "unknown". You can download the HTML file from my GitHub, and my code is below for importing and extracting the text.
# Install appropriate library
install.packages("rvest")
library(rvest)
# Read in page to local variable
page <- read_html("Thai to English dictionary, translation and transliteration.html")
# Filter for specific tags
translit <- page %>% html_nodes(".tlit-line") %>% html_text()
write(translit, file = 'translit.txt')
library(stringi)
stringi::stri_write_lines(translit, encoding = "UTF-8", "translit.txt")
stri_write_lines (From stringi v1.5.3 by Marek Gagolewski)
Write Text Lines To A Text File. Writes a text file is such a way
that each element of a given character vector becomes a separate text
line.
Usage
stri_write_lines(
str,
con,
encoding = "UTF-8",
sep = ifelse(.Platform$OS.type == "windows", "\r\n", "\n"),
fname = con
)
Arguments
str - character vector with data to write
con - name of the output file or a connection object (opened in the
binary mode)
encoding - output encoding, NULL or '' for the current default one
sep - newline separator
fname - deprecated alias of con
Details
It is a substitute for the R writeLines function, with the ability to
easily re-encode the output.
We suggest using the UTF-8 encoding for all text files: thus, it is
the default one for the output.
Has anyone had experience to read Korean language file using EUC-KR as text encoding?
I used fread function as it can read that file structure perfectly. Below is the sample code:
test <- fread("KoreanTest.txt", encoding = "EUC-KR")
Then I got error, "Error in fread("KoreanTest.txt", encoding = "EUC-KR") : Argument 'encoding' must be 'unknown', 'UTF-8' or 'Latin-1'".
Initially i was using UTF-8 as text encoding but the output characters were not displayed correctly in Korean language. I was looking to another solution but nothing seems to work at this time.
Appreciate if someone could share ideas. Thanks.
It allows an explicit encoding parameter. This common usage works well:
read.table(filesource, header = TRUE, stringsAsFactors = FALSE, encoding = "EUC-KR")
or you can try with Rstudio
File -> Import Dataset -> From text
let me start this by saying that I'm still pretty much a beginner with R.
Currently I am trying out basic text mining techniques for Turkish texts, using the tm package.
I have, however, encountered a problem with the display of Turkish characters in R.
Here's what I did:
docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")
My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))
also saves the file without the characters in ANSI encoding.
This seems to not only be an issue with the output file.
writeLines(as.character(docs[[1]])
for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"
After reading this: UTF-8 file output in R
I also tried the following code:
writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)
which didn't change the results.
All of this is on Windows 7 with both the most recent version of R and RStudio.
Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.
Here is how I keep the Turkish characters intact:
Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
Copy and Paste your text containing Turkish characters.
Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
yourdocument <- paste(yourdocument, collapse = " ")
After this step you can create your corpus
e.g start from VectorSource() in tm package.
Turkish characters will appear as they should.
This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...
First I have to adjust the dictionary a bit:
emojis$YouTube <- tolower(emojis$Codepoint)
emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)
Convert to character so as to be able to use utf8_encode:
emojimovie$test <- as.character(emojimovie$textOriginal)
This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.
utf8_encode(emojimovie$test)
BUT, when I do this:
emojimovie$text2 <- utf8_encode(emojimovie$test)
and then:
emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)
I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P
It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like
emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)
# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)
# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
keep.source = FALSE), eval)
This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.
Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".
The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.
I use the R packages RefManageR and bibtex packages to read in a bibtex file I exported from Mendeley (my reference manager). Sometimes authors are listed with accents in their name (López), but in BibTeX these are escaped to "L{\\'{o}}pez". However, in another reference this name is spelled without accent (Lopez).
How can I parse the "L{\\'{o}}pez" to López or Lopez so I can compare them?
I googled but this only shows how I can escape -while I want to unescape- or to make pdf's from R.
I tried this and it worked for me, but I still think there must be a better solution:
deTeX <- function(x) {
gsub("\\{\\\\.+?\\{([a-z]*)\\}\\}", "\\1", x, fixed = FALSE, perl = TRUE, ignore.case = TRUE)
}