I'd like R to render its whole output (both to the console and to files) in UTF-8. Is there a way to define the encoding for R output for a whole document?
I think what you're getting at comes out of options . For example, here's part of the help page for file :
file(description = "", open = "", blocking = TRUE,
encoding = getOption("encoding"), raw = FALSE)
So if you investigate setting options(encoding=your_choicehere) you may be all set.
Edit: If you haven't already, be sure to set your locale to the language desired.
Related
Has anyone had experience to read Korean language file using EUC-KR as text encoding?
I used fread function as it can read that file structure perfectly. Below is the sample code:
test <- fread("KoreanTest.txt", encoding = "EUC-KR")
Then I got error, "Error in fread("KoreanTest.txt", encoding = "EUC-KR") : Argument 'encoding' must be 'unknown', 'UTF-8' or 'Latin-1'".
Initially i was using UTF-8 as text encoding but the output characters were not displayed correctly in Korean language. I was looking to another solution but nothing seems to work at this time.
Appreciate if someone could share ideas. Thanks.
It allows an explicit encoding parameter. This common usage works well:
read.table(filesource, header = TRUE, stringsAsFactors = FALSE, encoding = "EUC-KR")
or you can try with Rstudio
File -> Import Dataset -> From text
I'm currently building a shiny application that needs to be translated in different languages. I have the whole structure but I'm struggling into getting values such as "Validació" that contain accents.
The structure I've followed is the following:
I have a dictionary that is simply a csv with the translation where
there's a key and then each language. The structure of this dictionary is the following:
key, cat, en
"selecció", "selecció", "Selection"
"Diferències","Diferències", "Differences"
"Descarregar","Descarregar", "Download"
"Diagnòstics","Diagnòstics", "Diagnoses"
I have a script that once the dictionary.csv is modified, generates a .bin file that later will be loaded in the code.
In strings.R I have all the strings that will appear on the code and I use a function to translate the current language to the one I want. The function is the following:
Code:
tr <- function(text){
sapply(text, function(s) translation[[s]][["cat"]], USE.NAMES=F)
}
When I translate something, since I am doing in another file, I assign it to another variable something like:
str_seleccio <- tr('Selecció)
The problem I'm facing is that for example if we translate 'Selecció' would be according to this function, tr('Selecció') and provides a correct answer if I execute it in the RStudio terminal but when I do it in the Shiny application, appears to me as a NULL. If the word I translate has no accents such as "Hello", tr("Hello") provides me a correct answer in the Shiny application and I can see it throught the code.
So mainly tr(word) gets the correct value but when assigning it "loses the value" so I'm a bit lost how to do it.
I know that you can do something like Encoding(str_seleccio) <- "UTF-8" but in this case is not working. In case of plain words it used to do but since when I asssign it, gets NULL is not working.
Any idea? Any suggestion? What I would like is to add something to tr function
The main idea comes from this repository that if you can take a look is the simplest version you can do, but (s)he has problem with utf-8 also.
https://github.com/chrislad/multilingualShinyApp
As in http://shiny.rstudio.com/articles/unicode.html suggested (re)save all files with UTF-8 encoding.
Additionaly change within updateTranslation.R:
translationContent <- read.delim("dictionary.csv", header = TRUE, sep = "\t", as.is = TRUE)
to:
translationContent <- read.delim("dictionary.csv", header = TRUE, sep = "\t", as.is = TRUE, fileEncoding = "UTF-8").
Warning, when you (re)save ui.R, your "c-cedilla" might get destroyed. Just re-insert it, in case it happens.
Happy easter :)
I am trying to read in a file in R, using the following command (in RStudio):
fileRaw <- read.csv(file = "file.csv", header = TRUE, stringsAsFactors = FALSE)
file.csv looks something like this:
However, when it's read into R, I get:
As you can see LOCATION is changed to ï..LOCATION for seemingly no reason.
I tried adding check.names = FALSE but this only made it worse, as LOCATION is now replaced with LOCATION. What gives?
How do I fix this? Why is R/RStudio doing this?
There is a UTF-8 BOM at the beginning of the file. Try reading as UTF-8, or remove the BOM from the file.
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence
0xEF,0xBB,0xBF. A text editor or web browser misinterpreting the text
as ISO-8859-1 or CP1252 will display the characters  for this.
Edit: looks like using fileEncoding = "UTF-8-BOM" fixes the problem in RStudio.
Using fileEncoding = "UTF-8-BOM" fixed my problem and read the file with no issues.
Using fileEncoding = "UTF-8"/encoding = "UTF-8" did not resolve the issue.
I would like to printout to the same txt (outfile.txt) file items one after the other.
For instance, first I would like to print to outfile.txt a dataframe - u. Afterwards, a written message 'hello' and finally a summary of model.
How can I do it? Is sink(outfile.txt) is appropriate for this case?
It is generally a very bad idea to mix data in the same file. I advise against it in the strongest terms: it makes the data file next to unusable for other programs.
That said, most functions to save data have an append argument. You can set this to TRUE to append to an existing file rather than overwriting its contents. No need for sink.
Where you do need sink (or equivalent) is when you want to write contents formatted in the same way as it’s written on the console. This, for instance, is the case for summary.
Here’s an example similar to your requirements:
filename = 'test.txt'
write.table(head(cars), filename, quote = FALSE, col.names = NA)
cat('\nHello\n\n', file = filename, append = TRUE)
capture.output(print(summary(cars)), file = filename, append = TRUE)
Rather than sink, this uses capture.output, which is a convenience wrapper around sink.
My question is in response to two issues I encountered in reading a .tsv file published that contains campaign finance data.
First, the file has a null character that terminates input and throws the error 'embedded nul in string: 'NAVARRO b\0\023 POWERS' when using data.table::fread(). I understand that there are a number of potential solutions to this problem but I was hoping to find something within R. Having seen the skipNul option in read.table(), I decided to give it a shot.
That brings me to the second issue: read.table() with reasonable settings (comment.char = "", quote = "", fill = T) is not throwing an error but it is also not detecting the same filesize that data.table::fread() identified (~100k rows with read.table() vs. ~8M rows with data.table::fread()). The fread() answer seems to be more correct as the file size is ~1.5GB and data.table::fread() identifies valid data when reading in rows leading up to where the error seems to be.
Here is a link to the code and output for the issue.
Any ideas on why read.table() is returning such different results? fread() operates by guessing characteristics of the input file but it doesn't seem to be guessing any exotic options that I didn't use in read.table().
Thanks for your help!
NOTE
I do not know anything about the file in question other than the source and what information it contains. The source is from the California Secretary of State by the way. At any rate, the file is too large to open in excel or notepad so I haven't been able to visually examine the file besides looking at a handful of rows in R.
I couldn't figure out an R way to deal with the issue but I was able to use a python script that relies on pandas:
import pandas as pd
import os
os.chdir(path = "C:/Users/taylor/Dropbox/WildPolicy/Data/Campaign finance/CalAccess/DATA")
receipts_chunked = pd.read_table("RCPT_CD.TSV", sep = "\t", error_bad_lines = False, low_memory = False, chunksize = 5e5)
chunk_num = 0
for chunk in receipts_chunked:
chunk_num = chunk_num + 1
file_name = "receipts_chunked_" + str(chunk_num) + ".csv"
print("Output file:", file_name)
chunk.to_csv(file_name, sep = ",", index = False)
The problem with this route is that, with 'error_bad_lines = False', problem rows are simply skipped instead of erroring out. There are only a handful of error cases (out of ~8 million rows) but this is still suboptimal obviously.