Knitting Rmd treats non-english characters differently

Knitting Rmd treats non-english characters differently - r

I've tried to write reproducable example below. It is a mix of .Rmd and .r . Hopefully you can see why.
The problem I have is that non-english characters are treated differently depending on whether code is run directly in the console or when Knitted to HTML.
In the example below I create a small data.frame with characters ü and ö, write it to csv, then read it back in again.
If the writing and reading both take place inside or outside a chunk, then all is well.
But if the writing and reading take place in different places then a different encoding is used (I think). and characters get mixed up.
This means that when reading in data I need a different encoding when compiling an .Rmd file than when working directly in R.
As far as I can see the locale is always the same, so I don't understand what's going on.
Any ideas?
Write and read csv directly to create new datafile
df2 <- data.frame(Cäl1 = c(1,2), Col2 = c("ü","a"))
write.csv(df2, file="df2.csv")
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
Now try Knitting the whole document (just running the chunk behaves differently)
```{r read_inside}
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
```
this second chunk will work because the data.frame is created inside the chunk
```{r write_read_inside}
df2 <- data.frame(Cäl1 = c(1,2), Col2 = c("ü","a"))
write.csv(df2, file="df2.csv")
read.csv("df2.csv")
Sys.getlocale(category = "LC_ALL")
```
Session Info:
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.15.0

So the answer is to guarantee UTF8 encoding, e.g. write.csv(..., fileEncoding = 'UTF-8'). The root problem was actually that RStudio uses UTF8 by default, but R uses the native encoding of the OS by default. We can either ask R to use UTF8 in write.csv, or ask RStudio to use native encoding (options(encoding = 'native.enc')).

Related

reading Tamil corpus in R

I am have built a basic word prediction product using R as part of an online course project work. I wanted to extend it for predicting next word from Tamil phases. I had used sample of Tamil language corpora from HC Corpora website. I have read it into R and created a tm() corpus.
testData <- "திருவண்ணாமலை, கொல்லிமலை, சதுரகிரி என அவன் சித்தர்களை பல
இடங்களில், மலைகளில், குகைகளில், இன்னும் பல ரகசிய இடங்களில்
அவன் சித்தர்களை சந்தித்து பல நம்பமுடியாத சக்திகளைப்
பெற்றுவிட்டான் என்று சொல்லிக் கொள்கிறார்கள்"
getUnigrams <- function(x) {NGramTokenizer(x,
Weka_control(min=1, max=1))}
unigrams <- DocumentTermMatrix(VCorpus(VectorSource(testData)),
control=list(tokenize=getUnigrams))
unigramsList <- data.frame(slam::col_sums(unigrams))
head(unigramsList, 3)
> slam..col_sums.unigrams.
அவன் 2
இடங்களில் 2
இன்னும் 1
The actual Tamil words are row names of this data-frame and displayed properly on the screen. However, when I try to add it as column against their respective count, the resulting data frame does not displays the Tamil words correctly in column unigramsList$word1. It displays it as unicode characters of underlying Tamil word.
unigramsList$word1 <- rownames(unigramsList) ## Encoding issues arise from here!!!
head(unigramsList, 3)
slam..col_sums.unigrams.
அவன் 2
இடங்களில் 2
இன்னும் 1
word1
அவன் <U+0B85><U+0BB5><U+0BA9><U+0BCD>
இடங்களில் <U+0B87><U+0B9F><U+0B99><U+0BCD><U+0B95><U+0BB3><U+0BBF><U+0BB2><U+0BCD>
இன்னும் <U+0B87><U+0BA9><U+0BCD><U+0BA9><U+0BC1><U+0BAE><U+0BCD>
>
I tried to continue with these unicode characters and built n-grams for 2, 3 and 4-grams and used it for my prediction. But all subsequent operations on this column are displayed as raw unicode only. I want to be able to view and predict them in their native Tamil characters.
My session information is as below:
> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-29 tm_0.6-2 NLP_0.1-9 stringi_1.0-1 stringr_1.0.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 parallel_3.2.5 tools_3.2.5 slam_0.1-37
[5] grid_3.2.5 rJava_0.9-8 RWekajars_3.9.0-1

I managed to hack a solution to above and hence thought of posting it for anyone interested in this topic.
a) Instead of saving the n-grams as csv files on Windows, I saved them in R binary format (using save() and load() functions). I had saved the generated n-grams using read.csv() with fileEncoding option set to UTF-8, but still it did not help even after deploying it on Shiny.
b) Deployed and tested on Shiny apps, which runs on a Linux platform and hence it was able to display Tamil characters in unicode correctly. Testing it locally on Windows was not effective as characters were displayed as raw unicodes e.g. , etc.
Thanks to Marek Gagolewski, author of stringi, for suggestions regarding shinyio, which helped me deploy and test on shiny's Linux platform.
You can check out the product using the below link if you are interested: https://periasamyr.shinyapps.io/predictwordml/
Regards
Peri

Displaying UTF-8 encoded characters in R

I am using the RODBC package to read data from SQL server. R is reading the Chinese characters as "?????"
I have passed the parameter DBMSencoding = "UTF-8" to the odbcConnect function.
Following is the sample code I am using:
Connection <- odbcConnect("abc", uid = "123", pwd = "123",
DBMSencoding = "UTF-8", readOnlyOptimize=T)
Var1 <- sqlQuery(Connection, query, errors = TRUE, stringsAsFactors=F)
May be I didn't pass the arguments the way I am supposed to?
sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RODBC_1.3-12
loaded via a namespace (and not attached):
[1] tools_3.2.3
odbcGetInfo(mainConnection)
DBMS_Name DBMS_Ver Driver_ODBC_Ver Data_Source_Name Driver_Name
"Microsoft SQL Server" "10.50.4000" "03.52" "SQLSRV32.DLL"
Driver_Ver ODBC_Ver Server_Name
"06.01.7601" "03.80.0000"

Check the database's character encoding:
select userenv('language') from dual;
SIMPLIFIED CHINESE_CHINA.AL32UTF8
Change your Environment Variable NLS_LANG before connecting to the database:
Sys.setenv(NLS_LANG="SIMPLIFIED CHINESE_CHINA.AL32UTF8")
Connection <- odbcConnect("abc", uid = "123", pwd = "123", DBMSencoding = "UTF-8", readOnlyOptimize=T)

R on Windows has a lot of problems displaying characters outside of ASCII, even though it is often faithfully representing them internally. There is a lot of information in this answer about why this is the case, and some simple diagnostics in this answer. First try plotting, like:
# first, make sure plotting Chinese works in general
# (i.e., you have an appropriate font)
hanzi <- "漢字"
plot(1, 1, type="n")
text(1, 1, hanzi)
If that works, replace the hanzi <- "漢字" line with your sql query line to get some Chinese text from your database into a string variable, and try plotting that. If it shows up on the plot, then the characters are being read fine and represented internally fine, and the problem is just displaying them in the console. If plotting worked for the "漢字" string variable but doesn't work for your SQL-extracted string, then at least you know that the problem is actually with the SQL part and not just with display in the console.

I got the same problem and successfully solved it. It was quite simple. Go to Control Panel --> Region and Language --> Administrative --> Change system locale --> Chinese.

Using special characters in Rstudio

I am working with some special characters in Rstudio. It coverts them into plain letters.
print("Safarzyńska2013")
[1] "Safarzynska2013"
x <- "Māori"
x
[1] "Maori"
Is there any way to read in the exact original characters.
Following info might be helpful:
Rstudio default encoding is UTF-8
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.1.1

This not an exclusively RStudio problem.
Typing print("Safarzyńska2013") on the console of RGui also converts them to plain letters. Running this code from an UTF-8 encoded Script in RGui returns [1] "Safarzy?ska2013".
I don't think that it is a good idea to type such special chars on the console. x <- "SomeString"; Encoding(x) returns "unknown" and that is probably the problem: R has no idea what encoding you are using on the console and probably has no chance to get your original encoding.
I put "Safarzyńska2013\nMāori\n" in a text file encoded with UTF-8. Then the following works fine:
tbl <- read.table('c:/test1.txt', encoding = 'UTF-8', stringsAsFactors = FALSE)
tbl[1,1]
tbl[2,1]
Encoding(tbl[1,1]) # returns "UTF-8"
If you really want to use the console, you probably will have to mask the special chars. In ?Encoding we find the following example to create a word with special chars:
x <- "fa\xE7ile"
Encoding(x)
Actually I don't know at the moment how to get these codes for your special chars and ?Encoding has also no hints...

Go to the label File of RStudio, them click on Save with encoding... , Choose Encoding
UTF-8 , Set as default encoding for source file and save.
Hope this helps

how to display and input chinese (and other non-ASCII) character in r console?

My system: win7 ultimate 64 english version + r-3.1(64) .
Here is my sessionInfo.
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
LC_MONETARY=English_United States.1252 LC_NUMERIC=C
LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
1.can't input chinese character into r console
When I input a chinese character in r console, it turns to garbled character .
2.can't display chinese character on the r console
When I read data in r console, the chinese character turns into a garbled character .
You can download the data, and test it with
read.table("r1.csv",sep=",")
Download Data
Please see the graph to download the data if you don't know how to get the data from my web.
How can I setup my pc to properly display and input chinese characters in r console?
I have updated the chinese language pack ,and enabled it,but problem remains still.

It is probably not very well documented, but you want to use setlocale in order to use Chinese. And the method applies to many other languages as well. The solution is not obvious as the official document of setlocale didn't specifically mentioned it as a method to solve the display issues.
> print('ÊÔÊÔ') #试试, meaning let's give it a shot in Chinese
[1] "ÊÔÊÔ" #won't show up correctly
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> Sys.setlocale(category = "LC_ALL", locale = "chs") #cht for traditional Chinese, etc.
[1] "LC_COLLATE=Chinese_People's Republic of China.936;LC_CTYPE=Chinese_People's Republic of China.936;LC_MONETARY=Chinese_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese_People's Republic of China.936"
> print('试试')
[1] "试试"
> read.table("c:/CHS.txt",sep=" ") #Chinese: the 1st record/observation
V1 V2 V3 V4 V5 V6
1 122 第一 122 条 122 记录
If you just want to change the display encoding, without changing other aspects of locales, use LC_CTYPE instead of LC_ALL:
> Sys.setlocale(category = "LC_CTYPE", locale = "chs")
[1] "Chinese_People's Republic of China.936"
> print('试试')
[1] "试试"
Now, of course this only applies to the official R console. If you use other IDE's, such as the very popular RStudio, you don't need to do this at all to be able to type and display Chinese, even if you didn't have the Chinese locale loaded.
Migrate some useful stuff from the following comments:
If the data still fails to show up correctly, the we should also look into the issue of the file encoding. If the file is UTF-8 encoded, tither data <- read.table("you_file", sep=',', fileEncoding="UTF-8-BOM", header=TRUE) or fileEncoding="UTF-8" will do, depends on which encoding it really has.
But you may want to stay away from UTF-BOM as it is not recommended: What's different between UTF-8 and UTF-8 without BOM?

how to display đ, ư, ơ, ă in R graphs

I am trying to put Vietnamese labeling in R graphs. I use RStudio and save my code using UTF-8 encoding. It handles the Vietnamese characters I put in the code well, I mean everything shows up in the code properly. However, in the graphs I make, while many characters display OK, several important ones do not show up properly, including
đ - which displays incorrectly as d
ư - which displays incorrectly as u
ơ - which displays incorrectly as o
ă - which displays incorrectly as a
Unfortunately this makes my graphs look unprofessional and untrustworthy.
I would really appreciate it if someone can help me figure this out.
Thanks much!
Trang
#DWin: I am on Windows 7, and here is my sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] MASS_7.3-17 MatchIt_2.4-20 tools_2.15.0
#krlmlr: Here's my code for a simple graph:
knorelative.count <- matrix(nrow=5,ncol=1)
knorelative.count[,1] <- c(1579,638,215,100,120)
par(mar=c(2,4,4,2))
barplot(prop.table(knorelative.count),beside=TRUE,
yaxt="n",ylim=c(0,.6),
legend=c("không ai biết",
"không biết nhiều hơn biết",
"nửa biết, nửa không biết",
"biết nhiều hơn không biết",
"tất cả đều biết"),
main="Người khác trong gia đình, họ hàng biết hay không")
axis(2,at=seq(0,.6,.1),labels=paste(100*seq(0,.6,.1),"%",sep=""),las=1)
When I run this, the đ in the main title and the two ơ's and the đ in the legend turn into d and o.

You are often safer specifying unicode characters by their hex codes:
plot(1:4,rep(1,4),pch=c("\u0111","\u01B0","\u01A1","\u0103"),cex=4)

For any Vietnamese folks out there who run into the same problem, here's an example for the fix using hex codes suggested by James:
print("trường")
[1] "truờng"
print("tr\u01B0ờng")
[1] "trường"
While I can type Vietnamese, in this example the word trường, into my R console ok, any kind of output (e.g. print, graph) fails to display the character ư. Replacing ư with the hex code fixes the output.
(Note: I used the function paste earlier, but then edited this based on James's suggestion to stick the hex code in the character string.)
I am so thankful to learn this way. I will do this for the report I am currently writing.
Trang

You save code in another file. Example: folder R/graph_display.R
After that run this code
eval(parse("R/graph_display.R", encoding = "UTF-8"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Knitting Rmd treats non-english characters differently - r

Related

reading Tamil corpus in R

Displaying UTF-8 encoded characters in R

Using special characters in Rstudio

how to display and input chinese (and other non-ASCII) character in r console?

how to display đ, ư, ơ, ă in R graphs

Categories

Resources