Chinese character encoding with differenct operation systems/languages - r

I am having trouble read my csv file containing simplified Chinese character into my r. I have tried the encoding=utf-8,gb18130,gb2130 etc. The Chinese character could be not displayed.
I also tried change the encoding by excel to utf8 csv, no luck.
I also
tried using Chinese windows and set the locale to China. No luck.
After I change to Chinese windows. The excel can open my csv (English
windows cannot open it correctly). The r studio can open it in the
View() but the R console console could not read my csv even if I
reinstall the r as Chinese version.
I tried the Ubuntu, Ubuntu could not even read my csv at all. At least in Windows, the R studio can read my data well.
I tried google sheet. But my file is so big that Google sheet would
not even open it
I tired Cals in Ubuntu and convert it GB* since GB is
working fine in Windows R studio. No luck. And it takes more than 10
minutes to convert my 200Mb-750Mb data to gb18013

The Ubuntu use UTF-8 as default Chinese Encoding. So you should encode it as UTF-8 instead GB18130 or other GB starting encoding.
(1) Download Open Office (free and fast to install, have have higher
file size than Cals in Ubuntu).
(2) Detect your CSV encoding. Simply open your csv using Open office and choose an encoding method that display your Chinese character.
(3) Save your csv to the correct encoding according to your
operation system. Default Windows encoding is GBK for Chinese and Ubuntu is UTF8.
This should solve your file size problem and encoding problem. You do not even have to force the encoding. Normal read.csv would work.

Related

Plotly tooltip showing unicode character instead of letters in shiny server (R

I have a shiny application running in a shiny server that started giving errors for all accented characters today (it was working with no errors until yesterday night).
I realized that all files seemed to be sourced using other types of encoding than utf-8. Thus, I tried forcing it using source('file.R', encoding='utf-8') , but this gives an error and the application doesn't run.
Since I could not find the reason to why this was happening, I changed all characters to their unicode format (\u00xx) and used the enc2utf8 where needed.
After doing this, most of the application seemed to be working fine. The only problem that I am having now, is that the tooltip of my plotly charts started showing the unicode value instead of the character as shown in the image below. I tried not using the enc2utf8 on those dataframes but it also showed the characters with errors. For example, it should be Último but it appears <c3><9a>ltimo when not using the enc2utf8.
How do I solve this problem? Is there a better way to force the files to be read in utf-8 instead of wharever encoding it is being read?
Extra information:
In RStudio I am saving all files as UTF-8 by default, so I assumed the encoding should not be an issue
In my local machine (windows) the application runs just fine and the plots are shown perfectly. The errors only occur on the server that runs in a linux machine

How to set the default program to open knitted Word files from RStudio?

I have Windows 7+ PCs with both LibreOffice and MS Office installed.
I'm trying to figure out how to make RStudio to open DOCX file produced via Knit to Word in LibreOffice Writer and not MS Word.
Is there an option to choose the default viewer for Word files?
One can easily set the default program for PDF files:
but I cannot find a similar option for DOCX files neither in GUI, nor by looking at the output of Sys.getenv().
P.S. LibreOffice is set systemwide as the default program for opening office formats, but it appears that RStudio doesn't check system preferences.

Diacritics in a Pascal console APP?

I print messages with diacritics in my console application. I tried to set multiple encoding commonly used for my language (CZECH) but non of them is giving me the desired result. I tried UTF-8, Windows(CP1250), ISO 8859-2...
Is there a way how to force console to use some specific encoding?
Or at least where can I find which encoding does my console use?
Thanks in advance.
EDIT: Using Windows 7 - basic command line console ( cmd.exe )
To display the current codepage in cmd.exe:
chcp
To change the current codepage, e.g., to CP-1250:
chcp 1250
By default, the Windows console uses the OEM encoding. There are three encodings for APIs in Windows OEM, ANSI and Unicode. CMD.exe when normally executed uses OEM.
UTF8 seems to be possible, but needs
starting the console with "cmd /u" (create a shortcut)
setting the codepage to chcp 65001
choosing a unicode capable font (e.g. Consolas 20) in the settings of the shortcut

How to protect special characters in UNIX files

I moved a text file from windows to unix. The content in windows file had some special characters like ®,ä which I needed. However after moving it to linux, all my special characters where prepended by Ã. For example if the string was äbcd# it was converted to ÃabcdÃ#. Also some special characters were totally replaced by either - or `. Please let me know how can I protect my special characters from being modified or corrupted.
Update1:
I tried using binary transfer in WinScp. I am still getting the same problem.
Update2:
I tried using dos2unix. It also dint work either.
The problem is caused by the fact that Windows and Unix use different text encoding. Your file on Windows is probably in an ANSI encoding (not ASCII). Unix (Linux?) expects most likely UTF-8.
In notepad save your file in UTF-8 format. Then run the file through dos2unix to fix the line breaks.

Ctrl-M chars when transfer files SFTP

I am sending files from a windows system to a Unix SFTP server using JSCAPE ftp client.
However, I am experiencing the following issue:
When uploading a text file from windows to UNiX, each line of text files transferred contains Control-M characters. I did some search and found out that If I use the "ASCII" transfer mode it should solve the issue. But the Ctrl-M is still appearing on the files.
Can anyone throw some light in this issue?
thanks in advance
FTP supports switching between Binary and ASCII transfer mode and converting data on the fly but SFTP does not support that feature and it always transfers files unchanged (at least for the most popular version 3 of the protocol).
The utility dos2unix can be used to convert files from DOS to Unix.
That's the newline character from windows files showing up on UNIX system.
Convert the line endings prior to uploading or find a different FTP server package that can do it for you.
Some text editors have this functionality built in. For instance, Notepad++
Do you have cygwin? You can use the dos2unix utility.

Resources