I have a text file that display differently when opening it in FreeBSD vs. Windows.
On FreeBSD:
An·lisis e InvestigaciÛn
On Windows:
Análisis e Investigación
The windows representation is obviously right. Any ideas on how to get that result in bsd?
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The problem is that it's not ASCII, but UTF-8. You have to use another editor which detects the encoding correctly or convert it to something your editor on freebsb understands.
This is not pure ASCII. It's utf-8. Try freebsd editor with utf-8 support or change locales.
From the way the characters are being displayed, I would say that file is UTF-8 encoded unicode. Windows is recognising this, and displaying the 'á' and 'ó' characters correctly, while FreeBSD is assuming it's ISO-8859-1, which results in these characters being displayed as 2 seperate characters (due to the UTF-8 encoding using 2 bytes).
You'll have to tell FreeBSD that it is a UTF-8 file, somehow.
How is the file encoded? I would try re-encoding the file as UTF-16.
So after doing a bit more digging if 1) Open the csv file in excel on mac and export it as csv file and 2) then open it in textmate, copy the text, and save it again it works.
The result of: file file.csv is
UTF-8 Unicode English text, with very long lines
The original is:
on-ISO extended-ASCII English text, with very long lines
This workaround isn't really suitable as this process is supposed to be automated, thanks for the help so far.
It doesn't matter which operating system you're using when you open the file. What matters is the application you use to open it. On Windows you're probably using Notepad, which automatically identifies the encoding as UTF-8.
The app you're using on FreeBSD obviously isn't doing that. Maybe it just can't read UTF-8 and you need to use a different app. Or maybe you just have to tell it which encoding to use. Automatic detection of character encodings is far from universal (and much farther from perfect).
Related
I am working with Mitsubishi PLC files that were originally commented in Japanese but then opened on English-only computers which converted the Japanese symbols to incomprehensible latin keyboard symbol combinations such as ‰^“]€”õONŠm”F(‘€ì”Õ1).
Being able to understand these comments would greatly enhance my ability to analyze and modify these files as I am required to do so for my work. If I could translate these back to Japanese symbols (I do have the Japanese language pack installed on my windows laptop), I could then translate these with Google Translate, which I know is not perfect, but is a lot better than ##$$##&^.
Does anyone have any ideas how this could be done? I figure that Windows must have interpreted the original characters somehow, and there may be a way to interpret them back to the original symbols.
I am thinking of trying to do some kind of character translation using a script in Python or Powershell or VBA (maybe I can create a map in Excel...)
Any ideas?
I can export these comments into CSV files so easy to get to and manipulate if I can figure out how....
This is an ongoing problem for me so I am willing to put some time into a solution.
I tried re-opening the oldest version of the files, in my computer with the Japanese language pack installed and no luck.
You can run your text through an ascii to hex converter and then through a hex to ascii converter in order to change the encoding without your system settings being in the way.
Is there any way to deal with this letter in R -Å?
In some configuration I'm able to read this letter from SQL by RODBC, but I didn't found any solution to save this letter to csv or txt. It's always getting converted to normal A or Ĺ.
Also, how to read this letter correctly from Excel file?
I understand from you question that the letter displays properly inside R but you have problems writing it to files.
R's writing functions usually have an encoding parameter (for example, for write.csv and write.table it's called fileEncoding).
When you don't set it explicitly, the function will encode the file using your OS's (or R-installations) native encoding, which can sometimes cause problems with special characters. What exactly goes wrong and how to fix it depends heavily on your system setup - especially if you're also interacting with databases, as you describe.
But very often, an easy fix is writing files in UTF-8 encoding, i.e.
write.csv(your_df, your_path, fileEncoding='UTF-8')
as most external programs (such as Excel) are able to automatically detect and properly read UTF-8 encoded files.
Set the fileEncoding argument on write.table to fit your needs (e.g., if your text is encoded as UTF-8, try write.table(my_tab, file = "my_tab.txt", fileEncoding = "UTF8")).
There are a number of StackOverflow posts about opening CSV files containing (UTF-8 encoded) Chinese characters into R, in Windows. None of the answers I've found seem to work completely.
If I read.csv with encoding="UTF-8", then the Chinese characters are shown encoded (<U+XXXX>, which I've manually verified are at least correct). However, if I interrogate the data frame to get just one row or a specific cell from a row, then it's printed properly.
One post suggested this is due to strings being typed as factors. However, setting stringsAsFactors=FALSE had no effect.
Other posts say the locale must be set correctly. My system locale is apparently English_United Kingdom.1252; a Windows code page looks decidedly non-Unicode friendly! If I try to change it to any of en.UTF-8, en_GB.UTF-8 or en_US.UTF-8 (or even UTF-8 or Unicode), I get an error saying that my OS cannot honour the request.
If I try Sys.setlocale(category="LC_ALL", locale="Chinese"), the locale does change (albeit to another Windows code page; still no Unicode) but then the CSV files can't be parsed. That said, if I read the files in the English locale and then switch to Chinese afterwards, the data frame is printed out correctly in the console. However, this is cludgy and, regardless, View(myData) now shows mojibake rather than the encoded Unicode code points.
Is there any way to just make it all work? That is, correct Chinese characters are echoed from the data frame to the console and View works, without having to perform secret handshakes when reading the data?
My gut feeling is that the problem is the locale: It should be set to a UTF-8 locale and then everything should [might] just work. However, I don't know how to do that...
The UTF notation is good and it means your characters were read in property. The issue is on R's side with printing to console, which shouldn't be a big problem unless you are copying and pasting output. Writing out is a bit tricky: you want to open a UTF-8 file connection, then write to that file.
I have a problem that it might be a bit unique, but I think that if it is answered it could answer other questions about encoding too.
In order to expand my R skills I tried to write a function that I could manage the vcf file from android phones. Everything went ok, until I tried to upload the file in the phone. An error appeared that the first line starts with something else than a normal VCF version 3 file. But when I check the file on the PC it appears to be ok without these characters that my phone said. So, I asked about it and one person here said that it is the Byte Ordering Mark and I should use a HEX editor to see it. And it was there even it couldn't be seen in the TXT editor of windows and linux.
Thus, I tried to solve the problem by using fileEncoding arguments in R. the code that I use to write the file is:
write.table(cons2,file=paste(filename,".vcf",sep=""),row.names=F,col.names=F,quote=FALSE,fileEncoding="")
I put ASCII as argument, UTF-8 etc but no luck. ASCII seems to delete some of the characters, and UTF-8 makes these characters be visible in the text file.
I would appreciate if someone could provide a solution to this.
PS: I know that if I modify the file in a HEX editor it solves the problem, but I want the solution in the R coding.
I'm trying to paste chinese text into terminal but I just get lots of numbers instead. if I quickly paste as soon as terminal loads the paste will work that once but not again? Its utf-8 unicode i'm using.
I dont think its the font as it works in textedit the only place I get the problem is in terminal but I need to use it to make my sqlite database.
What would be the best thing to do?
Thanks
Load Terminal Inspector, and make sure the Character Set Encoding has to be set to Unicode (UTF-8) and check the Wide glyphs for Japanese/Chinese/etc setting.
The best thing to do would probably be to write the data into an SQL file and perform that with sqlite3 mydatabase.db < mychinesetextfile.sql.
It's not pretty, on the whole; but it'll work.