I am facing problem with encoding while changing .csv file from “ISO-8859-7 -t UTF-8 using ICONV command. It shows me successfully completed but when this converted file I will export and checks in Notepad ++ it shows me ANSI encoding even the characters will different but when I changed the encoding to UTF-8 manually using Notepad ++ then it shows me the correct character strings.
For an example it is coming like ΒΙΓΛΑ ΟΛΥΜΠΟΥ ΑΕΒΕ converted using ICONV command using ISO-8859-7 -t UTF-8.
But above is not correct, when I am changing the encoding to UTF-8 in notepad ++ then it is coming ΒΙΓΛΑ ΟΛΥΜΠΟΥ ΑΕΒΕ which is correct.
Can anyone suggest me how i can get rid out from this problem.
Notepad is merely guessing what encoding it should open the file in. The file itself doesn't declare anywhere what encoding it's in. Obviously Notepad gets it wrong. So you need to tell it by hand what encoding the file is actually in.
If your file looks okay when interpreted as UTF-8, then it's actually UTF-8. Nothing else to do here.
Related
I am analyzing a collection of large (>150mb) fixed-width data files. I've been slowly reading them in using read.fwf() in 100 line chunks (each row is 7385 characters), then pushing them into a relational database for further manipulation. The problem is that the text files occasionally have a wonky multibyte character (e.g., often enough to be annoying, instead of a "U", the data file has whatever the system assigns to the Unicode U+F8FF. In OS X, that's an apple symbol, but not sure if that is a cross-platform standard). When that happens, I get an error like this:
invalid multibyte string at 'NTY <20> MAINE
000008 [...]
That should have been the latter part of the word "COUNTY", but the U was, as described above, wonky. (Happy to provide more detailed code & data if anyone thinks they would be useful.)
I'd like to do all the coding in R, and I'm just not sure to how to coerce single-byte. Hence the subject-line part of my question: is there some easy way to coerce single-byte ascii out of a text file that has some erroneous multibyte characters in it?
Or maybe there's an even better way to deal with this (should I be calling grep at the system level from R to hunt out the erroneous multi-byte characters)?
Any help much appreciated!
What does the output of the file command say about your data file?
/tmp >file a.txt b.txt
a.txt: UTF-8 Unicode text, with LF, NEL line terminators
b.txt: ASCII text, with LF, NEL line terminators
You can try to convert/transliterate the file's contents using iconv. For example, given a file that uses the Windows 1252 encoding:
# \x{93} and \x{94} are Windows 1252 quotes
/tmp >perl -E'say "He said, \x{93}hello!\x{94}"' > a.txt
/tmp >file a.txt
a.txt: Non-ISO extended-ASCII text
/tmp >cat a.txt
He said, ?hello!?
Now, with iconv you can try to convert it to ascii:
/tmp >iconv -f windows-1252 -t ascii a.txt
He said,
iconv: a.txt:1:9: cannot convert
Since there is no direct conversion here it fails. Instead, you can tell iconv to do a transliteration:
/tmp >iconv -f windows-1252 -t ascii//TRANSLIT a.txt > converted.txt
/tmp >file converted.txt
converted.txt: ASCII text
/tmp >cat converted.txt
He said, "hello!"
There might be a way to do this using R's IO layer, but I don't know R.
Hope that helps.
I have a file here that contais some texts, and I want edit them.
But between the characters have a decimal value 00. If I remove it, gives erros in the file and nothing appears in the program! But if I edit keeping the 00 values between the letters, it works.
Have a program that "hide" these values? By this mode, it is very difficult for me to edit so many letters one by one in a file of 13 MB! Here goes a print:
http://img211.imageshack.us/img211/2286/fsfsz.png
What can I do?
Thanks all in advance!
Your file looks like an UTF-16 text file, it means each character is coded in 16 bits instead of 8 bits.
If you try to edit this file as a standard text file, you get null characters between each letters.
You can use libiconv to convert the file format, or you can write your own converter.
Using iconv :
iconv -f UTF-16 -t UTF-8 yourFile.txt > fileToEdit.txt
iconv -f UTF-8 -t UTF-16 editedFile.txt > programFile.txt
If you're on Windows, you can use the MinGW distribution of libiconv.
The file is encoded in Unicode. UTF-16, most likely. If you open it in a decent text editor (e.g. Notepad++) it will automatically detect this and allow you to change the encoding. However, 'the program' (whatever that is) probably wants to consume the file with UTF-16 encoding. It's not clear why you're trying to change it but the answer is probably keep the 00s.
I am running the IBAMR model (a set of codes for solving immersed boundary problems) on x86_64 GNU/Linux.
The startup configuration file is called input2d.
When I open it with vi, I find:
"input2d" [noeol][dos] 251L, 11689C
If I compile the IBAMR model without saving input2d, it compiles and runs fine.
However, if I save input2d, the compiler crashes, saying:
Warning in input2d at line 251 column 5 : Illegal character token in input
Clearly this has something to do with unix adding a newline to the end of the file.
Here's my question:
How do I save this file in dos format AND without a trailing newline in vi on a unix system?
Use vim -b <file> or :set binary to tell vim not to append an newline at the end of the file. From :help binary:
When writing a file the <EOL> for the last line is only written if
there was one in the original file (normally Vim appends an <EOL> to
the last line if there is none; this would make the file longer). See
the 'endofline' option.
There's a script for this that I found on Vim Tips:
http://vim.wikia.com/wiki/Preserve_missing_end-of-line_at_end_of_text_files
It automatically enables "binary" if there was no eol, but ensures the original line-endings are preserved for the rest of the file.
I have a file which is described under Unix as:
$file xxx.csv
xxx.csv: UTF-8 Unicode text, with very long lines
Viewing it in less/vi will render some special chars (ßÄ°...) unreadable (├╝); Windows will also not display it; importing it directly into a db will just change the special characters to some other special characters (+ä, +ñ, ...).
I wanted to convert it now to a "default readable" encoding with iconv.
When I try to convert it with iconv
$iconv -f UTF-8 -t ISO-8859-1 xxx.csv > yyy.csv
iconv: illegal input sequence at position 1234
using UNICODE as input and UTF-8 as output will return the same message
I am guessing the file is somewhat encoded in another format which I do not know - how can I find out which format in order to convert it to something "universally" readable ...
Converting from UTF-8 to ISO-8859-1 only works if your UTF-8 text only has characters that can be represented in ISO-8859-1. If this is not the case, you should specify what needs to happen to these characters, either ignoring (//IGNORE) or approximating (//TRANSLIT) them. Try one of these two:
iconv -f UTF-8 -t ISO-8859-1//IGNORE --output=outfile.csv inputfile.csv
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT --output=outfile.csv inputfile.csv
In most cases, I guess approximation is the best solution, mapping e.g. accented characters to their unaccented counterparts, the euro sign to EUR, etc...
The problem was that Windows could not interpret the file as UTF-8 on itself. it reads it as asci and then ä becomes a 2 character interpretation ä (ascii 195 164)
trying to convert it, I found a solution that works for me:
iconv -f UTF-8 -t WINDOWS-1252//TRANSLIT --output=outfile.csv inputfile.csv
now I can view the special chars correctly in editors
For SQLServer compability, converting UTF-8 to UTF-16 will work even better ... just the filesize grows quite a bit
If you are not sure about the file type you dealing with then you can find it as follows,
file file_name
The above command will give you the file format. Then iconv can be used accordingly.
For example if the file format is UTF-16 and you want to convert it to UTF-8 then following can be used.
iconv -f UTF-16 -t UTF-8 file_name >output_file_name
Hope this gives add on insight to what you are looking for.
I am using Gina Trapiani's excellent todo.sh to organize my todo-list.
However being a dane, it would be nice if the script accepted special danish characters like ø and æ.
I am an absolute UNIX-n00b, so it would be a great help if anybody could tell me how to fix this! :)
Slowly, the Unix world is moving from ASCII and other regional encodings to UTF-8. You need to be running a UTF terminal, such as a modern xterm or putty.
In your ~/.bash_profile set you language to be one of the UTF-8 variants.
export LANG=C.UTF-8
or
export LANG=en_AU.UTF-8
etc..
You should then be able to write UTF-8 characters in the terminal, and include them in bash scripts.
#!/bin/bash
echo "UTF-8 is græat ☺"
See also: https://serverfault.com/questions/11015/utf-8-and-shell-scripts
What does this command show?
locale
It should show something like this for you:
LC_CTYPE="da_DK.UTF-8"
LC_NUMERIC="da_DK.UTF-8"
LC_TIME="da_DK.UTF-8"
LC_COLLATE="da_DK.UTF-8"
LC_MONETARY="da_DK.UTF-8"
LC_MESSAGES="da_DK.UTF-8"
LC_PAPER="da_DK.UTF-8"
LC_NAME="da_DK.UTF-8"
LC_ADDRESS="da_DK.UTF-8"
LC_TELEPHONE="da_DK.UTF-8"
LC_MEASUREMENT="da_DK.UTF-8"
LC_IDENTIFICATION="da_DK.UTF-8"
LC_ALL=
If not, you might try doing this before you run your script:
LANG=da_DK.UTF-8
You don't say what happens when you run the script and it encounters these characters. Are they in the todo file? Are they entered at a prompt? Is there an error message? Is something output in place of the expected output?
Try this and see what you get:
read -p "Enter some characters" string
echo "$string"