HEX 00 value between characters - hex

I have a file here that contais some texts, and I want edit them.
But between the characters have a decimal value 00. If I remove it, gives erros in the file and nothing appears in the program! But if I edit keeping the 00 values ​​between the letters, it works.
Have a program that "hide" these values? By this mode, it is very difficult for me to edit so many letters one by one in a file of 13 MB! Here goes a print:
http://img211.imageshack.us/img211/2286/fsfsz.png
What can I do?
Thanks all in advance!

Your file looks like an UTF-16 text file, it means each character is coded in 16 bits instead of 8 bits.
If you try to edit this file as a standard text file, you get null characters between each letters.
You can use libiconv to convert the file format, or you can write your own converter.
Using iconv :
iconv -f UTF-16 -t UTF-8 yourFile.txt > fileToEdit.txt
iconv -f UTF-8 -t UTF-16 editedFile.txt > programFile.txt
If you're on Windows, you can use the MinGW distribution of libiconv.

The file is encoded in Unicode. UTF-16, most likely. If you open it in a decent text editor (e.g. Notepad++) it will automatically detect this and allow you to change the encoding. However, 'the program' (whatever that is) probably wants to consume the file with UTF-16 encoding. It's not clear why you're trying to change it but the answer is probably keep the 00s.

Related

file in UTF-8 using ICONV showing ANSI encoding in notepad ++

I am facing problem with encoding while changing .csv file from “ISO-8859-7 -t UTF-8 using ICONV command. It shows me successfully completed but when this converted file I will export and checks in Notepad ++ it shows me ANSI encoding even the characters will different but when I changed the encoding to UTF-8 manually using Notepad ++ then it shows me the correct character strings.
For an example it is coming like ΒΙΓΛΑ ΟΛΥΜΠΟΥ ΑΕΒΕ converted using ICONV command using ISO-8859-7 -t UTF-8.
But above is not correct, when I am changing the encoding to UTF-8 in notepad ++ then it is coming ΒΙΓΛΑ ΟΛΥΜΠΟΥ ΑΕΒΕ which is correct.
Can anyone suggest me how i can get rid out from this problem.
Notepad is merely guessing what encoding it should open the file in. The file itself doesn't declare anywhere what encoding it's in. Obviously Notepad gets it wrong. So you need to tell it by hand what encoding the file is actually in.
If your file looks okay when interpreted as UTF-8, then it's actually UTF-8. Nothing else to do here.

Line entry/count difference between sed and nl on unix vs. mac

I have a simple and annoying problem, and I apologize for not posting an example. The files are big and I haven't been able to recreate the exact issue using smaller files:
These are tab-delimited files (some entries contain " ; or a single space character). On UNIX, when I access a unique word via: nl file | sed -n '/word/p' I see that my word is on exactly the same line in all my files.
Now I copy the files to my mac. I run the same command on the same exact files, but the line numbers are all different! The total number of lines via wc -l is still identical to the numbers I get in unix, but when I do nl file | tail -n1 I see a different number. Yet, when I enter the number returned from my unix nl, and access the same line via sed '12345p' file I get the correct entry!?
My question: I must have something in some of my lines that is interpreted as linebreaks on my mac but not in unix, and only by nl not sed. Can anyone help me figure out what it is? I already know it's not on every line. I found this issue persists when I load the data into R, and I'm stumped. Thank you!
"Phantom newlines" can be hidden in text in the form of a multi-byte UTF-8 character called an "overlong sequence".
UTF-8 normally represents ASCII characters as themselves: UTF-8 bytes in the range 0 to 127 are just those character values. However, overlong sequences can be used to (incorrectly) encode ASCII characters using multiple UTF-8 bytes (which are in the range 0x80-0xFF). A properly written UTF-8 decoder must detect overlong sequences and somehow flag them as invalid bytes. A naively written UTF-8 decoder will simply extract the implied character.
Thus, it's possible that your data is being treated as UTF-8, and contains some bytes which look like an overlong sequence for a newline, and that this is fooling some of the software you are working with. A two-byte overlong sequence for a newline would look like C0 8A, and a three-byte overlong sequence would be E0 80 8A.
It is hard to come up with an alternative hypothesis not involving character encodings.

coerce single-byte ascii from a text file

I am analyzing a collection of large (>150mb) fixed-width data files. I've been slowly reading them in using read.fwf() in 100 line chunks (each row is 7385 characters), then pushing them into a relational database for further manipulation. The problem is that the text files occasionally have a wonky multibyte character (e.g., often enough to be annoying, instead of a "U", the data file has whatever the system assigns to the Unicode U+F8FF. In OS X, that's an apple symbol, but not sure if that is a cross-platform standard). When that happens, I get an error like this:
invalid multibyte string at 'NTY <20> MAINE
000008 [...]
That should have been the latter part of the word "COUNTY", but the U was, as described above, wonky. (Happy to provide more detailed code & data if anyone thinks they would be useful.)
I'd like to do all the coding in R, and I'm just not sure to how to coerce single-byte. Hence the subject-line part of my question: is there some easy way to coerce single-byte ascii out of a text file that has some erroneous multibyte characters in it?
Or maybe there's an even better way to deal with this (should I be calling grep at the system level from R to hunt out the erroneous multi-byte characters)?
Any help much appreciated!
What does the output of the file command say about your data file?
/tmp >file a.txt b.txt
a.txt: UTF-8 Unicode text, with LF, NEL line terminators
b.txt: ASCII text, with LF, NEL line terminators
You can try to convert/transliterate the file's contents using iconv. For example, given a file that uses the Windows 1252 encoding:
# \x{93} and \x{94} are Windows 1252 quotes
/tmp >perl -E'say "He said, \x{93}hello!\x{94}"' > a.txt
/tmp >file a.txt
a.txt: Non-ISO extended-ASCII text
/tmp >cat a.txt
He said, ?hello!?
Now, with iconv you can try to convert it to ascii:
/tmp >iconv -f windows-1252 -t ascii a.txt
He said,
iconv: a.txt:1:9: cannot convert
Since there is no direct conversion here it fails. Instead, you can tell iconv to do a transliteration:
/tmp >iconv -f windows-1252 -t ascii//TRANSLIT a.txt > converted.txt
/tmp >file converted.txt
converted.txt: ASCII text
/tmp >cat converted.txt
He said, "hello!"
There might be a way to do this using R's IO layer, but I don't know R.
Hope that helps.

iconv unicode unknown input format

I have a file which is described under Unix as:
$file xxx.csv
xxx.csv: UTF-8 Unicode text, with very long lines
Viewing it in less/vi will render some special chars (ßÄ°...) unreadable (├╝); Windows will also not display it; importing it directly into a db will just change the special characters to some other special characters (+ä, +ñ, ...).
I wanted to convert it now to a "default readable" encoding with iconv.
When I try to convert it with iconv
$iconv -f UTF-8 -t ISO-8859-1 xxx.csv > yyy.csv
iconv: illegal input sequence at position 1234
using UNICODE as input and UTF-8 as output will return the same message
I am guessing the file is somewhat encoded in another format which I do not know - how can I find out which format in order to convert it to something "universally" readable ...
Converting from UTF-8 to ISO-8859-1 only works if your UTF-8 text only has characters that can be represented in ISO-8859-1. If this is not the case, you should specify what needs to happen to these characters, either ignoring (//IGNORE) or approximating (//TRANSLIT) them. Try one of these two:
iconv -f UTF-8 -t ISO-8859-1//IGNORE --output=outfile.csv inputfile.csv
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT --output=outfile.csv inputfile.csv
In most cases, I guess approximation is the best solution, mapping e.g. accented characters to their unaccented counterparts, the euro sign to EUR, etc...
The problem was that Windows could not interpret the file as UTF-8 on itself. it reads it as asci and then ä becomes a 2 character interpretation ä (ascii 195 164)
trying to convert it, I found a solution that works for me:
iconv -f UTF-8 -t WINDOWS-1252//TRANSLIT --output=outfile.csv inputfile.csv
now I can view the special chars correctly in editors
For SQLServer compability, converting UTF-8 to UTF-16 will work even better ... just the filesize grows quite a bit
If you are not sure about the file type you dealing with then you can find it as follows,
file file_name
The above command will give you the file format. Then iconv can be used accordingly.
For example if the file format is UTF-16 and you want to convert it to UTF-8 then following can be used.
iconv -f UTF-16 -t UTF-8 file_name >output_file_name
Hope this gives add on insight to what you are looking for.

How do I distinguish between 'binary' and 'text' files?

Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).
In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.
However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.
So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.
You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.
file README
README: ASCII English text, with very long lines
file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped
The spreadsheet software my company makes reads a number of binary file formats as well as text files.
We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.
You can determine the MIME type of the file with
file --mime FILENAME
The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).
If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.
To list text file names in current dir/subdirs:
grep -rIl ''
Binaries:
grep -rIL ''
To check for a particular file:
grep -qI '' FILE
then, exit status '0' would mean the file is a text; '1' - binary.
To check:
echo $?
Key option is this:
-I Process a binary file as if it did not contain matching data;
Other options:
-r, --recursive
Read all files under each directory, recursively;
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed.
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed.
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected.
Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:
$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'
(Note that those underscores without a preceding dollar are correct (RTFM).)
Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.
To distinguish a unicode text file, MSDN offers some great advice as to what to do.
The gist of it is to first inspect up to the first four bytes:
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.
Its an old topic, but maybe someone will find this useful.
If you have to decide in a script if something is a file then you can simply do like this :
if file -i $1 | grep -q text;
then
.
.
fi
This will get the file type, and with a silent grep you can decide if its a text.
You can use libmagic which is a library version of the Unix file command line (source).
There are wrappers for many languages:
Python
.NET
Nodejs
Ruby
Go
Rust
Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.
One simple check is if it has \0 characters. Text files don't have them.
As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.
This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.
The structure and description of the magic file can be found by consulting the relevant manual page (man magic)
As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following
/* Make sure we are dealing with ascii text before looking for tokens */
for (i = 0; i < nbytes - 1; i++) {
if (!isascii(buf[i]) ||
(iscntrl(buf[i]) && !isspace(buf[i]) &&
buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
)
)
return 0; /* not all ASCII */
}

Resources