I am analyzing a collection of large (>150mb) fixed-width data files. I've been slowly reading them in using read.fwf() in 100 line chunks (each row is 7385 characters), then pushing them into a relational database for further manipulation. The problem is that the text files occasionally have a wonky multibyte character (e.g., often enough to be annoying, instead of a "U", the data file has whatever the system assigns to the Unicode U+F8FF. In OS X, that's an apple symbol, but not sure if that is a cross-platform standard). When that happens, I get an error like this:
invalid multibyte string at 'NTY <20> MAINE
000008 [...]
That should have been the latter part of the word "COUNTY", but the U was, as described above, wonky. (Happy to provide more detailed code & data if anyone thinks they would be useful.)
I'd like to do all the coding in R, and I'm just not sure to how to coerce single-byte. Hence the subject-line part of my question: is there some easy way to coerce single-byte ascii out of a text file that has some erroneous multibyte characters in it?
Or maybe there's an even better way to deal with this (should I be calling grep at the system level from R to hunt out the erroneous multi-byte characters)?
Any help much appreciated!
What does the output of the file command say about your data file?
/tmp >file a.txt b.txt
a.txt: UTF-8 Unicode text, with LF, NEL line terminators
b.txt: ASCII text, with LF, NEL line terminators
You can try to convert/transliterate the file's contents using iconv. For example, given a file that uses the Windows 1252 encoding:
# \x{93} and \x{94} are Windows 1252 quotes
/tmp >perl -E'say "He said, \x{93}hello!\x{94}"' > a.txt
/tmp >file a.txt
a.txt: Non-ISO extended-ASCII text
/tmp >cat a.txt
He said, ?hello!?
Now, with iconv you can try to convert it to ascii:
/tmp >iconv -f windows-1252 -t ascii a.txt
He said,
iconv: a.txt:1:9: cannot convert
Since there is no direct conversion here it fails. Instead, you can tell iconv to do a transliteration:
/tmp >iconv -f windows-1252 -t ascii//TRANSLIT a.txt > converted.txt
/tmp >file converted.txt
converted.txt: ASCII text
/tmp >cat converted.txt
He said, "hello!"
There might be a way to do this using R's IO layer, but I don't know R.
Hope that helps.
I have a file here that contais some texts, and I want edit them.
But between the characters have a decimal value 00. If I remove it, gives erros in the file and nothing appears in the program! But if I edit keeping the 00 values between the letters, it works.
Have a program that "hide" these values? By this mode, it is very difficult for me to edit so many letters one by one in a file of 13 MB! Here goes a print:
http://img211.imageshack.us/img211/2286/fsfsz.png
What can I do?
Thanks all in advance!
Your file looks like an UTF-16 text file, it means each character is coded in 16 bits instead of 8 bits.
If you try to edit this file as a standard text file, you get null characters between each letters.
You can use libiconv to convert the file format, or you can write your own converter.
Using iconv :
iconv -f UTF-16 -t UTF-8 yourFile.txt > fileToEdit.txt
iconv -f UTF-8 -t UTF-16 editedFile.txt > programFile.txt
If you're on Windows, you can use the MinGW distribution of libiconv.
The file is encoded in Unicode. UTF-16, most likely. If you open it in a decent text editor (e.g. Notepad++) it will automatically detect this and allow you to change the encoding. However, 'the program' (whatever that is) probably wants to consume the file with UTF-16 encoding. It's not clear why you're trying to change it but the answer is probably keep the 00s.
I'm having a little trouble getting the output of diff to write to file. I have a new and old version of a .strings file and I want to be able to write the diff between these two files to a .strings.diff file.
Here's where I am right now:
diff -u -a -B $PROJECT_DIR/new/Localizable.strings $PROJECT_DIR/old/Localizable.strings >> $PROJECT_DIR/diff/Localizable.strings.diff
fgrep + $PROJECT_DIR/diff/Localizable.strings.diff > $PROJECT_DIR/diff/Localizable.txt
The result of the diff command writes to Localizable.strings.diff without any issues, but Localizable.strings.diff appears to be a binary file. Is there any way to output the diff to a UTF-8 encoded file instead?
Note that I'm trying to just get the additions using fgrep in my second command. If there's an easier way to do this, please let me know.
Thanks,
Sean
First, you probably need to identify the encoding of the Localizable.strings files. This might be done in a manner described by How to find encoding of a file in Unix via script(s), for example.
Then probably you need to convert the Localizable.strings file to UTF-8 with a tool like iconv using commands something like:
iconv -f x -t UTF-8 $PROJECT_DIR/new/Localizable.strings >Localizable.strings.new.utf8
iconv -f x -t UTF-8 $PROJECT_DIR/old/Localizable.strings >Localizable.strings.old.utf8
Where x is the actual encoding in a form recognized by iconv. You can use iconv --list to show all the encodings it knows about.
Then, you probably need to diff without having to use -a.
diff -u -B Localizable.strings.old.utf8 Localizable.strings.new.utf8 >Localizable.strings.diff.utf8
I am using Gina Trapiani's excellent todo.sh to organize my todo-list.
However being a dane, it would be nice if the script accepted special danish characters like ø and æ.
I am an absolute UNIX-n00b, so it would be a great help if anybody could tell me how to fix this! :)
Slowly, the Unix world is moving from ASCII and other regional encodings to UTF-8. You need to be running a UTF terminal, such as a modern xterm or putty.
In your ~/.bash_profile set you language to be one of the UTF-8 variants.
export LANG=C.UTF-8
or
export LANG=en_AU.UTF-8
etc..
You should then be able to write UTF-8 characters in the terminal, and include them in bash scripts.
#!/bin/bash
echo "UTF-8 is græat ☺"
See also: https://serverfault.com/questions/11015/utf-8-and-shell-scripts
What does this command show?
locale
It should show something like this for you:
LC_CTYPE="da_DK.UTF-8"
LC_NUMERIC="da_DK.UTF-8"
LC_TIME="da_DK.UTF-8"
LC_COLLATE="da_DK.UTF-8"
LC_MONETARY="da_DK.UTF-8"
LC_MESSAGES="da_DK.UTF-8"
LC_PAPER="da_DK.UTF-8"
LC_NAME="da_DK.UTF-8"
LC_ADDRESS="da_DK.UTF-8"
LC_TELEPHONE="da_DK.UTF-8"
LC_MEASUREMENT="da_DK.UTF-8"
LC_IDENTIFICATION="da_DK.UTF-8"
LC_ALL=
If not, you might try doing this before you run your script:
LANG=da_DK.UTF-8
You don't say what happens when you run the script and it encounters these characters. Are they in the todo file? Are they entered at a prompt? Is there an error message? Is something output in place of the expected output?
Try this and see what you get:
read -p "Enter some characters" string
echo "$string"
Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).
In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.
However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.
So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.
You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.
file README
README: ASCII English text, with very long lines
file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped
The spreadsheet software my company makes reads a number of binary file formats as well as text files.
We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.
You can determine the MIME type of the file with
file --mime FILENAME
The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).
If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.
To list text file names in current dir/subdirs:
grep -rIl ''
Binaries:
grep -rIL ''
To check for a particular file:
grep -qI '' FILE
then, exit status '0' would mean the file is a text; '1' - binary.
To check:
echo $?
Key option is this:
-I Process a binary file as if it did not contain matching data;
Other options:
-r, --recursive
Read all files under each directory, recursively;
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed.
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed.
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected.
Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:
$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'
(Note that those underscores without a preceding dollar are correct (RTFM).)
Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.
To distinguish a unicode text file, MSDN offers some great advice as to what to do.
The gist of it is to first inspect up to the first four bytes:
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.
Its an old topic, but maybe someone will find this useful.
If you have to decide in a script if something is a file then you can simply do like this :
if file -i $1 | grep -q text;
then
.
.
fi
This will get the file type, and with a silent grep you can decide if its a text.
You can use libmagic which is a library version of the Unix file command line (source).
There are wrappers for many languages:
Python
.NET
Nodejs
Ruby
Go
Rust
Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.
One simple check is if it has \0 characters. Text files don't have them.
As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.
This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.
The structure and description of the magic file can be found by consulting the relevant manual page (man magic)
As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following
/* Make sure we are dealing with ascii text before looking for tokens */
for (i = 0; i < nbytes - 1; i++) {
if (!isascii(buf[i]) ||
(iscntrl(buf[i]) && !isspace(buf[i]) &&
buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
)
)
return 0; /* not all ASCII */
}