I am analyzing a collection of large (>150mb) fixed-width data files. I've been slowly reading them in using read.fwf() in 100 line chunks (each row is 7385 characters), then pushing them into a relational database for further manipulation. The problem is that the text files occasionally have a wonky multibyte character (e.g., often enough to be annoying, instead of a "U", the data file has whatever the system assigns to the Unicode U+F8FF. In OS X, that's an apple symbol, but not sure if that is a cross-platform standard). When that happens, I get an error like this:
invalid multibyte string at 'NTY <20> MAINE
000008 [...]
That should have been the latter part of the word "COUNTY", but the U was, as described above, wonky. (Happy to provide more detailed code & data if anyone thinks they would be useful.)
I'd like to do all the coding in R, and I'm just not sure to how to coerce single-byte. Hence the subject-line part of my question: is there some easy way to coerce single-byte ascii out of a text file that has some erroneous multibyte characters in it?
Or maybe there's an even better way to deal with this (should I be calling grep at the system level from R to hunt out the erroneous multi-byte characters)?
Any help much appreciated!
What does the output of the file command say about your data file?
/tmp >file a.txt b.txt
a.txt: UTF-8 Unicode text, with LF, NEL line terminators
b.txt: ASCII text, with LF, NEL line terminators
You can try to convert/transliterate the file's contents using iconv. For example, given a file that uses the Windows 1252 encoding:
# \x{93} and \x{94} are Windows 1252 quotes
/tmp >perl -E'say "He said, \x{93}hello!\x{94}"' > a.txt
/tmp >file a.txt
a.txt: Non-ISO extended-ASCII text
/tmp >cat a.txt
He said, ?hello!?
Now, with iconv you can try to convert it to ascii:
/tmp >iconv -f windows-1252 -t ascii a.txt
He said,
iconv: a.txt:1:9: cannot convert
Since there is no direct conversion here it fails. Instead, you can tell iconv to do a transliteration:
/tmp >iconv -f windows-1252 -t ascii//TRANSLIT a.txt > converted.txt
/tmp >file converted.txt
converted.txt: ASCII text
/tmp >cat converted.txt
He said, "hello!"
There might be a way to do this using R's IO layer, but I don't know R.
Hope that helps.
Related
In my file somehow  is getting added. I am not sure what it is and how it is getting added.
12345AÂ 210Â CBCDEM
I want to remove this character from the file . I tried basic sed command to get it remove but unsuccessful.
sed -i -e 's/\Â//g'
I also read that dos2unix will do the job but unfortunately that also didn't work .Assuming it was hex character I also tried to remove it using hex value sed -i 's/\xc2//g' but that also didnt work
I really want to understand what this character is and how it is getting added. Moreover , is there possible way to delete all such characters in a file .
Adding encoding details :--
file test.txt
test.txt: ISO-8859 text
echo $LANG
en_US.UTF-8
OS Details :--
uname -a
Linux vm-testmachine-001 3.10.0-693.11.1.el7.x86_64 #1 SMP Fri Oct 27 05:39:05 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Regards.
It looks like you have an encoding mismatch between the program that writes the file (in some part of ISO-8859) and the program reading the file (assuming it to be UTF-8). This is a textbook use-case for iconv. In fact the sample in the man-page is almost exactly applicable to your case:
iconv -f iso-8859-1 -t utf-8 test.txt
iconv is a fairly standard program on almost every Unix distribution I have seen, so you should not have any issues here.
Based on the fact that you appear to be writing with English as your primary language, you are probably looking for iso-8859-1, which is quite popular apparently.
If that does not fix your issue, You probably need to find the proper encoding for the output of your database. You can do
iconv -l
to get a list of encodings available for iconv, and use the one that works for you. Keep in mind that the output of file saying ISO-8859 text is not absolute. There is no way to distinguish things like pure ASCII and UTF-8 in many cases. If I am not mistaken, file uses heuristics based on frequencies of character codes in the file to determine the encoding. It is quite liable to make a mistake if the sample is small and/or ambiguous.
If you want to save the output of iconv and your version supports the -o flag, you can use it. Otherwise, use redirection, but carefully:
TMP=$(mktemp)
iconv -f iso-8859-1 -t utf-8 test.txt > "$TMP" && mv "$TMP" test.txt
I have a simple and annoying problem, and I apologize for not posting an example. The files are big and I haven't been able to recreate the exact issue using smaller files:
These are tab-delimited files (some entries contain " ; or a single space character). On UNIX, when I access a unique word via: nl file | sed -n '/word/p' I see that my word is on exactly the same line in all my files.
Now I copy the files to my mac. I run the same command on the same exact files, but the line numbers are all different! The total number of lines via wc -l is still identical to the numbers I get in unix, but when I do nl file | tail -n1 I see a different number. Yet, when I enter the number returned from my unix nl, and access the same line via sed '12345p' file I get the correct entry!?
My question: I must have something in some of my lines that is interpreted as linebreaks on my mac but not in unix, and only by nl not sed. Can anyone help me figure out what it is? I already know it's not on every line. I found this issue persists when I load the data into R, and I'm stumped. Thank you!
"Phantom newlines" can be hidden in text in the form of a multi-byte UTF-8 character called an "overlong sequence".
UTF-8 normally represents ASCII characters as themselves: UTF-8 bytes in the range 0 to 127 are just those character values. However, overlong sequences can be used to (incorrectly) encode ASCII characters using multiple UTF-8 bytes (which are in the range 0x80-0xFF). A properly written UTF-8 decoder must detect overlong sequences and somehow flag them as invalid bytes. A naively written UTF-8 decoder will simply extract the implied character.
Thus, it's possible that your data is being treated as UTF-8, and contains some bytes which look like an overlong sequence for a newline, and that this is fooling some of the software you are working with. A two-byte overlong sequence for a newline would look like C0 8A, and a three-byte overlong sequence would be E0 80 8A.
It is hard to come up with an alternative hypothesis not involving character encodings.
I have a file here that contais some texts, and I want edit them.
But between the characters have a decimal value 00. If I remove it, gives erros in the file and nothing appears in the program! But if I edit keeping the 00 values between the letters, it works.
Have a program that "hide" these values? By this mode, it is very difficult for me to edit so many letters one by one in a file of 13 MB! Here goes a print:
http://img211.imageshack.us/img211/2286/fsfsz.png
What can I do?
Thanks all in advance!
Your file looks like an UTF-16 text file, it means each character is coded in 16 bits instead of 8 bits.
If you try to edit this file as a standard text file, you get null characters between each letters.
You can use libiconv to convert the file format, or you can write your own converter.
Using iconv :
iconv -f UTF-16 -t UTF-8 yourFile.txt > fileToEdit.txt
iconv -f UTF-8 -t UTF-16 editedFile.txt > programFile.txt
If you're on Windows, you can use the MinGW distribution of libiconv.
The file is encoded in Unicode. UTF-16, most likely. If you open it in a decent text editor (e.g. Notepad++) it will automatically detect this and allow you to change the encoding. However, 'the program' (whatever that is) probably wants to consume the file with UTF-16 encoding. It's not clear why you're trying to change it but the answer is probably keep the 00s.
I have a file which is described under Unix as:
$file xxx.csv
xxx.csv: UTF-8 Unicode text, with very long lines
Viewing it in less/vi will render some special chars (ßÄ°...) unreadable (├╝); Windows will also not display it; importing it directly into a db will just change the special characters to some other special characters (+ä, +ñ, ...).
I wanted to convert it now to a "default readable" encoding with iconv.
When I try to convert it with iconv
$iconv -f UTF-8 -t ISO-8859-1 xxx.csv > yyy.csv
iconv: illegal input sequence at position 1234
using UNICODE as input and UTF-8 as output will return the same message
I am guessing the file is somewhat encoded in another format which I do not know - how can I find out which format in order to convert it to something "universally" readable ...
Converting from UTF-8 to ISO-8859-1 only works if your UTF-8 text only has characters that can be represented in ISO-8859-1. If this is not the case, you should specify what needs to happen to these characters, either ignoring (//IGNORE) or approximating (//TRANSLIT) them. Try one of these two:
iconv -f UTF-8 -t ISO-8859-1//IGNORE --output=outfile.csv inputfile.csv
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT --output=outfile.csv inputfile.csv
In most cases, I guess approximation is the best solution, mapping e.g. accented characters to their unaccented counterparts, the euro sign to EUR, etc...
The problem was that Windows could not interpret the file as UTF-8 on itself. it reads it as asci and then ä becomes a 2 character interpretation ä (ascii 195 164)
trying to convert it, I found a solution that works for me:
iconv -f UTF-8 -t WINDOWS-1252//TRANSLIT --output=outfile.csv inputfile.csv
now I can view the special chars correctly in editors
For SQLServer compability, converting UTF-8 to UTF-16 will work even better ... just the filesize grows quite a bit
If you are not sure about the file type you dealing with then you can find it as follows,
file file_name
The above command will give you the file format. Then iconv can be used accordingly.
For example if the file format is UTF-16 and you want to convert it to UTF-8 then following can be used.
iconv -f UTF-16 -t UTF-8 file_name >output_file_name
Hope this gives add on insight to what you are looking for.
I want to check whether any DOS files exist in any specific directory.
Is there any way to distinguish DOS files from UNIX apart from the ^M chars ?
I tried using file, but it gives the same output for both.
$ file test_file
test_file: ascii text
And after conversion:
$ unix2dos test_file test_file
$ file test_file.txt
test_file.txt: ascii text
The CRLF (\r\n, ^M) line endings chars are the only difference between Unix and DOS/Windows ASCII files, so no, there's no other way.
What you might try if you have to fromdos command is to compare its output with the original file:
file=test_file
fromdos < $file | cmp $file -
This fails (non-zero $?) if fromdos stripped any \r away.
dos2unix might be used in a similar way, but I don't know its exact syntax.
If you actually put Windows newlines in, you'll see the following output from file:
test_file.txt: ASCII text, with CRLF line terminators