I want to check whether any DOS files exist in any specific directory.
Is there any way to distinguish DOS files from UNIX apart from the ^M chars ?
I tried using file, but it gives the same output for both.
$ file test_file
test_file: ascii text
And after conversion:
$ unix2dos test_file test_file
$ file test_file.txt
test_file.txt: ascii text
The CRLF (\r\n, ^M) line endings chars are the only difference between Unix and DOS/Windows ASCII files, so no, there's no other way.
What you might try if you have to fromdos command is to compare its output with the original file:
file=test_file
fromdos < $file | cmp $file -
This fails (non-zero $?) if fromdos stripped any \r away.
dos2unix might be used in a similar way, but I don't know its exact syntax.
If you actually put Windows newlines in, you'll see the following output from file:
test_file.txt: ASCII text, with CRLF line terminators
Related
I need a unix command to verify the file has ASCII printable characters only (between ASCII Hex 20 and 7E inclusive).
I got below command to check if file contains non-ASCII characters, but cannot figure out my above question.
if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
nice to have:
- Stop loading results. Sometimes one is enough
To find 20 to 7E characters in a file you can use:
grep -P "[\x20-\x7E]" file
Note the usage of -P to perform Perl regular expressions.
But in this case you want to check if the file just contains these kind of characters. So the best thing to do is to check if there is any of them that are not within this range, that is check [^range]:
grep -P "[^\x20-\x7E]" file
All together, I would say:
grep -qP "[^\x20-\x7E]" file && echo "weird ASCII" || echo "clean one"
This can be done in unix using the POSIX grep options:
if LC_ALL=C grep -q '[^ -~]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
where the characters in [ ... ] are ^ (caret), space, - (ASCII minus sign), ~ (tilde).
You could also specify ASCII tab. The standard refers to these as collating elements. It seems that both \x (hexadecimal) or \0 (octal) are shown in the standard description of bracket expressions (see 7.4.1). So you could use \x09 or \011 for the literal tab.
According to the description, by default -e accepts a basic regular expression (BRE). If you added a -E, you could have an extended regular expression (but that is not needed).
I am analyzing a collection of large (>150mb) fixed-width data files. I've been slowly reading them in using read.fwf() in 100 line chunks (each row is 7385 characters), then pushing them into a relational database for further manipulation. The problem is that the text files occasionally have a wonky multibyte character (e.g., often enough to be annoying, instead of a "U", the data file has whatever the system assigns to the Unicode U+F8FF. In OS X, that's an apple symbol, but not sure if that is a cross-platform standard). When that happens, I get an error like this:
invalid multibyte string at 'NTY <20> MAINE
000008 [...]
That should have been the latter part of the word "COUNTY", but the U was, as described above, wonky. (Happy to provide more detailed code & data if anyone thinks they would be useful.)
I'd like to do all the coding in R, and I'm just not sure to how to coerce single-byte. Hence the subject-line part of my question: is there some easy way to coerce single-byte ascii out of a text file that has some erroneous multibyte characters in it?
Or maybe there's an even better way to deal with this (should I be calling grep at the system level from R to hunt out the erroneous multi-byte characters)?
Any help much appreciated!
What does the output of the file command say about your data file?
/tmp >file a.txt b.txt
a.txt: UTF-8 Unicode text, with LF, NEL line terminators
b.txt: ASCII text, with LF, NEL line terminators
You can try to convert/transliterate the file's contents using iconv. For example, given a file that uses the Windows 1252 encoding:
# \x{93} and \x{94} are Windows 1252 quotes
/tmp >perl -E'say "He said, \x{93}hello!\x{94}"' > a.txt
/tmp >file a.txt
a.txt: Non-ISO extended-ASCII text
/tmp >cat a.txt
He said, ?hello!?
Now, with iconv you can try to convert it to ascii:
/tmp >iconv -f windows-1252 -t ascii a.txt
He said,
iconv: a.txt:1:9: cannot convert
Since there is no direct conversion here it fails. Instead, you can tell iconv to do a transliteration:
/tmp >iconv -f windows-1252 -t ascii//TRANSLIT a.txt > converted.txt
/tmp >file converted.txt
converted.txt: ASCII text
/tmp >cat converted.txt
He said, "hello!"
There might be a way to do this using R's IO layer, but I don't know R.
Hope that helps.
I am running the IBAMR model (a set of codes for solving immersed boundary problems) on x86_64 GNU/Linux.
The startup configuration file is called input2d.
When I open it with vi, I find:
"input2d" [noeol][dos] 251L, 11689C
If I compile the IBAMR model without saving input2d, it compiles and runs fine.
However, if I save input2d, the compiler crashes, saying:
Warning in input2d at line 251 column 5 : Illegal character token in input
Clearly this has something to do with unix adding a newline to the end of the file.
Here's my question:
How do I save this file in dos format AND without a trailing newline in vi on a unix system?
Use vim -b <file> or :set binary to tell vim not to append an newline at the end of the file. From :help binary:
When writing a file the <EOL> for the last line is only written if
there was one in the original file (normally Vim appends an <EOL> to
the last line if there is none; this would make the file longer). See
the 'endofline' option.
There's a script for this that I found on Vim Tips:
http://vim.wikia.com/wiki/Preserve_missing_end-of-line_at_end_of_text_files
It automatically enables "binary" if there was no eol, but ensures the original line-endings are preserved for the rest of the file.
I am using Powershell in windows to replace a '£' with a '$' in a file generated in Unix. The problem is that the output file has CRLF at the end of each line rather than LF which it originally had. When I look at the original file in Notepad++ the status bar tells me it is Unix ANSI, I want to keep this format and have LF at the end of each line.
I have tried all the encoding options with no success, I have also tried Set-Content instead od Out-File. My code is:
$old = '£'
$new = '$'
$encoding = 'UTF8'
(Get-Content $fileInfo.FullName) | % {$_ -replace $old, $new} | Out-File -filepath $fileInfo.FullName -Encoding $encoding
Thanks for any help
Jamie
#Keith Hill made a cmdlet for this ConvertTo-UnixLineEnding you can find it in the Powershell Community Extension
I realise that this is a very old question now but I stumbled across it when I encountered a similar issue and thought I would share what worked for me. This may help other coders in future without the need for a third party cmdlet.
When reading in the Unix format file, that is with LF line terminators, rather than the CRLF Windows style line terminators, simply use the -Raw parameter after the filename in your Get-Content command then output with encoding type of STRING, although UTF8 encoding may have the same result STRING worked for my requirements.
My specific command that I had issue with was reading in a template file, replacing some variables then outputting to a new file. The original template is Unix style, but the output was coming out Windows style until adding the -Raw parameter as follows. Note that this is a powershell command I used called from a batch file, hence its formatting.
powershell -Command "get-content master.template -Raw | %%{$_ -replace \"#MASTERIP#\",\"%MASTERIP%\"} | %%{$_ -replace \"#SLAVEIP#\",\"%SLAVEIP%\"} | set-content %MYFILENAME%-%MASTERIP%.cfg -Encoding STRING"
I use Notepad++ to check the formatting of the resulting file and this did the trick in my case.
I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);