Identifying and removing null characters in UNIX

Identifying and removing null characters in UNIX - unix

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?

I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.

Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.

A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.

I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'

If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile

Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?

I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.

I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')

Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

Related

How can I delete the last comma in each record of a comma-delimited csv?

Example Input : A,B,"C,D",E,F,G,
Example Output : A,B,"C,D",E,F,G
The issue I face with using the 'cut' command to accomplish the same is that my data has comma as well.
I wish to do the same in an automated process. So, Linux commands would be helpful.

This should work:
sed 's/,$//g' < input_file.csv > output_file.csv
,$ is a regular expression that matches a comma at the end of each line. This gets replaced with the s command by nothing.
Proof:
$ echo 'A,B,"C,D",E,F,G,' | sed 's/,$//g'
A,B,"C,D",E,F,G
Note that some CSV dialects can also have line endings inside double quotes. If there happens to be a comma right before such a quoted line ending, that comma will also be stripped. If you want to handle this case correctly, you'll need a proper CSV parser.

sed usage not able to understand

I have come across unix sed command usage and not able to understand what it does. Could you please help me to understand the usage ? If possible please share some reference to understand such usages of sed command.
sed -i '/^export JAVA_HOME/ s:.*:export JAVA_HOME=/usr/java/default\nexport HADOOP_PREFIX=/usr/local/hadoop\nexport HADOOP_HOME=/usr/local/hadoop\n:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

The command is simple, though it assumes GNU sed because of the way it uses the -i option; for macOS Sierra and related systems, you'd need to use -i '' in place of just -i.
Overall, it corresponds to:
sed -i '/Pattern/ s:.*:Replacement:' file
where:
-i means overwrite each input file with its edited output without creating a backup copy.
/Pattern/ is ^export JAVA_HOME; a line starting with the word export and then JAVA_HOME separated by a single space.
s:.*:Replacement: is a substitute command, using : instead of the more conventional / (often s/.*/Replacement/) as the pattern delimiter. This is done because the replacement text contains slashes. The .* matches the whole line. The rest of the material is written in place of the original export JAVA_HOME line. The \n sequence expands to a newline, so it actually produces a number of lines in the output.
file is $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

As others have pointed out, this is a sed command invocation. The command is short for "Stream EDitor" and is quite useful for modifying files programaticallly. Your best bet is to read the man pages (man sed, but I've broken down your particular command here for instructive purposes:
sed # The command
-i # Edit file in place (no backup)
'/^export JAVA_HOME/ # For every line that begins with 'export JAVA_HOME'...
s: # substitue...
.*: # the entire line with...
export JAVA_HOME=/usr/java/default
export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
:' # End of command
$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh # Run on the following file
Points of interest:
Commands can be limited to a particular address range or scope. Here, the scope was a search.
The substitue command can be delimited by almost any character (usually it is /, but in this case, : was chosen to prevent escaping of the / in the filepaths
The sed expression was enclosed in ' to prevent shell expansion of variables. Although no expansions would have taken place in this scenario, it is fairly common to see the expression wrapped in ' to eliminate the possibility.

How to replace string by an escape character plus string in unix

How can I convert a one line like below:
794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|...and so on
to a linefeed-delimted,
Expected Output:
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|
and so on.
BTW I'n not a unix expert and just want steps or simple commands to resolve. Appreciate your help.

I assume you have your string in a file with name "x", then you can do this.
I use the character ":" to represent the carriage return that 'sed' adds to your string. Choose something else if ":" occurs in your string. Then "tr" changes ":" to carriage return. The output is as you desire except that there is an extra carriage return at the beginning.
cat x | sed 's/794170/:794170/g' | tr ':' "\n"

You can use the fold command:
$ fold -w32 file
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|

I don't think you can do it with a simple command. There are several options for creating scripts that can split lines more or less arbitrarily. Any Unix will have the awk utility available. On most systems you will also find Python and Perl. My guess is that a Perl or Python script is the easiest way to split lines like the one you gave.
This would be one way to do it in Python
inline = "794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|"
splits = ['794170' + s for s in inline.split('794170')]
for s in splits[1:]:
print s
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|

How to handle non-printable ASCII character parameters?

I'm working on a project where we are dealing with importing/exporting data from database tables using ksh scripts and Perl scripts. We have an existing process to export data from a table to a file and it is then imported into another system.
Here's the catch - the export process dumps out pipe delimited files while the system that is doing the import expects files delimited by the ASCII group separator character which is decimal 29, hexidecimal 1d, or octal 35. It shows up in vi as ^] Right now, I'm converting the delimiter via a Perl script. What I'd like to do is tell our export process to just use the delimiter we are expecting. Something like:
export_table.ksh -d '\035'
The problem is I can't figure out how to pass this character to the export script.
I've tried all kinds of combinations of single quotes, double quotes, backslashes, and the octal and hex version of this character.
I'm on Solaris 10 using ksh and/or Perl.

have you tried:
$'\x29'
actually try this for ]:
echo $'\x5d'
and for ^
echo $'\x5e'
so you just need to do:
export_table.ksh -d $'\x5e\x5d'

In bash(1), one can prefix a character with ^v to enter that character verbatim. Perhaps ksh(1) does the same?

How do I distinguish between 'binary' and 'text' files?

Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).
In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.
However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.
So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.

You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.
file README
README: ASCII English text, with very long lines
file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped

The spreadsheet software my company makes reads a number of binary file formats as well as text files.
We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.

You can determine the MIME type of the file with
file --mime FILENAME
The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).
If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.

To list text file names in current dir/subdirs:
grep -rIl ''
Binaries:
grep -rIL ''
To check for a particular file:
grep -qI '' FILE
then, exit status '0' would mean the file is a text; '1' - binary.
To check:
echo $?
Key option is this:
-I Process a binary file as if it did not contain matching data;
Other options:
-r, --recursive
Read all files under each directory, recursively;
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed.
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed.
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected.

Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:
$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'
(Note that those underscores without a preceding dollar are correct (RTFM).)

Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.
To distinguish a unicode text file, MSDN offers some great advice as to what to do.
The gist of it is to first inspect up to the first four bytes:
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.

Its an old topic, but maybe someone will find this useful.
If you have to decide in a script if something is a file then you can simply do like this :
if file -i $1 | grep -q text;
then
.
.
fi
This will get the file type, and with a silent grep you can decide if its a text.

You can use libmagic which is a library version of the Unix file command line (source).
There are wrappers for many languages:
Python
.NET
Nodejs
Ruby
Go
Rust

Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.

One simple check is if it has \0 characters. Text files don't have them.

As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.
This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.
The structure and description of the magic file can be found by consulting the relevant manual page (man magic)
As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following
/* Make sure we are dealing with ascii text before looking for tokens */
for (i = 0; i < nbytes - 1; i++) {
if (!isascii(buf[i]) ||
(iscntrl(buf[i]) && !isspace(buf[i]) &&
buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
)
)
return 0; /* not all ASCII */
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex