Remove ^# Characters in a Unix File - unix

I have a question about removing invisible characters which can be only be seen when we try to view the file using "vi" command. We have a file that's being generated by Datastage Application (Source is a DB2 Table -> Target is a .txt File). File has data with different data types. I'm having an issue with just 3 columns which have their datatypes defined as CHAR.
If you open the file in a Textpad you'd see spaces. But when you view the same file on Unix via vi command, we see ^# characters in blue color. My file is a delimiter file with the delimiter as ^#^ (I know it's kinda sounds weird) .
I have tried:
tr -d [:cntrl:] <Filename >NewFileName — Still no luck — [Delimiters are removed but the spaces remain]
tr -s "^#" <Filename >NewFilename — Still no luck — I see file reduce in file size but the invisible characters still stay.
Tried changing the delimiter — but still see the same invisible characters.
Used sed "s/^#/g/" (and other sed commands) <Filename — still no luck.
Any suggestions are really appreciated. I have researched the posts on this website but I couldn't find one. If it's a simple please excuse me and share your thoughts.

In vi, NUL characters are represented as ^#. To get rid of them:
tr
Using tr, you should be able to remove the NUL characters as follows:
tr -d '\000' < file-name > new-file-name

open the file with vim and then type ':' without the single quote and paste this:
%s/control + 2//g (on regular PCs)
%s/control + shift + 2 //g (on Mac PCs)
of course, replace with keys from your keyboard

Related

How to remove stray ^M and <96> from text file before import into sas

I receive a pipe-delimited text file from a user that populates an excel spreadsheet using screen scrapes, so the data is a mess. It is full of random ^M (carriage returns) and <96> (windows en dash) throughout which causes the import to be incomplete.
I have tried the dos2unix, and I receive an error that there was a problem with the conversion. I removed all the ^M by using this solution I found on this site:
tr -d '\r' < infile > outfile
The <96> characters remain. What would be the comparable '/r' for these dashes? Or perhaps there is a better solution? I would actually like to replace the "bad" dashes with "good" dashes if possible.
Why not just clean up the file using SAS instead? If your lines as shorter than 32,767 characters then it would be simple.
data _null_;
infile 'input-file' termstr=LF ;
file 'output-file' termstr=LF ;
input;
_infile_=translate(compress(_infile_,'0D'x),'-','96'x);
put _infile_;
run;
If the lines are longer you can read the data field by field and fix it instead.
You can get the octal value using the command cat file.txt | od and remove it using tr just like you did with ^M characters.

special character removal 'sed'

I'm facing an issue where I'm getting some special characters in my file at the beggining; a snap of the same below:
^#<9b>200931350515,test1,910,420032400825443
^#<9a>200931350515,test1,910,420032400825443
^#<9d>200931746996,test2,910,420031390086807
I'm using the following command to remove anything other than numbers in first column:
sed 's/^[^0-9]*//g' file.dat
No success on that. The file is created btw during a fastexport from Teradata, the process adds some special characters by itself during extract.
Any idea on the command?
If you want to remove any non-ASCII characters anywhere in a line, you can use tr.
tr -d '\000\200-\377' <file >file.new
Using perl
perl -lne 'print /\d+,.*/g'
200931350515,test1,910,420032400825443
200931350515,test1,910,420032400825443
200931746996,test2,910,420031390086807
matches only digits up to the first comma and then everything else.
sed is to big gun for such a small problem,
use cut to remove the beginning of each line:
cut -b 2- file.dat
Where 2- is the range of bytes you want to retain, I'm not sure how many such strange characters you have there, so I would experiment with 1-, 2-, 3-, 4-, 5-, etc.
It looks like the number of characters that should be removed is constant across all line. To remove a fixed number of characters from the beginning of each line, you could simply do
$ sed 's/^.....//' input >output
Adjust the number of dots to fit your need.

How can I see which characters have been removed from file after running tr -cd

To remove any non-ascii characters from a file, I tried
tr -cd '\11\12\15\40-\176' < original.csv > clean-copy.csv
I'd like to see the specific characters that were removed from the file, is there any way to print them out? The only thing I can think of is
diff original.csv clean-copy.csv but I don't think that is sufficient.
tr -d '\11\12\15\40-\176' < original.csv
will give you all the characters you deleted (same as your original, but without the complement (-c).
I suppose you probably wanted them printed out in some more readable format, though; you could try piping that through hd.
If you need byte offsets, that's a different question.

Identifying and removing null characters in UNIX

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

simple shell script in cygwin

#!/bin/bash
echo 'first line' >foo.xml
echo 'second line' >>foo.xml
I am a total newbie to shell scripting.
I am trying to run the above script in cygwin. I want to be able to write one line after the other to a new file.
However, when I execute the above script, I see the follwoing contents in foo.xml:
second line
The second time I run the script, I see in foo.xml:
second line
second line
and so on.
Also, I see the following error displayed at the command prompt after running the script:
: No such file or directory.xml
I will eventually be running this script on a unix box, I am just trying to develop it using cygwin. So I would appreciate it if you could point out if it is a cygwin oddity and if so, should I avoid trying to use cygwin for development of such scripts?
Thanks in advance.
Run dos2unix on your shell script. That will fix the problem.
I had the same kind of problem as the original poster: A very simple script file was not working in Cygwin.
Thanks to Don Branson for the clue.
The fix for me was built into the text editor I'm using. (Most programmer's editors have a feature like this.) For example, in my case I'm using Notepad++, which has a menu item to convert the file line endings to Unix-style. From the menu: [Edit]->[EOL Conversion]->[Unix (LF)]
Then the script behaved as expected.
But there must be something else that is wrong here. When I try it, it works as expected.
> foo.xml puts the line into foo.xml, replacing any previous contents.
>> foo.xml appends to file
It sounds like you may have a typo somewhere. Also keep in mind that while the Windows command prompt can be forgiving about paths with embedded spaces, cygwin's shells will not be, so if you have a filename that contains embedded spaces, you need to either quote the filename or escape the spaces:
echo 'first line' > 'My File.txt'
echo 'first line' > My\ File.txt
The same goes for certain "special" characters including quotes, ampersand (&), semicolons (;) and generally most punctuation other than period/full-stop (.).
So if you are seeing those issues using the exact script that you are running (i.e. you copy and pasted it, there is no possibility of transcription errors) then something truly strange may be happening that I can't explain. Otherwise, there may be a misplaced space or unquoted character somewhere.
I cannot reproduce your results. The script you quote looks correct, and indeed works as expected in my installation of Cygwin here, producing the file foo.xml containing the lines first line and second line; implying that what you are actually running differs from what you quoted in some way that is causing the problem.
The error message implies some sort of problem with the filename in the first echo line. Do you have some nonprintable characters in the script you are running? Have you missed escaping a space in the filename? Are you subsituting shell variables and mistyping the name of the variable or failing to escape the resulting string?
The above should work normally..
However you can always specify a heredoc:
#!/bin/bash
cat <<EOF > foo.xml
first line
second line
EOF

Resources