How to get rid of invisible characters after converting - unix

I want to convert windows UTF8 file containing a special apostrophe to unix ISO-8859-1 file. This is how I am doing it :
# -- unix file
tr -d '\015' < my_utf8_file.xml > t_my_utf8_file.xml
# -- get rid of special apostrophe
sed "s/’/'/g" t_my_utf8_file.xml > temp_my_utf8_file.xml
# -- change the xml header
sed "s/UTF-8/ISO-8859-1/g" temp_my_utf8_file.xml > my_utf8_file_temp.xml
# -- the actual charecter set conversion
iconv -c -f UTF-8 -t ISO8859-1 my_utf8_file_temp.xml > my_file.xml
Everything is fine but one thing in one of my files. It seems like there is originally an invisible character at the beginning of the file. When I open my_file.xml in Notepadd ++, I see a SUB at the beginning of the file. In Unix VI I see ^Z.
What and where should I add to my unix script to delete those kinds of characters.
Thank you

To figure out exactly what character(s) you're dealing with, isolate the line in question (in this case something simple like head -1 <file> should suffice) and pipe the result to od (using the appropriate flag to display the character(s) in the desired format):
head -1 <file> | od -c # view as character
head -1 <file> | od -d # view as decimal
head -1 <file> | od -o # view as octal
head -1 <file> | od -x # view as hex
Once you know the character(s) you're dealing with you can use your favorite command (eg, tr, sed) to remove said character.

Related

Using inverse grep to compare two .txt files

I have two .txt files "test1.txt" and "test2.txt" and I want to use inverse grep (UNIX) to find out all lines in test2.txt that do not contain any of the lines in test1.txt
test1.txt contains only user names, while test2.txt contains longer strings of text. I only want the lines in test2.txt that DO NOT contain the usernames found in test1.txt
Would it be something like?
grep -v test1.txt test2.txt > answer.txt
Your were almost there just missed one option in your command (i.e -f )
Your Solution should be use the -f flag, see below for sample session demonstrating the same
Demo Session
$ # first file
$ cat a.txt
xxxx yyyy
kkkkkk
zzzzzzzz
$ # second file
$ cat b.txt
line doesnot contain any name
This person is xxxx yyyy good
Another line which doesnot contain any name
Is kkkkkk a good name ?
This name itself is sleeping ...zzzzzzzz
I can't find any other name
Lets try the command now
$ # -i is used to ignore the case while searching
$ # output contains only lines from second file not containing text for first file lines
$ grep -v -i -f a.txt b.txt
line doesnot contain any name
Another line which doesnot contain any name
I can't find any other name
Lets try the command now
They're probably better ways to do this ie. without grep but heres a solution which will work
grep -v -P "($(sed ':a;N;$!ba;s/\n/)|(/g' test1.txt))" test2.txt > answer.txt
To explain this:
$(sed ':a;N;$!ba;s/\n/)|(/g' test1.txt) is an embedded sed command which outputs a string where each newline in test1.txt is replaced by )|( the output is then inserted into a perl style regex (-P) for grep to use, so that grep is searching test2.txt for the every line in text1.txt and returns only those in test2.txt which don't contain lines in test1.txt because of the -v param.
What flavor of unix are you using? This will provide us with a better understanding of what is available to you from the command line. Currently what you have will not work, you're looking for the diff command which compares two files.
You can do the following for OS X 10.6 I have tested this at home.
diff -i -y FILE1 FILE2
diff compares the files -i will ignore the case if this does not matter so Hi and HI will still mean the same. Finally -y will output side by side the results If you want to out the information to a file you could do diff -i -y FILE1 FILE2 >> /tmp/Results.txt

script to replace all dots in a file with a space but dots used in numbers should not be replaced

How to replace all dots in a file with a space but dots in numbers such as 1.23232 or 4.23232 should not be replaced.
for example
Input:
abc.hello is with cdf.why with 1.9343 and 3.3232 points. What will
Output:
abc_hello is with cdf_why with 1.9343 and 3.3232 point_ what will
$ cat file
abc.hello is with cdf.why with 1.9343 and 3.3232 points. What will
this is 1.234.
here it is ...1.234... a number
.that was a number.
$ sed -e 's/a/aA/g' -e 's/\([[:digit:]]\)\.\([[:digit:]]\)/\1aB\2/g' -e 's/\./_/g' -e 's/aB/./g' -e 's/aA/a/g' file
abc_hello is with cdf_why with 1.9343 and 3.3232 points_ What will
this is 1.234_
here it is ___1.234___ a number
_that was a number_
Try any solution you're considering with that input file as it includes some edge cases (there may be more I haven't included in that file too).
The solution is basically to temporarily convert periods within numbers to some string that cannot exist anywhere else in the file so we can then convert any other periods to underscores and then undo that first temporary conversion.
So first we create a string that can't exist in the file by converting all as to the string aA which means that the string aB cannot exist in the file. Then convert all .s within numbers to aBs, then all remaining .s to _s then unwind the temporary conversions so aBs return to .s and aAs returns to as:
sed -e 's/a/aA/g' # a -> aA encode #1
-e 's/\([[:digit:]]\)\.\([[:digit:]]\)/\1aB\2/g' # 2.4 -> 2aB4 encode #2
-e 's/\./_/g' # . -> _ convert
-e 's/aB/./g' # 2aB4 -> 2.4 decode #2
-e 's/aA/a/g' # aA -> a decode #1
file
That approach of creating a temporary string that you KNOW can't exist in the file is a common alternative to picking a control character or trying to come up with some string you THINK is highly unlikely to exist in the file when you temporarily need a string that doesn't exist in the file.
I think, that will do what you want:
sed 's/\([^0-9]\)\.\([^0-9]\)/\1_\2/g' filename
This will replace all dots that are not between two digits with an underscore (_) sign (you can exchange the underscore with a space character in the above command to get spaces in the output).
If you want to write the changes back into the file, use sed -i.
Edit:
To cover dots at the beginning resp. end of the line or directly before or after a number the expression becomes a bit more ugly:
sed -r 's/(^|[^0-9])\.([^0-9]|$)/\1_\2/g;s/(^|[^0-9])\.([0-9])/\1_\2/g;s/([0-9])\.([^0-9]|$)/\1_\2/g'
resp.:
sed 's/\(^\|[^0-9]\)\.\([^0-9]\|$\)/\1_\2/g;s/\(^\|[^0-9]\)\.\([0-9]\)/\1_\2/g;s/\([0-9]\)\.\([^0-9]\|$\)/\1_\2/g'
gawk
awk -v RS='[[:space:]]+' '!/^[[:digit:]]+\.[[:digit:]]+$/{gsub("\\.", "_")}; {printf "%s", $0RT}' file.txt
since you tagged with vi, I guess you may have vim too? it would be a very easy task for vim:
:%s/\D\zs\.\ze\D/_/g

Removing Carriage Returns in a column

I have a file in in Unix which has an additional Carriage Return character appearing in a particular field and want to remove it.
I tried printing the ASCII value for the characters in the field and it appears as follows :
head -1 BVP.csv | cut -d "," -f26 | tr -d "\n" | od -An -t dC
34 78 13
Actual values in the field is: "N[Carriage Return]
So I tried removing the carriage return (ASCII value :13) as follows and tried printing the output to a new file, BVP1.csv:
tr -d '\r' < BVP.csv > BVP1.csv
Then I executed the same command
head -1 BVP1.csv | cut -d "," -f26 | tr -d "\n" | od -An -t dC
34 78
It prints the ASCII values without the Carriage Return.
But when I open the file in any text editor or even from Windows, I can see that the line breaks into a new record in the file, ie, the additional line feed is not removed.
Can anyone please suggest a method to remove this additional Carriage return appearing in the field.
Thanks in Advance,
Tom
I'm willing to bet that there's an LF (ASCII 10) character after the CR (ASCII 13) you're trying to eliminate. You're not seeing it because of your tr -d '\n'. I'm willing to bet you need to remove both the CR and LF from field 26 of your BVP.csv. But, you only want to remove the LFs that appear just after a CR.
You might try this simple perl recipe to strip all CR+LF combos from your csv file, while leaving bare LF intact:
perl -p -e 's/\r\n//g' < BVP.csv > BVP1.csv
This will strip CR+LF while leaving LF alone.
The reason I think this will do what you want is that bare CR without a trailing LF are relatively uncommon. The pre-MacOS X text file formats used CR without LF. Since OS X, those have waned in popularity dramatically.

How to remove blank lines from a Unix file

I need to remove all the blank lines from an input file and write into an output file. Here is my data as below.
11216,33,1032747,64310,1,0,0,1.878,0,0,0,1,1,1.087,5,1,1,18-JAN-13,000603221321
11216,33,1033196,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,059762153003
11216,33,1033246,31300,1,0,0,1.5391,0,0,0,1,1,1.054,5,1,1,18-JAN-13,000603211032
11216,33,1033280,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,055111034001
11216,33,1033287,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000378689701
11216,33,1033358,31118,1,0,0,1.5513,0,0,0,1,1,1.115,5,1,1,18-JAN-13,000093737301
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041926
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802041954
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049326
11216,33,1035476,37340,1,0,0,1.7046,0,0,0,1,1,1.123,5,1,1,18-JAN-13,045802049383
11216,33,1036985,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000093415580
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781202001
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781261305
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781603955
11216,33,1037003,15151,1,0,0,1.4436,0,0,0,1,1,1.065,5,1,1,18-JAN-13,000781615746
sed -i '/^$/d' foo
This tells sed to delete every line matching the regex ^$ i.e. every empty line. The -i flag edits the file in-place, if your sed doesn't support that you can write the output to a temporary file and replace the original:
sed '/^$/d' foo > foo.tmp
mv foo.tmp foo
If you also want to remove lines consisting only of whitespace (not just empty lines) then use:
sed -i '/^[[:space:]]*$/d' foo
Edit: also remove whitespace at the end of lines, because apparently you've decided you need that too:
sed -i '/^[[:space:]]*$/d;s/[[:space:]]*$//' foo
awk 'NF' filename
awk 'NF > 0' filename
sed -i '/^$/d' filename
awk '!/^$/' filename
awk '/./' filename
The NF also removes lines containing only blanks or tabs, the regex /^$/ does not.
Use grep to match any line that has nothing between the start anchor (^) and the end anchor ($):
grep -v '^$' infile.txt > outfile.txt
If you want to remove lines with only whitespace, you can still use grep. I am using Perl regular expressions in this example, but here are other ways:
grep -P -v '^\s*$' infile.txt > outfile.txt
or, without Perl regular expressions:
grep -v '^[[:space:]]*$' infile.txt > outfile.txt
sed -e '/^ *$/d' input > output
Deletes all lines which consist only of blanks (or is completely empty). You can change the blank to [ \t] where the \t is a representation for tab. Whether your shell or your sed will do the expansion varies, but you can probably type the tab character directly. And if you're using GNU or BSD sed, you can do the edit in-place, if that's what you want, with the -i option.
If I execute the above command still I have blank lines in my output file. What could be the reason?
There could be several reasons. It might be that you don't have blank lines but you have lots of spaces at the end of a line so it looks like you have blank lines when you cat the file to the screen. If that's the problem, then:
sed -e 's/ *$//' -e '/^ *$/d' input > output
The new regex removes repeated blanks at the end of the line; see previous discussion for blanks or tabs.
Another possibility is that your data file came from Windows and has CRLF line endings. Unix sees the carriage return at the end of the line; it isn't a blank, so the line is not removed. There are multiple ways to deal with that. A reliable one is tr to delete (-d) character code octal 15, aka control-M or \r or carriage return:
tr -d '\015' < input | sed -e 's/ *$//' -e '/^ *$/d' > output
If neither of those works, then you need to show a hex dump or octal dump (od -c) of the first two lines of the file, so we can see what we're up against:
head -n 2 input | od -c
Judging from the comments that sed -i does not work for you, you are not working on Linux or Mac OS X or BSD — which platform are you working on? (AIX, Solaris, HP-UX spring to mind as relatively plausible possibilities, but there are plenty of other less plausible ones too.)
You can try the POSIX named character classes such as sed -e '/^[[:space:]]*$/d'; it will probably work, but is not guaranteed. You can try it with:
echo "Hello World" | sed 's/[[:space:]][[:space:]]*/ /'
If it works, there'll be three spaces between the 'Hello' and the 'World'. If not, you'll probably get an error from sed. That might save you grief over getting tabs typed on the command line.
grep . file
grep looks at your file line-by-line; the dot . matches anything except a newline character. The output from grep is therefore all the lines that consist of something other than a single newline.
with awk
awk 'NF > 0' filename
To be thorough and remove lines even if they include spaces or tabs something like this in perl will do it:
cat file.txt | perl -lane "print if /\S/"
Of course there are the awk and sed equivalents. Best not to assume the lines are totally blank as ^$ would do.
Cheers
You can sed's -i option to edit in-place without using temporary file:
sed -i '/^$/d' file

Remove carriage return in Unix

What is the simplest way to remove all the carriage returns \r from a file in Unix?
I'm going to assume you mean carriage returns (CR, "\r", 0x0d) at the ends of lines rather than just blindly within a file (you may have them in the middle of strings for all I know). Using this test file with a CR at the end of the first line only:
$ cat infile
hello
goodbye
$ cat infile | od -c
0000000 h e l l o \r \n g o o d b y e \n
0000017
dos2unix is the way to go if it's installed on your system:
$ cat infile | dos2unix -U | od -c
0000000 h e l l o \n g o o d b y e \n
0000016
If for some reason dos2unix is not available to you, then sed will do it:
$ cat infile | sed 's/\r$//' | od -c
0000000 h e l l o \n g o o d b y e \n
0000016
If for some reason sed is not available to you, then ed will do it, in a complicated way:
$ echo ',s/\r\n/\n/
> w !cat
> Q' | ed infile 2>/dev/null | od -c
0000000 h e l l o \n g o o d b y e \n
0000016
If you don't have any of those tools installed on your box, you've got bigger problems than trying to convert files :-)
tr -d '\r' < infile > outfile
See tr(1)
The simplest way on Linux is, in my humble opinion,
sed -i.bak 's/\r$//g' <filename>
-i will edit the file in place, while the .bak will create a backup of the original file by making a copy of your file and adding the extension .bak at the end. (You can specify what ever you want after the -i, or specify only -i to not create a backup.)
The strong quotes around the substitution operator 's/\r//' are essential. Without them the shell will interpret \r as an escape+r and reduce it to a plain r, and remove all lower case r. That's why the answer given above in 2009 by Rob doesn't work.
And adding the /g modifier ensures that even multiple \r will be removed, and not only the first one.
Old School:
tr -d '\r' < filewithcarriagereturns > filewithoutcarriagereturns
There's a utility called dos2unix that exists on many systems, and can be easily installed on most.
sed -i s/\r// <filename> or somesuch; see man sed or the wealth of information available on the web regarding use of sed.
One thing to point out is the precise meaning of "carriage return" in the above; if you truly mean the single control character "carriage return", then the pattern above is correct. If you meant, more generally, CRLF (carriage return and a line feed, which is how line feeds are implemented under Windows), then you probably want to replace \r\n instead. Bare line feeds (newline) in Linux/Unix are \n.
If you are a Vi user, you may open the file and remove the carriage return with:
:%s/\r//g
or with
:1,$ s/^M//
Note that you should type ^M by pressing ctrl-v and then ctrl-m.
Someone else recommend dos2unix and I strongly recommend it as well. I'm just providing more details.
If installed, jump to the next step. If not already installed, I would recommend installing it via yum like:
yum install dos2unix
Then you can use it like:
dos2unix fileIWantToRemoveWindowsReturnsFrom.txt
Once more a solution... Because there's always one more:
perl -i -pe 's/\r//' filename
It's nice because it's in place and works in every flavor of unix/linux I've worked with.
Removing \r on any UNIX® system:
Most existing solutions in this question are GNU-specific, and wouldn't work on OS X or BSD; the solutions below should work on many more UNIX systems, and in any shell, from tcsh to sh, yet still work even on GNU/Linux, too.
Tested on OS X, OpenBSD and NetBSD in tcsh, and on Debian GNU/Linux in bash.
With sed:
In tcsh on an OS X, the following sed snippet could be used together with printf, as neither sed nor echo handle \r in the special way like the GNU does:
sed `printf 's/\r$//g'` input > output
With tr:
Another option is tr:
tr -d '\r' < input > output
Difference between sed and tr:
It would appear that tr preserves a lack of a trailing newline from the input file, whereas sed on OS X and NetBSD (but not on OpenBSD or GNU/Linux) inserts a trailing newline at the very end of the file even if the input is missing any trailing \r or \n at the very end of the file.
Testing:
Here's some sample testing that could be used to ensure this works on your system, using printf and hexdump -C; alternatively, od -c could also be used if your system is missing hexdump:
% printf 'a\r\nb\r\nc' | hexdump -C
00000000 61 0d 0a 62 0d 0a 63 |a..b..c|
00000007
% printf 'a\r\nb\r\nc' | ( sed `printf 's/\r$//g'` /dev/stdin > /dev/stdout ) | hexdump -C
00000000 61 0a 62 0a 63 0a |a.b.c.|
00000006
% printf 'a\r\nb\r\nc' | ( tr -d '\r' < /dev/stdin > /dev/stdout ) | hexdump -C
00000000 61 0a 62 0a 63 |a.b.c|
00000005
%
If you're using an OS (like OS X) that doesn't have the dos2unix command but does have a Python interpreter (version 2.5+), this command is equivalent to the dos2unix command:
python -c "import sys; import fileinput; sys.stdout.writelines(line.replace('\r', '\n') for line in fileinput.input(mode='rU'))"
This handles both named files on the command line as well as pipes and redirects, just like dos2unix. If you add this line to your ~/.bashrc file (or equivalent profile file for other shells):
alias dos2unix="python -c \"import sys; import fileinput; sys.stdout.writelines(line.replace('\r', '\n') for line in fileinput.input(mode='rU'))\""
... the next time you log in (or run source ~/.bashrc in the current session) you will be able to use the dos2unix name on the command line in the same manner as in the other examples.
you can simply do this :
$ echo $(cat input) > output
Here is the thing,
%0d is the carriage return character. To make it compatabile with Unix. We need to use the below command.
dos2unix fileName.extension fileName.extension
try this to convert dos file into unix file:
fromdos file
For UNIX... I've noticed dos2unix removed Unicode headers form my UTF-8 file. Under git bash (Windows), the following script seems to work nicely. It uses sed. Note it only removes carriage-returns at the ends of lines, and preserves Unicode headers.
#!/bin/bash
inOutFile="$1"
backupFile="${inOutFile}~"
mv --verbose "$inOutFile" "$backupFile"
sed -e 's/\015$//g' <"$backupFile" >"$inOutFile"
If you are running an X environment and have a proper editor (visual studio code), then I would follow the reccomendation:
Visual Studio Code: How to show line endings
Just go to the bottom right corner of your screen, visual studio code will show you both the file encoding and the end of line convention followed by the file, an just with a simple click you can switch that around.
Just use visual code as your replacement for notepad++ on a linux environment and you are set to go.
Using sed
sed $'s/\r//' infile > outfile
Using sed on Git Bash for Windows
sed '' infile > outfile
The first version uses ANSI-C quoting and may require escaping \ if the command runs from a script. The second version exploits the fact that sed reads the input file line by line by removing \r and \n characters. When writing lines to the output file, however, it only appends a \n character. A more general and cross-platform solution can be devised by simply modifying IFS
IFS=$'\r\n' # or IFS+=$'\r' if the lines do not contain whitespace
printf "%s\n" $(cat infile) > outfile
IFS=$' \t\n' # not necessary if IFS+=$'\r' is used
Warning: This solution performs filename expansion (*, ?, [...] and more if extglob is set). Use it only if you are sure that the file does not contain special characters or you want the expansion.
Warning: None of the solutions can handle \ in the input file.
cat input.csv | sed 's/\r/\n/g' > output.csv
worked for me
I've used python for it, here my code;
end1='/home/.../file1.txt'
end2='/home/.../file2.txt'
with open(end1, "rb") as inf:
with open(end2, "w") as fixed:
for line in inf:
line = line.replace("\n", "")
line = line.replace("\r", "")
fixed.write(line)
Though it's a older post, recently I came across with same problem. As I had all the files to rename inside /tmp/blah_dir/ as each file in this directory had "/r" trailing character ( showing "?" at end of file), so doing it script way was only I could think of.
I wanted to save final file with same name (without trailing any character).
With sed, problem was the output filename which I was needed to mention something else ( which I didn't want).
I tried other options as suggested here (not considered dos2unix because of some limitations) but didn't work.
I tried with "awk" finally which worked where I used "\r" as delimiter and taken the first part:
trick is:
echo ${filename}|awk -F"\r" '{print $1}'
Below script snippet I used ( where I had all file had "\r" as trailing character at path /tmp/blah_dir/) to fix my issue:
cd /tmp/blah_dir/
for i in `ls`
do
mv $i $(echo $i | awk -F"\r" '{print $1}')
done
Note: This example is not very exact though close to what I worked (Mentioning here just to give the better idea about what I did)
I made this shell-script to remove the \r character. It works in solaris and red-hat:
#!/bin/ksh
LOCALPATH=/Any_PATH
for File in `ls ${LOCALPATH}`
do
ARCACT=${LOCALPATH}/${File}
od -bc ${ARCACT}|sed -n 'p;n'|sed 's/015/012/g'|awk '{$1=""; print $0}'|sed 's/ /\\/g'|awk '{printf $0;}'>${ARCACT}.TMP
printf "`cat ${ARCACT}.TMP`"|sed '/^$/d'>${ARCACT}
rm ${ARCACT}.TMP
done
exit 0

Resources