I want to convert windows UTF8 file containing a special apostrophe to unix ISO-8859-1 file. This is how I am doing it :
# -- unix file
tr -d '\015' < my_utf8_file.xml > t_my_utf8_file.xml
# -- get rid of special apostrophe
sed "s/’/'/g" t_my_utf8_file.xml > temp_my_utf8_file.xml
# -- change the xml header
sed "s/UTF-8/ISO-8859-1/g" temp_my_utf8_file.xml > my_utf8_file_temp.xml
# -- the actual charecter set conversion
iconv -c -f UTF-8 -t ISO8859-1 my_utf8_file_temp.xml > my_file.xml
Everything is fine but one thing in one of my files. It seems like there is originally an invisible character at the beginning of the file. When I open my_file.xml in Notepadd ++, I see a SUB at the beginning of the file. In Unix VI I see ^Z.
What and where should I add to my unix script to delete those kinds of characters.
Thank you
To figure out exactly what character(s) you're dealing with, isolate the line in question (in this case something simple like head -1 <file> should suffice) and pipe the result to od (using the appropriate flag to display the character(s) in the desired format):
head -1 <file> | od -c # view as character
head -1 <file> | od -d # view as decimal
head -1 <file> | od -o # view as octal
head -1 <file> | od -x # view as hex
Once you know the character(s) you're dealing with you can use your favorite command (eg, tr, sed) to remove said character.
I need to convert the text file to dos format (ending each line with 0x0d0x0a, rather than 0x0a only), if the file is in unix format (0x0a only at the end of each line).
I know how to convert it (sed 's/$/^M/'), but don't how how to detect the end-of-line character(s) of a file.
I am using ksh.
Any help would be appreciated.
[Update]:
Kind of figured it out, and here is my ksh script to do the check.
[qiangxu#host:/my/folder]# cat eol_check.ksh
#!/usr/bin/ksh
if ! head -1 $1 |grep ^M$ >/dev/null 2>&1; then
echo UNIX
else
echo DOS
fi
In the above script, ^M should be inserted in vi with Ctrl-V and Ctrl-M.
Want to know if there is any better method.
Simply use the file command.
If the file contains lines with CR LF at the end, this is printed out by a comment:
'ASCII text, with CRLF line terminators'.
e.g.
if file myFile | grep "CRLF" > /dev/null 2>&1;
then
....
fi
The latest (7.1) version of the dos2unix (and unix2dos) command that installs with Cygwin and some recent Linux distributions has a handy --info option which prints out a count of the different types of newline in each file. This is dos2unix 7.1 (2014-10-06) http://waterlan.home.xs4all.nl/dos2unix.html
From the man page:
--info[=FLAGS] FILE ...
Display file information. No conversion is done.
The following information is printed, in this order:
number of DOS line breaks, number of Unix line breaks, number of Mac line breaks, byte order mark, text or binary, file name.
Example output:
6 0 0 no_bom text dos.txt
0 6 0 no_bom text unix.txt
0 0 6 no_bom text mac.txt
6 6 6 no_bom text mixed.txt
50 0 0 UTF-16LE text utf16le.txt
0 50 0 no_bom text utf8unix.txt
50 0 0 UTF-8 text utf8dos.txt
2 418 219 no_bom binary dos2unix.exe
Optionally extra flags can be set to change the output. One or more flags can be added.
d Print number of DOS line breaks.
u Print number of Unix line breaks.
m Print number of Mac line breaks.
b Print the byte order mark.
t Print if file is text or binary.
c Print only the files that would be converted.
With the "c" flag dos2unix will print only the files that contain DOS line breaks, unix2dos will print only file names that have Unix line breaks.
Thus:
if [[ -n $(dos2unix --info=c "${filename}") ]] ; then echo DOS; fi
Conversely:
if [[ -n $(unix2dos --info=c "${filename}") ]] ; then echo UNIX; fi
if awk '/\r$/{exit 0;} 1{exit 1;}' myFile
then
echo "is DOS"
fi
I can't test on AIX, but try:
if [[ "$(head -1 filename)" == *$'\r' ]]; then echo DOS; fi
You can simply remove any existing carriage returns from all lines, and then add the carriage return to the end of all lines. Then it doesn't matter what format the incoming file is in. The outgoing format will always be DOS format.
sed 's/\r$//;s/$/\r/'
I'm probably late on this one, but I've had the same issue and I did not want to put the special ^M character in my script (I'm worried some editors might not display the special character properly or some later programmer might replace it by 2 normal characters: ^ and M...).
The solution I found feeds the special character to grep, by letting the shell convert its hex value:
if head -1 ${filename} | grep $'[\x0D]' >/dev/null
then
echo "Win"
else
echo "Unix"
fi
unfortunately I cannot make the $'[\x0D]' construct work in ksh.
In ksh, I found this:
if head -1 ${filename} | od -x | grep '0d0a$' >/dev/null
then
echo "Win"
else
echo "Unix"
fi
od -x displays the text in hex codes.
'0d0a$' is the hex code for CR-LF (the DOS-Win line terminator). The Unix line terminator is '0a00$'
I have a text file which has over 60MB size. It has got entries in 5105043 lines, but when I am doing wc -l it is giving only 5105042 results which is one less than actual. Does anyone have any idea why it is happening?
Is it a common thing when the file size is large?
Last line does not contain a new line.
One trick to get the result you want would be:
sed -n '=' <yourfile> | wc -l
This tells sed just to print the line number of each line in your file which wc then counts. There are probably better solutions, but this works.
The last line in your file is probably missing a newline ending. IIRC, wc -l merely counts the number of newline characters in the file.
If you try: cat -A file.txt | tail does your last line contain a trailing dollar sign ($)?
EDIT:
Assuming the last line in your file is lacking a newline character, you can append a newline character to correct it like this:
printf "\n" >> file.txt
The results of wc -l should now be consistent.
60 MB seems a bit big file but for small size files. One option could be
cat -n file.txt
OR
cat -n sample.txt | cut -f1 | tail -1
i have a file with e.g. 9818 lines. When i use wc -l file, i see 9818 lines. When i vi the file, i see 9818 lines. When i :set numbers, i see 9818 lines. But when i cat file | nl, i see the final line number is 9750 (e.g.). Basically i'm asking why line numbers from cat file | nl and wc -l file do not match.
wc -l: count all lines
nl: count all (nonempty) lines
try
nl -ba: count all lines
nl(1) says the default is for header and footer lines to not be numbered (-hn -fn), and those are specified by repeating \; on various lines. Perhaps your input file includes some of these?
I suggest reading the output of nl line by line against cat -n output and see where things diverge. Or use diff -u if you want to take the fun out of reading 9818 lines. :)
nl does not number blank lines, so this is almost certainly the reason. If you can point us to the file, we can confirm that, but I suspect this is the case.
What is the simplest way to remove all the carriage returns \r from a file in Unix?
I'm going to assume you mean carriage returns (CR, "\r", 0x0d) at the ends of lines rather than just blindly within a file (you may have them in the middle of strings for all I know). Using this test file with a CR at the end of the first line only:
$ cat infile
hello
goodbye
$ cat infile | od -c
0000000 h e l l o \r \n g o o d b y e \n
0000017
dos2unix is the way to go if it's installed on your system:
$ cat infile | dos2unix -U | od -c
0000000 h e l l o \n g o o d b y e \n
0000016
If for some reason dos2unix is not available to you, then sed will do it:
$ cat infile | sed 's/\r$//' | od -c
0000000 h e l l o \n g o o d b y e \n
0000016
If for some reason sed is not available to you, then ed will do it, in a complicated way:
$ echo ',s/\r\n/\n/
> w !cat
> Q' | ed infile 2>/dev/null | od -c
0000000 h e l l o \n g o o d b y e \n
0000016
If you don't have any of those tools installed on your box, you've got bigger problems than trying to convert files :-)
tr -d '\r' < infile > outfile
See tr(1)
The simplest way on Linux is, in my humble opinion,
sed -i.bak 's/\r$//g' <filename>
-i will edit the file in place, while the .bak will create a backup of the original file by making a copy of your file and adding the extension .bak at the end. (You can specify what ever you want after the -i, or specify only -i to not create a backup.)
The strong quotes around the substitution operator 's/\r//' are essential. Without them the shell will interpret \r as an escape+r and reduce it to a plain r, and remove all lower case r. That's why the answer given above in 2009 by Rob doesn't work.
And adding the /g modifier ensures that even multiple \r will be removed, and not only the first one.
Old School:
tr -d '\r' < filewithcarriagereturns > filewithoutcarriagereturns
There's a utility called dos2unix that exists on many systems, and can be easily installed on most.
sed -i s/\r// <filename> or somesuch; see man sed or the wealth of information available on the web regarding use of sed.
One thing to point out is the precise meaning of "carriage return" in the above; if you truly mean the single control character "carriage return", then the pattern above is correct. If you meant, more generally, CRLF (carriage return and a line feed, which is how line feeds are implemented under Windows), then you probably want to replace \r\n instead. Bare line feeds (newline) in Linux/Unix are \n.
If you are a Vi user, you may open the file and remove the carriage return with:
:%s/\r//g
or with
:1,$ s/^M//
Note that you should type ^M by pressing ctrl-v and then ctrl-m.
Someone else recommend dos2unix and I strongly recommend it as well. I'm just providing more details.
If installed, jump to the next step. If not already installed, I would recommend installing it via yum like:
yum install dos2unix
Then you can use it like:
dos2unix fileIWantToRemoveWindowsReturnsFrom.txt
Once more a solution... Because there's always one more:
perl -i -pe 's/\r//' filename
It's nice because it's in place and works in every flavor of unix/linux I've worked with.
Removing \r on any UNIX® system:
Most existing solutions in this question are GNU-specific, and wouldn't work on OS X or BSD; the solutions below should work on many more UNIX systems, and in any shell, from tcsh to sh, yet still work even on GNU/Linux, too.
Tested on OS X, OpenBSD and NetBSD in tcsh, and on Debian GNU/Linux in bash.
With sed:
In tcsh on an OS X, the following sed snippet could be used together with printf, as neither sed nor echo handle \r in the special way like the GNU does:
sed `printf 's/\r$//g'` input > output
With tr:
Another option is tr:
tr -d '\r' < input > output
Difference between sed and tr:
It would appear that tr preserves a lack of a trailing newline from the input file, whereas sed on OS X and NetBSD (but not on OpenBSD or GNU/Linux) inserts a trailing newline at the very end of the file even if the input is missing any trailing \r or \n at the very end of the file.
Testing:
Here's some sample testing that could be used to ensure this works on your system, using printf and hexdump -C; alternatively, od -c could also be used if your system is missing hexdump:
% printf 'a\r\nb\r\nc' | hexdump -C
00000000 61 0d 0a 62 0d 0a 63 |a..b..c|
00000007
% printf 'a\r\nb\r\nc' | ( sed `printf 's/\r$//g'` /dev/stdin > /dev/stdout ) | hexdump -C
00000000 61 0a 62 0a 63 0a |a.b.c.|
00000006
% printf 'a\r\nb\r\nc' | ( tr -d '\r' < /dev/stdin > /dev/stdout ) | hexdump -C
00000000 61 0a 62 0a 63 |a.b.c|
00000005
%
If you're using an OS (like OS X) that doesn't have the dos2unix command but does have a Python interpreter (version 2.5+), this command is equivalent to the dos2unix command:
python -c "import sys; import fileinput; sys.stdout.writelines(line.replace('\r', '\n') for line in fileinput.input(mode='rU'))"
This handles both named files on the command line as well as pipes and redirects, just like dos2unix. If you add this line to your ~/.bashrc file (or equivalent profile file for other shells):
alias dos2unix="python -c \"import sys; import fileinput; sys.stdout.writelines(line.replace('\r', '\n') for line in fileinput.input(mode='rU'))\""
... the next time you log in (or run source ~/.bashrc in the current session) you will be able to use the dos2unix name on the command line in the same manner as in the other examples.
you can simply do this :
$ echo $(cat input) > output
Here is the thing,
%0d is the carriage return character. To make it compatabile with Unix. We need to use the below command.
dos2unix fileName.extension fileName.extension
try this to convert dos file into unix file:
fromdos file
For UNIX... I've noticed dos2unix removed Unicode headers form my UTF-8 file. Under git bash (Windows), the following script seems to work nicely. It uses sed. Note it only removes carriage-returns at the ends of lines, and preserves Unicode headers.
#!/bin/bash
inOutFile="$1"
backupFile="${inOutFile}~"
mv --verbose "$inOutFile" "$backupFile"
sed -e 's/\015$//g' <"$backupFile" >"$inOutFile"
If you are running an X environment and have a proper editor (visual studio code), then I would follow the reccomendation:
Visual Studio Code: How to show line endings
Just go to the bottom right corner of your screen, visual studio code will show you both the file encoding and the end of line convention followed by the file, an just with a simple click you can switch that around.
Just use visual code as your replacement for notepad++ on a linux environment and you are set to go.
Using sed
sed $'s/\r//' infile > outfile
Using sed on Git Bash for Windows
sed '' infile > outfile
The first version uses ANSI-C quoting and may require escaping \ if the command runs from a script. The second version exploits the fact that sed reads the input file line by line by removing \r and \n characters. When writing lines to the output file, however, it only appends a \n character. A more general and cross-platform solution can be devised by simply modifying IFS
IFS=$'\r\n' # or IFS+=$'\r' if the lines do not contain whitespace
printf "%s\n" $(cat infile) > outfile
IFS=$' \t\n' # not necessary if IFS+=$'\r' is used
Warning: This solution performs filename expansion (*, ?, [...] and more if extglob is set). Use it only if you are sure that the file does not contain special characters or you want the expansion.
Warning: None of the solutions can handle \ in the input file.
cat input.csv | sed 's/\r/\n/g' > output.csv
worked for me
I've used python for it, here my code;
end1='/home/.../file1.txt'
end2='/home/.../file2.txt'
with open(end1, "rb") as inf:
with open(end2, "w") as fixed:
for line in inf:
line = line.replace("\n", "")
line = line.replace("\r", "")
fixed.write(line)
Though it's a older post, recently I came across with same problem. As I had all the files to rename inside /tmp/blah_dir/ as each file in this directory had "/r" trailing character ( showing "?" at end of file), so doing it script way was only I could think of.
I wanted to save final file with same name (without trailing any character).
With sed, problem was the output filename which I was needed to mention something else ( which I didn't want).
I tried other options as suggested here (not considered dos2unix because of some limitations) but didn't work.
I tried with "awk" finally which worked where I used "\r" as delimiter and taken the first part:
trick is:
echo ${filename}|awk -F"\r" '{print $1}'
Below script snippet I used ( where I had all file had "\r" as trailing character at path /tmp/blah_dir/) to fix my issue:
cd /tmp/blah_dir/
for i in `ls`
do
mv $i $(echo $i | awk -F"\r" '{print $1}')
done
Note: This example is not very exact though close to what I worked (Mentioning here just to give the better idea about what I did)
I made this shell-script to remove the \r character. It works in solaris and red-hat:
#!/bin/ksh
LOCALPATH=/Any_PATH
for File in `ls ${LOCALPATH}`
do
ARCACT=${LOCALPATH}/${File}
od -bc ${ARCACT}|sed -n 'p;n'|sed 's/015/012/g'|awk '{$1=""; print $0}'|sed 's/ /\\/g'|awk '{printf $0;}'>${ARCACT}.TMP
printf "`cat ${ARCACT}.TMP`"|sed '/^$/d'>${ARCACT}
rm ${ARCACT}.TMP
done
exit 0