From my understanding, a \n should be interpreted as a new line in
"a\nb"
$'a\nb'
but not in
'a\nb'
However, I see the following:
(transscript from my zsh session follows)
-0-1- ~ > zsh --version
zsh 5.1.1 (x86_64-unknown-cygwin)
-0-1- ~ > echo 'a\nb'
a
b
-0-1- ~ >
zsh itself does not interpret the \n in 'a\nb', but the builtin echo does. 'a\nb' and "a\nb" are equivalent and do not contain a newline but the (literal) character sequence \n. $'a\nb' on the other hand actually does contains a newline.
There are two things at work here:
1. Quoting:
Quoting is used to tell the shell that you want a character itself, not any special meaning it may have in the shells syntax. zsh has four types of quoting, some of which may retain or add special meaning to a few characters or character sequences:
\: Quoting of single characters by prepending \. For zsh saying \n only means "I want the character n". As n has no special meaning, it is the same as writing just n. This changes with characters like *: Without quoting * is used as a wildcard for globbing, writing \* prevents that (for example: compare outputs of echo /* and echo /\*). If you want to pass the character \ literally, you have to quote it, for example with another \: \\.
'...': Just like with \ any character withing '...' is taken literally, this includes other methods of quoting. 'foo* bar?' is essentially equivalent to foo\*\ bar\? (or \f\o\o\*\ \b\a\r\? if one wants to be pedantic). Only ' itself cannot appear inside a string quoted with '...' as it would mark the end of the qoute. (This can be changed by setting RC_QUOTES. If set a pair of single quote is taken as a single quote: 'foo''bar' → foo\'bar)
"...": Allows for parameter and command substitution. That means words after a $ are taken as parameter names (e.g. "path: $PATH") and strings wrapped in $(...) or `...` are used for command substitution (e.g. echo "Date: $(date)"). Otherwise it behaves like '...' with the exception that it allows the quoting of `, $, \ and " with \.
$'...': strings insid $'...' are treated like string arguments of the print builtin. Only here some alphabetic characters have indeed special meaning: \n for newline, \b for backspace, \a for the bell character, etc.. \ and ' can be quoted with \\ and \' respectively. The resulting string is considered fully quoted. There is no parameter or command substitution inside $'...'.
2. The zsh-builtin echo:
Other than the echo binary or the bash-builtin echo the zsh-builtin by default recognizes (some) escape sequences like \n or \t. While this behavior usually needs to be explicitly enabled with /bin/echo or the bash-builtin (usually by passing the -e flag: echo -e "foo\nbar"), for the zsh-builtin it need to be explicitly disabled. Either by passing the -E flag (echo -E 'foo\nbar') or by setting the BSD_ECHO option (setopt bsdecho) in which case the -e flag can be used to re-enable the feature like with the other types of echo.
Conclusion
That means that both 'a\nb' and "a\nb" (and a\\nb for that matter) are passed as a\nb (literally), but the zsh-builtin echo then interprets the \n, leading to an output with a newline. On the other hand $'a\nb' contains a literal newline already before it is passed to echo.
Running
for quoted_string in a\\nb 'a\nb' "a\nb" $'a\nb'; do
echo '>' $quoted_string '<'
/bin/echo -e '>' $quoted_string '<'
echo -E '>' $quoted_string '<'
/bin/echo '>' $quoted_string '<'
echo
done
should get you the following output:
> a
b <
> a
b <
> a\nb <
> a\nb <
> a
b <
> a
b <
> a\nb <
> a\nb <
> a
b <
> a
b <
> a\nb <
> a\nb <
> a
b <
> a
b <
> a
b <
> a
b <
As you can see there is no difference between the first three kinds of quoting, while the fourth always prints with a newline.
BTW: perl (at least version 5; I do not know about Perl 6) behaves in the way you describe the expected behavior. In perl '...' behaves like it does in zsh. On the other hand"..." in perl behaves like a combination of "..." and $'...' of zsh: variables are replaced by their value and character sequences like \n and \t are treated specially.
You are wrong about "a\nb", which does not replace \n with a newline (well, not portably, anyway). So there is no standard quoting mechanism that allows a newline in a string without using a literal newline, as in
x="a
b"
This is why the $'a\nb' notation was invented to do just that. It's not POSIX as of 2016, but the POSIX folks are thinking about adding it to the shell specification. As always, don't hold your breath :-)
A new line is a character, not a variable, so there is no interpretation (or expansion). If you want to print a \, you need to escape it twice to indicate that you don’t want to escape a variable and that you don’t want to print a special character.
So you have to use echo a\\\\nb or echo 'a\\nb'
Related
I am trying to format a below text file, record order will be always like this
Dept 0100 Batch Load Errors for 8/16/2016 4:45:56 AM
Case 1111111111
Rectype: ABCD
Key:UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
Case 2222222222
Rectype: ABCD
Key:UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
Case 3333333333
Rectype: WXYZ
Key:UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
as output
1111111111~ABCD~UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
2222222222~ABCD~UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID|NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
3333333333~WXYZ~UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
I tried regular expression as below
sed -r '/^Case/!d;$!N;/\nRectype/!D;s/\s+$/ /;s/(.*)\n(.*)/\2\1\n\1/;P;D' file.txt
but this is working only till Rectype row, not able to achieve rest.
Thank you.
It seems to me that you're not really looking for a regular expression. You're looking for text reformatting, and you appear to have selected regular expression matching in sed as the method by which you'll process fields.
Read about XY problems here. Thankfully, you've included raw data and expected output, which is AWESOME for a new StackOverflow member. (Really! Yay you!) So I can recommend an alternative that will likely work better for you.
It is awk. Another command-line tool which, like sed, is installed on virtually every unix-like system on the planet.
$ awk -v RS= -v OFS="~" '!/^Case/{next} {sub(/^Key:/,"",$5); key=$5; for (f=6;f<=NF;f++) { if ($f=="NTNB") key=key "|"; else if ($f=="UMSV") key=key OFS; else key=key " "; key=key $f } print $2,$4,key}' inp2
1111111111~ABCD~UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
2222222222~ABCD~UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID|NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
3333333333~WXYZ~UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
Here's what's going on.
awk -v RS= - This is important. It sets a "null" record separator, which tells awk that we're dealing with multi-line records. Records are terminated by a blank line, and fields within this record are separated by whitespace. (Space, tab, newline.)
-v OFS="~" - Set an output field separator of tilde, for convenience.
$1!="Case"{next} - If the current record doesn't have the word "Case" as its first field, it's not a line we can handle, so skip it.
sub(/^Key:/,"",$5); key=$5; - Trim the word Key from the beginning of the fifth field, save the field to a variable.
for (f=6;f<=NF;f++) { - Step through the remaining fields
if ($f=="NTNB") key=key "|"; - setting the appropriate field separator.
else if ($f=="UMSV") key=key OFS; - ...
else key=key " "; - Or space if the text doesn't look like a new field.
key=key $f } - Finally, add the current field to our our running variable,
print $2,$4,key} - and print everything.
NOTE: One thing this doesn't do is maintain spacing as you've shown in your "expected output" in your question. Two or more spaces will always be shrunk to just one space, since within each record, fields are separated by whitespace.
UPDATE per comments
Windows uses \r\n (CRLF) to end lines, whereas unix/linux use just \n (LF). Since your file is being generated in Windows, the "blank" lines actually contain an invisible CR, and awk never sees a record separator.
To see the "real" contents of your file, you can use tools like hexdump or od. For example:
$ printf 'foo\r\nbar\r\n' | od -c
0000000 f o o \r \n b a r \r \n
0000012
In your case, simply run:
$ od -c filename | less
(Or use more if less isn't available.)
Many systems have a package available called dos2unix which can convert text format.
If you don't have dos2unix available, you should be able to achieve the same thing using other tools. In GNU sed:
sed -i 's/\r$//' filename
Or in other sed variants, but with a shell (like bash) that supports format substitution (read man sed to see if you have a -i option):
sed $'s/\r$//' inputfile > outputfile
Or a little less precisely, as it will remove all CRs even if they're not at the end of the line, you could use tr:
tr -d '\015' < inputfile > outputfile
Or if perl is available, you can use a substitution expression that's almost identical to the one for sed (perl documentation is readily available to tell you what the options do):
perl -i -pe 's/\r\n$/\n/g' filename
Good luck!
I'm writing strings which contain backslashes (\) to a file:
x1 = "\\str"
x2 = "\\\str"
# Error: '\s' is an unrecognized escape in character string starting "\\\s"
x2="\\\\str"
write(file = 'test', c(x1, x2))
When I open the file named test, I see this:
\str
\\str
If I want to get a string containing 5 backslashes, should I write 10 backslashes, like this?
x = "\\\\\\\\\\str"
[...] If I want to get a string containing 5 \ ,should i write 10 \ [...]
Yes, you should. To write a single \ in a string, you write it as "\\".
This is because the \ is a special character, reserved to escape the character that follows it. (Perhaps you recognize \n as newline.) It's also useful if you want to write a string containing a single ". You write it as "\"".
The reason why \\\str is invalid, is because it's interpreted as \\ (which corresponds to a single \) followed by \s, which is not valid, since "escaped s" has no meaning.
Have a read of this section about character vectors.
In essence, it says that when you enter character string literals you enclose them in a pair of quotes (" or '). Inside those quotes, you can create special characters using \ as an escape character.
For example, \n denotes new line or \" can be used to enter a " without R thinking it's the end of the string. Since \ is an escape character, you need a way to enter an actual . This is done by using \\. Escaping the escape!
Note that the doubling of backslashes is because you are entering the string at the command line and the string is first parsed by the R parser. You can enter strings in different ways, some of which don't need the doubling. For example:
> tmp <- scan(what='')
1: \\\\\str
2:
Read 1 item
> print(tmp)
[1] "\\\\\\\\\\str"
> cat(tmp, '\n')
\\\\\str
>
Zsh seems to do some weird backslashing when you try to echo a bunch of backslashes. I cannot seem to figure out a very clear pattern to this. Any reasons for this madness? Of course, if I actually wanted to use backslashes properly, then I'd use proper quoting etc, but why does this happen in the first place?
Here's a small example to show the same:
$ echo \\
\
$ echo \\ \\
\ \
$ echo \\ \\ \\
\ \ \
$ echo \\ \\ \\ \\
\ \ \ \
$ echo \\\\ \\ \\
\ \ \
$ echo \\\\\\ \\
\\ \
$ echo \\\\\\\\
\\
I initially independently discovered this a while ago, but was reminded of it by this tweet by Zach Riggle.
In the first step, the echo command is not special. The command line is parsed by rules that are independent of what command is being executed. The overall effect of this step is convert your command from a series of characters to a series of words.
The two general parsing rules you need to know to understand this example are: the space character separates words, and the backslash character escapes special characters, including itself.
So the command echo \\ becomes a list of 2 words:
echo
\
The first backslash escapes the second one, resulting in a single backslash being in the second word.
echo \\ \\ \\ \\
becomes this list of words:
echo
\
\
\
\
Now command line parsing is done. Only now does the shell look for a command named by the first word. Up until now, the fact that the command is echo has been irrelevant. If you'd said cat \\ \\ \\ \\, cat would be invoked with 4 argument words, each containing a single backslash.
Normally when you run echo you'll be getting the shell builtin command. The zsh builtin echo has configurable behavior. I like to use setopt BSD_ECHO to select BSD-style echo behavior, but from your sample output it appears you are in the default mode, SysV-style.
BSD-style echo doesn't do any backslash processing, it would just print them as it received them.
SysV echo processes backslash escapes like in C strings - \t becomes a tab character, \r becomes a carriage return, etc. Also \c is interpreted as "end the output without a newline".
So if you said echo a\\tb then the shell parsing would result in a single backslash in the argument word given to echo, and echo would interpret a\tb and print a and b separated by a tab. It would be more readable if written as echo 'a\tb', using apostrophes to provide quoting at the shell-command-parsing level. Likewise echo \\\\ is two backslashes after command line parsing, so echo sees \\ and outputs one backslash. If you wanted to print literally a\tb without using an other form of quoting, you'd have to says echo a\\\\tb.
So the shell has a simple rule - two backslashes on the command line to make one backslash in the argument word. And echo has a simple rule - two backslashes in the argument word to make one backslash in the output.
But there's a problem... when echo does its thing, a backslash followed by t means output a tab, a backslash followed by a backslash means output a backslash... but there are lots of combinations that don't mean anything. A backslash followed by T for example is not a valid escape sequence. In C it would be a warning or an error. But the echo command tries to be more tolerant.
Try echo \\T or echo '\T' and you will discover that a backslash followed by anything that doesn't have a defined meaning as a backslash escape will just cause echo to output both characters as-is.
Which brings us to the last case: what if the backslash isn't followed by anything at all? What if it's the last character in the argument word? In that case, echo just outputs the backslash.
So in summary, two backslashes in the argument word result in one backslash in the output. But one backslash in the argument word also results in one backslash in the output, if it is the last character in the word or if the backslash together with the next character don't form a valid escape sequence.
The command line echo \\\\ thus becomes the word list
echo
\\
which outputs a single backslash "properly", with quoting applied at all levels.
The command line echo \\ becomes the word list
echo
\
which output a single backslash "messily", because echo found a stray backslash at the end of the argument and was generous enough to output it for you even though it wasn't escaped.
The rest of the examples should be clear from these principles.
I need a unix command to verify the file has ASCII printable characters only (between ASCII Hex 20 and 7E inclusive).
I got below command to check if file contains non-ASCII characters, but cannot figure out my above question.
if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
nice to have:
- Stop loading results. Sometimes one is enough
To find 20 to 7E characters in a file you can use:
grep -P "[\x20-\x7E]" file
Note the usage of -P to perform Perl regular expressions.
But in this case you want to check if the file just contains these kind of characters. So the best thing to do is to check if there is any of them that are not within this range, that is check [^range]:
grep -P "[^\x20-\x7E]" file
All together, I would say:
grep -qP "[^\x20-\x7E]" file && echo "weird ASCII" || echo "clean one"
This can be done in unix using the POSIX grep options:
if LC_ALL=C grep -q '[^ -~]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
where the characters in [ ... ] are ^ (caret), space, - (ASCII minus sign), ~ (tilde).
You could also specify ASCII tab. The standard refers to these as collating elements. It seems that both \x (hexadecimal) or \0 (octal) are shown in the standard description of bracket expressions (see 7.4.1). So you could use \x09 or \011 for the literal tab.
According to the description, by default -e accepts a basic regular expression (BRE). If you added a -E, you could have an extended regular expression (but that is not needed).
I'm writing strings which contain backslashes (\) to a file:
x1 = "\\str"
x2 = "\\\str"
# Error: '\s' is an unrecognized escape in character string starting "\\\s"
x2="\\\\str"
write(file = 'test', c(x1, x2))
When I open the file named test, I see this:
\str
\\str
If I want to get a string containing 5 backslashes, should I write 10 backslashes, like this?
x = "\\\\\\\\\\str"
[...] If I want to get a string containing 5 \ ,should i write 10 \ [...]
Yes, you should. To write a single \ in a string, you write it as "\\".
This is because the \ is a special character, reserved to escape the character that follows it. (Perhaps you recognize \n as newline.) It's also useful if you want to write a string containing a single ". You write it as "\"".
The reason why \\\str is invalid, is because it's interpreted as \\ (which corresponds to a single \) followed by \s, which is not valid, since "escaped s" has no meaning.
Have a read of this section about character vectors.
In essence, it says that when you enter character string literals you enclose them in a pair of quotes (" or '). Inside those quotes, you can create special characters using \ as an escape character.
For example, \n denotes new line or \" can be used to enter a " without R thinking it's the end of the string. Since \ is an escape character, you need a way to enter an actual . This is done by using \\. Escaping the escape!
Note that the doubling of backslashes is because you are entering the string at the command line and the string is first parsed by the R parser. You can enter strings in different ways, some of which don't need the doubling. For example:
> tmp <- scan(what='')
1: \\\\\str
2:
Read 1 item
> print(tmp)
[1] "\\\\\\\\\\str"
> cat(tmp, '\n')
\\\\\str
>