Zsh backslash madness? - zsh

Zsh seems to do some weird backslashing when you try to echo a bunch of backslashes. I cannot seem to figure out a very clear pattern to this. Any reasons for this madness? Of course, if I actually wanted to use backslashes properly, then I'd use proper quoting etc, but why does this happen in the first place?
Here's a small example to show the same:
$ echo \\
\
$ echo \\ \\
\ \
$ echo \\ \\ \\
\ \ \
$ echo \\ \\ \\ \\
\ \ \ \
$ echo \\\\ \\ \\
\ \ \
$ echo \\\\\\ \\
\\ \
$ echo \\\\\\\\
\\
I initially independently discovered this a while ago, but was reminded of it by this tweet by Zach Riggle.

In the first step, the echo command is not special. The command line is parsed by rules that are independent of what command is being executed. The overall effect of this step is convert your command from a series of characters to a series of words.
The two general parsing rules you need to know to understand this example are: the space character separates words, and the backslash character escapes special characters, including itself.
So the command echo \\ becomes a list of 2 words:
echo
\
The first backslash escapes the second one, resulting in a single backslash being in the second word.
echo \\ \\ \\ \\
becomes this list of words:
echo
\
\
\
\
Now command line parsing is done. Only now does the shell look for a command named by the first word. Up until now, the fact that the command is echo has been irrelevant. If you'd said cat \\ \\ \\ \\, cat would be invoked with 4 argument words, each containing a single backslash.
Normally when you run echo you'll be getting the shell builtin command. The zsh builtin echo has configurable behavior. I like to use setopt BSD_ECHO to select BSD-style echo behavior, but from your sample output it appears you are in the default mode, SysV-style.
BSD-style echo doesn't do any backslash processing, it would just print them as it received them.
SysV echo processes backslash escapes like in C strings - \t becomes a tab character, \r becomes a carriage return, etc. Also \c is interpreted as "end the output without a newline".
So if you said echo a\\tb then the shell parsing would result in a single backslash in the argument word given to echo, and echo would interpret a\tb and print a and b separated by a tab. It would be more readable if written as echo 'a\tb', using apostrophes to provide quoting at the shell-command-parsing level. Likewise echo \\\\ is two backslashes after command line parsing, so echo sees \\ and outputs one backslash. If you wanted to print literally a\tb without using an other form of quoting, you'd have to says echo a\\\\tb.
So the shell has a simple rule - two backslashes on the command line to make one backslash in the argument word. And echo has a simple rule - two backslashes in the argument word to make one backslash in the output.
But there's a problem... when echo does its thing, a backslash followed by t means output a tab, a backslash followed by a backslash means output a backslash... but there are lots of combinations that don't mean anything. A backslash followed by T for example is not a valid escape sequence. In C it would be a warning or an error. But the echo command tries to be more tolerant.
Try echo \\T or echo '\T' and you will discover that a backslash followed by anything that doesn't have a defined meaning as a backslash escape will just cause echo to output both characters as-is.
Which brings us to the last case: what if the backslash isn't followed by anything at all? What if it's the last character in the argument word? In that case, echo just outputs the backslash.
So in summary, two backslashes in the argument word result in one backslash in the output. But one backslash in the argument word also results in one backslash in the output, if it is the last character in the word or if the backslash together with the next character don't form a valid escape sequence.
The command line echo \\\\ thus becomes the word list
echo
\\
which outputs a single backslash "properly", with quoting applied at all levels.
The command line echo \\ becomes the word list
echo
\
which output a single backslash "messily", because echo found a stray backslash at the end of the argument and was generous enough to output it for you even though it wasn't escaped.
The rest of the examples should be clear from these principles.

Related

How to split a string delimited by new lines in /bin/sh

I have a string like FIRSTNAME\nLASTNAME in a file. I want to put the first name and last name in their own variables. I have to use /bin/sh, not bash.
How can I easily do this?
You can use parameter expansion operators to strip everything following the \n as well as everything preceding it.
$ s="Stealth\nRabbi"
$ first=${s%\\n*}
$ last=${s#*\\n}
$ echo "$first"
Stealth
$ echo "$last"
Rabbi
Note that there are no literal newlines involved; you have two characters, \ and n.

Zsh: Why is \n interpreted within single quotes?

From my understanding, a \n should be interpreted as a new line in
"a\nb"
$'a\nb'
but not in
'a\nb'
However, I see the following:
(transscript from my zsh session follows)
-0-1- ~ > zsh --version
zsh 5.1.1 (x86_64-unknown-cygwin)
-0-1- ~ > echo 'a\nb'
a
b
-0-1- ~ >
zsh itself does not interpret the \n in 'a\nb', but the builtin echo does. 'a\nb' and "a\nb" are equivalent and do not contain a newline but the (literal) character sequence \n. $'a\nb' on the other hand actually does contains a newline.
There are two things at work here:
1. Quoting:
Quoting is used to tell the shell that you want a character itself, not any special meaning it may have in the shells syntax. zsh has four types of quoting, some of which may retain or add special meaning to a few characters or character sequences:
\: Quoting of single characters by prepending \. For zsh saying \n only means "I want the character n". As n has no special meaning, it is the same as writing just n. This changes with characters like *: Without quoting * is used as a wildcard for globbing, writing \* prevents that (for example: compare outputs of echo /* and echo /\*). If you want to pass the character \ literally, you have to quote it, for example with another \: \\.
'...': Just like with \ any character withing '...' is taken literally, this includes other methods of quoting. 'foo* bar?' is essentially equivalent to foo\*\ bar\? (or \f\o\o\*\ \b\a\r\? if one wants to be pedantic). Only ' itself cannot appear inside a string quoted with '...' as it would mark the end of the qoute. (This can be changed by setting RC_QUOTES. If set a pair of single quote is taken as a single quote: 'foo''bar' → foo\'bar)
"...": Allows for parameter and command substitution. That means words after a $ are taken as parameter names (e.g. "path: $PATH") and strings wrapped in $(...) or `...` are used for command substitution (e.g. echo "Date: $(date)"). Otherwise it behaves like '...' with the exception that it allows the quoting of `, $, \ and " with \.
$'...': strings insid $'...' are treated like string arguments of the print builtin. Only here some alphabetic characters have indeed special meaning: \n for newline, \b for backspace, \a for the bell character, etc.. \ and ' can be quoted with \\ and \' respectively. The resulting string is considered fully quoted. There is no parameter or command substitution inside $'...'.
2. The zsh-builtin echo:
Other than the echo binary or the bash-builtin echo the zsh-builtin by default recognizes (some) escape sequences like \n or \t. While this behavior usually needs to be explicitly enabled with /bin/echo or the bash-builtin (usually by passing the -e flag: echo -e "foo\nbar"), for the zsh-builtin it need to be explicitly disabled. Either by passing the -E flag (echo -E 'foo\nbar') or by setting the BSD_ECHO option (setopt bsdecho) in which case the -e flag can be used to re-enable the feature like with the other types of echo.
Conclusion
That means that both 'a\nb' and "a\nb" (and a\\nb for that matter) are passed as a\nb (literally), but the zsh-builtin echo then interprets the \n, leading to an output with a newline. On the other hand $'a\nb' contains a literal newline already before it is passed to echo.
Running
for quoted_string in a\\nb 'a\nb' "a\nb" $'a\nb'; do
echo '>' $quoted_string '<'
/bin/echo -e '>' $quoted_string '<'
echo -E '>' $quoted_string '<'
/bin/echo '>' $quoted_string '<'
echo
done
should get you the following output:
> a
b <
> a
b <
> a\nb <
> a\nb <
> a
b <
> a
b <
> a\nb <
> a\nb <
> a
b <
> a
b <
> a\nb <
> a\nb <
> a
b <
> a
b <
> a
b <
> a
b <
As you can see there is no difference between the first three kinds of quoting, while the fourth always prints with a newline.
BTW: perl (at least version 5; I do not know about Perl 6) behaves in the way you describe the expected behavior. In perl '...' behaves like it does in zsh. On the other hand"..." in perl behaves like a combination of "..." and $'...' of zsh: variables are replaced by their value and character sequences like \n and \t are treated specially.
You are wrong about "a\nb", which does not replace \n with a newline (well, not portably, anyway). So there is no standard quoting mechanism that allows a newline in a string without using a literal newline, as in
x="a
b"
This is why the $'a\nb' notation was invented to do just that. It's not POSIX as of 2016, but the POSIX folks are thinking about adding it to the shell specification. As always, don't hold your breath :-)
A new line is a character, not a variable, so there is no interpretation (or expansion). If you want to print a \, you need to escape it twice to indicate that you don’t want to escape a variable and that you don’t want to print a special character.
So you have to use echo a\\\\nb or echo 'a\\nb'

Check if a file contains certain ASCII characters

I need a unix command to verify the file has ASCII printable characters only (between ASCII Hex 20 and 7E inclusive).
I got below command to check if file contains non-ASCII characters, but cannot figure out my above question.
if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
nice to have:
- Stop loading results. Sometimes one is enough
To find 20 to 7E characters in a file you can use:
grep -P "[\x20-\x7E]" file
Note the usage of -P to perform Perl regular expressions.
But in this case you want to check if the file just contains these kind of characters. So the best thing to do is to check if there is any of them that are not within this range, that is check [^range]:
grep -P "[^\x20-\x7E]" file
All together, I would say:
grep -qP "[^\x20-\x7E]" file && echo "weird ASCII" || echo "clean one"
This can be done in unix using the POSIX grep options:
if LC_ALL=C grep -q '[^ -~]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
where the characters in [ ... ] are ^ (caret), space, - (ASCII minus sign), ~ (tilde).
You could also specify ASCII tab. The standard refers to these as collating elements. It seems that both \x (hexadecimal) or \0 (octal) are shown in the standard description of bracket expressions (see 7.4.1). So you could use \x09 or \011 for the literal tab.
According to the description, by default -e accepts a basic regular expression (BRE). If you added a -E, you could have an extended regular expression (but that is not needed).

Zsh escape backslash

I noticed a while ago already that in zsh, you can get a \ by typing \\ like in bash.
> echo \\
\
But, there's a strange phenomenon with 4 backlashes in zsh.
bash$ echo \\\\
\\
zsh> echo \\\\
\
Do you know why ? Is it a bug ?
No, this is not a bug. It's just that the echo implementation in these shells
have a different default settings for interpretation of backslash sequences.
In either shell the command-line parser will remove one layer of backslashes
converting 4 backslashes to 2. That argument is then passed to the echo
builtin command. When echo interprets backslash sequences 1 backslash is
output for that sequence, if backslash interpretation isn't being done by echo
2 backslashes will be output.
In either shell's implementation of echo the -e or -E option can be used
to respectively enable or disable backslash interpretation. So the following
will produce the same output in either shell:
echo -e \\\\
echo -E \\\\
Both shells also have shell-level options to alter the default behaviour of
their echo command. In zsh the default can be changed with setopt BSD_echo,
to change the default in bash the command is shopt -s xpg_echo.
If you're trying to write portable shell scripts, you'd be best served by
avoiding use of echo altogether; it is one of the least portable commands
around. Use printf instead.

grep for special characters in Unix

I have a log file (application.log) which might contain the following string of normal & special characters on multiple lines:
*^%Q&$*&^#$&*!^#$*&^&^*&^&
I want to search for the line number(s) which contains this special character string.
grep '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
The above command doesn't return any results.
What would be the correct syntax to get the line numbers?
Tell grep to treat your input as fixed string using -F option.
grep -F '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
Option -n is required to get the line number,
grep -Fn '*^%Q&$*&^#$&*!^#$*&^&^*&^&' application.log
The one that worked for me is:
grep -e '->'
The -e means that the next argument is the pattern, and won't be interpreted as an argument.
From: http://www.linuxquestions.org/questions/programming-9/how-to-grep-for-string-769460/
A related note
To grep for carriage return, namely the \r character, or 0x0d, we can do this:
grep -F $'\r' application.log
Alternatively, use printf, or echo, for POSIX compatibility
grep -F "$(printf '\r')" application.log
And we can use hexdump, or less to see the result:
$ printf "a\rb" | grep -F $'\r' | hexdump -c
0000000 a \r b \n
Regarding the use of $'\r' and other supported characters, see Bash Manual > ANSI-C Quoting:
Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard
grep -n "\*\^\%\Q\&\$\&\^\#\$\&\!\^\#\$\&\^\&\^\&\^\&" test.log
1:*^%Q&$&^#$&!^#$&^&^&^&
8:*^%Q&$&^#$&!^#$&^&^&^&
14:*^%Q&$&^#$&!^#$&^&^&^&
You could try removing any alphanumeric characters and space. And then use -n will give you the line number. Try following:
grep -vn "^[a-zA-Z0-9 ]*$" application.log
Try vi with the -b option, this will show special end of line characters
(I typically use it to see windows line endings in a txt file on a unix OS)
But if you want a scripted solution obviously vi wont work so you can try the -f or -e options with grep and pipe the result into sed or awk.
From grep man page:
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified
by POSIX.)

Resources