What does #! Before a path name do in a ksh script - unix

Basically I want to know what this line of scripting code does
function make_expand_query_string_shell {
cat <<DONE | tr '#' '#'
#!/usr/bin/ksh
DONE

cat <<DONE | tr '#' '#'
#!/usr/bin/ksh
DONE
This is a Unix "pipeline" tying together a few useful utilities to create some output.
The shell itself is going to read the first line, and break it down approximately as so:
cat — is the name of a program, which will be found on the PATH. The cat program is used to concatenate files together.
<< — is used to "redirect" the standard input to the program coming before it. Since there is nothing between cat and <<, the program will be started without any command-line parameters (e.g. filenames), and like many shell utilities, will expect its input from the "standard input" stream.
DONE is a symbol that is, essentially, a parameter to <<.
| instructs the shell to "pipe" the standard output from the program to its left (cat) to the program to its right (tr).
tr is the name of another program. Its purpose is to translate or transpose characters.
'#' '#' are command-line parameters to tr.
The << feature is called a "here-document." Every Unix program starts its life with three standard I/O streams (except under unusual circumstances) — its standard input, output, and error output. Normally, all three are connected to your terminal.
In this case, however, << will essentially link the standard input to the sequence of lines in the script file, itself, until it reads a line that matches the ending symbol given — in this case, DONE. It's called a "here-document" because the document being fed to the input is given "here" — immediately in the script file, itself.
As #KeithThompson recommended, you could have found this in the ksh manual, by searching for "<<":
<<[-]word
The shell input is read up to a line that is the same as
word after any quoting has been removed, or to an end-of-
file. No parameter substitution, command substitution,
arithmetic substitution or file name generation is per-
formed on word. The resulting document, called a here-
document, becomes the standard input. If any character
of word is quoted, then no interpretation is placed upon
the characters of the document; otherwise, parameter
expansion, command substitution, and arithmetic substitu-
tion occur, \new-line is ignored, and \ must be used to
quote the characters \, $, �. If - is appended to <<,
then all leading tabs are stripped from word and from the
document. If # is appended to <<, then leading spaces
and tabs will be stripped off the first line of the docu-
ment and up to an equivalent indentation will be stripped
from the remaining lines and from word. A tab stop is
assumed to occur at every 8 columns for the purposes of
determining the indentation.
Likewise, the | is taking the output from cat and passing it directly to the input of tr.
So, what do these two programs do? Let's check their manuals.
NAME
cat - concatenate files and print on the standard output
SYNOPSIS
cat [OPTION]... [FILE]...
DESCRIPTION
Concatenate FILE(s), or standard input, to standard output.
OK … so, this will concatenate its standard input to its standard output. What about tr?
NAME
tr - translate or delete characters
SYNOPSIS
tr [OPTION]... SET1 [SET2]
DESCRIPTION
Translate, squeeze, and/or delete characters from standard input, writ-
ing to standard output.
…
SETs are specified as strings of characters. Most represent them-
selves.
So tr well translate a character in SET1 to the character in the same position in SET2. Looks like we have two sets with only one member each, so that's easy to see what will happen.
Since cat does not do anything to its input except copy it to its output, it's being used to effectively copy the here-document as the input to tr. In turn, tr is transposing every # on its input into a #.
This creates, as its output, a typical Unix "shebang" line, of #!/usr/bin/ksh.
The entire sequence is a much more ornate version of
echo '#!/usr/bin/ksh'

Related

Can sort command be used to sort file based on multiple columns in a csv file

We have a requirement where we have a csv file with custom delimiter '||' (double-pipes) . We have 40 columns in the file and the file size is approximately between 400 to 500 MB.
We need to sort the file based on 2 columns, first on column 4 and then by column 17.
We found this command using which we can sort for one column, but not able to find a command which can sort based on both columns.
Since we use a delimiter with 2 characters, we are using awk command for sorting.
Command:
awk -F \|\| '{print $4}' abc.csv | sort > output.csv
Please advise.
If your inputs are not too fancy (no newlines in the middle of a record, for instance), the sort utility can almost do what you want, but it supports only one-character field separators. So || would not work. But wait, if you do not have other | characters in your files, we could just consider | as the field separator and account for the extra empty fields:
sort -t'|' -k7 -k33 foo.csv
We sort by fields 7 (instead of 4) and then 33 (instead of 17) because of these extra empty fields. The formula that gives the new field number is simply 2*N-1 where N is the original field number.
If you do have | characters inside your fields a simple solution is to substitute them all by one unused character, sort, and restore the original ||. Example with tabs:
sed 's/||/\t/g' foo.csv | sort -t$'\t' -k4 -k17 | sed 's/\t/||/g'
If tab is also used in your fields chose any unused character instead. Form feed (\f) or the field separator (ASCII code 28, that is, replace the 3 \t with \x1c) are good candidates.
Using PROCINFO in gnu-awk you can use this solution to sort on multi-character delimiter:
awk -F '\\|\\|' '{a[$2,$17] = $0} END {
PROCINFO["sorted_in"]="#ind_str_asc"; for (i in a) print a[i]}' file.csv
You could try following awk code. Written as per your shown attempts only. Set OFS as |(this is putting | as output field separator in case you want it ,comma etc then change OFS value accordingly in program) and print 17th field also as per your requirement in awk program. In sort use 1st and 2nd fields to sort it(because now 4th and 17th fields have become 1st and 2nd fields respectively for sort).
awk -F'\\|\\|' -v OFS='\\|' '{print $4,$17}' abc.csv | sort -t'|' -k1.1 -k2.1 > output.csv
The sort command works on physical lines, which may or may not be acceptable. CSV files can contain quoted fields which contain newlines, which will throw off sort (and most other Unix line-oriented utilities; it's hard to write a correct Awk script for this scenario, too).
If you need to be able to manipulate arbitrary CSV files, probably look to a dedicated utility, or use a scripting language with proper CSV support. For example, assume you have a file like this:
Title,Number,Arbitrary text
"He said, ""Hello""",2,"There can be
newlines and
stuff"
No problem,1,Simple undramatic single-line CSV
In case it's not obvious, CSV is fundamentally just a text file, with some restrictions on how it can be formatted. To be valid CSV, every record should be comma-separated; any literal commas or newlines in the data needs to be quoted, and any literal quotes need to be doubled. There are many variations; different tools accept slightly different dialects. One common variation is TSV which uses tabs instead of commas as delimiters.
Here is a simple Python script which sorts the above file on the second field.
import csv
import sys
with open("test.csv", "r") as csvfile:
csvdata = csv.reader(csvfile)
lines = [line for line in csvdata]
titles = lines.pop(0) # comment out if you don't have a header
writer = csv.writer(sys.stdout)
writer.writerow(titles) # comment out if you don't have a header
writer.writerows(sorted(lines, key=lambda x: x[1]))
Using sys.stdout for output is slightly unconventional; obviously, adapt to suit your needs. The Python csv library documentation is obviously not designed primarily to be friendly for beginners, but it should not be impossible to figure out, and it's not hard to find examples of working code.
In Python, sorted() returns a copy of a list in sorted order. There is also sort() which sorts a list in-place. Both functions accept an optional keyword parameter to specify a custom sort order. To sort on the 4th and 17th fields, use
sorted(lines, key=lambda x: (x[3], x[16]))
(Python's indexing is zero-based, so [3] is the fourth element.)
To use | as a delimiter, specify delimiter='|' in the csv.reader() and csv.writer() calls. Unfortunately, Python doesn't easily let you use a multi-character delimiter, so you might have to preprocess the data to switch to a single-character delimiter which does not occur in the data, or properly quote the fields which contain the character you selected as your delimiter.

unix SED command to replace part of key value pair

We have requirement where i need to replace part of param value in our configuration file.
Example
key1=123-456
I need to replace the value after hyphen with new value.
I got command which is being used in other projects but i am not sure how it works.
Command
[test]$ cat test_sed_key_value.txt
key1=123-456
[test]$ sed -i -e '/key1/ s/-.*$/-789/' test_sed_key_value.txt
[test]$
[test]$ cat test_sed_key_value.txt
key1=123-789
[test]$
It will be helpful if some one can explain how the above command or is there a simpler way to do this using sed.
Here is a list of parts of that commandline, each followed by a short explanation:
sed
which tool to use
-i
flag: apply the effect directly to the processed file (whithout creating a copy of the input file)
-e
expression parameter: the sed code to apply follows
/key1/
"address": only process lines on which this regex applies, i.e. those containing the text "key1"
s/replacethis/withthis/
command: do a search-and-replace, "replacethis" and "withthis" are the next to explanations
-.*$
regex: (what is actually in the commandline instead of "replacethis") a regular expression representing a "minus" followed by anything, in any number, until the end of the line
-789
literal: (what is actually in the commandline instead of "withthis") simply that string "-789"
test_sed_key_value.txt
file parameter: process this file
I cannot think of any way to do this simpler. The shown command already uses some assumptions on the formatting of the input file.
I'd add to Yunnosch's answer that here the "replacethis" is a regexp:
-.*$
See here for an overview of the syntax of sed's regular expressions by Gnu.
Asterisk means a repetition of the previous thing, dot means any character, so .* means a sequence of characters.
$ is the end of the line.
You might want to be a bit more restrictive, since here you'd lose something in a line like this one for instance:
key1=123-456, key2=abc-def
replacing it by:
key1=123-789
removing completely the key2 part (since the .* takes all characters after the first dash until end of line).
So depending on the format of your values, you might prefer something like
-[0-9]*
(without the $), meaning a sequence of numbers after the -
or
-[0-9a-zA-Z_]
meaning a sequence of numbers or letters or underscore after the -

programmatic grep command output

Is there a way to get XML or equivalent output of grep command that can be passed on to other programs.
For example, grep can give the file names, line numbers and context of the pattern matched.
Filename and line number extraction can be done using some split command with delimiter ':'. However, if the filename contains ':' character (I know it is weird, but there is a possibility), it would need lot more processing.
With the context (grep -C option), it becomes even more complex. If the context of two matches overlaps, grep optimizes the output and it will be difficult to separate.
So I am wondering if grep command can simply generate an XML or JSON like output that other programs can just load.
There is an option -Z to grep which produces unambiguous output, by using Nul characters.

Unix Text Processing - how to remove part of a file name from the results?

I'm searching through text files using grep and sed commands and I also want the file names displayed before my results. However, I'm trying to remove part of the file name when it is displayed.
The file names are formatted like this: aja_EPL_1999_03_01.txt
I want to have only the date without the beginning letters and without the .txt extension.
I've been searching for an answer and it seems like it's possible to do that with a sed or a grep command by using something like this to look forward and back and extract between _ and .txt:
(?<=_)\d+(?=\.)
But I must be doing something wrong, because it hasn't worked for me and I possibly have to add something as well, so that it doesn't extract only the first number, but the whole date. Thanks in advance.
Edit: Adding also the working command I've used just in case. I imagine whatever command is needed would have to go at the beginning?
sed '/^$/d' *.txt | grep -P '(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)' *.txt --colour -A 1
The results look like this:
aja_EPL_1999_03_02.txt:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
A desired output would be this:
1999_03_02:PALLILENNUD : korraga üritavad ümbermaailmalendu kaks meeskonda
First off, you might want to think about your regular expression. While the one you have you say works, I wonder if it could be simplified. You told us:
(^([A-ZÖÄÜÕŠŽ].*)?[Pp][Aa][Ll]{2}.*[^\.]$)
It looks to me as if this is intended to match lines that start with a case insensitive "PALL", possibly preceded by any number of other characters that start with a capital letter, and that lines must not end in a backslash or a dot. So valid lines might be any of:
PALLILENNUD : korraga üritavad etc etc
Õlu on kena. Do I have appalling speling?
Peeter Pall is a limnologist at EMU!
If you'd care to narrow down this description a little and perhaps provide some examples of lines that should be matched or skipped, we may be able to do better. For instance, your outer parentheses are probably unnecessary.
Now, let's clarify what your pipe isn't doing.
sed '/^$/d' *.txt
This reads all your .txt files as an input stream, deletes any empty lines, and prints the output to stdout.
grep -P 'regex' *.txt --otheroptions
This reads all your .txt files, and prints any lines that match regex. It does not read stdin.
So .. in the command line you're using right now, your sed command is utterly ignored, as sed's output is not being read by grep. You COULD instruct grep to read from both files and stdin:
$ echo "hello" > x.txt
$ echo "world" | grep "o" x.txt -
x.txt:hello
(standard input):world
But that's not what you're doing.
By default, when grep reads from multiple files, it will precede each match with the name of the file from whence that match originated. That's also what you're seeing in my example above -- two inputs, one x.txt and the other - a.k.a. stdin, separated by a colon from the match they supplied.
While grep does include the most minuscule capability for filtering (with -o, or GNU grep's \K with optional Perl compatible RE), it does NOT provide you with any options for formatting the filename. Since you can'd do anything with the output of grep, you're limited to either parsing the output you've got, or using some other tool.
Parsing is easy, if your filenames are predictably structured as they seem to be from the two examples you've provided.
For this, we can ignore that these lines contain a file and data. For the purpose of the filter, they are a stream which follows a pattern. It looks like you want to strip off all characters from the beginning of each line up to and not including the first digit. You can do this by piping through sed:
sed 's/^[^0-9]*//'
Or you can achieve the same effect by using grep's minimal filtering to return every match starting from the first digit:
grep -o '[0-9].*'
If this kind of pipe-fitting is not to your liking, you may want to replace your entire grep with something in awk that combines functionality:
$ awk '
/[\.]$/ {next} # skip lines ending in backslash or dot
/^([A-ZÖÄÜÕŠŽ].*)?PALL/ { # lines to match
f=FILENAME
sub(/^[^0-9]*/,"",f) # strip unwanted part of filename, like sed
printf "%s:%s\n", f, $0
getline # simulate the "-A 1" from grep
printf "%s:%s\n", f, $0
}' *.txt
Note that I haven't tested this, because I don't have your data to work with.
Also, awk doesn't include any of the fancy terminal-dependent colourization that GNU grep provides through the --colour option.

How do I specify a glob that matches file.txt but not file.x.txt in zsh?

Let's say I have a bunch of files and one of them has .txt as the extension and others have .x.txt where x could be whatever. How do I pull out the file that only has the .txt extension ?
Here's a reproducible example:
touch file.txt file.x.txt
echo *.txt
# file.txt file.x.txt
% touch file.txt file.x.txt
% echo [^.]#.txt
file.txt
From the FILENAME GENERATION section of man zshexpn:
[...] Matches any of the enclosed characters. Ranges of characters can be specified by separating two characters by a `-'. A `-' or `]' may be matched by including it as the first character in the
list. There are also several named classes of characters, in the form `[:name:]' with the following meanings. The first set use the macros provided by the operating system to test for the given
character combinations, including any modifications due to local language settings, see ctype(3):
...
[^...]
[!...] Like [...], except that it matches any character which is not in the given set.
...
x# (Requires EXTENDED_GLOB to be set.) Matches zero or more occurrences of the pattern x. This operator has high precedence; `12#' is equivalent to `1(2#)', rather than `(12)#'. It is an error for
an unquoted `#' to follow something which cannot be repeated; this includes an empty string, a pattern already followed by `##', or parentheses when part of a KSH_GLOB pattern (for example,
`!(foo)#' is invalid and must be replaced by `*(!(foo))').
So this matches any string ending with .txt that contains no other . characters.

Resources