Modify awk script to add looping logic - unix

I have an awk script to print pids appearing in myfilename. Where myfilename contains a list of pids each one appearing on a new line...
ps -eaf | awk -f script.awk myfilename -
And here is the contents of script.awk...
# process the first file on the command line (aka myfilename)
# this is the list of pids
ARGIND == 1 {
pids[$0] = 1
}
# second and subsequent files ("-"/stdin in the example)
ARGIND > 1 {
# is column 2 of the ps -eaf output [i.e.] the pid in the list of desired
# pids? -- if so, print the entire line
if ($2 in pids)
printf("%s\n",$0)
}
At the moment the comman prints out pids in order of the ps -eaf command however I would like it to print out pids as per the order that they appear in myfilename.
I tried to modify the script to loop through $pids and repeat the same logic but I couldn't quite get it right.
Appreciate it if someone could help me with this.
thanks

Forgive my rusty AWK. Perhaps this is usable?
ARGIND == 1 {
pids[$0] = NR # capture the order
}
ARGIND > 1 {
if ($2 in pids) {
idx = pids[$2];
matches[idx] = $0; # capture the line and associate it with the ps -eaf order
if (idx > max)
max = idx;
}
}
END {
for(i = 1; i <= max; i++)
if (i in matches)
print matches[i];
}
I don't know what the output from ps -eaf looks like or what assumptions might be useful to exploit from its output. When I first read the question I thought OP had more than two inputs to the script. If it's really going to be only two then it probably makes more sense to reverse the inputs, if not then this might be the more general approach.

I would instead do this using the time-honoured NR==FNR construct. It goes a little something like this (one-liner).
ps -eaf | awk 'NR==FNR{p[$1]++;next} $2 in p' mypidlist -
The idea of NR==FNR is we look at the current record number (NR), and compare it to the record number within the current file (FNR). If they are the same, we are in the same file, so we store a record and move to the next line of input.
If NR==FNR is not true, then we simply check for $2 being in the array.
So the first expression populates the array p[] with the contents of mypidlist, and the second construct is a condition only, which defaults to {print} as its statement.
Of course, the one-liner above does not address your requirement to print results in the order of your pid input file. To do that, you need to keep an index and record the data in an array for some kind of sort. Of course, it doesn't have to be a real sort, just keeping the index itself should be sufficient. The following is a bit long as a one-liner:
ps -eaf | awk 'NR==FNR{p[$1]++;o[++n]=$1;next} $2 in p {c[$2]=$0} END {for(n=1;n<=length(o);n++){print n,o[n],c[o[n]]}}' mypidlist -
Broken out for easier reading, the awk script looks like this:
# Record the pid list...
NR==FNR {
p[$1]++ # Each pid is an element in this array.
o[++n]=$1 # This array records the order of the pids.
next
}
# If the second+ input source has a matching pid...
$2 in p {
c[$2]=$0 # record the line in a third array, pid as key.
}
END {
# At the end of our input, step through the ordered pid list...
for (n=1;n<=length(o);n++) {
print c[o[n]] # and print the collected line, using our pid index as key.
}
}
Note that in the event a pid from your list is missing from ps output, the result will be to print a blank line, since awk doesn't complain about references to nonexistent array indices.
Note also that length(arrayname) notation works in GAWK and OneTrueAwk, but may not be universal. If that doesn't work for you, you might be able to add an something like this to your awk script:
function alength(arrayname, i, n) {
for(i in arrayname)
n++
return n
}

If there is one file, you can flip the order of inputs and use idiomatic awk as follows
$ awk 'NR==1; NR==FNR{a[$2]=$0;next} $0 in a{print a[$0]}' <(ps -eaf) <(seq 10)
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 02:36 ? 00:00:03 /sbin/init
root 2 0 0 02:36 ? 00:00:00 [kthreadd]
root 3 2 0 02:36 ? 00:00:00 [ksoftirqd/0]
root 4 2 0 02:36 ? 00:00:00 [kworker/0:0]
root 5 2 0 02:36 ? 00:00:00 [kworker/0:0H]
root 6 2 0 02:36 ? 00:00:00 [kworker/u30:0]
root 7 2 0 02:36 ? 00:00:00 [rcu_sched]
root 8 2 0 02:36 ? 00:00:00 [rcuos/0]
root 9 2 0 02:36 ? 00:00:00 [rcuos/1]
root 10 2 0 02:36 ? 00:00:00 [rcuos/2]
Here, the list of the ids provided by the seq, substitute with your file.

Related

Loop and process over blocks of lines between two patterns in awk?

This is actually a continued version of thisquestion:
I have a file
1
2
PAT1
3 - first block
4
PAT2
5
6
PAT1
7 - second block
PAT2
8
9
PAT1
10 - third block
and I use awk '/PAT1/{flag=1; next} /PAT2/{flag=0} flag'
to extract the blocks of lines.
Extracting them works ok, but I'm trying to iterate over these blooks in a block-by-block fashion and do some processing with each block (e.g. save to file, process with other scripts etc.).
How can I construct such a loop?
Problem is not very clear but you may do something like this:
awk '/PAT1/ {
flag = 1
++n
s = ""
next
}
/PAT2/ {
flag = 0
printf "Processing record # %d =>\n%s", n, s
}
flag {
s = s $0 ORS
}' file
Processing record # 1 =>
3 - first block
4
Processing record # 2 =>
7 - second block
This might work for you (GNU sed):
sed -ne '/PAT1/!b;:a;N;/PAT2/!ba;e echo process:' -e 's/.*/echo "&"|wc/pe;p' file
Gather up the lines between PAT1 and PAT2 and process the collection.
In the example above, the literal process: is printed.
The command to print the result of the wc command for the collection is built and printed.
The result of the evaluation of the above command is printed.
N.B. The position of the p flag in the substitution command is critical. If the p is before the e flag the pattern space is printed before the evaluation, if the p flag is after the e flag the pattern space is post evaluation.

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, e.g.
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
all the different variants uniq
the number of variants repeteated
and the file location
i.e.:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[#]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.
Pure bash would be hard I think but everyone has some awk lying around :D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *

How to match a list of strings in two different files using a loop structure?

I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list

how to delete the last line in file starting with string in unix

I have a file in the below format
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
MD 2 0 8 -1
I want to find the last line that start with "MD 2" (not MD as MD is embedded in other data) and delete that line.
so my output should be --
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
I have tried many regular expression in sed but it seems it is not working..
sed -e '/^MD *2/p' <file Name >
sed '/^(MD 2)/p' <file Name>
This might work for you (GNU sed):
sed '/^MD\s\+2/,${//{x;//p;d};H;$!d;x;s/^[^\n]*\n//}' file
This holds a window of lines in the hold space. When it encounters the required pattern, it prints out the current window and starts a new one. At the end of file it prints out all but the first line of the window (as it is the first line that is the required pattern to be deleted).
You can do this in 2 steps:
Find the line number of that line
Delete the line using sed
For example:
n=$(awk '/^MD *2/ { n=NR } END { print n }' filename)
sed "${n}d" filename
If you are trying to match exactly 2 in the second column (and not strings that begin with 2), do two passes:
awk 'NR==FNR && $1 == "MD" && $2 == "2"{k=NR} NR!=FNR && FNR!=k' input input
Or, if you have access to tac and want to make 3 passes on the file:
tac input | awk '$1 == "MD" && $2 == "2" && !k{ k=1; next}1' | tac
To match when the second column does not exactly equal the string 2 but merely begins with a 2, replace $2 == "2" in the above with $2 ~ /^2/
Here is one way to do it.
awk '{a[NR]=$0} /^MD *2/ {f=NR} END {for (i=1;i<=NR;i++) if (f!=i) print a[i]}' file
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
Store all data in array a
Search and find last MD 2 and store record number in f
Then print array a, but only if record number is not equal to value in f

How to print the last but one record of a file using awk?

How to print the last but one record of a file using awk?
Something like:
awk '{ prev_line=this_line; this_line=$0 } END { print prev_line }' < file
Essentially, keep a record of the line before the current one, until you hit the end of the file, then print the previous line.
edit to respond to comment:
To just extract the second field in the penultimate line:
awk '{ prev_f2=this_f2; this_f2=$2 } END { print prev_f2 }' < file
You can do it with awk but you may find that:
tail -2 inputfile | head -1
will be a quicker solution - it grabs the last two lines of the complete set then the first of those two.
The following transcript shows how this works:
pax$ echo '1
> 2
> 3
> 4
> 5' | tail -2 | head -1
4
If you must use awk, you can use:
pax$ echo '1
2
3
4
5' | awk '{last = this; this = $0} END {print last}'
4
It works by keeping the last and current line in variables last and this and just printing out last when the file is finished.

Resources