Loop and process over blocks of lines between two patterns in awk? - unix

This is actually a continued version of thisquestion:
I have a file
1
2
PAT1
3 - first block
4
PAT2
5
6
PAT1
7 - second block
PAT2
8
9
PAT1
10 - third block
and I use awk '/PAT1/{flag=1; next} /PAT2/{flag=0} flag'
to extract the blocks of lines.
Extracting them works ok, but I'm trying to iterate over these blooks in a block-by-block fashion and do some processing with each block (e.g. save to file, process with other scripts etc.).
How can I construct such a loop?

Problem is not very clear but you may do something like this:
awk '/PAT1/ {
flag = 1
++n
s = ""
next
}
/PAT2/ {
flag = 0
printf "Processing record # %d =>\n%s", n, s
}
flag {
s = s $0 ORS
}' file
Processing record # 1 =>
3 - first block
4
Processing record # 2 =>
7 - second block

This might work for you (GNU sed):
sed -ne '/PAT1/!b;:a;N;/PAT2/!ba;e echo process:' -e 's/.*/echo "&"|wc/pe;p' file
Gather up the lines between PAT1 and PAT2 and process the collection.
In the example above, the literal process: is printed.
The command to print the result of the wc command for the collection is built and printed.
The result of the evaluation of the above command is printed.
N.B. The position of the p flag in the substitution command is critical. If the p is before the e flag the pattern space is printed before the evaluation, if the p flag is after the e flag the pattern space is post evaluation.

Related

csplit in zsh: splitting file based on pattern

I would like to split the following file based on the pattern ABC:
ABC
4
5
6
ABC
1
2
3
ABC
1
2
3
4
ABC
8
2
3
to get file1:
ABC
4
5
6
file2:
ABC
1
2
3
etc.
Looking at the docs of man csplit: csplit my_file /regex/ {num}.
I can split this file using: csplit my_file '/^ABC$/' {2} but this requires me to put in a number for {num}. When I try to match with {*} which suppose to repeat the pattern as much as possible, i get the error:
csplit: *}: bad repetition count
I am using a zshell.
To split a file on a pattern like this, I would turn to awk:
awk 'BEGIN { i=0; }
/^ABC/ { ++i; }
{ print >> "file" i }' < input
This reads lines from the file named input; before reading any lines, the BEGIN section explicitly initializes an "i" variable to zero; variables in awk default to zero, but it never hurts to be explicit. The "i" variable is our index to the serial filenames.
Subsequently, each line that starts with "ABC" will increment this "i" variable.
Any and every line in the file will then be printed (in append mode) to the file name that's generated from the text "file" and the current value of the "i" variable.

Performing calculations based on customer ID in comma-separated file [duplicate]

This question already has an answer here:
Use awk to sum or average for each unique ID
(1 answer)
Closed 6 years ago.
I have a file that contains several comma-separated columns, including a customer ID in the first column.
One customer ID may occur on several rows, but always refers to the same real customer.
How do I run basic calculations in a shell script based on this ID column? For example, calculating the sum of the mileages (the 5th field) for the given customer ID.
102,305,Jin,Kerala,40
104,308,Paul,US,45
105,350,Nina,AUS,50
102,390,Jin,Kerala,10
104,395,Paul,US,35
102,399,Jin,Kerala,35
5th field is the mileage, 1st field is the customer ID.
This is a simple awk script that will sum up the mileages and print the customer IDs together with the sums at the end:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
customer_id = $1;
mileage = $5;
total_mileage[customer_id] += mileage;
}
END {
for (customer_id in total_mileage) {
print customer_id, total_mileage[customer_id];
}
}
To run (after making it executable with chmod +x script.awk):
$ ./script.awk data.in
102 85
104 80
105 50
Alternatively, as a "one-liner":
$ awk -F, '{t[$1]+=$5} END {for (c in t){print c,t[c]}}' data.in
102 85
104 80
105 50
While I agree with #wilx that using a database might be smarter, this sample awk script should get you started:
awk -v FS=',' '{miles[$1] += $5}
END { for (customerid in miles) {
print customerid, miles[customerid]; } }' customers
You can get a list of unique IDs using something like (assuming the first column is the ID):
awk '{print $1}' inputFile | sort -u
This outputs the first field of every single line in the input file inputFile, sorts them and removes duplicates.
You can then use that method with a bash loop to process each of the unique IDs with another awk command to perform some action on them. In the following snippet, I print out the matching lines for each ID:
for id in $(awk '{print $1}' inputFile | sort -u) ; do
echo "${id}:"
awk -vid=${id} '$1==id {print " "$0)' inputFile
done
In that code, for each individual ID, it first outputs the ID then uses awk to only process lines matching that ID. The action carried out is to output the full line with indentation.
Of course, you can do anything you wish with the lines matching each ID. As shown below, an example more closely matching your requirements.
First, here's an input file I used for testing - we can assume field 1 is the customer ID and field 2 the mileage:
$ cat inputFile
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
c 9
b 10
c 11
c 12
And here's a command-line transcript of the method proposed (note that $ and + are input prompt and continuation prompt respectively, they are not part of the actual commands):
$ for id in $(awk '{print $1}' inputFile | sort -u) ; do
+ awk -vid=${id} '
+ $1==id {print $0; sum += $2 }
+ END {print "Total: "sum; print }
+ ' inputFile
+ done
a 1
a 4
a 7
Total: 12
b 2
b 5
b 8
b 10
Total: 25
c 3
c 6
c 9
c 11
c 12
Total: 41
Keep in mind that, for non-huge data sets, it's also possible to do this in a single pass awk script, using associative arrays to store the totals then outputting all the data in the END block. I myself tend to prefer the multi-pass approach myself since it minimises the possibility of running out of memory. The trade-off, of course, is that it will no doubt take longer since you're processing the file more than once.
For a single-pass solution, you can use something like:
$ awk '{sum[$1] += $2} {for (key in sum) { print key": "sum[key]}}' inputFile
which gives you:
a: 12
b: 25
c: 41

Modify awk script to add looping logic

I have an awk script to print pids appearing in myfilename. Where myfilename contains a list of pids each one appearing on a new line...
ps -eaf | awk -f script.awk myfilename -
And here is the contents of script.awk...
# process the first file on the command line (aka myfilename)
# this is the list of pids
ARGIND == 1 {
pids[$0] = 1
}
# second and subsequent files ("-"/stdin in the example)
ARGIND > 1 {
# is column 2 of the ps -eaf output [i.e.] the pid in the list of desired
# pids? -- if so, print the entire line
if ($2 in pids)
printf("%s\n",$0)
}
At the moment the comman prints out pids in order of the ps -eaf command however I would like it to print out pids as per the order that they appear in myfilename.
I tried to modify the script to loop through $pids and repeat the same logic but I couldn't quite get it right.
Appreciate it if someone could help me with this.
thanks
Forgive my rusty AWK. Perhaps this is usable?
ARGIND == 1 {
pids[$0] = NR # capture the order
}
ARGIND > 1 {
if ($2 in pids) {
idx = pids[$2];
matches[idx] = $0; # capture the line and associate it with the ps -eaf order
if (idx > max)
max = idx;
}
}
END {
for(i = 1; i <= max; i++)
if (i in matches)
print matches[i];
}
I don't know what the output from ps -eaf looks like or what assumptions might be useful to exploit from its output. When I first read the question I thought OP had more than two inputs to the script. If it's really going to be only two then it probably makes more sense to reverse the inputs, if not then this might be the more general approach.
I would instead do this using the time-honoured NR==FNR construct. It goes a little something like this (one-liner).
ps -eaf | awk 'NR==FNR{p[$1]++;next} $2 in p' mypidlist -
The idea of NR==FNR is we look at the current record number (NR), and compare it to the record number within the current file (FNR). If they are the same, we are in the same file, so we store a record and move to the next line of input.
If NR==FNR is not true, then we simply check for $2 being in the array.
So the first expression populates the array p[] with the contents of mypidlist, and the second construct is a condition only, which defaults to {print} as its statement.
Of course, the one-liner above does not address your requirement to print results in the order of your pid input file. To do that, you need to keep an index and record the data in an array for some kind of sort. Of course, it doesn't have to be a real sort, just keeping the index itself should be sufficient. The following is a bit long as a one-liner:
ps -eaf | awk 'NR==FNR{p[$1]++;o[++n]=$1;next} $2 in p {c[$2]=$0} END {for(n=1;n<=length(o);n++){print n,o[n],c[o[n]]}}' mypidlist -
Broken out for easier reading, the awk script looks like this:
# Record the pid list...
NR==FNR {
p[$1]++ # Each pid is an element in this array.
o[++n]=$1 # This array records the order of the pids.
next
}
# If the second+ input source has a matching pid...
$2 in p {
c[$2]=$0 # record the line in a third array, pid as key.
}
END {
# At the end of our input, step through the ordered pid list...
for (n=1;n<=length(o);n++) {
print c[o[n]] # and print the collected line, using our pid index as key.
}
}
Note that in the event a pid from your list is missing from ps output, the result will be to print a blank line, since awk doesn't complain about references to nonexistent array indices.
Note also that length(arrayname) notation works in GAWK and OneTrueAwk, but may not be universal. If that doesn't work for you, you might be able to add an something like this to your awk script:
function alength(arrayname, i, n) {
for(i in arrayname)
n++
return n
}
If there is one file, you can flip the order of inputs and use idiomatic awk as follows
$ awk 'NR==1; NR==FNR{a[$2]=$0;next} $0 in a{print a[$0]}' <(ps -eaf) <(seq 10)
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 02:36 ? 00:00:03 /sbin/init
root 2 0 0 02:36 ? 00:00:00 [kthreadd]
root 3 2 0 02:36 ? 00:00:00 [ksoftirqd/0]
root 4 2 0 02:36 ? 00:00:00 [kworker/0:0]
root 5 2 0 02:36 ? 00:00:00 [kworker/0:0H]
root 6 2 0 02:36 ? 00:00:00 [kworker/u30:0]
root 7 2 0 02:36 ? 00:00:00 [rcu_sched]
root 8 2 0 02:36 ? 00:00:00 [rcuos/0]
root 9 2 0 02:36 ? 00:00:00 [rcuos/1]
root 10 2 0 02:36 ? 00:00:00 [rcuos/2]
Here, the list of the ids provided by the seq, substitute with your file.

Extract data before and after matching (BIG FILE )

I have got a big file ( arounf 80K lines )
my main goal is to find the patterns and pring for example 10 lines before and 10 lines after the pattern .
the pattern accures multiple times across the file .
using the grep command :
grep -i <my_pattern>* -B 10 -A 10 <my_file>
i get only some of the data , i think it must be something related to the buffer size ....
i need a command ( grep , sed , awk ) that will handle all the matching
and will print 10 line before and after the pattern ...
Example :
my patterns hides here :
a
b
c
pattern_234
c
b
a
a
b
c
pattern_567
c
b
a
this happens multiple times across the file .
running this command :
grep -i pattern_* -B 3 -A 3 <my_file>
will get he right output :
a
b
c
c
b
a
a
b
c
c
b
it works but not full time
if i have 80 patterns not all the 80 will be shown
awk to the rescue
awk -vn=4 # pass the argument of context line count
'{
for(i=1;i<=n;i++) # store the past n lines in an indexed array
p[i]=p[i+1];
p[n+1]=$0
}
/pattern/ # if pattern matched
{
c=n+1; # set the counter to after match line count
for(i=1;i<=n;i++) # print previously saved entries
print p[i]
}
c-->0' # print the lines after match until counter runs out
will print 4 lines before and 4 lines after the match of pattern, change the value of n as per your need.
if non-symmetric before/after you need two variables
awk -vb=2 -va=3 '{for(i=1;i<=b;i++) p[i]=p[i+1];p[b+1]=$0} /pattern/{c=a+1;for(i=1;i<=b;i++) print p[i]} c-->0'

How to convert multiple lines into fixed column lengths

To convert rows into tab-delimited, it's easy
cat input.txt | tr "\n" " "
But I have a long file with 84046468 lines. I wish to convert this into a file with 1910147 rows and 44 tab-delimited columns. The first column is a text string such as chrXX_12345_+ and the other 43 columns are numerical strings. Is there a way to perform this transformation?
There are NAs present, so I guess sed and substituting "\n" for "\t" if the string preceding is a number doesn't work.
sample input.txt
chr10_1000103_+
0.932203
0.956522
1
0.972973
1
0.941176
1
0.923077
1
1
0.909091
0.9
1
0.916667
0.8
1
1
0.941176
0.904762
1
1
1
0.979592
0.93617
0.934783
1
0.941176
1
1
0.928571
NA
1
1
1
0.941176
1
0.875
0.972973
1
1
NA
0.823529
0.51366
chr10_1000104_-
0.952381
1
1
0.973684
sample output.txt
chr10_1000103_+ 0.932203 (numbers all tab-delimited)
chr10_1000104_- etc
(sorry alot of numbers to type manually)
sed '
# use a delimiter
s/^/M/
:Next
# put a counter
s/^/i/
# test counter
/^\(i\)\{44\}/ !{
$ !{
# not 44 line or end of file, add the next line
N
# loop
b Next
}
}
# remove marker and counter
s/^i*M//
# replace new line by tab
s/\n/ /g' YourFile
some limite if more than 255 tab on sed (so 44 is ok)
Here's the right approach using 4 columns instead of 44:
$ cat file
chr10_1000103_+
0.932203
0.956522
1
chr10_1000104_-
0.952381
1
1
$ awk '{printf "%s%s", $0, (NR%4?"\t":"\n")}' file
chr10_1000103_+ 0.932203 0.956522 1
chr10_1000104_- 0.952381 1 1
Just change 4 to 44 for your real input.
If you are seeing control-Ms in your output it's because they are present in your input so use dos2unix or similar to remove them before running the tool or with GNU awk you could just set -v RS='\n\r'.
When posting questions it's important to make it as clear, simple, and brief as possible so that as many people as possible will be interested in helping you.
BTW, cat input.txt | tr "\n" " " is a UUOC and should just be tr "\n" " " < input.txt
Not the best solution, but should work:
line="nonempty"; while [ ! -z "$line" ]; do for i in $(seq 44); do read line; echo -n "$line "; done; echo; done < input.txt
If there is an empty line in the file, it will terminate. For a more permanent solution I'd try perl.
edit:
If you are concerned with efficiency, just use awk.
awk '{ printf "%s\t", $1 } NR%44==0{ print "" }' < input.txt
You may want to strip the trailing tab character with | sed 's/\t$//' or make the awk script more complicated.
This might work for you (GNU sed):
sed '/^chr/!{H;$!d};x;s/\n/\t/gp;d' file
If a line does not begin with chr append it to the hold space and then delete it unless it is the last. If the line does start chr or it is the last line, then swap to the hold space and replace all newlines by tabs and print out the result.
N.B. the start of the next line will be left untouched in the pattern space which becomes the new hold space.

Resources