Modify BED with poliregions [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I have a somewhat tricky BED file format, which I should convert to a classic BED format so as I can properly use it for further steps:
I have this unconventional BED format:
1 12349 12398 +
1 23523 23578 -
1 23550;23570;23590 23640;23689;23652 +
1 43533 43569 +
1 56021;56078 56099;56155 +
Say that those multiple position rows are representing non-coding fragmented regions.
What I would like to get is a canonical BED file such as:
1 12349 12398 +
1 23523 23578 -
1 23550 23640 +
1 23570 23689 +
1 23590 23652 +
1 43533 43569 +
1 56021 56099 +
1 56078 56155 +
where the poliregions that were mixed in one row, are put in other rows, while maintaining chromosome number and strand.

Could you please try following.
awk '
{
num=split($2,array1,";")
num1=split($3,array2,";")
}
num>1 || num1>1{
for(i=1;i<=num;i++){
print $1,array1[i],array2[i],$NF
}
next
}
1' Input_file | column -t
Output will be as follows.
1 12349 12398 +
1 23523 23578 -
1 23550 23640 +
1 23570 23689 +
1 23590 23652 +
1 43533 43569 +
1 56021 56099 +
1 56078 56155 +

#!/usr/bin/env bash
# ^^^^-- NOT /bin/sh
while read -r a b c d; do
if [[ $b = *';'* ]]; then # if b contains any ';'s
IFS=';' read -r -a ba <<<"$b" # read string b into array ba
IFS=';' read -r -a ca <<<"$c" # read string c into array ca
for idx in "${!ba[#]}"; do # iterate over the indices of array ba
# print a and d with the values for a given index for both ba and ca
printf '%s\t%s\t%s\t%s\n' "$a" "${ba[idx]}" "${ca[idx]}" "$d"
done
else
printf '%s\t%s\t%s\t%s\n' "$a" "$b" "$c" "$d"
fi
done
This combines the answers to existing StackOverflow questions:
bash script loop through two variables in lock step
Reading a delimited string into an array in Bash
...and guidance in the BashFAQ:
How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
See this running at https://ideone.com/wmrXPE

$ cat tst.awk
BEGIN { FS="[[:space:];]+" }
{
n = (NF - 2) / 2
for (i=1; i<=n; i++) {
print $1, $(i+1), $(i+n), $NF
}
}
$ awk -f tst.awk file
1 12349 12349 +
1 23523 23523 -
1 23550 23590 +
1 23570 23640 +
1 23590 23689 +
1 43533 43533 +
1 56021 56078 +
1 56078 56099 +

Try Perl solution
perl -lane ' if( /;/ and /(\S{2,})\s+(\S{2,})/ ) {
$i=0;#x=split(";",$1);#y=split(";",$2); while($i++<scalar(#x))
{ print join(" ",$F[0],$x[$i-1],$y[$i-1],$F[-1]) }} else { print } ' emilio.txt| column -t
with the given inputs
$ cat emilio.txt
1 12349 12398 +
1 23523 23578 -
1 23550;23570;23590 23640;23689;23652 +
1 43533 43569 +
1 56021;56078 56099;56155 +
$ perl -lane ' if( /;/ and /(\S{2,})\s+(\S{2,})/ ) {
$i=0;#x=split(";",$1);#y=split(";",$2); while($i++<scalar(#x))
{ print join(" ",$F[0],$x[$i-1],$y[$i-1],$F[-1]) }} else { print } ' emilio.txt| column -t
1 12349 12398 +
1 23523 23578 -
1 23550 23640 +
1 23570 23689 +
1 23590 23652 +
1 43533 43569 +
1 56021 56099 +
1 56078 56155 +
$

Related

Combining big data files with different columns into one big file

I have N tab-separated files. Each file has a header line saying the names of the columns. Some of the columns are common to all of the files, but some are unique.
I want to combine all of the files into one big file containing all of the relevant headers.
Example:
> cat file1.dat
a b c
5 7 2
3 9 1
> cat file2.dat
a b e f
2 9 8 3
2 8 3 3
1 0 3 2
> cat file3.dat
a c d g
1 1 5 2
> merge file*.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
The - can be replaced by anything, for example NA.
Caveat: the files are so big that I can not load all of them into memory simultaneously.
I had a solution in R using
write.table(do.call(plyr:::rbind.fill,
Map(function(filename)
read.table(filename, header=1, check.names=0),
filename=list.files('.'))),
'merged.dat', quote=FALSE, sep='\t', row.names=FALSE)
but this fails with a memory error when the data are too large.
What is the best way to accomplish this?
I am thinking the best route will be to first loop through all the files to collect the column names, then loop through the files to put them into the right format, and write them to disc as they are encountered. However, is there perhaps already some code available that performs this?
From an algorithm point of view I would take the following steps:
Process the headers:
read all headers of all input files and extract all column names
sort the column names in the order you want
create a lookup table which returns the column-name when a field number is given (h[n] -> "name")
process the files: after the headers, you can reprocess the files
read the header of the file
create a lookup table which returns the field number when given a column name. An associative array is useful here: (a["name"] -> field_number)
process the remainder of the file
loop over all fields of the merged file
get the column name with h
check if the column name is in a, if not print -, if so print the field number corresponding with a.
This is easily done with a GNU awk making use of the extensions nextfile and asorti. The nextfile function allows us to read the header only and move to the next file without processing the full file. Since we need to process the file twice (step 1 reading the header and step 2 reading the file), we will ask awk to dynamically manipulate its argument list. Every time a file's header is processed, we add it at the end of the argument list ARGV so it can be used for step 2.
BEGIN { s="-" } # define symbol
BEGIN { f=ARGC-1 } # get total number of files
f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
ARGV[ARGC++] = FILENAME # add file at end of argument list
if (--f == 0) { # did we process all headers?
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
}
nextfile # end of processing headers
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
If you store the above in a file merge.awk you can use the command:
awk -f merge.awk f1 f2 f3 f4 ... fx
A similar way, but less hastle with f:
BEGIN { s="-" } # define symbol
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
This method is slightly different, but allows the processing of files with different field separators as
awk -f merge.awk f1 FS="," f2 f3 FS="|" f4 ... fx
If your argument list becomes too long, you can use awk to create it for you :
BEGIN { s="-" } # define symbol
BEGIN { # read argument list from input file:
fname=(ARGC==1 ? "-" : ARGV[1])
ARGC=1 # from: filelist or /dev/stdin
while ((getline < fname) > 0) # to: arg1 arg2 ... argx
ARGV[ARGC++]=$0
}
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
which can be ran as:
$ awk -f merge.awk filelist
$ find . | awk -f merge.awk "-"
$ find . | awk -f merge.awk
or any similar command.
As you see, by adding only a tiny block of code, we were able to flexibly adjust to awk code to support our needs.
Miller (johnkerl/miller) is so underused when dealing with huge files. It has tons of features included from all useful file processing tools out there. Like the official documentation says
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
For this particular case, it supports a verb unsparsify, which by the documentation says
Prints records with the union of field names over all input records.
For field names absent in a given record but present in others, fills in
a value. This verb retains all input before producing any output.
You just need to do the following and reorder the file back with the column positions as you desire
mlr --tsvlite --opprint unsparsify then reorder -f a,b,c,d,e,f file{1..3}.dat
which produces the output in one-shot as
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
You can even customize what characters you can use to fill the empty fields with, with default being -. For custom characters use unsparsify --fill-with '#'
A brief explanation of the fields used
To delimit the input stream as a tab delimited content, --tsvlite
To pretty print the tabular data --opprint
And unsparsify like explained above does a union of all the field names over all input stream
The reordering verb reorder is needed because the column headers appear in random order between the files. So to define the order explicitly, use the -f option with the column headers you want the output to appear with.
And installation of the package is so straightforward. Miller is written in portable, modern C, with zero runtime dependencies. The installation via package managers is so easy and it supports all major package managers Homebrew, MacPorts, apt-get, apt and yum.
Given your updated information in comments about having about 10^5 input files (and so exceeding the shells max number of args for a non-builtin command) and wanting the output columns in the order they're seen rather than alphabetically sorted, the following will work using any awk and any find:
$ cat tst.sh
#!/bin/env bash
find . -maxdepth 1 -type f -name "$1" |
awk '
NR==FNR {
fileName = $0
ARGV[ARGC++] = fileName
if ( (getline fldList < fileName) > 0 ) {
if ( !seenList[fldList]++ ) {
numFlds = split(fldList,fldArr)
for (inFldNr=1; inFldNr<=numFlds; inFldNr++) {
fldName = fldArr[inFldNr]
if ( !seenName[fldName]++ ) {
hdr = (numOutFlds++ ? hdr OFS : "") fldName
outNr2name[numOutFlds] = fldName
}
}
}
}
close(fileName)
next
}
FNR == 1 {
if ( !doneHdr++ ) {
print hdr
}
delete name2inNr
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName] = inFldNr
}
next
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
inFldNr = name2inNr[fldName]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' -
.
$ ./tst.sh 'file*.dat'
a b c e f d g
5 7 2 - - - -
3 9 1 - - - -
2 9 - 8 3 - -
2 8 - 3 3 - -
1 0 - 3 2 - -
1 - 1 - - 5 2
Note that input to the script is now the globbing pattern you want find to use to find the files, not the list of files.
Original answer:
If you don't mind a combined shell+awk script then this will work using any awk:
$ cat tst.sh
#!/bin/env bash
awk -v hdrs="$(head -1 -q "$#" | tr ' ' '\n' | sort -u)" '
BEGIN {
numOutFlds = split(hdrs,outNr2name)
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
printf "%s%s", fldName, (outFldNr<numOutFlds ? OFS : ORS)
}
}
FNR == 1 {
delete name2inNr
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName] = inFldNr
}
next
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
inFldNr = name2inNr[fldName]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' "$#"
.
$ ./tst.sh file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
otherwise this is all awk using GNU awk for arrays of arrays, sorted_in, and ARGIND:
$ cat tst.awk
BEGIN {
for (inFileNr=1; inFileNr<ARGC; inFileNr++) {
inFileName = ARGV[inFileNr]
if ( (getline < inFileName) > 0 ) {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName][inFileNr] = inFldNr
}
}
close(inFileName)
}
PROCINFO["sorted_in"] = "#ind_str_asc"
for (fldName in name2inNr) {
printf "%s%s", (numOutFlds++ ? OFS : ""), fldName
for (inFileNr in name2inNr[fldName]) {
outNr2inNr[numOutFlds][inFileNr] = name2inNr[fldName][inFileNr]
}
}
print ""
}
FNR > 1 {
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = outNr2inNr[outFldNr][ARGIND]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
.
$ awk -f tst.awk file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
For efficiency the 2nd script above does all the heavy lifting in the BEGIN section so there's as little work left to do as possible in the main body of the script that's evaluated once per input line. In the BEGIN section it creates an associative array (outNr2inNr[]) that maps the outgoing field numbers (alphabetically sorted list of all field names across all input files) to the incoming field numbers so all that's left to do in the body is print the fields in that order.
Here is the solution I (the OP) have come up with so far. It may have some advantage over other approaches in that it processes the files in parallel.
R code:
library(parallel)
library(parallelMap)
# specify the directory containing the files we want to merge
args <- commandArgs(TRUE)
directory <- if (length(args)>0) args[1] else 'sg_grid'
#output_fname <- paste0(directory, '.dat')
# make a tmp directory that will store all the files
tmp_dir <- paste0(directory, '_tmp')
dir.create(tmp_dir)
# list the .dat files we want to merge
filenames <- list.files(directory)
filenames <- filenames[grep('.dat', filenames)]
# a function to read the column names
get_col_names <- function(filename)
colnames(read.table(file.path(directory, filename),
header=T, check.names=0, nrow=1))
# grab all the headers of all the files and merge them
col_names <- get_col_names(filenames[1])
for (simulation in filenames) {
col_names <- union(col_names, get_col_names(simulation))
}
# put those column names into a blank data frame
name_DF <- data.frame(matrix(ncol = length(col_names), nrow = 0))
colnames(name_DF) <- col_names
# save that as the header file
write.table(name_DF, file.path(tmp_dir, '0.dat'),
col.names=TRUE, row.names=F, quote=F, sep='\t')
# now read in every file and merge with the blank data frame
# it will have NAs in any columns it didn't have before
# save it to the tmp directory to be merged later
parallelStartMulticore(max(1,
min(as.numeric(Sys.getenv('OMP_NUM_THREADS')), 62)))
success <- parallelMap(function(filename) {
print(filename)
DF <- read.table(file.path(directory, filename),
header=1, check.names=0)
DF <- plyr:::rbind.fill(name_DF, DF)
write.table(DF, file.path(tmp_dir, filename),
quote=F, col.names=F, row.names=F, sep='\t')
}, filename=filenames)
# and we're done
print(all(unlist(success)))
This creates temporary versions of all the files, which each now have all the headers, which we can then cat together into the result:
ls -1 sg_grid_tmp/* | while read fn ; do cat "$fn" >> sg_grid.dat; done

Unix awk scripting to convert columns to rows

Need help to convert rows to columns in unix scripting. My source is file system.
Tried the below script:
`perl -nle '
if($. == 1)
{ (#a)=/([\w - .]+)(?=,|\s*$)/g }
else
{
(#b)=/([\w - .]+)(?=,|\s*$)/g;
print "$a[0]|$b[0]|$b[1]|$b[2}|$a[$_]|$b[$_+3]" foreach (0..$#a)
}
' ip.txt >op.txt
input data from file:
src,FI,QMA,PCG,PCC,PREI,G T
PIM2016.csv,MMR.S T - RED,334,114,120,34,123,725
output with latest script:
SRC|PIM2016.csv|MMRPPS|RED|SRC|334 SRC|PIM2016.csv|MMRPPS|RED|FI|114
SDRC|PIM2016.csv|MMRPPS|RED|QMA|120 SRC|PIM2016.csv|MMRPPS|RED|PCG|34
SRC|PIM2016.csv|MMRPPS|RED|PCC|123 SRC|PIM2016.csv|MMRPPS|RED|PREI|725
SRC|PIM2016.csv|MMRPPS|RED|G T|
Required output:
SRC|PIM2016.csv|MMRPPS|S T -RED|FI|334 SRC|PIM2016.csv|MMRPPS|S T
-RED|QMA|114 SRC|PIM2016.csv|MMRPPS|S T -RED|PCG|120 SRC|PIM2016.csv|MMRPPS|S T -RED|PCC|34 SRC|PIM2016.csv|MMRPPS|S T
-RED|PREI|123 SRC|PIM2016.csv|MMRPPS|S T -RED|G T|725
$ cat ip.txt
HDR :FI,QA,PC,PM,PRE,G T
Detail row: MMRPPS,ST - RED,334,114,120,34,123,725
UP,UPR,0,0,0,0,0,0
Assuming no blank lines between rows:
$ perl -nle '
s/^.*:\s*|^\s*|\s*$//;
if($. == 1)
{ (#a) = /[^,]+/g }
else
{
(#b) = /[^,]+/g;
print "$b[0] $a[$_] $b[1] $b[$_+2]" foreach (0..$#a);
}
' ip.txt
MMRPPS FI ST - RED 334
MMRPPS QA ST - RED 114
MMRPPS PC ST - RED 120
MMRPPS PM ST - RED 34
MMRPPS PRE ST - RED 123
MMRPPS G T ST - RED 725
UP FI UPR 0
UP QA UPR 0
UP PC UPR 0
UP PM UPR 0
UP PRE UPR 0
UP G T UPR 0
Input lines are pre-processed to remove leading text upto :, any leading and trailing white-spaces
From first line, extract comma separated values into #a array. The regex looks for string of non , characters
For all other lines,
same regex to extract comma separated values into #b array
print in desired order
#sundeep : thanks for your answer. Below script works
perl -nle '
if($. == 1)
{ (#a)=/([\w -]+)(?=,|\s*$)/g }
else
{
(#b)=/([\w -]+)(?=,|\s*$)/g;
print "$b[0] $a[$_] $b[1] $b[$_+2]" foreach (0..$#a)
}
' ip.txt

UNIX - Find most elements in document

I'm trying to figure out a way to take in a file where there is one word per line and output a log of the most frequently used words in the file and how often they occurred.
Namly, if I were given a file like this (far shorter than what I am looking at, but for clarity's sake...):
dog
dog
cat
bird
cat
horse
dog
I would get an output like:
dog - 3
cat - 2
bird - 1
horse - 1
How about this:
[cnicutar#fresh ~]$ sort < file | uniq -c | sort -rn
3 dog
2 cat
1 horse
1 bird
You can then tweak it to get dog-3 and so on.
using awk & sort :
$ awk '{arr[$1]++}END{for(a in arr){print a" - "arr[a]}}' file.txt | sort -nrk3
A full awk version :
awk '{
arr[$1]++
}
END{
for (i in arr) tmpidx[sprintf("%12s", arr[i]),i] = i
num = asorti(tmpidx)
j = 0
for (i=num; i>=1; i--) {
split(tmpidx[i], tmp, SUBSEP)
indices[++j] = tmp[2]
}
for (i=1; i<=num; i++) print indices[i], arr[indices[i]]
}' file.txt
OUTPUTs
dog - 3
cat - 2
horse - 1
bird - 1
Another way using perl (exact output like you asked):
perl -lne '
END{
print "$_ - $h{$_}" for reverse sort {$h{$a} cmp $h{$b}} keys %h
}
$h{$_}++
' file.txt
OUTPUT
dog - 3
cat - 2
bird - 1
horse - 1

Easiest way to join two files from the unix command line, inserting zero entries for missing keys

I'm trying to join two files each of which contains rows of the form <key> <count>. Each file contains a few lines that are missing from the other, and I would like to have zero inserted for all such values rather than omitting these lines (I've seen -a, but this isn't quite what I'm looking for). Is there a simple way to accomplish this?
Here is some sample input:
a.txt
apple 5
banana 7
b.txt
apple 6
cherry 4
expected output:
apple 5 6
banana 7 0
cherry 0 4
join -o 0,1.2,2.2 -e 0 -a1 -a2 a.txt b.txt
-o 0,1.2,2.2 → output join field, then 2nd field of 1st file, then 2nd field of 2nd file.
-e 0 → Output 0 on empty input fields.
-a1 -a2 → Show all values from file 1 and file 2.
Write a script, whatever language you want. You will parse both files using a map/hashtable/dictionary data structure (lets just say dictionary). Each dictionary will have the first word as the key and the count (or even a string of counts) as the value. Here is some pseudocode of the algorithm:
Dict fileA, fileB; //Already parsed
while(!fileA.isEmpty()) {
string check = fileA.top().key();
int val1 = fileA.top().value();
if(fileB.contains(check)) {
printToFile(check + " " + val1 + " " + fileB.getValue(check));
fileB.remove(check);
}
else {
printToFile(check + " " + val1 + " 0");
}
fileA.pop();
}
while(!fileB.isEmpty()) { //Know key does not exist in FileA
string check = fileB.top().key();
int val1 = fileB.top().value();
printToFile(check + " 0 " + val1);
fileB.pop();
}
You can use any type of iterator to go through the data structure instead of pop and top. Obviously you may need to access the data a different way depending on what language/data structure you need to use.
#ninjalj's answer is much saner, but here's a shell script implementation just for fun:
exec 8< a.txt
exec 9< b.txt
while true; do
if [ -z "$k1" ]; then
read k1 v1 <& 8
fi
if [ -z "$k2" ]; then
read k2 v2 <& 9
fi
if [ -z "$k1$k2" ]; then break; fi
if [ "$k1" == "$k2" ]; then
echo $k1 $v1 $v2
k1=
k2=
elif [ -n "$k1" -a "$k1" '<' "$k2" ]; then
echo $k1 $v1 0
k1=
else
echo $k2 0 $v2
k2=
fi
done

How to print a line with a pattern which is nearest to another line with a specific pattern?

I want to find a pattern which is nearest to a specific pattern. Such as I want to print "bbb=" which is under the "yyyy:" (it is the closest line with bbb= to yyyy). It is line 8. line numbers and the order might be changed so it is better not to use line numbers.
root# vi a
"a" 15 lines
1 ## xxxx:
2 aaa=3
3 bbb=4
4 ccc=2
5 ddd=1
6 ## yyyy:
7 aaa=1
8 bbb=0
9 ccc=3
10 ddd=3
11 ## zzzz:
12 aaa=1
13 bbb=1
14 ccc=1
15 ddd=1
Do you have an idea using awk or grep for this purpose?
Something like this?
awk '/^## yyyy:/ { i = 1 }; i && /^bbb=/ { print; exit }'
Or can a line above also match if? In that case, perhaps:
awk '/^bbb=/ && !i { p=NR; s=$0 }; /^bbb=/ && i { print (NR-i < i-p) ? $0 : s; exit }; /^## yyyy:/ { i=NR }'
Taking into account that there might not be a previous or next entry:
/^bbb=/ && !i { p1 = NR; s1 = $0 }
/^bbb=/ && i { p2 = NR; s2 = $0; exit }
/^## yyyy:/ { i = NR }
END {
if (p1 == 0)
print s2
else if (p2 == 0)
print s1
else
print (i - p1 < p2 - i ? s1 : s2)
}
Quick and dirty using grep:
grep -A 100 '##yyyy' filename | grep 'bbb='

Resources