Unix print pattern between the Strings - unix

I have a file which has got content like below. START and STOP stand for a block.
START
X | 123
Y | abc
Z | +=-
STOP
START
X | 456
Z | +%$
STOP
START
X | 789
Y | ghi
Z | !##
STOP
I would like to get the values of X and Y printed in the format below for each block:
123 ~~ abc
456 ~~
789 ~~ ghi
If it is single occurrence of START/STOP, sed -n '/START/,/STOP/p' would have helped. Since this is repetitive, I need your help.

Based on my own solution to How to select lines between two marker patterns which may occur multiple times with awk/sed:
awk -v OFS=" ~~ " '
/START/{flag=1;next}
/STOP/{flag=0; print first, second; first=second=""}
flag && $1=="X" {first=$3}
flag && $1=="Y" {second=$3}' file
Test
$ awk -v OFS=" ~~ " '/START/{flag=1;next}/STOP/{flag=0; print first, second; first=second=""} flag && $1=="X" {first=$3} flag && $1=="Y" {second=$3}' a
123 ~~ abc
456 ~~
789 ~~ ghi

Sed is always the wrong choice for any problem that involves processing multiple lines. All of sed's arcane constructs for doing so became obsolete in the mid-1970s when awk was invented.
Whenever you have name-value pairs in your input I find it useful to create an array that maps each name to it's value and then access the array by the names. In this case, using GNU awk for multi-char RS and delete array:
$ cat tst.awk
BEGIN {
RS = "\nSTOP\n"
OFS=" ~~ "
}
{
delete n2v
for (i=2;i<=NF;i+=3) {
n2v[$i] = $(i+2)
}
print n2v["X"], n2v["Y"]
}
$ gawk -f tst.awk file
123 ~~ abc
456 ~~
789 ~~ ghi

Because I like brain teasers (not because this sort of thing is practical to do in sed), a possible sed solution is
sed -n '/START/,/STOP/ { //!H; // { g; /^$/! { s/.*\nX | \([^\n]*\).*/\1 ~~/; ta; s/.*/~~/; :a G; s/\n.*Y | \([^\n]*\).*/ \1/; s/\n.*//; p; s/.*//; h } } }'
This works as follows:
/START/,/STOP/ { # between two start and stop lines
//! H # assemble the lines in the hold buffer
# note that // repeats the previously
# matched pattern, so // matches the
# start and end lines, //! all others.
// { # At the end
g # That is: When it is one of the
/^$/! { # boundary lines and the hold buffer
# is not empty
s/.*\nX | \([^\n]*\).*/\1 ~~/ # isolate the X value, append ~~
ta # if there is no X value, just use ~~
s/.*/~~/
:a
G # append the hold buffer to that
s/\n.*Y | \([^\n]*\).*/ \1/ # and isolate the Y value so that
# the pattern space contains X ~~ Y
s/\n.*// # Cutting off everything after a newline
# is important if there is no Y value
# and the previous substitution did
# nothing
p # print the result
s/.*// # and make sure the hold buffer is
h # empty for the next block.
}
}
}

Related

Combining big data files with different columns into one big file

I have N tab-separated files. Each file has a header line saying the names of the columns. Some of the columns are common to all of the files, but some are unique.
I want to combine all of the files into one big file containing all of the relevant headers.
Example:
> cat file1.dat
a b c
5 7 2
3 9 1
> cat file2.dat
a b e f
2 9 8 3
2 8 3 3
1 0 3 2
> cat file3.dat
a c d g
1 1 5 2
> merge file*.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
The - can be replaced by anything, for example NA.
Caveat: the files are so big that I can not load all of them into memory simultaneously.
I had a solution in R using
write.table(do.call(plyr:::rbind.fill,
Map(function(filename)
read.table(filename, header=1, check.names=0),
filename=list.files('.'))),
'merged.dat', quote=FALSE, sep='\t', row.names=FALSE)
but this fails with a memory error when the data are too large.
What is the best way to accomplish this?
I am thinking the best route will be to first loop through all the files to collect the column names, then loop through the files to put them into the right format, and write them to disc as they are encountered. However, is there perhaps already some code available that performs this?
From an algorithm point of view I would take the following steps:
Process the headers:
read all headers of all input files and extract all column names
sort the column names in the order you want
create a lookup table which returns the column-name when a field number is given (h[n] -> "name")
process the files: after the headers, you can reprocess the files
read the header of the file
create a lookup table which returns the field number when given a column name. An associative array is useful here: (a["name"] -> field_number)
process the remainder of the file
loop over all fields of the merged file
get the column name with h
check if the column name is in a, if not print -, if so print the field number corresponding with a.
This is easily done with a GNU awk making use of the extensions nextfile and asorti. The nextfile function allows us to read the header only and move to the next file without processing the full file. Since we need to process the file twice (step 1 reading the header and step 2 reading the file), we will ask awk to dynamically manipulate its argument list. Every time a file's header is processed, we add it at the end of the argument list ARGV so it can be used for step 2.
BEGIN { s="-" } # define symbol
BEGIN { f=ARGC-1 } # get total number of files
f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
ARGV[ARGC++] = FILENAME # add file at end of argument list
if (--f == 0) { # did we process all headers?
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
}
nextfile # end of processing headers
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
If you store the above in a file merge.awk you can use the command:
awk -f merge.awk f1 f2 f3 f4 ... fx
A similar way, but less hastle with f:
BEGIN { s="-" } # define symbol
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
This method is slightly different, but allows the processing of files with different field separators as
awk -f merge.awk f1 FS="," f2 f3 FS="|" f4 ... fx
If your argument list becomes too long, you can use awk to create it for you :
BEGIN { s="-" } # define symbol
BEGIN { # read argument list from input file:
fname=(ARGC==1 ? "-" : ARGV[1])
ARGC=1 # from: filelist or /dev/stdin
while ((getline < fname) > 0) # to: arg1 arg2 ... argx
ARGV[ARGC++]=$0
}
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
which can be ran as:
$ awk -f merge.awk filelist
$ find . | awk -f merge.awk "-"
$ find . | awk -f merge.awk
or any similar command.
As you see, by adding only a tiny block of code, we were able to flexibly adjust to awk code to support our needs.
Miller (johnkerl/miller) is so underused when dealing with huge files. It has tons of features included from all useful file processing tools out there. Like the official documentation says
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
For this particular case, it supports a verb unsparsify, which by the documentation says
Prints records with the union of field names over all input records.
For field names absent in a given record but present in others, fills in
a value. This verb retains all input before producing any output.
You just need to do the following and reorder the file back with the column positions as you desire
mlr --tsvlite --opprint unsparsify then reorder -f a,b,c,d,e,f file{1..3}.dat
which produces the output in one-shot as
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
You can even customize what characters you can use to fill the empty fields with, with default being -. For custom characters use unsparsify --fill-with '#'
A brief explanation of the fields used
To delimit the input stream as a tab delimited content, --tsvlite
To pretty print the tabular data --opprint
And unsparsify like explained above does a union of all the field names over all input stream
The reordering verb reorder is needed because the column headers appear in random order between the files. So to define the order explicitly, use the -f option with the column headers you want the output to appear with.
And installation of the package is so straightforward. Miller is written in portable, modern C, with zero runtime dependencies. The installation via package managers is so easy and it supports all major package managers Homebrew, MacPorts, apt-get, apt and yum.
Given your updated information in comments about having about 10^5 input files (and so exceeding the shells max number of args for a non-builtin command) and wanting the output columns in the order they're seen rather than alphabetically sorted, the following will work using any awk and any find:
$ cat tst.sh
#!/bin/env bash
find . -maxdepth 1 -type f -name "$1" |
awk '
NR==FNR {
fileName = $0
ARGV[ARGC++] = fileName
if ( (getline fldList < fileName) > 0 ) {
if ( !seenList[fldList]++ ) {
numFlds = split(fldList,fldArr)
for (inFldNr=1; inFldNr<=numFlds; inFldNr++) {
fldName = fldArr[inFldNr]
if ( !seenName[fldName]++ ) {
hdr = (numOutFlds++ ? hdr OFS : "") fldName
outNr2name[numOutFlds] = fldName
}
}
}
}
close(fileName)
next
}
FNR == 1 {
if ( !doneHdr++ ) {
print hdr
}
delete name2inNr
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName] = inFldNr
}
next
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
inFldNr = name2inNr[fldName]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' -
.
$ ./tst.sh 'file*.dat'
a b c e f d g
5 7 2 - - - -
3 9 1 - - - -
2 9 - 8 3 - -
2 8 - 3 3 - -
1 0 - 3 2 - -
1 - 1 - - 5 2
Note that input to the script is now the globbing pattern you want find to use to find the files, not the list of files.
Original answer:
If you don't mind a combined shell+awk script then this will work using any awk:
$ cat tst.sh
#!/bin/env bash
awk -v hdrs="$(head -1 -q "$#" | tr ' ' '\n' | sort -u)" '
BEGIN {
numOutFlds = split(hdrs,outNr2name)
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
printf "%s%s", fldName, (outFldNr<numOutFlds ? OFS : ORS)
}
}
FNR == 1 {
delete name2inNr
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName] = inFldNr
}
next
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
inFldNr = name2inNr[fldName]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' "$#"
.
$ ./tst.sh file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
otherwise this is all awk using GNU awk for arrays of arrays, sorted_in, and ARGIND:
$ cat tst.awk
BEGIN {
for (inFileNr=1; inFileNr<ARGC; inFileNr++) {
inFileName = ARGV[inFileNr]
if ( (getline < inFileName) > 0 ) {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName][inFileNr] = inFldNr
}
}
close(inFileName)
}
PROCINFO["sorted_in"] = "#ind_str_asc"
for (fldName in name2inNr) {
printf "%s%s", (numOutFlds++ ? OFS : ""), fldName
for (inFileNr in name2inNr[fldName]) {
outNr2inNr[numOutFlds][inFileNr] = name2inNr[fldName][inFileNr]
}
}
print ""
}
FNR > 1 {
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = outNr2inNr[outFldNr][ARGIND]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
.
$ awk -f tst.awk file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
For efficiency the 2nd script above does all the heavy lifting in the BEGIN section so there's as little work left to do as possible in the main body of the script that's evaluated once per input line. In the BEGIN section it creates an associative array (outNr2inNr[]) that maps the outgoing field numbers (alphabetically sorted list of all field names across all input files) to the incoming field numbers so all that's left to do in the body is print the fields in that order.
Here is the solution I (the OP) have come up with so far. It may have some advantage over other approaches in that it processes the files in parallel.
R code:
library(parallel)
library(parallelMap)
# specify the directory containing the files we want to merge
args <- commandArgs(TRUE)
directory <- if (length(args)>0) args[1] else 'sg_grid'
#output_fname <- paste0(directory, '.dat')
# make a tmp directory that will store all the files
tmp_dir <- paste0(directory, '_tmp')
dir.create(tmp_dir)
# list the .dat files we want to merge
filenames <- list.files(directory)
filenames <- filenames[grep('.dat', filenames)]
# a function to read the column names
get_col_names <- function(filename)
colnames(read.table(file.path(directory, filename),
header=T, check.names=0, nrow=1))
# grab all the headers of all the files and merge them
col_names <- get_col_names(filenames[1])
for (simulation in filenames) {
col_names <- union(col_names, get_col_names(simulation))
}
# put those column names into a blank data frame
name_DF <- data.frame(matrix(ncol = length(col_names), nrow = 0))
colnames(name_DF) <- col_names
# save that as the header file
write.table(name_DF, file.path(tmp_dir, '0.dat'),
col.names=TRUE, row.names=F, quote=F, sep='\t')
# now read in every file and merge with the blank data frame
# it will have NAs in any columns it didn't have before
# save it to the tmp directory to be merged later
parallelStartMulticore(max(1,
min(as.numeric(Sys.getenv('OMP_NUM_THREADS')), 62)))
success <- parallelMap(function(filename) {
print(filename)
DF <- read.table(file.path(directory, filename),
header=1, check.names=0)
DF <- plyr:::rbind.fill(name_DF, DF)
write.table(DF, file.path(tmp_dir, filename),
quote=F, col.names=F, row.names=F, sep='\t')
}, filename=filenames)
# and we're done
print(all(unlist(success)))
This creates temporary versions of all the files, which each now have all the headers, which we can then cat together into the result:
ls -1 sg_grid_tmp/* | while read fn ; do cat "$fn" >> sg_grid.dat; done

Bash script awk

I am new to Bash scripting. I am struggling to understand this particular line of code. Please help.
old_tag = awk -v search="$new_tag" -F" " '$1==search { a[count] = $2; count++; } END { srand();print a[int(rand()*(count-1))+1] }' $tag_dir/$file
[ -z "$new_tag" ] && break
The code seems to be incorrect. With old_tag = awk the code tries to out the results of the awk command in the var old_tag. An assignment of a var should be done without spaces around the =, and the command should be enclosed in $(..). It might have been backtics in the original code, these are depreciated and backtics are used for formatting in SO.
Your question would have been easier to answer with an example inputfile, but try to explain assuming inputlines like
apple x1
car a
rotten apple
tree sf
apple x5
car a4
apple x3
I switched old_tag and new_tag, that seems to make more sense.
new_tag=$(awk -v search="$old_tag" -F" " '
$1==search { a[count] = $2; count++; }
END { srand(); print a[int(rand()*(count-1))+1] }
' $tag_dir/$file)
[ -z "$new_tag" ] && break
This cod tries to replace to find a new tag by searching the old tag in $tag_dir/$file. When the tag occurs more than once, take one of the lines random.
The code explained in more detail:
# assign output to variable new_tag
new_tag=$(..)
# use awk program
awk ..
# Assign the valuo of old_tag to a variable "search" that can be used in awk
-v search="$old_tag"
# Different fields seperated by spaces
-F" "
# The awk programming lines
' .. '
# Check first field of line with the variable search
$1==search { .. }
# When true, store second field of line in array and increment index
a[count] = $2; count++;
# Additional comands after processing everything
END {..}
# Print random index from array
srand(); print a[int(rand()*(count-1))+1]
# Use file as input for awk
$tag_dir/$file
# Stop when no new_tag has been found
[ -z "$new_tag" ] && break
# I would have preferred the syntax
test -z "${new_tag}" && break
With the sample input and old_tag="apple", the code will find the lines with apple as the first word
apple x1
apple x5
apple x3
The words x1 x5 x3 are stored in array a and randomly one of these 3 is assigned to new_tag.

deleting repetitive columns in unix

I would like to delete multiple repetitive columns from a huge file (about 1 million).
The columns that I want to delete has the same column names: A and others has different unique name. Say:
A B2 A B3
1.1 AA 1.2 AA
2.1 AB 4.3 CT
2.2 AC 6.4 GT
so column headers are A, B2, A, B3,... .
How could I delete the columns named as A's from the data.
Another in awk:
$ awk '
NR==1 {
split($0,a)
for(i in a)
if(a[i]=="A")
delete a[i]
}
{
for(i=1;i<=NF;i++)
printf "%s",(i in a?$i OFS:"")
printf ORS
}' file
B2 B3
AA AA
AB CT
AC GT
I'm not sure I'm understanding your question correctly, but here an (GNU) awk solution to delete all duplicate columns (keeping only the first occurrence):
#!/usr/bin/awk -f
NR==1 {
seen[$1] = 1
cols[0] = 1
for (i=2; i<=NF; i++) {
if (!($i in seen)) {
seen[$i] = 1
cols[length(cols)] = i
}
}
}
{
for (i=0; i<length(cols); i++)
printf $(cols[i]) " "
printf "\n"
}
For the first line (NR==1), we find all non-duplicate columns (preserving the order), and for all the other lines, we just print out the columns (fields) we selected before (cols array holds column/field indexes we wish to keep).
$ ./filter.awk file
A B2 B3
1.1 AA AA
2.1 AB CT
2.2 AC GT
cut -d' ' -f $(head -1 filename|tr ' ' '\n'|awk '{if(!seen[$0]++) print NR}'|paste -s -d ',') filename
this will work like a charm.
The question is solved by the James Brown code.
I added
!/usr/bin/awk -f
to the first line of his code and correct tiny typo at the end of the code (simply additional -'- deleted).
I am sorry, I did not have time to try all other suggestions
with my best wishes

Performing calculations based on customer ID in comma-separated file [duplicate]

This question already has an answer here:
Use awk to sum or average for each unique ID
(1 answer)
Closed 6 years ago.
I have a file that contains several comma-separated columns, including a customer ID in the first column.
One customer ID may occur on several rows, but always refers to the same real customer.
How do I run basic calculations in a shell script based on this ID column? For example, calculating the sum of the mileages (the 5th field) for the given customer ID.
102,305,Jin,Kerala,40
104,308,Paul,US,45
105,350,Nina,AUS,50
102,390,Jin,Kerala,10
104,395,Paul,US,35
102,399,Jin,Kerala,35
5th field is the mileage, 1st field is the customer ID.
This is a simple awk script that will sum up the mileages and print the customer IDs together with the sums at the end:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
customer_id = $1;
mileage = $5;
total_mileage[customer_id] += mileage;
}
END {
for (customer_id in total_mileage) {
print customer_id, total_mileage[customer_id];
}
}
To run (after making it executable with chmod +x script.awk):
$ ./script.awk data.in
102 85
104 80
105 50
Alternatively, as a "one-liner":
$ awk -F, '{t[$1]+=$5} END {for (c in t){print c,t[c]}}' data.in
102 85
104 80
105 50
While I agree with #wilx that using a database might be smarter, this sample awk script should get you started:
awk -v FS=',' '{miles[$1] += $5}
END { for (customerid in miles) {
print customerid, miles[customerid]; } }' customers
You can get a list of unique IDs using something like (assuming the first column is the ID):
awk '{print $1}' inputFile | sort -u
This outputs the first field of every single line in the input file inputFile, sorts them and removes duplicates.
You can then use that method with a bash loop to process each of the unique IDs with another awk command to perform some action on them. In the following snippet, I print out the matching lines for each ID:
for id in $(awk '{print $1}' inputFile | sort -u) ; do
echo "${id}:"
awk -vid=${id} '$1==id {print " "$0)' inputFile
done
In that code, for each individual ID, it first outputs the ID then uses awk to only process lines matching that ID. The action carried out is to output the full line with indentation.
Of course, you can do anything you wish with the lines matching each ID. As shown below, an example more closely matching your requirements.
First, here's an input file I used for testing - we can assume field 1 is the customer ID and field 2 the mileage:
$ cat inputFile
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
c 9
b 10
c 11
c 12
And here's a command-line transcript of the method proposed (note that $ and + are input prompt and continuation prompt respectively, they are not part of the actual commands):
$ for id in $(awk '{print $1}' inputFile | sort -u) ; do
+ awk -vid=${id} '
+ $1==id {print $0; sum += $2 }
+ END {print "Total: "sum; print }
+ ' inputFile
+ done
a 1
a 4
a 7
Total: 12
b 2
b 5
b 8
b 10
Total: 25
c 3
c 6
c 9
c 11
c 12
Total: 41
Keep in mind that, for non-huge data sets, it's also possible to do this in a single pass awk script, using associative arrays to store the totals then outputting all the data in the END block. I myself tend to prefer the multi-pass approach myself since it minimises the possibility of running out of memory. The trade-off, of course, is that it will no doubt take longer since you're processing the file more than once.
For a single-pass solution, you can use something like:
$ awk '{sum[$1] += $2} {for (key in sum) { print key": "sum[key]}}' inputFile
which gives you:
a: 12
b: 25
c: 41

How to match a list of strings in two different files using a loop structure?

I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list

Resources