We have a fixed width file
Col1 length 10
Col2 length 10
Col3 length 30
Col4 length 40
Sample record
ABC 123 xyz. 5171-5261,51617
ABC. 1234. Xxy. 81651-61761
Col4 can have any number of comma separated values
1 or more within length of 40 characters: If it is has 1 value for that record there is no change in output file.
If more than one value is there i.e. comma separated (5171-5261,51617)
the output file should have multiple records.
1 record
ABC. 123. Xyz. 5171-5261
ABC 123. Xyz. 51617
What is the most efficient way to do this.
As of now trying using while and for loop but it is taking so long for execution since we are doing this splitting by reading each record.
The output file can be comma separated or fixed width.
awk is your friend here.
A single line of awk will achieve what you need:
awk -v FIELDWIDTHS="10 10 30 40" '{ if (match($4,",")) { split($4,array,","); for (i in array) { print $1,$2,$3,array[i]; }; } else { print $1,$2,$3,$4 }; }' samp.dat
For ease of reading the code is:
{
if (match($4,",")) {
split($4,array,",");
for (i in array) {
print $1,$2,$3,array[i];
};
} else {
print $1,$2,$3,$4
};
}
Testing with the sample data you supplied gives:
ABC 123 xyz. 5171-5261
ABC 123 xyz. 51617
ABC. 1234. Xxy. 81651-61761
How it works:
awk reads your file one line at a time.
The FIELDWIDTHS directive allows us to reference each column as $1,$2...
Now that we have our columns we can look for a comma in the fourth field with match($4,",").
If we find one we make an array of the values in the fourth field that are separated by commas with split($4,array,",").
Then we loop through this array and print multiple lines of output, one for each element of the array.
If the fourth field has no comma the else clause prints a single line.
This process repeats for each line in your fixed width file.
NOTE:
awk associative arrays do not guarantee to preserve the order of your data.
This means that your output might come out as
ABC 123 xyz. 51617
ABC 123 xyz. 5171-5261
ABC. 1234. Xxy. 81651-61761
i.e. 5171-5261,51617 in the input data produced a line from the second value before the first.
If the ordering is important to you then you can use the code below that makes a csv from your input data first, then produces the output preserving the order.
awk -v FIELDWIDTHS="10 10 30 40" '{print $1,$2,$3,$4}' OFS=',' samp.data > samp.csv
awk -F',' '{ for (i=4; i<=NF; i++) { print $1,$2,$3,$i } }' samp.csv
Related
I have a very large text file with 2 columns and more than 10 mio of lines.
Most lines have in column 2 a number that is the number of column 2 of the previous line +1. However, few thousands of lines behave differently (see example below).
Input file:
A 1
A 2
A 3
A 10
A 11
A 12
A 40
A 41
I would like to extract the pair of two lines that do not respect the +1 increment in column 2.
Desired output file:
A 3
A 10
A 12
A 40
Is there (preferentially) an awk command that allows to do that?
I tried several codes comparing column 2 of two consecutive lines but unfortunately I fail until now (see the code below).
awk 'FNR==1 {print; next} $2==p2+1 {print p $0; p=""; next} {p=$0 ORS; p2=$2}' input.txt > output.txt
Thanks for your help. Best,
Would you please try the following:
awk 'NR>1 {if ($2!=p2+1) print p ORS $0} {p=$0; p2=$2}' input.txt > output.txt
Output:
A 3
A 10
A 12
A 40
The variables names are similar to yours: p holds the previous line and
p2 holds the second column of the previous line.
The condition NR>1 suppresses to print on the 1st line.
if ($2!=p2+1) print p ORS $0 prints the pairs of two lines which
meet the condition.
The block {p=$0; p2=$2} preserves values of current line for the next iteration.
I like perl for the text processing that needs arithmetic.
$ perl -ane 'print and next if $.<3; print $p and print if $F[3]!=$fp+1; $fp=$F[3]; $p=$_' input.txt
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 3 |
| A | 10 |
| A | 12 |
| A | 40 |
This is using -a to autosplit into #F.
Prints first 2 lines: print and next if $.<3
On subsequent lines, prints previous line and current line if the 4th field isn't exactly one more than the prior 4th field: print $p and print if $F[3]!=$fp+1
Saves the 4th field as $fp and the entire line as $p: $fp=$F[3]; $p=$_
Assumptions:
columns are tab-delimited
the 1st column may contain white space (this isn't demonstrated in the sample provided by OP but it also hasn't been ruled out)
lines of interest must have the same value in the 1st column (ie, if the values in the 1st column differ then we don't bother with comparing the values in the 2nd column and instead proceed to the next input line)
if 3 consecutive lines meet the criteria, the 2nd/middle line is only printed once
Setup:
$ cat input.txt
A 1
A 2
A 3 # match
A 10 # match
A 11
A 12 # match
A 23 # match
A 40 # match
A 41
X to Z 101
X to Z 102 # match
X to Z 104 # match
X to Z 105
NOTE: comments only added here to highlight the lines that match the search criteria
One awk idea:
awk -F'\t' '
FNR==1 { prevline=$0 }
FNR>1 { if ($1 == prev1 && $2+0 != prev2+1) {
if (prevline) print prevline
print
prevline="" # make sure this line is not printed again if next line also meets criteria
}
else
prevline=$0
}
{ prev1=$1; prev2=$2 }
' input.txt
This generates:
A 3
A 10
A 12
A 23
A 40
X to Z 102
X to Z 104
This might work for you (GNU sed):
sed -nE 'N;h
s/.*\s+(.*)\n.*(\s.*)/echo "$((\1+1))\2"/e;/^(.*)\s\1$/!{x;p;x};x;D' file
Open a two line window throughout the length of the file.
Make a copy of the window and increment the 2nd column of the first line by one. If this amended value is equal to the 2nd column of the second line then print both unadulterated lines.
Delete the first line and repeat.
N.B. This may print the second of these lines twice if the following line meets the same criteria.
I would like to split the following file based on the pattern ABC:
ABC
4
5
6
ABC
1
2
3
ABC
1
2
3
4
ABC
8
2
3
to get file1:
ABC
4
5
6
file2:
ABC
1
2
3
etc.
Looking at the docs of man csplit: csplit my_file /regex/ {num}.
I can split this file using: csplit my_file '/^ABC$/' {2} but this requires me to put in a number for {num}. When I try to match with {*} which suppose to repeat the pattern as much as possible, i get the error:
csplit: *}: bad repetition count
I am using a zshell.
To split a file on a pattern like this, I would turn to awk:
awk 'BEGIN { i=0; }
/^ABC/ { ++i; }
{ print >> "file" i }' < input
This reads lines from the file named input; before reading any lines, the BEGIN section explicitly initializes an "i" variable to zero; variables in awk default to zero, but it never hurts to be explicit. The "i" variable is our index to the serial filenames.
Subsequently, each line that starts with "ABC" will increment this "i" variable.
Any and every line in the file will then be printed (in append mode) to the file name that's generated from the text "file" and the current value of the "i" variable.
I have a long list of data organised as below (INPUT).
I want to split the data up so that I get an output as below (desired OUTPUT).
The code below first identifies all the lines containing ">gi" and saves the linecount of those lines in an array called B.
Then, in a new file, it should replace those lines from array B with the shortened version of the text following the ">gi"
I figured the easiest way would be to split at "|", however this does not work (no separation happens with my code if i replace " " with "|")
My code is below and does split nicely after the " " if I replace the "|" by " " in the INPUT, however I get into trouble when I want to get the text between the [ ] brackets, which is NOT always there and not always only 2 words...:
B=$( grep -n ">gi" 1VAO_1DII_5fxe_all_hits_combined.txt | cut -d : -f 1)
awk <1VAO_1DII_5fxe_all_hits_combined.txt >seqIDs_1VAO_1DII_5fxe_all_hits_combined.txt -v lines="$B" '
BEGIN {split(lines, a, " "); for (i in a) change[a[i]]=1}
NR in change {$0 = ">" $4}
1
'
let me know if more explanations are needed!
INPUT:
>gi|9955361|pdb|1E0Y|A:1-560 Chain A, Structure Of The D170sT457E DOUBLE MUTANT OF VANILLYL- Alcohol Oxidase
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVA
>gi|557721169|dbj|GAD99964.1|:1-560 hypothetical protein NECHADRAFT_63237 [Byssochlamys spectabilis No. 5]
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVAPRNV
desired OUTPUT:
>1E0Y
MSKTQEFRPLTLPPKLSLSDFNEFIQDIIRIVGSENVEVISSKDQIVDGSYMKPTHTHDPHHVMDQDYFLASAIVAPRNV
>GAD99964.1 Byssochlamys spectabilis No. 5
MSETMEFRPMVLPPNLLLSEFNGFIRETIRLVGCENVEVISSKDQIHDGSYMDPRHTHDPHHIMEQDYFLASAIVA
This can be done in one step with awk (gnu awk):
awk -F'|' '/^>gi/{a=1;match($NF,/\[([^]]*)]/, b);print ">"$4" "b[1];next}a{print}!$0{a=0}' input > output
In a more readable way:
/^>gi/ { # when the line starts with ">gi"
a=1; # set flag "a" to 1
# extract the eventual part between brackets in the last field
match($NF,"\\[([^]]*)]", b);
print ">"$4" "b[1]; # display the line
next # jump to the next record
}
a { print } # when "a" (allowed block) display the line
!$0 { a=0 } # when the line is empty, set "a" to 0 to stop the display
I would like to remove all the lines in my data file that contain a value in column 2 that is repeated in column 2 in other lines.
I've sorted by the value in column 2, but can't figure out how to use uniq for just the values in one field as the values are not necessarily of the same length.
Alternately, I can remove lines with the duplicate using an awk one-liner like
awk -F"[,]" '!_[$2]++'
but this retains the line with the first incidence of the repeated value in col 2.
As an example, if my data is
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
I would like to remove ALL lines (including the first) where b occurs in the second column.
Like this:
d,e,f
h,i,j
Thanks for any advice!!
If the order is not important then the following should work:
awk -F, '
!seen[$2]++ {
line[$2] = $0
}
END {
for(val in seen)
if(seen[val]==1)
print line[val]
}' file
Output
h,i,j
d,e,f
Solution with grep:
grep -v -E '\b,b,\b' text.txt
Content of the file:
$ cat text.txt
a,b,c
c,b,a
d,e,f
h,i,j
j,b,h
a,n,b
b,c,f
$ grep -v -E '\b,b,\b' text.txt
d,e,f
h,i,j
a,n,b
b,c,f
Hope it helps
Some different awk:
awk -F, '
BEGIN {f=0}
FNR==NR {_[$2]++;next}
f==0 {
f=1
for(j in _)if(_[j]>1)delete _[j]
}
$2 in _
' file file
Explanation
The awk passes through the file twice - that's why it appears twice at the end. On the first pass (when FNR==NR) I count the number of times each column 2 appears in array _[]. At the end of the first pass, I then delete all elements of _[] where that element has been seen more than once. Then, on the second pass, I print lines whose second field appears in _[].
I have a file like:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
B 37.40,37.40,38.40,38.80,58.40,58.80,45.00,44.8
.
.
.
I want to print those lines that all values in column 2 are more than 50
output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
I tried:
cat file | tr ',' '\t' | awk '{for (i=2; i<=NF; i++){if($i<50) continue; else print $i}}'
I hope you meant that r tag you added to your question.
tab <- read.table("file")
splt <- strsplit(as.character(tab[[2]]), ",")
rows <- unlist(lapply(splt, function(a) all(as.numeric(a) > 50)))
tab[rows,]
This will read your file as a space-separated table, split the second column into individual values (resulting in a list of character vectors), then compute a logical value for each such row depending on whether or not all values are > 50. These results are combined to a logical vector which is then used to subset your data.
The field separator can be any regular expression, so if you include commas in FS your approach works:
awk '{ for(i=2; i<=NF; i++) if($i<=50) next } 1' FS='[ \t,]+' infile
Output:
A 50.40,60.80,56.60,67.80,51.20,78.40,63.80,64.2
Explanation
The for-loop runs through the comma-separated values in the second column and if any of them is lower than or equal to 50 next is executed, i.e. skip to next line. If the first block is passed, the 1 is encountered which evaluates to true and executes the default block: { print $0 }.