Extract two consecutive lines that have non-consecutive strings - unix

I have a very large text file with 2 columns and more than 10 mio of lines.
Most lines have in column 2 a number that is the number of column 2 of the previous line +1. However, few thousands of lines behave differently (see example below).
Input file:
A 1
A 2
A 3
A 10
A 11
A 12
A 40
A 41
I would like to extract the pair of two lines that do not respect the +1 increment in column 2.
Desired output file:
A 3
A 10
A 12
A 40
Is there (preferentially) an awk command that allows to do that?
I tried several codes comparing column 2 of two consecutive lines but unfortunately I fail until now (see the code below).
awk 'FNR==1 {print; next} $2==p2+1 {print p $0; p=""; next} {p=$0 ORS; p2=$2}' input.txt > output.txt
Thanks for your help. Best,

Would you please try the following:
awk 'NR>1 {if ($2!=p2+1) print p ORS $0} {p=$0; p2=$2}' input.txt > output.txt
Output:
A 3
A 10
A 12
A 40
The variables names are similar to yours: p holds the previous line and
p2 holds the second column of the previous line.
The condition NR>1 suppresses to print on the 1st line.
if ($2!=p2+1) print p ORS $0 prints the pairs of two lines which
meet the condition.
The block {p=$0; p2=$2} preserves values of current line for the next iteration.

I like perl for the text processing that needs arithmetic.
$ perl -ane 'print and next if $.<3; print $p and print if $F[3]!=$fp+1; $fp=$F[3]; $p=$_' input.txt
| COLUMN 1 | COLUMN 2 |
| -------- | -------- |
| A | 3 |
| A | 10 |
| A | 12 |
| A | 40 |
This is using -a to autosplit into #F.
Prints first 2 lines: print and next if $.<3
On subsequent lines, prints previous line and current line if the 4th field isn't exactly one more than the prior 4th field: print $p and print if $F[3]!=$fp+1
Saves the 4th field as $fp and the entire line as $p: $fp=$F[3]; $p=$_

Assumptions:
columns are tab-delimited
the 1st column may contain white space (this isn't demonstrated in the sample provided by OP but it also hasn't been ruled out)
lines of interest must have the same value in the 1st column (ie, if the values in the 1st column differ then we don't bother with comparing the values in the 2nd column and instead proceed to the next input line)
if 3 consecutive lines meet the criteria, the 2nd/middle line is only printed once
Setup:
$ cat input.txt
A 1
A 2
A 3 # match
A 10 # match
A 11
A 12 # match
A 23 # match
A 40 # match
A 41
X to Z 101
X to Z 102 # match
X to Z 104 # match
X to Z 105
NOTE: comments only added here to highlight the lines that match the search criteria
One awk idea:
awk -F'\t' '
FNR==1 { prevline=$0 }
FNR>1 { if ($1 == prev1 && $2+0 != prev2+1) {
if (prevline) print prevline
print
prevline="" # make sure this line is not printed again if next line also meets criteria
}
else
prevline=$0
}
{ prev1=$1; prev2=$2 }
' input.txt
This generates:
A 3
A 10
A 12
A 23
A 40
X to Z 102
X to Z 104

This might work for you (GNU sed):
sed -nE 'N;h
s/.*\s+(.*)\n.*(\s.*)/echo "$((\1+1))\2"/e;/^(.*)\s\1$/!{x;p;x};x;D' file
Open a two line window throughout the length of the file.
Make a copy of the window and increment the 2nd column of the first line by one. If this amended value is equal to the 2nd column of the second line then print both unadulterated lines.
Delete the first line and repeat.
N.B. This may print the second of these lines twice if the following line meets the same criteria.

Related

get the top N counts based on the value of another column

I have a file with two columns that looks like this:
a 3
a 7
b 6
a 6
b 1
b 8
c 1
b 1
For each value in the first column, I'd like to find the top N counts from the second column. Using this example, I'd like to find the top 2 values in column 2 for each string in column 1. The desired output would be:
a 7
a 6
b 8
b 6
c 1
I was trying to do something like this with awk, but I am not that familiar with it. This gives the max, not top N:
awk '$2>max[$1]{max[$1]=$2; row[$1]=$0} END{for (i in row) print row[i]}'
Could you please try following, using sort + awk solution.
sort -k2 -s -nr Input_file | awk '++array[$1]<=2' | sort -k1,1 -k2,2nr
OR
sort -k2 -s -nr Input_file | sort -k1,1 -k2,2nr | awk '++array[$1]<=2'
Logical brief explanation: First 2 sort commands are being used to sort data as per 1st and 2nd fields to get data in right order(as per OP's samples), then passing its output to awk to get only 1st 2 occurrences of each first field only as per ask.
$ sort -k1,1 -k2,2rn file | awk -v n=2 '(++cnt[$1])<=n'
a 7
a 6
b 8
b 6
c 1

Performing calculations based on customer ID in comma-separated file [duplicate]

This question already has an answer here:
Use awk to sum or average for each unique ID
(1 answer)
Closed 6 years ago.
I have a file that contains several comma-separated columns, including a customer ID in the first column.
One customer ID may occur on several rows, but always refers to the same real customer.
How do I run basic calculations in a shell script based on this ID column? For example, calculating the sum of the mileages (the 5th field) for the given customer ID.
102,305,Jin,Kerala,40
104,308,Paul,US,45
105,350,Nina,AUS,50
102,390,Jin,Kerala,10
104,395,Paul,US,35
102,399,Jin,Kerala,35
5th field is the mileage, 1st field is the customer ID.
This is a simple awk script that will sum up the mileages and print the customer IDs together with the sums at the end:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
customer_id = $1;
mileage = $5;
total_mileage[customer_id] += mileage;
}
END {
for (customer_id in total_mileage) {
print customer_id, total_mileage[customer_id];
}
}
To run (after making it executable with chmod +x script.awk):
$ ./script.awk data.in
102 85
104 80
105 50
Alternatively, as a "one-liner":
$ awk -F, '{t[$1]+=$5} END {for (c in t){print c,t[c]}}' data.in
102 85
104 80
105 50
While I agree with #wilx that using a database might be smarter, this sample awk script should get you started:
awk -v FS=',' '{miles[$1] += $5}
END { for (customerid in miles) {
print customerid, miles[customerid]; } }' customers
You can get a list of unique IDs using something like (assuming the first column is the ID):
awk '{print $1}' inputFile | sort -u
This outputs the first field of every single line in the input file inputFile, sorts them and removes duplicates.
You can then use that method with a bash loop to process each of the unique IDs with another awk command to perform some action on them. In the following snippet, I print out the matching lines for each ID:
for id in $(awk '{print $1}' inputFile | sort -u) ; do
echo "${id}:"
awk -vid=${id} '$1==id {print " "$0)' inputFile
done
In that code, for each individual ID, it first outputs the ID then uses awk to only process lines matching that ID. The action carried out is to output the full line with indentation.
Of course, you can do anything you wish with the lines matching each ID. As shown below, an example more closely matching your requirements.
First, here's an input file I used for testing - we can assume field 1 is the customer ID and field 2 the mileage:
$ cat inputFile
a 1
b 2
c 3
a 4
b 5
c 6
a 7
b 8
c 9
b 10
c 11
c 12
And here's a command-line transcript of the method proposed (note that $ and + are input prompt and continuation prompt respectively, they are not part of the actual commands):
$ for id in $(awk '{print $1}' inputFile | sort -u) ; do
+ awk -vid=${id} '
+ $1==id {print $0; sum += $2 }
+ END {print "Total: "sum; print }
+ ' inputFile
+ done
a 1
a 4
a 7
Total: 12
b 2
b 5
b 8
b 10
Total: 25
c 3
c 6
c 9
c 11
c 12
Total: 41
Keep in mind that, for non-huge data sets, it's also possible to do this in a single pass awk script, using associative arrays to store the totals then outputting all the data in the END block. I myself tend to prefer the multi-pass approach myself since it minimises the possibility of running out of memory. The trade-off, of course, is that it will no doubt take longer since you're processing the file more than once.
For a single-pass solution, you can use something like:
$ awk '{sum[$1] += $2} {for (key in sum) { print key": "sum[key]}}' inputFile
which gives you:
a: 12
b: 25
c: 41

transpose column to row [duplicate]

This question already has answers here:
awk group by multiple columns and print max value with non-primary key
(3 answers)
Closed 6 years ago.
I have tried transpose the file below using awk
n counts
1 -0.1520
1 0.0043
1 -0.4903
10 0.0316
10 -0.4076
10 -0.1175
200 0.2720
200 -0.2007
200 0.0559
I need a output like that
1 -0.1520 0.0043 -0.4903
10 0.0316 -0.4076 -0.1175
200 0.2720 -0.2007 0.0559
I tried but didnĀ“t work
awk 'NR==1{print} NR>1{a[$1]=a[$1]" "$2}END{for (i in a){print i " " a[i]}}'
Thank you
its working. try below
awk 'NR==1{print} NR>1{a[$1]=a[$1]" "$2}END{for (i in a){print i " " a[i]}}' file | tac
or you can use sort
awk 'NR==1{print} NR>1{a[$1]=a[$1]" "$2}END{for (i in a){print i " " a[i]}}' file | sort -k1 -n
mmm awk.
$ cat bar.awk
#! /usr/bin/awk -f
BEGIN{getline}
$1 != n {if(row)print row; n=$1; row = $0}
$1 == n {row = row FS $2}
END{ print row }
$ ./bar.awk foo
1 -0.1520 -0.1520 0.0043 -0.4903
10 0.0316 0.0316 -0.4076 -0.1175
200 0.2720 0.2720 -0.2007 0.0559
the getline eats the header
the $1 != n will notice when the first column changes
n will be 0 to begin with so if the first column (& second row) is also zero there will be a problem and you will have to initialize n to something else
when the first column changes it is time to print the previous row and begin collecting the next row,
(if the row is empty, as it is initially, don't print it)
when the first column is the same as the previous row,
just append the second value to your row.
finally print the last row.
the FS is whatever the current field separator is.

Comparing consecutive rows within an file

I have two files with with 1s and 0s in each column, where the field separator is "," :
1,0,0,1,1,1,0,0,0,0,1,0,0,1,1,0,1,0
0,1,0,1,1,1,0,1,0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,1
1,0,1,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
Want I want to do is look at the file in pairs of rows, compare them, and if they are exactly the same output a 1. So for this example the rows 1 & 2 are different so they don't get a 1, rows 3 & 4 are exactly the same so they get a 1, and rows 5&6 differ by 1 column so they don't get a 1, and so on.
So the desired output could be something like :
1
1
1
Because here there are exactly 3 pairs (they are paired by the fact if they are consecutive) of rows that are exactly the same: rows 3&4, 7&8, and 9&10. The comparison should not reuse a row, so if you compare rows 1 & 2, you shouldn't then compare rows 2 & 3.
You can do this with awk like:
awk -F, '!(NR%2) {print $0==p} {p=$0}' data
0
1
0
1
1
where every line that's evenly divisible by two will print a 0 if the current line doesn't match the last value for p or a 1 if it matches.
If you truly only want the 1s, which is throwing away any information about which pairs matched, you could:
awk -F, '!(NR%2)&&$0==p {print 1} {p=$0}' data
1
1
1
Alternatively, you could output matching pair line numbers like:
awk -F, '!(NR%2)&&$0==p {print NR-1 "," NR} {p=$0}' data
3,4
7,8
9,10
Or just the counts of all matched pairs:
awk -F, '!(NR%2)&&$0==p {c++} {p=$0} END{ print c}' data
3
Another useful variant might be just to return the matching lines directly:
awk -F, '!(NR%2)&&$0==p {print} {p=$0}' data
1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0
1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0
1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0
I would use a shell script like this:
while read line
do
if test "$prevline" = "$line"
then
echo 1
fi
prevline=$line
done
I'm not 100% sure about your requirement to "not reuse a row", but I think that could be achieved by changing inner part of the loop to
if test "$prevline" = "$line"
then
echo 1
line="" # don't reuse a line
fi

how to delete the last line in file starting with string in unix

I have a file in the below format
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
MD 2 0 8 -1
I want to find the last line that start with "MD 2" (not MD as MD is embedded in other data) and delete that line.
so my output should be --
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
I have tried many regular expression in sed but it seems it is not working..
sed -e '/^MD *2/p' <file Name >
sed '/^(MD 2)/p' <file Name>
This might work for you (GNU sed):
sed '/^MD\s\+2/,${//{x;//p;d};H;$!d;x;s/^[^\n]*\n//}' file
This holds a window of lines in the hold space. When it encounters the required pattern, it prints out the current window and starts a new one. At the end of file it prints out all but the first line of the window (as it is the first line that is the required pattern to be deleted).
You can do this in 2 steps:
Find the line number of that line
Delete the line using sed
For example:
n=$(awk '/^MD *2/ { n=NR } END { print n }' filename)
sed "${n}d" filename
If you are trying to match exactly 2 in the second column (and not strings that begin with 2), do two passes:
awk 'NR==FNR && $1 == "MD" && $2 == "2"{k=NR} NR!=FNR && FNR!=k' input input
Or, if you have access to tac and want to make 3 passes on the file:
tac input | awk '$1 == "MD" && $2 == "2" && !k{ k=1; next}1' | tac
To match when the second column does not exactly equal the string 2 but merely begins with a 2, replace $2 == "2" in the above with $2 ~ /^2/
Here is one way to do it.
awk '{a[NR]=$0} /^MD *2/ {f=NR} END {for (i=1;i<=NR;i++) if (f!=i) print a[i]}' file
AB1234 jhon cell number etc
MD 2 0 8 -1
MD4567 Jhon2 cell number etc
Store all data in array a
Search and find last MD 2 and store record number in f
Then print array a, but only if record number is not equal to value in f

Resources