To count quickly until first match and stop in megastring - unix

I want to count number of characters until the pattern 030 in megarow (do not read data forward from that point) such that you do not read the whole megarow in in memory.
It should return 28.
Megastring Data
48000000fe5a1eda480000000d00030001000000cd010000020000000000000000000000000000000000000000000000000000000200000001000000ffffffff57ea5e55ff640c00585e0000fe5a1eda480000000d00030007000000cd010000010000000000000002000000000000800000000000000000000000
My initial idea was to split at first instance of 030 but I did not succeed with this.
I am also not familiar with split command's capability to read only until the end of the pattern.
How can you count quickly until the first match?

If your megarow is in a file named megarow_file you could do the following:
#!/bin/bash
INPUT=megarow_file
SEARCH_STRING="030"
comp_string=""
while IFS= read -r -n1 char
do
char_count=`expr $char_count + 1`
comp_string="${comp_string}${char}"
comp_string_length=${#comp_string}
if [ $comp_string_length -eq 3 ]; then
# echo comparing value $comp_string
if [ $comp_string = $SEARCH_STRING ]; then
# echo match
break
fi
fi
if [ $comp_string_length -gt 3 ]; then
# echo its bigger than 3, strip 1st char
comp_string="${comp_string:1:3}"
# echo comparing value $comp_string
if [ $comp_string = $SEARCH_STRING ]; then
# echo match
break
fi
fi
done < "$INPUT"
count_up_to_comp_string=`expr $char_count - ${#SEARCH_STRING}`
echo "Length up to ${SEARCH_STRING} was ${count_up_to_comp_string} characters"

Comparing GNU awk and BSD AWK initiated by BlueMoon's comment
$ time cat megaRow | awk '{print index($0, "fafafafa")-1}'
48584
real 1m13.489s
user 1m11.608s
sys 0m4.685s
$ time cat megaRow | gawk '{print index($0, "fafafafa")-1}'
48584
real 1m12.792s
user 1m8.845s
sys 0m4.933s
where GNU AWK little faster but not enough significantly, because within uncertainty.

Related

Duplicates in an unix text file based on multiple fields

I have a requirement to find duplicates based on three columns in a .txt file in unix which is delimited by ,.
Input:
a,b,c,d,e,f,gf,h
a,bd,cg,dd,ey,f,g,h
a,b,df,d,e,fd,g,h
a,b,ck,d,eg,f,g,h
Let's take we are finding dupliactes based on 1,2,5 fields.
Expected output:
a,b,c,d,e,f,gf,h
a,b,df,d,e,fd,g,h
Can anyone help to write a script for this or is there a command already available?
I tried like this:
awk -F, '!x[$1,$2,$3]++' file.txt but did not work
One way using awk:
awk -F, 'FNR==NR { x[$1,$2,$5]++; next } x[$1,$2,$5] > 1' a.txt a.txt
This is simple, but reads the file two times. On the first pass (FNR==NR), it maintains counts based on key fields. During the second pass, if prints the line if its key was found more than once.
Another way using awk:
awk -F, '{if (x[$1$2$5]) { y[$1$2$5]++; print $0; if (y[$1$2$5] == 1) { print x[$1$2$5] } } x[$1$2$5] = $0}' a.txt
Explanation:
1 awk -F,
2 '{if (x[$1$2$5])
3 { y[$1$2$5]++; print $0;
4 if (y[$1$2$5] == 1)
5 { print x[$1$2$5] }
6 } x[$1$2$5] = $0
7 }'
Line 2: If x has $1$2$5, this key was seen before, do steps 3-5
Line 3: Increment the count and print the line because it is a dup
Line 4: This means, We are seeing this key for the 2nd time, so we need to print the first line with this key. Last time we saw this key we did not know whether it was a dup or not. So we print the first line in step 5.
Line 6: Store the current line against the key so we can use it in step 2
Another way using sort, uniq and awk
Note: uniq command has an option '-f' to skip the specified number of fields before it starts comparison.
sort -t, -k1,1 -k2,2 -k5,5 a.txt | awk -F, 'BEGIN { OFS = " "} {print $0, $1, $2, $5}' | sed 's/,/ /g' | uniq -f7 -D | sed 's/ /,/g' | cut -d',' -f 1-7
This sorts based on fields 1,2,5. awk prints the original line and appends fields 1,2,5 . sed changes the delimiter because uniq does not have an option to specify delimiter. uniq skips first 7 fields and works on rest of the line and prints duplicate lines.
I had a similar issue
I needed to eliminate duplicate detail records while preserving flat file record formatting and seqence of the records.
The duplication caused by a time expansion of the date field in column 2 of the detail only.
Receiving system was reporting duplication on columns 4 and 5.
I cobbled together this quick hack to resolve it.
First read the file data into an array
Then we can read and manipulate the individual records (crudely with a counter) as demonstrated in this snippet integrating a case statement to logically treat the various record types.
Cheers!
readarray inrecs < [input file name]
filebase=echo "[input file name] | cut -d '.' -f1
i=1
for inrec in "${inrecs[#]}";do
field1=echo ${inrecs[$i-1]} | cut -d',' -f1
field2=echo ${inrecs[$i-1]} | cut -d',' -f2
field3=echo ${inrecs[$i-1]} | cut -d',' -f3
field4=echo ${inrecs[$i-1]} | cut -d',' -f4
field5=echo ${inrecs[$i-1]} | cut -d',' -f5
field6=echo ${inrecs[$i-1]} | cut -d',' -f6
field7=echo ${inrecs[$i-1]} | cut -d',' -f7
field8=echo ${inrecs[$i-1]} | cut -d',' -f8
case $field1 in
'H')
echo "$field1,$field2,$field3">${filebase}.new
;;
'D')
dupecount=0
dupecount=`zegrep -c -e "${field4},${field5}" ${infile}`
if [[ "$dupecount" -gt 1 ]];then
writtencount=0
writtencount=`zegrep -c -e "${field4},${field5}" ${filebase}.new`
if [[ "${writtencount}" -eq 0 ]];then
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
else
echo "$field1,$field2,$field3,$field4,$field5,$field6,$field7,$field8,">>${filebase}.new
fi
;;
'T')
dcount=`zegrep -c '^D' ${filebase}.new`
echo "$field1,$field2,$dcount,$field4">>${filebase}.new
;;
esac
((i++))
done

print duplicate entries without deleting unix/linux

Let's say I have a file like this with 2 columns
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
I am trying to get an output like this:
56-cde
56-cao
67-cde
67-cgh
456-hhh
456-jjjj
45678-aief
45678-nnmn
So basically instead of printing out the unique values I need to print the duplicates:
I tried to accomplish this using awk like this :
cat input.txt | awk -F"-" '{print $1,$2}' | sort -n | uniq -w 2 -D
This is without doubt showing me what values in column 1 have been duplicated, and also displaying the duplicated values of column 1 along with the respective column 2 values. But since I am hardcoding the number of bytes to 2, it displays the duplicated values only for the 2 digit numbers in column one. Is there a way to do this using awk ?
Thanks in advance.
See if your uniq has a -D option. My cygwin version does:
cat input.txt | sort | uniq -w 2 -D
another awk solution without arrays (but with presort)
sort -n file | awk -F- '
NR==1{p=$1; a=$0; c++; next}
p==$1{a=a RS $0; c++; next}
c{print a}
{a=$0; p=$1; c=0}
END{if(c) print a}'
This is what I came up with (just an awk program, no external sort, uniq etc.):
BEGIN { FS = "-" }
{ arr[$1] = arr[$1] "-" $2 }
END {
for (i in arr) {
if ((n = split(arr[i], a)) < 3) continue
for (j = 2; j <= n; ++j)
print i"-"a[j]
}
}
It collects all numbers along with the different strings attached
in arr (assuming the strings won't contain dashes -).
With gawk, you could use arrays of arrays in order to avoid the concatenation and splitting with dashes.
I would handle the varying-number-of-digits case by pre-conditioning the data so that the number field is a fixed large width (and use that width in uniq):
cat input.txt | awk -F- '{printf "%12d-%s\n",$1,$2}'| sort | uniq -w 12 -D
If you need the output left-justified as well, just tack on this post-conditioning step:
| awk '{print $1}'
Using Perl
$ cat two_cols.txt
56-cde
67-cde
56-cao
67-cgh
78-xyz
456-hhh
456-jjjj
45678-nnmn
45677-abdc
45678-aief
$ perl -F"-" -lane ' #t=#{$kv{$F[0]}}; push(#t,$_); $kv{$F[0]}=[#t]; END { while(($x,$y)=each(%kv)){ print join("\n",#{$y}) if scalar #{$y}>1 }} ' two_cols.txt
67-cde
67-cgh
56-cde
56-cao
456-hhh
456-jjjj
45678-nnmn
45678-aief
$

Grep and split in unix

I need to first grep the exact line and then to capture the required value.
eg.
Total logical records skipped: 0
Total logical records read: 500
Total logical records rejected: 3
Total logical records discarded: 0
I need to capture the value 500. How can i do that?
One way using awk:
awk '/Total logical records read:/ { print $NF }' file.txt
If by capture, you mean store as a shell variable:
variable=$(awk '/Total logical records read:/ { print $NF }' file.txt)
Use grep + awk:
cat log.txt | grep read | awk '{print $ 5;}'
or just awk:
cat log.txt | awk '/read/ {print $5;}'
Look at these examples: Awk Introduction Tutorial – 7 Awk Print Examples

Advanced grep unix

Usually grep command is used to display the line contaning the specified pattern. Is there any way to display n lines before and after the line which contains the specified pattern?
Can this will be achieved using awk?
Yes, use
grep -B num1 -A num2
to include num1 lines of context before the match, and num2 lines of context after the match.
EDIT:
Seems the OP is using AIX. This has a different set of options which doesn't include -B and -A
this link describes grep on AIX 4.3 (it doesn't look promising)
Matt's perl script might be a better solution.
Here is what I usually do on AIX:
before=2 << The number of lines to be shown Before >>
after=2 << The number of lines to be shown After >>
grep -n <pattern> <filename> | cut -d':' -f1 | xargs -n1 -I % awk "NR<=%+$after && NR>=%-$before" <filename>
If you do not want the extra 2 varialbles you can always use it an a one line:
grep -n <pattern> <filename> | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+<<after>> && NR>=%-<<before>>' <filename>
Suppose I have a pattern 'stack' and the filename is flow.txt
I want 2 lines before and 3 lines after. The the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+3 && NR>=%-2' flow.txt
I want 2 lines before and only - the the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=% && NR>=%-2' flow.txt
I want 3 lines after and only - the the command will be like:
grep -n 'stack' flow.txt | cut -d':' -f1 | xargs -n1 -I % awk 'NR<=%+3 && NR>=%' flow.txt
Multiple Files - change it for Awk & grep. From above for the pattern 'stack' with the filename is flow.* - 2 lines before and 3 lines after. The the command will be like:
awk 'BEGIN {
before=1; after=3; pattern="stack";
i=0; hold[before]=""; afterprints=0}
{
#Print the lines from the previous Match
if (afterprints > 0)
{
print FILENAME ":" FNR ":" $0
afterprints-- #keep a track of the lines to print after - this can be reset if a match is found
if (afterprints == 0) print "---"
}
#Look for the pattern in current line
if ( match($0, pattern) > 0 )
{
# print the lines in the hold round robin buffer from the current line to line-1
# if (before >0) => user wants lines before avoid divide by 0 in %
# and afterprints => 0 - we have not printed the line already
for(j=i; j < i+before && before > 0 && afterprints == 0 ; j++)
print hold[j%before]
if (afterprints == 0) # print the line if we have not printed the line already
print FILENAME ":" FNR ":" $0
afterprints=after
}
if (before > 0) # Store the lines in the round robin hold buffer
{ hold[i]=FILENAME ":" FNR ":" $0
i=(i+1)%before }
}' flow.*
From the tags, it's likely that the system has a grep that may not support providing context (Solaris is one system that doesn't and I can't remember about AIX). If that is the case, there's a perl script that may help at http://www.sun.com/bigadmin/jsp/descFile.jsp?url=descAll/cgrep__context_grep.
If you have sed you could use this shell script
BEFORE=2
AFTER=3
FILE=file.txt
PATTERN=pattern
for i in $(grep -n $PATTERN $FILE | sed -e 's/\:.*//')
do head -n $(($AFTER+$i)) $FILE | tail -n $(($AFTER+$BEFORE+1))
done
What it does is, grep -n prefixes each match with the line it was found at, the sed strips all but the line it was found at. Then you use head to get the lines up to the line it was found on plus an additional $AFTER lines. That's then piped to tail to just get $BEFORE + $AFTER + 1 lines (that is, your matching line plus the number of lines before and after)
Sure there is (from the grep man page):
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
and if you want the same amount of lines before AND after the match, use:
-C NUM, -NUM, --context=NUM
Print NUM lines of output context. Places a line containing a
group separator (--) between contiguous groups of matches. With
the -o or --only-matching option, this has no effect and a
warning is given.
you can use awk
awk 'BEGIN{t=4}
c--&&c>=0
/pattern/{ c=t; for(i=NR;i<NR+t;i++)print a[i%t] }
{ a[NR%t]=$0}
' file
output
$ more file
1
2
3
4
5
pattern
6
7
8
9
10
11
$ ./shell.sh
2
3
4
5
6
7
8
9

Get specific lines from a text file

I am working on a UNIX box, and trying to run an application, which gives some debug logs to the standard output. I have redirected this output to a log file, but now wish to get the lines where the error is being shown.
My problem here is that a simple
cat output.log | grep FAIL
does not help out. As this shows only the lines which have FAIL in them. I want some more information along with this. Like the 2-3 lines above this line with FAIL. Is there any way to do this via a simple shell command? I would like to have a single command line (can have pipes) to do the above.
grep -C 3 FAIL output.log
Note that this also gets rid of the useless use of cat (UUOC).
grep -A $NUM
This will print $NUM lines of trailing context after matches.
-B $NUM prints leading context.
man grep is your best friend.
So in your case:
cat log | grep -A 3 -B 3 FAIL
I have two implementations of what I call sgrep, one in Perl, one using just pre-Perl (pre-GNU) standard Unix commands. If you've got GNU grep, you've no particular need of these. It would be more complex to deal with forwards and backwards context searches, but that might be a useful exercise.
Perl solution:
#!/usr/perl/v5.8.8/bin/perl -w
#
# #(#)$Id: sgrep.pl,v 1.6 2007/09/18 22:55:20 jleffler Exp $
#
# Perl-based SGREP (special grep) command
#
# Print lines around the line that matches (by default, 3 before and 3 after).
# By default, include file names if more than one file to search.
#
# Options:
# -b n1 Print n1 lines before match
# -f n2 Print n2 lines following match
# -n Print line numbers
# -h Do not print file names
# -H Do print file names
use strict;
use constant debug => 0;
use Getopt::Std;
my(%opts);
sub usage
{
print STDERR "Usage: $0 [-hnH] [-b n1] [-f n2] pattern [file ...]\n";
exit 1;
}
usage unless getopts('hnf:b:H', \%opts);
usage unless #ARGV >= 1;
if ($opts{h} && $opts{H})
{
print STDERR "$0: mutually exclusive options -h and -H specified\n";
exit 1;
}
my $op = shift;
print "# regex = $op\n" if debug;
# print file names if -h omitted and more than one argument
$opts{F} = (defined $opts{H} || (!defined $opts{h} and scalar #ARGV > 1)) ? 1 : 0;
$opts{n} = 0 unless defined $opts{n};
my $before = (defined $opts{b}) ? $opts{b} + 0 : 3;
my $after = (defined $opts{f}) ? $opts{f} + 0 : 3;
print "# before = $before; after = $after\n" if debug;
my #lines = (); # Accumulated lines
my $tail = 0; # Line number of last line in list
my $tbp_1 = 0; # First line to be printed
my $tbp_2 = 0; # Last line to be printed
# Print lines from #lines in the range $tbp_1 .. $tbp_2,
# leaving $leave lines in the array for future use.
sub print_leaving
{
my ($leave) = #_;
while (scalar(#lines) > $leave)
{
my $line = shift #lines;
my $curr = $tail - scalar(#lines);
if ($tbp_1 <= $curr && $curr <= $tbp_2)
{
print "$ARGV:" if $opts{F};
print "$curr:" if $opts{n};
print $line;
}
}
}
# General logic:
# Accumulate each line at end of #lines.
# ** If current line matches, record range that needs printing
# ** When the line array contains enough lines, pop line off front and,
# if it needs printing, print it.
# At end of file, empty line array, printing requisite accumulated lines.
while (<>)
{
# Add this line to the accumulated lines
push #lines, $_;
$tail = $.;
printf "# array: N = %d, last = $tail: %s", scalar(#lines), $_ if debug > 1;
if (m/$op/o)
{
# This line matches - set range to be printed
my $lo = $. - $before;
$tbp_1 = $lo if ($lo > $tbp_2);
$tbp_2 = $. + $after;
print "# $. MATCH: print range $tbp_1 .. $tbp_2\n" if debug;
}
# Print out any accumulated lines that need printing
# Leave $before lines in array.
print_leaving($before);
}
continue
{
if (eof)
{
# Print out any accumulated lines that need printing
print_leaving(0);
# Reset for next file
close ARGV;
$tbp_1 = 0;
$tbp_2 = 0;
$tail = 0;
#lines = ();
}
}
Pre-Perl Unix solution (using plain ed, sed, and sort - though it uses getopt which was not necessarily available back then):
#!/bin/ksh
#
# #(#)$Id: old.sgrep.sh,v 1.5 2007/09/15 22:15:43 jleffler Exp $
#
# Special grep
# Finds a pattern and prints lines either side of the pattern
# Line numbers are always produced by ed (substitute for grep),
# which allows us to eliminate duplicate lines cleanly. If the
# user did not ask for numbers, these are then stripped out.
#
# BUG: if the pattern occurs in in the first line or two and
# the number of lines to go back is larger than the line number,
# it fails dismally.
set -- `getopt "f:b:hn" "$#"`
case $# in
0) echo "Usage: $0 [-hn] [-f x] [-b y] pattern [files]" >&2
exit 1;;
esac
# Tab required - at least with sed (perl would be different)
# But then the whole problem would be different if implemented in Perl.
number="'s/^\\([0-9][0-9]*\\) /\\1:/'"
filename="'s%^%%'" # No-op for sed
f=3
b=3
nflag=no
hflag=no
while [ $# -gt 0 ]
do
case $1 in
-f) f=$2; shift 2;;
-b) b=$2; shift 2;;
-n) nflag=yes; shift;;
-h) hflag=yes; shift;;
--) shift; break;;
*) echo "Unknown option $1" >&2
exit 1;;
esac
done
pattern="${1:?'No pattern'}"
shift
case $# in
0) tmp=${TMPDIR:-/tmp}/`basename $0`.$$
trap "rm -f $tmp ; exit 1" 0
cat - >$tmp
set -- $tmp
sort="sort -t: -u +0n -1"
;;
*) filename="'s%^%'\$file:%"
sort="sort -t: -u +1n -2"
;;
esac
case $nflag in
yes) num_remove='s/[0-9][0-9]*://';;
no) num_remove='s/^//';;
esac
case $hflag in
yes) fileremove='s%^$file:%%';;
no) fileremove='s/^//';;
esac
for file in $*
do
echo "g/$pattern/.-${b},.+${f}n" |
ed - $file |
eval sed -e "$number" -e "$filename" |
$sort |
eval sed -e "$fileremove" -e "$num_remove"
done
rm -f $tmp
trap 0
exit 0
The shell version of sgrep was written in February 1989, and bug fixed in May 1989. It then remained unchanged except for an administrative change (SCCS to RCS transition) in 1997 until 2007, when I added the -h option. I switched to the Perl version in 2007.
http://thedailywtf.com/Articles/The_Complicator_0x27_s_Gloves.aspx
You can use sed to print specific lines, lets say you want line 20
sed '20 p' -n FILE_YOU_WANT_THE_LINE_FROM
Done.
-n prevents echoing lines from the file. The part in quotes is a sed rule to apply, it specifies that you want the rule to apply to line 20, and you want to print.
With GNU grep on Windows:
$ grep --context 3 FAIL output.log
$ grep --help | grep context
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM

Resources