How do i figure out the percentage of not null records in my file in UNIX?
My file like this: I wanted to know the amount of records & the percentage of not null rec's. Tried whole lot of grep n cut commands but nothing seems to be working out. Can anyone help me here please...
"name","country","age","place"
"sam","US","30","CA"
"","","",""
"joe","UK","34","BRIS"
,,,,
"jake","US","66","Ohio"
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
use 5.012; # say, keys #arr
use Text::CSV_XS qw{ csv };
my ($count_all, #count_nonempty);
csv(in => shift,
out => \ 'skip',
headers => 'skip',
on_in => sub {
my (undef, $columns) = #_;
++$count_all;
length $columns->[$_] and $count_nonempty[$_]++
for 0 .. $#$columns;
},
);
for my $column (keys #count_nonempty) {
say "Column ", 1 + $column, ": ",
100 * $count_nonempty[$column] / $count_all, '%';
}
It uses Text::CSV_XS to read the CSV file. It skips the header line, and for each subsequent line, it calls the callback specified in on_in, which increments the count of all lines and also the count of empty fields per column if the length of a field is zero.
Along with choroba, I would normally recommend using a CSV parser on CSV data.
But in this case, all we want to look for is that a record contains any character that is not a comma or quote: if a record contains only commas and/or quotes, it is a "null" record.
awk '
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
To handle leading/trailing whitespace
awk '
{sub(/^[[:blank:]]+/,""); sub(/[[:blank:]]+$/,"")}
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
If allowing fields containing only whitespace, such as
" ","",,," "
is also a null record, we can simple ignore all whitespace
awk '
/[^",[:blank:]]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
I have two files like this:
# step distance
0 4.48595407961296e+01
2500 4.50383737781376e+01
5000 4.53506757198727e+01
7500 4.51682465277482e+01
10000 4.53410353656445e+01
# step distance
0 4.58854106214881e+01
2500 4.58639266431320e+01
5000 4.60620560167519e+01
7500 4.58990075106227e+01
10000 4.59371359946124e+01
So I want to join the two files together, while maintaining the spacing.
Especially, the second file needs to remember the ending values of the first one and start counting from that one.
output:
# step distance
0 4.48595407961296e+01
2500 4.50383737781376e+01
5000 4.53506757198727e+01
7500 4.51682465277482e+01
10000 4.53410353656445e+01
12500 4.58854106214881e+01
15000 4.58639266431320e+01
17500 4.60620560167519e+01
20000 4.58990075106227e+01
22500 4.59371359946124e+01
With calc it was easy to do the problem is that the spacing needs to be in order to work and in that case calc makes a complete mess.
# start awk and set the *Step* between file to 2500
awk -v 'Step=2500' '
# 1st line of 1 file (NR count every line, from each file) init and print header
NR == 1 {LastFile = FILENAME; OFS = "\t"; print}
# when file change (new filename compare to previous line read)
# Set a new index (for incremental absolute step from relative one) and new filename reference
FILENAME != LastFile { StartIndex = LastIndex + Step; LastFile = FILENAME}
# after first line and for every line stating witha digit (+ space if any)
# calculate absolute step and replace relative one, print the new content
NR > 1 && /^[[:blank:]]*[0-9]/ { $1 += StartIndex; LastIndex = $1;print }
' YourFiles*
Result will depend of files order
output separator is set by OFS value (tab here)
Perl to the rescue!
#!/usr/bin/perl
use warnings;
use strict;
open my $F1, '<', 'file1' or die $!;
my ($before, $after, $diff);
my $max = 0;
while (<$F1>) {
print;
my ($space1, $num, $space2) = /^(\s*) ([0-9]+) (\s*)/x or next;
($before, $after) = ($space1, $space2);
$diff = $num - $max;
$max = $num;
}
$before = length "$before$max"; # We'll need it to format the computed numbers.
open my $F2, '<', 'file2' or die $!;
<$F2>; # Skip the header.
while (<$F2>) {
my ($step, $distance) = split;
$step += $max + $diff;
printf "% ${before}d%s%s\n", $step, $after, $distance;
}
The program remembers the last number in $max. It also keeps the length of the leading whitespace plus $max in $before to format all future numbers to take up the same space (using printf).
You didn't show how the distance column is aligned, i.e.
20000 4.58990075106227e+01
22500 11.59371359946124e+01 # dot aligned?
22500 11.34572478912301e+01 # left aligned?
The program would align it the latter way. If you want the former, use a similar trick as for the step column.
Would like to extract all the lines from first file (GunZip *.gz i.e Input.csv.gz), if the first file 4th field is falls within a range of
Second file (Slab.csv) first field (Start Range) and second field (End Range) then populate Slab wise count of rows and sum of 4th and 5th field of first file.
Input.csv.gz (GunZip)
Desc,Date,Zone,Duration,Calls
AB,01-06-2014,XYZ,450,3
AB,01-06-2014,XYZ,642,3
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,205,3
AB,01-06-2014,XYZ,98,1
AB,01-06-2014,XYZ,455,1
AB,01-06-2014,XYZ,120,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,193,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,161,2
Slab.csv
StartRange,EndRange
0,0
1,10
11,100
101,200
201,300
301,400
401,500
501,10000
Expected Output:
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
I am using below two commands to get the above output , expect "NotFound"cases .
awk -F, 'NR==FNR{s[NR]=$1;e[NR]=$2;c[NR]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' Slab.csv <(gzip -dc Input.csv.gz) >Op_step1.csv
cat Op_step1.csv | awk -F, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} END{for(i in a)print i","a[i]","b[i]","c[i]}' >Op_step2.csv
Op_step2.csv
101,200,3,474,4
501,10000,1,642,3
0,0,3,0,0
401,500,2,905,4
11,100,1,98,1
201,300,1,205,3
Any suggestions to make it one liner command to achieve the Expected Output , Don't have perl , python access.
Here is another option using perl which takes benefits of creating multi-dimensional arrays and hashes.
perl -F, -lane'
BEGIN {
$x = pop;
## Create array of arrays from start and end ranges
## $range = ( [0,0] , [1,10] ... )
(undef, #range)= map { chomp; [split /,/] } <>;
#ARGV = $x;
}
## Skip the first line
next if $. ==1;
## Create hash of hash
## $line = '[0,0]' => { "count" => counts , "sum4" => sum_of_col4 , "sum5" => sum_of_col5 }
for (#range) {
if ($F[3] >= $_->[0] && $F[3] <= $_->[1]) {
$line{"#$_"}{"count"}++;
$line{"#$_"}{"sum4"} +=$F[3];
$line{"#$_"}{"sum5"} +=$F[4];
}
}
}{
print "StartRange,EndRange,Count,Sum-4,Sum-5";
print join ",", #$_,
$line{"#$_"}{"count"} //"NotFound",
$line{"#$_"}{"sum4"} //"NotFound",
$line{"#$_"}{"sum5"} //"NotFound"
for #range
' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Here is one way using awk and sort:
awk '
BEGIN {
FS = OFS = SUBSEP = ",";
print "StartRange,EndRange,Count,Sum-4,Sum-5"
}
FNR == 1 { next }
NR == FNR {
ranges[$1,$2]++;
next
}
{
for (range in ranges) {
split(range, tmp, SUBSEP);
if ($4 >= tmp[1] && $4 <= tmp[2]) {
count[range]++;
sum4[range]+=$4;
sum5[range]+=$5;
next
}
}
}
END {
for(range in ranges)
print range, (count[range]?count[range]:"NotFound"), (sum4[range]?sum4[range]:"NotFound"), (sum5[range]?sum5[range]:"NotFound") | "sort -t, -nk1,2"
}' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,NotFound,NotFound
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Set the Input, Output Field Separators and SUBSEP to ,. Print the Header line.
If it is the first line skip it.
Load the entire slab.txt in to an array called ranges.
For every range in the ranges array, split the field to get start and end range. If the 4th column is in the range, increment the count array and add the value to sum4 and sum5 array appropriately.
In the END block, iterate through the ranges and print them.
Pipe the output to sort to get the output in order.
I have a blt real time graph in tcl/tk, the graph collects data over time with no time limit and it keeps storing all the data points into vectors and the user is able to scroll back and forth in the graph. The problem is that if I let the graph collect the points for a long period of time the cpu and memory consumption increases dramatically, i figure that for a 24 hour interval window should be fine. When i try to "unset x(0)" from the graph i get an error saying "command name invalid" i also try with "x delete 0" and same thing. Any help is much appreciated
This is how i initialize the graph:
proc startGraph {} {
global btnColor
global backColor
global startTime
global txtActionLevel
global txtAlertLevel
global x y1 y2 flagTime inStart inTime vectorFlag
global resolution
global notFirstTime
global txtActionLevel
set notFirstTime 0
set resolution 0
# Create stripchart widget
blt::stripchart .s -width 625 -height 330 -background $backColor -plotbackground black -font defaultFont
scrollbar .scroll -command { ScrollToEnd
.s axis view x } -orient horizontal -relief flat -background black -troughcolor $backColor -activebackground black -elementborderwidth 5
.s axis configure x -scrollcommand { .scroll set }
# Create BLT vectors
blt::vector create x
blt::vector create y1
blt::vector create y2
set startTime 0
set flagTime 0
set inStart -1
set inTime 0
set vectorFlag 0
.s configure -borderwidth 0 \
-leftmargin 0 \
-rightmargin 0 \
-plotborderwidth 0 \
-plotpadx {0 0} \
-plotpady {0 0}
.s legend configure -hide yes
.s grid configure -color gray \
-dashes 1 \
-minor 0 \
-hide 0
# X-axis
.s axis configure x -autorange 60 \
-shiftby 1 \
-stepsize 10 \
-subdivisions 1 \
-command FormatXLabel
# Alert txtAlertLevel
#.s tag create line -mapx 2 -mapy 2
proc FormatXLabel {widget x} {
set x [expr round($x)]
return [clock format $x -format "%I:%M:%S"]
}
# Y-axis
#.s axis configure y -title "C o u n t s"
image create photo .countsIcon -format PNG -file counts.png
label .titleGraph -image .countsIcon -background $backColor
place .titleGraph -in .measureView -x 0 -y 160
# Particles
.s element create Particles -symbol {} -color yellow -linewidth 1 \
-smooth linear -xdata x -ydata y1
# Bio
.s element create Bio -symbol {} -color red -linewidth 1 \
-smooth linear -xdata x -ydata y2
.s marker create line -name actionLine -coords {-Inf $txtActionLevel Inf $txtActionLevel} -linewidth 1 -outline orange
.s marker create line -name alertLine -coords {-Inf $txtAlertLevel Inf $txtAlertLevel} -linewidth 1 -outline green
place .s -in .measureView -x 10 -y 50
place .scroll -in .measureView -x 60 -y 380 -width 515 -height 35
#chartTime
}
this is where i add the values to the vectors:
set x(++end) [clock seconds]
set flagTime 0
set vectorFlag 1
set len [y1 length]
if {$len == 0} {
set startTime $x(end)
set y1(++end) $particle_sec
set y2(++end) $x_summary(bio_sec)
#if {$inStart < 0} {
# .s axis configure x -min "" -max ""
# set inStart 0
#}
} else {
set y1(++end) $particle_sec
set y2(++end) $x_summary(bio_sec)
}
puts "Vector length [x length]------"
puts "First value $x(0)----------"
#This is where i'm trying to catch whenever it reaches 60 seconds in this case
#when the length of the vector reaches 60 seconds it will unset the first value
#but it doesn't work it throws and invalid command name error
if {[x length] > 60} {
[unset -nocomplain x(0)]
}
#incr everyten
add_Result $particle_sec $bioSec [format "%.2f" $fv_eff]
for some odd reason when you use blt::vector create the "[unset -nocomplain x(0)]" doesn't seem to work, so i change it back to "x delete 0" without the square brackets and it works now.
When you put unset -nocomplain x(0) in [square brackets] and have it on its own like that, what you get is this:
It tries to unset that variable (I don't know if that will have problems; I don't use BLT vectors in my own code). There will be no errors from this. The result will be the empty string (it's documented).
It takes the result (an empty string) and uses it as a word of a command. The whole word, as you're not concatenating anything with it. The whole first word, in fact. You're going to try to evaluate the command whose name is the empty string, and pass it no other arguments. This is general Tcl semantics.
Now, creating a command with an empty name is legal, slightly tricky (because rename to an empty string deletes it — fully-qualified names are the workaround), and highly unusual. In your case, you've no such command and what you've really got is a bug. The bug? Those square brackets.
Use unset -nocomplain x(0) instead of [unset -nocomplain x(0)].
Note also that in many places online, when Tclers are putting Tcl code fragments inline in a place without fancy formatting, they'll put square brackets around the code. It's just a convention to make things easier to read. You shouldn't be seeing such things here on Stack Overflow.
I am working on a UNIX box, and trying to run an application, which gives some debug logs to the standard output. I have redirected this output to a log file, but now wish to get the lines where the error is being shown.
My problem here is that a simple
cat output.log | grep FAIL
does not help out. As this shows only the lines which have FAIL in them. I want some more information along with this. Like the 2-3 lines above this line with FAIL. Is there any way to do this via a simple shell command? I would like to have a single command line (can have pipes) to do the above.
grep -C 3 FAIL output.log
Note that this also gets rid of the useless use of cat (UUOC).
grep -A $NUM
This will print $NUM lines of trailing context after matches.
-B $NUM prints leading context.
man grep is your best friend.
So in your case:
cat log | grep -A 3 -B 3 FAIL
I have two implementations of what I call sgrep, one in Perl, one using just pre-Perl (pre-GNU) standard Unix commands. If you've got GNU grep, you've no particular need of these. It would be more complex to deal with forwards and backwards context searches, but that might be a useful exercise.
Perl solution:
#!/usr/perl/v5.8.8/bin/perl -w
#
# #(#)$Id: sgrep.pl,v 1.6 2007/09/18 22:55:20 jleffler Exp $
#
# Perl-based SGREP (special grep) command
#
# Print lines around the line that matches (by default, 3 before and 3 after).
# By default, include file names if more than one file to search.
#
# Options:
# -b n1 Print n1 lines before match
# -f n2 Print n2 lines following match
# -n Print line numbers
# -h Do not print file names
# -H Do print file names
use strict;
use constant debug => 0;
use Getopt::Std;
my(%opts);
sub usage
{
print STDERR "Usage: $0 [-hnH] [-b n1] [-f n2] pattern [file ...]\n";
exit 1;
}
usage unless getopts('hnf:b:H', \%opts);
usage unless #ARGV >= 1;
if ($opts{h} && $opts{H})
{
print STDERR "$0: mutually exclusive options -h and -H specified\n";
exit 1;
}
my $op = shift;
print "# regex = $op\n" if debug;
# print file names if -h omitted and more than one argument
$opts{F} = (defined $opts{H} || (!defined $opts{h} and scalar #ARGV > 1)) ? 1 : 0;
$opts{n} = 0 unless defined $opts{n};
my $before = (defined $opts{b}) ? $opts{b} + 0 : 3;
my $after = (defined $opts{f}) ? $opts{f} + 0 : 3;
print "# before = $before; after = $after\n" if debug;
my #lines = (); # Accumulated lines
my $tail = 0; # Line number of last line in list
my $tbp_1 = 0; # First line to be printed
my $tbp_2 = 0; # Last line to be printed
# Print lines from #lines in the range $tbp_1 .. $tbp_2,
# leaving $leave lines in the array for future use.
sub print_leaving
{
my ($leave) = #_;
while (scalar(#lines) > $leave)
{
my $line = shift #lines;
my $curr = $tail - scalar(#lines);
if ($tbp_1 <= $curr && $curr <= $tbp_2)
{
print "$ARGV:" if $opts{F};
print "$curr:" if $opts{n};
print $line;
}
}
}
# General logic:
# Accumulate each line at end of #lines.
# ** If current line matches, record range that needs printing
# ** When the line array contains enough lines, pop line off front and,
# if it needs printing, print it.
# At end of file, empty line array, printing requisite accumulated lines.
while (<>)
{
# Add this line to the accumulated lines
push #lines, $_;
$tail = $.;
printf "# array: N = %d, last = $tail: %s", scalar(#lines), $_ if debug > 1;
if (m/$op/o)
{
# This line matches - set range to be printed
my $lo = $. - $before;
$tbp_1 = $lo if ($lo > $tbp_2);
$tbp_2 = $. + $after;
print "# $. MATCH: print range $tbp_1 .. $tbp_2\n" if debug;
}
# Print out any accumulated lines that need printing
# Leave $before lines in array.
print_leaving($before);
}
continue
{
if (eof)
{
# Print out any accumulated lines that need printing
print_leaving(0);
# Reset for next file
close ARGV;
$tbp_1 = 0;
$tbp_2 = 0;
$tail = 0;
#lines = ();
}
}
Pre-Perl Unix solution (using plain ed, sed, and sort - though it uses getopt which was not necessarily available back then):
#!/bin/ksh
#
# #(#)$Id: old.sgrep.sh,v 1.5 2007/09/15 22:15:43 jleffler Exp $
#
# Special grep
# Finds a pattern and prints lines either side of the pattern
# Line numbers are always produced by ed (substitute for grep),
# which allows us to eliminate duplicate lines cleanly. If the
# user did not ask for numbers, these are then stripped out.
#
# BUG: if the pattern occurs in in the first line or two and
# the number of lines to go back is larger than the line number,
# it fails dismally.
set -- `getopt "f:b:hn" "$#"`
case $# in
0) echo "Usage: $0 [-hn] [-f x] [-b y] pattern [files]" >&2
exit 1;;
esac
# Tab required - at least with sed (perl would be different)
# But then the whole problem would be different if implemented in Perl.
number="'s/^\\([0-9][0-9]*\\) /\\1:/'"
filename="'s%^%%'" # No-op for sed
f=3
b=3
nflag=no
hflag=no
while [ $# -gt 0 ]
do
case $1 in
-f) f=$2; shift 2;;
-b) b=$2; shift 2;;
-n) nflag=yes; shift;;
-h) hflag=yes; shift;;
--) shift; break;;
*) echo "Unknown option $1" >&2
exit 1;;
esac
done
pattern="${1:?'No pattern'}"
shift
case $# in
0) tmp=${TMPDIR:-/tmp}/`basename $0`.$$
trap "rm -f $tmp ; exit 1" 0
cat - >$tmp
set -- $tmp
sort="sort -t: -u +0n -1"
;;
*) filename="'s%^%'\$file:%"
sort="sort -t: -u +1n -2"
;;
esac
case $nflag in
yes) num_remove='s/[0-9][0-9]*://';;
no) num_remove='s/^//';;
esac
case $hflag in
yes) fileremove='s%^$file:%%';;
no) fileremove='s/^//';;
esac
for file in $*
do
echo "g/$pattern/.-${b},.+${f}n" |
ed - $file |
eval sed -e "$number" -e "$filename" |
$sort |
eval sed -e "$fileremove" -e "$num_remove"
done
rm -f $tmp
trap 0
exit 0
The shell version of sgrep was written in February 1989, and bug fixed in May 1989. It then remained unchanged except for an administrative change (SCCS to RCS transition) in 1997 until 2007, when I added the -h option. I switched to the Perl version in 2007.
http://thedailywtf.com/Articles/The_Complicator_0x27_s_Gloves.aspx
You can use sed to print specific lines, lets say you want line 20
sed '20 p' -n FILE_YOU_WANT_THE_LINE_FROM
Done.
-n prevents echoing lines from the file. The part in quotes is a sed rule to apply, it specifies that you want the rule to apply to line 20, and you want to print.
With GNU grep on Windows:
$ grep --context 3 FAIL output.log
$ grep --help | grep context
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM