calculate the percentage of not null recs in a file in unix - unix

How do i figure out the percentage of not null records in my file in UNIX?
My file like this: I wanted to know the amount of records & the percentage of not null rec's. Tried whole lot of grep n cut commands but nothing seems to be working out. Can anyone help me here please...
"name","country","age","place"
"sam","US","30","CA"
"","","",""
"joe","UK","34","BRIS"
,,,,
"jake","US","66","Ohio"

Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
use 5.012; # say, keys #arr
use Text::CSV_XS qw{ csv };
my ($count_all, #count_nonempty);
csv(in => shift,
out => \ 'skip',
headers => 'skip',
on_in => sub {
my (undef, $columns) = #_;
++$count_all;
length $columns->[$_] and $count_nonempty[$_]++
for 0 .. $#$columns;
},
);
for my $column (keys #count_nonempty) {
say "Column ", 1 + $column, ": ",
100 * $count_nonempty[$column] / $count_all, '%';
}
It uses Text::CSV_XS to read the CSV file. It skips the header line, and for each subsequent line, it calls the callback specified in on_in, which increments the count of all lines and also the count of empty fields per column if the length of a field is zero.

Along with choroba, I would normally recommend using a CSV parser on CSV data.
But in this case, all we want to look for is that a record contains any character that is not a comma or quote: if a record contains only commas and/or quotes, it is a "null" record.
awk '
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
To handle leading/trailing whitespace
awk '
{sub(/^[[:blank:]]+/,""); sub(/[[:blank:]]+$/,"")}
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
If allowing fields containing only whitespace, such as
" ","",,," "
is also a null record, we can simple ignore all whitespace
awk '
/[^",[:blank:]]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file

Related

How to replace a CRLF char from a variable length file in the middle of a string in Unix?

My sample file is variable length, without any field delimiters. Lines have a minimum of 18 chars length and the 'CRLF' is potentially (not always) between columns 11-15. How do I replace this with a space only when it has a new line char ('CRLF') in the middle (columns 11-15). I still want to keep true end of record.
Sample data:
Input:
1123xxsdfdsfsfdsfdssa
1234ddfxxyff
frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm
knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
Expected output:
1123xxsdfdsfsfdsfdssa
1234ddfxxyfff rrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
What I tried:
sed '/^.\{15\}$/!N;s/./ /11' filename
But above code just adding space, not removing 'CRLF'
Given your sample data, this seems to produce the desired output:
$ awk 'length($0) < 18 { getline x; $0 = $0 " " x} { print }' data
1123xxsdfdsfsfdsfdssa
1234ddfxxyff frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
$
However, if the input contained CRLF line endings, things would not be so happy; it would be best to filter out the CR characters altogether (Unix files don't normally contain CR and certainly do not normally have CRLF line endings).
$ tr -d '\r' < data | awk 'length($0) < 18 { getline x; $0 = $0 " " x} { print }'
1123xxsdfdsfsfdsfdssa
1234ddfxxyff frrrdds
1123dfdffdfdxxxxxxxxxas
1234ydfyyyzm knsaaass
1234asdafxxfrrrfrrrsaa
1123werwetrretttrretertre
$
If you really need DOS-style CRLF input and output, you probably need to use a program such as utod or unix2dos (or some other similar tool) to convert from Unix line endings to DOS.

Truncate Leading Zeroes and Padding Equal Number of Spaces to particular fields in UNIX

How to remove leading zeroes from particular columns and padding with equal number of spaces.
A001|XYZ|00001|00234|0090
B001|XYZ|00010|00234|0990
C001|XYZ|00321|00234|0345
D001|XYZ|05001|00234|0777
Fields 3 and 5 are integer fields and output should be as below-
A001|XYZ|1....|00234|90..
B001|XYZ|10...|00234|990.
C001|XYZ|321..|00234|345.
D001|XYZ|5001.|00234|7...
('.' represent spaces)
Very trivial using awk and printf:
awk '
BEGIN { FS = OFS = "|" }
{ len = length ($3)
$3 = sprintf ("%*-d", len, $3)
len = length ($5)
$5 = sprintf ("%*-d", len, $5)
}1' file
A001|XYZ|1 |00234|90
B001|XYZ|10 |00234|990
C001|XYZ|321 |00234|345
D001|XYZ|5001 |00234|777
Identify the length of the field. Using sprintf assign the value to the same column. Using * allows us to capture the number of spaces required from the length of the field taken from argument.
sed ':again
s/^\(\([^|]*\|\)\{2\}\)0\([0-9]*\)/\1\3 /
t again
s/^\(\([^|]*\|\)\{4\}\)0\([0-9]*\)/\1\3 /
t again' YourFile
easier for any columns (if needed)
sed ':again
s/\|0\([0-9]*\)/\1 /g
t again' YourFile

awk count and sum based on slab:

Would like to extract all the lines from first file (GunZip *.gz i.e Input.csv.gz), if the first file 4th field is falls within a range of
Second file (Slab.csv) first field (Start Range) and second field (End Range) then populate Slab wise count of rows and sum of 4th and 5th field of first file.
Input.csv.gz (GunZip)
Desc,Date,Zone,Duration,Calls
AB,01-06-2014,XYZ,450,3
AB,01-06-2014,XYZ,642,3
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,205,3
AB,01-06-2014,XYZ,98,1
AB,01-06-2014,XYZ,455,1
AB,01-06-2014,XYZ,120,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,193,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,161,2
Slab.csv
StartRange,EndRange
0,0
1,10
11,100
101,200
201,300
301,400
401,500
501,10000
Expected Output:
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
I am using below two commands to get the above output , expect "NotFound"cases .
awk -F, 'NR==FNR{s[NR]=$1;e[NR]=$2;c[NR]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' Slab.csv <(gzip -dc Input.csv.gz) >Op_step1.csv
cat Op_step1.csv | awk -F, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} END{for(i in a)print i","a[i]","b[i]","c[i]}' >Op_step2.csv
Op_step2.csv
101,200,3,474,4
501,10000,1,642,3
0,0,3,0,0
401,500,2,905,4
11,100,1,98,1
201,300,1,205,3
Any suggestions to make it one liner command to achieve the Expected Output , Don't have perl , python access.
Here is another option using perl which takes benefits of creating multi-dimensional arrays and hashes.
perl -F, -lane'
BEGIN {
$x = pop;
## Create array of arrays from start and end ranges
## $range = ( [0,0] , [1,10] ... )
(undef, #range)= map { chomp; [split /,/] } <>;
#ARGV = $x;
}
## Skip the first line
next if $. ==1;
## Create hash of hash
## $line = '[0,0]' => { "count" => counts , "sum4" => sum_of_col4 , "sum5" => sum_of_col5 }
for (#range) {
if ($F[3] >= $_->[0] && $F[3] <= $_->[1]) {
$line{"#$_"}{"count"}++;
$line{"#$_"}{"sum4"} +=$F[3];
$line{"#$_"}{"sum5"} +=$F[4];
}
}
}{
print "StartRange,EndRange,Count,Sum-4,Sum-5";
print join ",", #$_,
$line{"#$_"}{"count"} //"NotFound",
$line{"#$_"}{"sum4"} //"NotFound",
$line{"#$_"}{"sum5"} //"NotFound"
for #range
' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Here is one way using awk and sort:
awk '
BEGIN {
FS = OFS = SUBSEP = ",";
print "StartRange,EndRange,Count,Sum-4,Sum-5"
}
FNR == 1 { next }
NR == FNR {
ranges[$1,$2]++;
next
}
{
for (range in ranges) {
split(range, tmp, SUBSEP);
if ($4 >= tmp[1] && $4 <= tmp[2]) {
count[range]++;
sum4[range]+=$4;
sum5[range]+=$5;
next
}
}
}
END {
for(range in ranges)
print range, (count[range]?count[range]:"NotFound"), (sum4[range]?sum4[range]:"NotFound"), (sum5[range]?sum5[range]:"NotFound") | "sort -t, -nk1,2"
}' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,NotFound,NotFound
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Set the Input, Output Field Separators and SUBSEP to ,. Print the Header line.
If it is the first line skip it.
Load the entire slab.txt in to an array called ranges.
For every range in the ranges array, split the field to get start and end range. If the 4th column is in the range, increment the count array and add the value to sum4 and sum5 array appropriately.
In the END block, iterate through the ranges and print them.
Pipe the output to sort to get the output in order.

Extract values of a variable occurring multiple times in a file

I have to extract value of a variable which occurs multiple times in a file. for example, I have a text file abc.txt . There is a variable result. Suppose value of result in first line is 2, in third line it is 55 and in last line it is 66.
Then my desired output should be :
result:2,55,66
I am new in unix so I could not figure out how to do this. Please help
The contents of text file can be as follows:
R$#$#%$W%^BHGF, result=2,
fsdfsdsgf
VSDF$TR$R,result=55
fsdf4r54
result=66
Try this :
using awk code :
awk -F'(,| |^)result=' '
/result=/{
gsub(",", "", $2)
v = $2
str = (str) ? str","v : v
}
END{print "result:"str}
' abc.txt
Using perl code :
perl -lane '
push #arr, $& if /\bresult=\K\d+/;
END{print "result:" . join ",", #arr}
' abc.txt
Output :
result:2,55,66

awk field separator , when the separator shows up in double quote

I am trying to use awk to read some input at the field position at 3, $3, field 3 is a string
awk -F'","' '{print $1}' input.txt
my file input.txt looks like this
field1,field2,field3,field4,field5
the problem is that these fields are separated by commas, some of them are double quoted while others are not. And field 5 is double quoted and contains every type of symbols. Example:
imfield1,imfield2,"imfield3",imfield4,"im"",""fi"",el,""d5"
can awk handle a situation like this??
In more gner, how can I get the whole string by typIng $5 ?
You can use Lorance Stinson's Awk CSV parser, in which case it's as simple as:
function parse_csv(..) {
..
}
{
num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
print csv[2]
}
If you're not hell-bent on Awk, Python also comes with a nice CSV parser:
import csv, sys
for row in csv.reader(sys.stdin):
print row[2]
Or from the command line (bit tricky in one line):
python -c 'import csv,sys;[sys.stdout.write(row[2]+"\n") for row in csv.reader(sys.stdin)]' < input.txt
The separator is a simple comma, not a comma betweeen quotes. If the fields do not contain commas, then awk may be up for the task:
awk -F , '
{
if ($3 ~ /^".*"$/) {
$3 = substr($3, 2, length($3)-2);
gsub(/""/, "", $3);
}
print $3;
}' input.txt
This is already getting pretty complicated. If there can be commas inside fields, use a proper CSV parser, for example in Perl or Python. See https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files
You can parse the line in awk setting null field separator. Instead of printf("%s",$i) you can assign $i to a var and print out when inda==0
#echo "\"AAA,BBB\",\"CCC\",\"DDD, EEE, FFF\"" > uno
awk 'BEGIN { FS="" }
{
for ( i=1; i<NF; i++) {
if ( $i == "\"" )
if ( inda == 0 )
inda = 1
else
inda = 0
if ( $i == "," )
if ( inda == 0 )
$i="|"
printf("%s",$i)
}
printf("\n")
}' uno

Resources