calculate the percentage of not null recs in a file in unix - unix
How do i figure out the percentage of not null records in my file in UNIX?
My file like this: I wanted to know the amount of records & the percentage of not null rec's. Tried whole lot of grep n cut commands but nothing seems to be working out. Can anyone help me here please...
"name","country","age","place"
"sam","US","30","CA"
"","","",""
"joe","UK","34","BRIS"
,,,,
"jake","US","66","Ohio"
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
use 5.012; # say, keys #arr
use Text::CSV_XS qw{ csv };
my ($count_all, #count_nonempty);
csv(in => shift,
out => \ 'skip',
headers => 'skip',
on_in => sub {
my (undef, $columns) = #_;
++$count_all;
length $columns->[$_] and $count_nonempty[$_]++
for 0 .. $#$columns;
},
);
for my $column (keys #count_nonempty) {
say "Column ", 1 + $column, ": ",
100 * $count_nonempty[$column] / $count_all, '%';
}
It uses Text::CSV_XS to read the CSV file. It skips the header line, and for each subsequent line, it calls the callback specified in on_in, which increments the count of all lines and also the count of empty fields per column if the length of a field is zero.
Along with choroba, I would normally recommend using a CSV parser on CSV data.
But in this case, all we want to look for is that a record contains any character that is not a comma or quote: if a record contains only commas and/or quotes, it is a "null" record.
awk '
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
To handle leading/trailing whitespace
awk '
{sub(/^[[:blank:]]+/,""); sub(/[[:blank:]]+$/,"")}
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
If allowing fields containing only whitespace, such as
" ","",,," "
is also a null record, we can simple ignore all whitespace
awk '
/[^",[:blank:]]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
Related
How to replace a CRLF char from a variable length file in the middle of a string in Unix?
My sample file is variable length, without any field delimiters. Lines have a minimum of 18 chars length and the 'CRLF' is potentially (not always) between columns 11-15. How do I replace this with a space only when it has a new line char ('CRLF') in the middle (columns 11-15). I still want to keep true end of record. Sample data: Input: 1123xxsdfdsfsfdsfdssa 1234ddfxxyff frrrdds 1123dfdffdfdxxxxxxxxxas 1234ydfyyyzm knsaaass 1234asdafxxfrrrfrrrsaa 1123werwetrretttrretertre Expected output: 1123xxsdfdsfsfdsfdssa 1234ddfxxyfff rrrdds 1123dfdffdfdxxxxxxxxxas 1234ydfyyyzm knsaaass 1234asdafxxfrrrfrrrsaa 1123werwetrretttrretertre What I tried: sed '/^.\{15\}$/!N;s/./ /11' filename But above code just adding space, not removing 'CRLF'
Given your sample data, this seems to produce the desired output: $ awk 'length($0) < 18 { getline x; $0 = $0 " " x} { print }' data 1123xxsdfdsfsfdsfdssa 1234ddfxxyff frrrdds 1123dfdffdfdxxxxxxxxxas 1234ydfyyyzm knsaaass 1234asdafxxfrrrfrrrsaa 1123werwetrretttrretertre $ However, if the input contained CRLF line endings, things would not be so happy; it would be best to filter out the CR characters altogether (Unix files don't normally contain CR and certainly do not normally have CRLF line endings). $ tr -d '\r' < data | awk 'length($0) < 18 { getline x; $0 = $0 " " x} { print }' 1123xxsdfdsfsfdsfdssa 1234ddfxxyff frrrdds 1123dfdffdfdxxxxxxxxxas 1234ydfyyyzm knsaaass 1234asdafxxfrrrfrrrsaa 1123werwetrretttrretertre $ If you really need DOS-style CRLF input and output, you probably need to use a program such as utod or unix2dos (or some other similar tool) to convert from Unix line endings to DOS.
Truncate Leading Zeroes and Padding Equal Number of Spaces to particular fields in UNIX
How to remove leading zeroes from particular columns and padding with equal number of spaces. A001|XYZ|00001|00234|0090 B001|XYZ|00010|00234|0990 C001|XYZ|00321|00234|0345 D001|XYZ|05001|00234|0777 Fields 3 and 5 are integer fields and output should be as below- A001|XYZ|1....|00234|90.. B001|XYZ|10...|00234|990. C001|XYZ|321..|00234|345. D001|XYZ|5001.|00234|7... ('.' represent spaces)
Very trivial using awk and printf: awk ' BEGIN { FS = OFS = "|" } { len = length ($3) $3 = sprintf ("%*-d", len, $3) len = length ($5) $5 = sprintf ("%*-d", len, $5) }1' file A001|XYZ|1 |00234|90 B001|XYZ|10 |00234|990 C001|XYZ|321 |00234|345 D001|XYZ|5001 |00234|777 Identify the length of the field. Using sprintf assign the value to the same column. Using * allows us to capture the number of spaces required from the length of the field taken from argument.
sed ':again s/^\(\([^|]*\|\)\{2\}\)0\([0-9]*\)/\1\3 / t again s/^\(\([^|]*\|\)\{4\}\)0\([0-9]*\)/\1\3 / t again' YourFile easier for any columns (if needed) sed ':again s/\|0\([0-9]*\)/\1 /g t again' YourFile
awk count and sum based on slab:
Would like to extract all the lines from first file (GunZip *.gz i.e Input.csv.gz), if the first file 4th field is falls within a range of Second file (Slab.csv) first field (Start Range) and second field (End Range) then populate Slab wise count of rows and sum of 4th and 5th field of first file. Input.csv.gz (GunZip) Desc,Date,Zone,Duration,Calls AB,01-06-2014,XYZ,450,3 AB,01-06-2014,XYZ,642,3 AB,01-06-2014,XYZ,0,0 AB,01-06-2014,XYZ,205,3 AB,01-06-2014,XYZ,98,1 AB,01-06-2014,XYZ,455,1 AB,01-06-2014,XYZ,120,1 AB,01-06-2014,XYZ,0,0 AB,01-06-2014,XYZ,193,1 AB,01-06-2014,XYZ,0,0 AB,01-06-2014,XYZ,161,2 Slab.csv StartRange,EndRange 0,0 1,10 11,100 101,200 201,300 301,400 401,500 501,10000 Expected Output: StartRange,EndRange,Count,Sum-4,Sum-5 0,0,3,0,0 1,10,NotFound,NotFound,NotFound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,NotFound,NotFound,NotFound 401,500,2,905,4 501,10000,1,642,3 I am using below two commands to get the above output , expect "NotFound"cases . awk -F, 'NR==FNR{s[NR]=$1;e[NR]=$2;c[NR]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' Slab.csv <(gzip -dc Input.csv.gz) >Op_step1.csv cat Op_step1.csv | awk -F, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} END{for(i in a)print i","a[i]","b[i]","c[i]}' >Op_step2.csv Op_step2.csv 101,200,3,474,4 501,10000,1,642,3 0,0,3,0,0 401,500,2,905,4 11,100,1,98,1 201,300,1,205,3 Any suggestions to make it one liner command to achieve the Expected Output , Don't have perl , python access.
Here is another option using perl which takes benefits of creating multi-dimensional arrays and hashes. perl -F, -lane' BEGIN { $x = pop; ## Create array of arrays from start and end ranges ## $range = ( [0,0] , [1,10] ... ) (undef, #range)= map { chomp; [split /,/] } <>; #ARGV = $x; } ## Skip the first line next if $. ==1; ## Create hash of hash ## $line = '[0,0]' => { "count" => counts , "sum4" => sum_of_col4 , "sum5" => sum_of_col5 } for (#range) { if ($F[3] >= $_->[0] && $F[3] <= $_->[1]) { $line{"#$_"}{"count"}++; $line{"#$_"}{"sum4"} +=$F[3]; $line{"#$_"}{"sum5"} +=$F[4]; } } }{ print "StartRange,EndRange,Count,Sum-4,Sum-5"; print join ",", #$_, $line{"#$_"}{"count"} //"NotFound", $line{"#$_"}{"sum4"} //"NotFound", $line{"#$_"}{"sum5"} //"NotFound" for #range ' slab input StartRange,EndRange,Count,Sum-4,Sum-5 0,0,3,0,0 1,10,NotFound,NotFound,NotFound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,NotFound,NotFound,NotFound 401,500,2,905,4 501,10000,1,642,3
Here is one way using awk and sort: awk ' BEGIN { FS = OFS = SUBSEP = ","; print "StartRange,EndRange,Count,Sum-4,Sum-5" } FNR == 1 { next } NR == FNR { ranges[$1,$2]++; next } { for (range in ranges) { split(range, tmp, SUBSEP); if ($4 >= tmp[1] && $4 <= tmp[2]) { count[range]++; sum4[range]+=$4; sum5[range]+=$5; next } } } END { for(range in ranges) print range, (count[range]?count[range]:"NotFound"), (sum4[range]?sum4[range]:"NotFound"), (sum5[range]?sum5[range]:"NotFound") | "sort -t, -nk1,2" }' slab input StartRange,EndRange,Count,Sum-4,Sum-5 0,0,3,NotFound,NotFound 1,10,NotFound,NotFound,NotFound 11,100,1,98,1 101,200,3,474,4 201,300,1,205,3 301,400,NotFound,NotFound,NotFound 401,500,2,905,4 501,10000,1,642,3 Set the Input, Output Field Separators and SUBSEP to ,. Print the Header line. If it is the first line skip it. Load the entire slab.txt in to an array called ranges. For every range in the ranges array, split the field to get start and end range. If the 4th column is in the range, increment the count array and add the value to sum4 and sum5 array appropriately. In the END block, iterate through the ranges and print them. Pipe the output to sort to get the output in order.
Extract values of a variable occurring multiple times in a file
I have to extract value of a variable which occurs multiple times in a file. for example, I have a text file abc.txt . There is a variable result. Suppose value of result in first line is 2, in third line it is 55 and in last line it is 66. Then my desired output should be : result:2,55,66 I am new in unix so I could not figure out how to do this. Please help The contents of text file can be as follows: R$#$#%$W%^BHGF, result=2, fsdfsdsgf VSDF$TR$R,result=55 fsdf4r54 result=66
Try this : using awk code : awk -F'(,| |^)result=' ' /result=/{ gsub(",", "", $2) v = $2 str = (str) ? str","v : v } END{print "result:"str} ' abc.txt Using perl code : perl -lane ' push #arr, $& if /\bresult=\K\d+/; END{print "result:" . join ",", #arr} ' abc.txt Output : result:2,55,66
awk field separator , when the separator shows up in double quote
I am trying to use awk to read some input at the field position at 3, $3, field 3 is a string awk -F'","' '{print $1}' input.txt my file input.txt looks like this field1,field2,field3,field4,field5 the problem is that these fields are separated by commas, some of them are double quoted while others are not. And field 5 is double quoted and contains every type of symbols. Example: imfield1,imfield2,"imfield3",imfield4,"im"",""fi"",el,""d5" can awk handle a situation like this?? In more gner, how can I get the whole string by typIng $5 ?
You can use Lorance Stinson's Awk CSV parser, in which case it's as simple as: function parse_csv(..) { .. } { num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1); print csv[2] } If you're not hell-bent on Awk, Python also comes with a nice CSV parser: import csv, sys for row in csv.reader(sys.stdin): print row[2] Or from the command line (bit tricky in one line): python -c 'import csv,sys;[sys.stdout.write(row[2]+"\n") for row in csv.reader(sys.stdin)]' < input.txt
The separator is a simple comma, not a comma betweeen quotes. If the fields do not contain commas, then awk may be up for the task: awk -F , ' { if ($3 ~ /^".*"$/) { $3 = substr($3, 2, length($3)-2); gsub(/""/, "", $3); } print $3; }' input.txt This is already getting pretty complicated. If there can be commas inside fields, use a proper CSV parser, for example in Perl or Python. See https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files
You can parse the line in awk setting null field separator. Instead of printf("%s",$i) you can assign $i to a var and print out when inda==0 #echo "\"AAA,BBB\",\"CCC\",\"DDD, EEE, FFF\"" > uno awk 'BEGIN { FS="" } { for ( i=1; i<NF; i++) { if ( $i == "\"" ) if ( inda == 0 ) inda = 1 else inda = 0 if ( $i == "," ) if ( inda == 0 ) $i="|" printf("%s",$i) } printf("\n") }' uno