Append two columns adding a specific integer at each value in Unix

Append two columns adding a specific integer at each value in Unix - unix

I have two files like this:
# step distance
0 4.48595407961296e+01
2500 4.50383737781376e+01
5000 4.53506757198727e+01
7500 4.51682465277482e+01
10000 4.53410353656445e+01
# step distance
0 4.58854106214881e+01
2500 4.58639266431320e+01
5000 4.60620560167519e+01
7500 4.58990075106227e+01
10000 4.59371359946124e+01
So I want to join the two files together, while maintaining the spacing.
Especially, the second file needs to remember the ending values of the first one and start counting from that one.
output:
# step distance
0 4.48595407961296e+01
2500 4.50383737781376e+01
5000 4.53506757198727e+01
7500 4.51682465277482e+01
10000 4.53410353656445e+01
12500 4.58854106214881e+01
15000 4.58639266431320e+01
17500 4.60620560167519e+01
20000 4.58990075106227e+01
22500 4.59371359946124e+01
With calc it was easy to do the problem is that the spacing needs to be in order to work and in that case calc makes a complete mess.

# start awk and set the *Step* between file to 2500
awk -v 'Step=2500' '
# 1st line of 1 file (NR count every line, from each file) init and print header
NR == 1 {LastFile = FILENAME; OFS = "\t"; print}
# when file change (new filename compare to previous line read)
# Set a new index (for incremental absolute step from relative one) and new filename reference
FILENAME != LastFile { StartIndex = LastIndex + Step; LastFile = FILENAME}
# after first line and for every line stating witha digit (+ space if any)
# calculate absolute step and replace relative one, print the new content
NR > 1 && /^[[:blank:]]*[0-9]/ { $1 += StartIndex; LastIndex = $1;print }
' YourFiles*
Result will depend of files order
output separator is set by OFS value (tab here)

Perl to the rescue!
#!/usr/bin/perl
use warnings;
use strict;
open my $F1, '<', 'file1' or die $!;
my ($before, $after, $diff);
my $max = 0;
while (<$F1>) {
print;
my ($space1, $num, $space2) = /^(\s*) ([0-9]+) (\s*)/x or next;
($before, $after) = ($space1, $space2);
$diff = $num - $max;
$max = $num;
}
$before = length "$before$max"; # We'll need it to format the computed numbers.
open my $F2, '<', 'file2' or die $!;
<$F2>; # Skip the header.
while (<$F2>) {
my ($step, $distance) = split;
$step += $max + $diff;
printf "% ${before}d%s%s\n", $step, $after, $distance;
}
The program remembers the last number in $max. It also keeps the length of the leading whitespace plus $max in $before to format all future numbers to take up the same space (using printf).
You didn't show how the distance column is aligned, i.e.
20000 4.58990075106227e+01
22500 11.59371359946124e+01 # dot aligned?
22500 11.34572478912301e+01 # left aligned?
The program would align it the latter way. If you want the former, use a similar trick as for the step column.

Related

calculate the percentage of not null recs in a file in unix

How do i figure out the percentage of not null records in my file in UNIX?
My file like this: I wanted to know the amount of records & the percentage of not null rec's. Tried whole lot of grep n cut commands but nothing seems to be working out. Can anyone help me here please...
"name","country","age","place"
"sam","US","30","CA"
"","","",""
"joe","UK","34","BRIS"
,,,,
"jake","US","66","Ohio"

Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
use 5.012; # say, keys #arr
use Text::CSV_XS qw{ csv };
my ($count_all, #count_nonempty);
csv(in => shift,
out => \ 'skip',
headers => 'skip',
on_in => sub {
my (undef, $columns) = #_;
++$count_all;
length $columns->[$_] and $count_nonempty[$_]++
for 0 .. $#$columns;
},
);
for my $column (keys #count_nonempty) {
say "Column ", 1 + $column, ": ",
100 * $count_nonempty[$column] / $count_all, '%';
}
It uses Text::CSV_XS to read the CSV file. It skips the header line, and for each subsequent line, it calls the callback specified in on_in, which increments the count of all lines and also the count of empty fields per column if the length of a field is zero.

Along with choroba, I would normally recommend using a CSV parser on CSV data.
But in this case, all we want to look for is that a record contains any character that is not a comma or quote: if a record contains only commas and/or quotes, it is a "null" record.
awk '
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
To handle leading/trailing whitespace
awk '
{sub(/^[[:blank:]]+/,""); sub(/[[:blank:]]+$/,"")}
/[^",]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file
If allowing fields containing only whitespace, such as
" ","",,," "
is also a null record, we can simple ignore all whitespace
awk '
/[^",[:blank:]]/ {nonnull++}
END {printf "%d / %d = %.2f\n", nonnull, NR, nonnull/NR}
' file

Get a specific column number by column name using awk

i have n number of file, in these files a specific column named "thrudate" is given at different column number in every files.
i Just want to extract the value of this column from all files in one go. So i tried with using awk. Here i'm considering only one file, and extracting the values of thrudate
awk -F, -v header=1,head="" '{for(j=1;j<=2;j++){if($header==1){for(i=1;i<=$NF;i++){if($i=="thrudate"){$head=$i;$header=0;break}}} elif($header==0){print $0}}}' file | head -10
How i have approached:
used find command to find all the similar files and then executing the second step for every file
loop all fields in first row, checking the column name with header values as 1 (initialized it to 1 to check first row only), once it matched with 'thrudate', i set header as 0, then break from this loop.
once i get the column number then print it for every row.

You can use the following awk script:
print_col.awk:
# Find the column number in the first line of a file
FNR==1{
for(n=1;n<=NF;n++) {
if($n == header) {
next
}
}
}
# Print that column on all other lines
{
print $n
}
Then use find to execute this script on every file:
find ... -exec awk -v header="foo" -f print_col.awk {} +
In comments you've asked for a version that could print multiple columns based on their header names. You may use the following script for that:
print_cols.awk:
BEGIN {
# Parse headers into an assoc array h
split(header, a, ",")
for(i in a) {
h[a[i]]=1
}
}
# Find the column numbers in the first line of a file
FNR==1{
split("", cols) # This will re-init cols
for(i=1;i<=NF;i++) {
if($i in h) {
cols[i]=1
}
}
next
}
# Print those columns on all other lines
{
res = ""
for(i=1;i<=NF;i++) {
if(i in cols) {
s = res ? OFS : ""
res = res "" s "" $i
}
}
if (res) {
print res
}
}
Call it like this:
find ... -exec awk -v header="foo,bar,test" -f print_cols.awk {} +

Merging the rows in a file using awk

Can somebody explain the meaning of the below script please?
awk -F "," 's != $1 || NR ==1{s=$1;if(p){print p};p=$0;next}
{sub($1,"",$0);p=p""$0;}
END{print p}' file
The file has the following data:
2,"URL","website","aaa.com"
2,"abc","text","some text"
2,"Password","password","12345678"
3,"URL","website","10.0.10.75"
3,"uname","text","some text"
3,"Password","password","password"
3,"ulang","text","some text"
4,"URL","website","192.168.2.110"
4,"Login","login","admin"
4,"Password","password","blah-blah"
and the output is:
2,"URL","website","aaa.com","abc","text","some text",Password","password","12345678"
3,"URL","website","10.0.10.75","uname","text","some text""Password","password","password","ulang","text","some text"

awk has this structure
pattern {action}
for your script, let's analyze the elements, first pattern
s != $1 || NR == 1 # if the variable s is not equal to first field
# or we're operating on first row
first action
s = $1 # assign variable s to first field
if (p) { # if p is not empty, print
print p
}
p = $0 # assign the current line to p variable
next # move to next, skip the rest
next pattern is missing, so the action will apply to all rows
sub($1, "", $0) # same as $1="", will clear the first field
p = ((p "") $0) # concat new line to p
last pattern is special reserved word END, only applied when all rows are consumed (there is counterpart BEGIN that's applied before the file is opened)
END {
print p # print the value of p
}

awk count and sum based on slab:

Would like to extract all the lines from first file (GunZip *.gz i.e Input.csv.gz), if the first file 4th field is falls within a range of
Second file (Slab.csv) first field (Start Range) and second field (End Range) then populate Slab wise count of rows and sum of 4th and 5th field of first file.
Input.csv.gz (GunZip)
Desc,Date,Zone,Duration,Calls
AB,01-06-2014,XYZ,450,3
AB,01-06-2014,XYZ,642,3
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,205,3
AB,01-06-2014,XYZ,98,1
AB,01-06-2014,XYZ,455,1
AB,01-06-2014,XYZ,120,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,193,1
AB,01-06-2014,XYZ,0,0
AB,01-06-2014,XYZ,161,2
Slab.csv
StartRange,EndRange
0,0
1,10
11,100
101,200
201,300
301,400
401,500
501,10000
Expected Output:
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
I am using below two commands to get the above output , expect "NotFound"cases .
awk -F, 'NR==FNR{s[NR]=$1;e[NR]=$2;c[NR]=$0;n++;next} {for(i=1;i<=n;i++) if($4>=s[i]&&$4<=e[i]) {print $0,","c[i];break}}' Slab.csv <(gzip -dc Input.csv.gz) >Op_step1.csv
cat Op_step1.csv | awk -F, '{key=$6","$7;++a[key];b[key]=b[key]+$4;c[key]=c[key]+$5} END{for(i in a)print i","a[i]","b[i]","c[i]}' >Op_step2.csv
Op_step2.csv
101,200,3,474,4
501,10000,1,642,3
0,0,3,0,0
401,500,2,905,4
11,100,1,98,1
201,300,1,205,3
Any suggestions to make it one liner command to achieve the Expected Output , Don't have perl , python access.

Here is another option using perl which takes benefits of creating multi-dimensional arrays and hashes.
perl -F, -lane'
BEGIN {
$x = pop;
## Create array of arrays from start and end ranges
## $range = ( [0,0] , [1,10] ... )
(undef, #range)= map { chomp; [split /,/] } <>;
#ARGV = $x;
}
## Skip the first line
next if $. ==1;
## Create hash of hash
## $line = '[0,0]' => { "count" => counts , "sum4" => sum_of_col4 , "sum5" => sum_of_col5 }
for (#range) {
if ($F[3] >= $_->[0] && $F[3] <= $_->[1]) {
$line{"#$_"}{"count"}++;
$line{"#$_"}{"sum4"} +=$F[3];
$line{"#$_"}{"sum5"} +=$F[4];
}
}
}{
print "StartRange,EndRange,Count,Sum-4,Sum-5";
print join ",", #$_,
$line{"#$_"}{"count"} //"NotFound",
$line{"#$_"}{"sum4"} //"NotFound",
$line{"#$_"}{"sum5"} //"NotFound"
for #range
' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,0,0
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3

Here is one way using awk and sort:
awk '
BEGIN {
FS = OFS = SUBSEP = ",";
print "StartRange,EndRange,Count,Sum-4,Sum-5"
}
FNR == 1 { next }
NR == FNR {
ranges[$1,$2]++;
next
}
{
for (range in ranges) {
split(range, tmp, SUBSEP);
if ($4 >= tmp[1] && $4 <= tmp[2]) {
count[range]++;
sum4[range]+=$4;
sum5[range]+=$5;
next
}
}
}
END {
for(range in ranges)
print range, (count[range]?count[range]:"NotFound"), (sum4[range]?sum4[range]:"NotFound"), (sum5[range]?sum5[range]:"NotFound") | "sort -t, -nk1,2"
}' slab input
StartRange,EndRange,Count,Sum-4,Sum-5
0,0,3,NotFound,NotFound
1,10,NotFound,NotFound,NotFound
11,100,1,98,1
101,200,3,474,4
201,300,1,205,3
301,400,NotFound,NotFound,NotFound
401,500,2,905,4
501,10000,1,642,3
Set the Input, Output Field Separators and SUBSEP to ,. Print the Header line.
If it is the first line skip it.
Load the entire slab.txt in to an array called ranges.
For every range in the ranges array, split the field to get start and end range. If the 4th column is in the range, increment the count array and add the value to sum4 and sum5 array appropriately.
In the END block, iterate through the ranges and print them.
Pipe the output to sort to get the output in order.

Printing all lines that have/are duplicates of first field in Unix shell scipt

I have a list of of semi-colon delimeted data:
TR=P561;dir=o;day=sa;TI=16:30;stn=south station;Line=worcester
I need to take this file and print out only the lines with TR values that occur more than once. I would like first occurrence and all duplicates listed.
Thanks

If you are willing to make 2 passes on the file:
awk '!/TR=/ { next } # Ignore lines that do not set TR
{t=$0; sub( ".*TR=", "", t ); sub( ";.*", "", t ) } # Get TR value
FNR == NR { a[t] +=1 } # Count the number of times this value of TR seen
FNR != NR && a[t] > 1 # print those lines whose TR value is seen more than once
' input-file input-file
This uses a common awk idiom of checking FNR to see which file we are using. By passing the input-file as an argument twice, it becomes a way to run one command on the first pass, and a different command on the second.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Append two columns adding a specific integer at each value in Unix - unix

Related

calculate the percentage of not null recs in a file in unix

Get a specific column number by column name using awk

Merging the rows in a file using awk

awk count and sum based on slab:

Printing all lines that have/are duplicates of first field in Unix shell scipt

Categories

Resources