Update operator yields empty object in JQ - jq

I cannot understand the behaviour of the update operator of jq (version 1.6) shown in the following examples.
Why does example 1 return an updated object, but example 2 and 3 return an empty object or a wrong result?
The difference between the examples is only the calling order of the function to convert a string into a number.
#!/bin/bash
#
# strange behaviour jq
# example 1 - works as expected
jq -n '
def numberify($x): $x | tonumber? // 0;
"1" as $stringValue
| numberify($stringValue) as $intValue
# | { } # version 1: a.b does not exist yet
| { a: { b: 1 } } # version 2: a.b exists already
| .["a"] |= { b: (.b + $intValue) }
'
# result example 1, version 1 - expected
# {
# "a": {
# "b": 1
# }
# }
# result example 1, version 2 - expected
# {
# "a": {
# "b": 2
# }
# }
# example 2 - erroneous result
jq -n '
def numberify($x): $x | tonumber? // 0;
"1" as $stringValue
# | { } # version 1: a.b does not exist yet
| { a: { b: 1 } } # version 2: a.b exists already
| .["a"] |= { b: (.b + numberify($stringValue)) }
'
# result example 2, version 1 - unexpected
# {}
# result example 2, version 2 - unexpected
# {}
# example 3 - erroneous result
jq -n '
def numberify($x): $x | try tonumber catch 0;
"1" as $stringValue
# | { } # version 1: a.b does not exist yet
| { a: { b: 1 } } # version 2: a.b exists already
| .["a"] |= { b: (.b + numberify($stringValue)) }
'
# result example 3, version 1 - unexpected
# {
# "a": {
# "b": 0
# }
# }
# result example 3, version 2 - unexpected
# {
# "a": {
# "b": 1
# }
# }
#oguzismail That's a good idea to use '+=' instead of '|='.
I hadn't thought of it before.
Currently, my code with the workaround for the bug looks like this:
def numberify($x): $x | tonumber? // 0;
"1" as $sumReqSize
| "10" as $sumResSize
| { statistics: { count: 1, sumReqSize: 2, sumResSize: 20 } }
| [numberify($sumReqSize), numberify($sumResSize)] as $sizes # workaround for bug
| .statistics |= {
count: (.count + 1),
sumReqSize: (.sumReqSize + $sizes[0]),
sumResSize: (.sumResSize + $sizes[1])
}
'
Following your suggestion it will be more concise and doesn't need the ugly workaround:
def numberify($x): $x | tonumber? // 0;
"1" as $sumReqSize
| "10" as $sumResSize
| { statistics: { count: 1, sumReqSize: 2, sumResSize: 20 } }
| .statistics.count += 1
| .statistics.sumReqSize += numberify($sumReqSize)
| .statistics.sumResSize += numberify($sumResSize)

This is a bug in jq 1.6. One option would be to use an earlier version of jq (e.g. jq 1.5).
Another would be to avoid |= by using = instead, along the lines of:
.a = (.a | ...)
or if the RHS does not actually depend on the LHS (as in your original examples), simply replacing |= by =.

This is a bug in jq 1.6. In this case you can use try-catch instead.
def numberify($x): $x | try tonumber catch 0;
But I don't know if there is a generic way to walk around this issue.

Related

how to use awk to pull fields and make them variables to do data calculation

I am using awk to compute Mean Frational Bias form a data file. How can I make the data points a variable to call in to my equation?
Input.....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
I am looking to feed into this equation....
MFB = .5 * (Yavg-Xavg)/[(Yavg+Xavg)/2]
output....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
row #5 MFB: (computed value)
currently trying to use the following code to do this but not working....
var= 'linear_reg-County119-O3-2004-Winter2013-2018XYstats.out.out'
val1=$(awk -F, OFS=":" "NR==2{print $2; exit}" <$var)
val2=$(awk -F, OFS=":" "NR==1{print $2; exit}" <$var)
#MFB = .5*((val2-val1)/((val2+val1)/2))
awk '{ print "MFB :" .5*((val2-val1)/((val2+val1)/2))}' >> linear_regCounty119-O3-2004-Winter2013-2018XYstats-wMFB.out
Try running: awk -f mfb.awk input.txt where
mfb.awk:
BEGIN { FS = OFS = ": " } # set the separators
{ v[$1] = $2; print } # store each line in an array named "v"
END {
MFB = 0.5 * (v["Yavg"] - v["Xavg"]) / ((v["Yavg"] + v["Xavg"]) / 2)
print "MFB", MFB
}
input.txt:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
Output:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
MFB: -0.166823
Alternatively, mfb.awk can be the following, resembling your original code:
BEGIN { FS = OFS = ": " }
{ print }
NR == 1 { Yavg = $2 } NR == 2 { Xavg = $2 }
END {
MFB = 0.5 * (Yavg - Xavg) / ((Yavg + Xavg) / 2)
print "MFB", MFB
}
Note that you don't usually toss variables back and forth between the shell and Awk (at least when you deal with a single input file).

Data Type Validation in UNIX

I need to validate the file with respect to the data types. I have a file with below data,
data.csv
Col1 | Col2 | Col3 | Col4
100 | XYZ | 200 | 2020-07-11
200 | XYZ | 500 | 2020-07-10
300 | XYZ | 700 | 2020-07-09
I have another file having the configurations,
Config_file.txt
Columns = Col1|Col2|Col3|Col4
Data_type = numeric|string|numeric|date
Delimiter = |
I have to compare the configuration file and data file and return a result.
For example:
In configuration file data_type of Col1 is numeric. In case if i get any string value in Col1 in data file, the script should return Datatype Mismatch Found in Col1. I have tried with awk, if its one line item its easy to get it done by defining the position of the columns. But am not sure how to loop entire file column by column ad check the data.
I have also tried providing the patterns and achieve this. But am unable to validate complete file. Any suggestion would be helpful.
awk -F "|" '$1 ~ "^[+-]?[0-9]+([.][0-9]+)?$" && $4 ~ "^[+-]?[0-9]+([.][0-9]+)?$" && length($5) == 10 {print}' data.csv
The goal is to compare the data file (data.csv) and Data_Type in config file(Config_file.txt) for each column and check if any column is having datatype mismatch.
For example, consider below data
Col1 | Col2 | Col3 | Col4
100 | XYZ | 200 | 2020-07-11
ABC | XYZ | 500 | 2020-07-10 -- This is incorrect data because Col1 is having string value `ABC`, in config file, the data type is numeric
300 | XYZ | 700 | 2020-07-09
300 | XYZ | 700 | 2020-07-09
300 | XYZ | XYZ | 2020-07-09 -- Incorrect Data
300 | 300 | 700 | 2020-07-09
300 | XYZ | 700 | XYX -- Incorrect Data
The data type provided in config table is as below,
Columns = Col1|Col2|Col3|Col4
Data_type = numeric|string|numeric|date
The script should echo the result as Data Type Mismatch Found in Col1
Here is a skeleton solution in GNU awk. In lack of sample output I improvised:
awk '
BEGIN {
FS=" *= *"
}
function numeric(p) { # testing for numeric
if(p==(p+0))
return 1
else return 0
}
function string(p) { # cant really fail string test, right
return 1
}
function date(p) {
gsub(/-/," ",p)
if(mktime(p " 0 0 0")>=0)
return 1
else return 0
}
NR==FNR{ # process config file
switch($1) {
case "Columns":
a["Columns"]=$NF;
break
case "Data_type":
a["Data_type"]=$NF;
break
case "Delimiter":
a["Delimiter"]=$NF;
}
if(a["Columns"] && a["Data_type"] && a["Delimiter"]) {
split(a["Columns"],c,a["Delimiter"])
split(a["Data_type"],d,a["Delimiter"])
for(i in c) { # b["Col1"]="string" etc.
b[c[i]]=d[i]
FS= a["Delimiter"]
}
}
next
}
FNR==1{ # processing headers of data file
for(i=1;i<=NF;i++) {
h[i]=$i # h[1]="Col1" etc.
}
}
{
for(i=1;i<=NF;i++) { # process all fields
f=b[h[i]] # using indirect function calls check
printf "%s%s",(#f($i)?$i:"FAIL"),(i==NF?ORS:FS) # the data
}
}' config <(tr -d \ <data) # deleting space from your data as "|"!=" | "
Sample output:
FAIL|Col2|FAIL|FAIL
100|XYZ|200|2020-07-11
200|XYZ|500|2020-07-10
300|XYZ|700|2020-07-09
FAIL|XYZ|FAIL|FAIL # duplicated previous record and malformed it
$ cat tst.awk
NR == FNR {
gsub(/^[[:space:]]+|[[:space:]]+$/,"")
tag = val = $0
sub(/[[:space:]]*=.*/,"",tag)
sub(/[^=]+=[[:space:]]*/,"",val)
cfg_tag2val[tag] = val
next
}
FNR == 1 {
FS = cfg_tag2val["Delimiter"]
$0 = $0
reqd_NF = split(cfg_tag2val["Columns"],reqd_names)
split(cfg_tag2val["Data_type"],reqd_types)
}
NF != reqd_NF {
printf "%s: Error: line %d NF (%d) != required NF (%d)\n", FILENAME, FNR, NF, reqd_NF | "cat>&2"
got_errors = 1
}
FNR == 1 {
for ( i=1; i<=NF; i++ ) {
reqd_name = reqd_names[i]
name = $i
gsub(/^[[:space:]]+|[[:space:]]+$/,"",name)
if ( name != reqd_name ) {
printf "%s: Error: line %d col %d name (%s) != required col name (%s)\n", FILENAME, FNR, i, name, reqd_name | "cat>&2"
got_errors = 1
}
}
}
FNR > 1 {
for ( i=1; i<=NF; i++ ) {
reqd_type = reqd_types[i]
if ( reqd_type != "string" ) {
value = $i
gsub(/^[[:space:]]+|[[:space:]]+$/,"",value)
type = val2type(value)
if ( type != reqd_type ) {
printf "%s: Error: line %d field %d (%s) type (%s) != required field type (%s)\n", FILENAME, FNR, i, value, type, reqd_type | "cat>&2"
got_errors = 1
}
}
}
}
END { exit got_errors }
function val2type(val, type) {
if ( val == val+0 ) { type = "numeric" }
else if ( val ~ /^[0-9]{4}(-[0-9]{2}){2}$/ ) { type = "date" }
else { type = "string" }
return type
}
.
$ awk -f tst.awk config.txt data.csv
data.csv: Error: line 3 field 1 (ABC) type (string) != required field type (numeric)
data.csv: Error: line 6 field 3 (XYZ) type (string) != required field type (numeric)
data.csv: Error: line 8 field 4 (XYX) type (string) != required field type (date)

Combining big data files with different columns into one big file

I have N tab-separated files. Each file has a header line saying the names of the columns. Some of the columns are common to all of the files, but some are unique.
I want to combine all of the files into one big file containing all of the relevant headers.
Example:
> cat file1.dat
a b c
5 7 2
3 9 1
> cat file2.dat
a b e f
2 9 8 3
2 8 3 3
1 0 3 2
> cat file3.dat
a c d g
1 1 5 2
> merge file*.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
The - can be replaced by anything, for example NA.
Caveat: the files are so big that I can not load all of them into memory simultaneously.
I had a solution in R using
write.table(do.call(plyr:::rbind.fill,
Map(function(filename)
read.table(filename, header=1, check.names=0),
filename=list.files('.'))),
'merged.dat', quote=FALSE, sep='\t', row.names=FALSE)
but this fails with a memory error when the data are too large.
What is the best way to accomplish this?
I am thinking the best route will be to first loop through all the files to collect the column names, then loop through the files to put them into the right format, and write them to disc as they are encountered. However, is there perhaps already some code available that performs this?
From an algorithm point of view I would take the following steps:
Process the headers:
read all headers of all input files and extract all column names
sort the column names in the order you want
create a lookup table which returns the column-name when a field number is given (h[n] -> "name")
process the files: after the headers, you can reprocess the files
read the header of the file
create a lookup table which returns the field number when given a column name. An associative array is useful here: (a["name"] -> field_number)
process the remainder of the file
loop over all fields of the merged file
get the column name with h
check if the column name is in a, if not print -, if so print the field number corresponding with a.
This is easily done with a GNU awk making use of the extensions nextfile and asorti. The nextfile function allows us to read the header only and move to the next file without processing the full file. Since we need to process the file twice (step 1 reading the header and step 2 reading the file), we will ask awk to dynamically manipulate its argument list. Every time a file's header is processed, we add it at the end of the argument list ARGV so it can be used for step 2.
BEGIN { s="-" } # define symbol
BEGIN { f=ARGC-1 } # get total number of files
f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
ARGV[ARGC++] = FILENAME # add file at end of argument list
if (--f == 0) { # did we process all headers?
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
}
nextfile # end of processing headers
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
If you store the above in a file merge.awk you can use the command:
awk -f merge.awk f1 f2 f3 f4 ... fx
A similar way, but less hastle with f:
BEGIN { s="-" } # define symbol
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
This method is slightly different, but allows the processing of files with different field separators as
awk -f merge.awk f1 FS="," f2 f3 FS="|" f4 ... fx
If your argument list becomes too long, you can use awk to create it for you :
BEGIN { s="-" } # define symbol
BEGIN { # read argument list from input file:
fname=(ARGC==1 ? "-" : ARGV[1])
ARGC=1 # from: filelist or /dev/stdin
while ((getline < fname) > 0) # to: arg1 arg2 ... argx
ARGV[ARGC++]=$0
}
BEGIN { # modify argument list from
c=ARGC; # from: arg1 arg2 ... argx
ARGV[ARGC++]="f=1" # to: arg1 arg2 ... argx f=1 arg1 arg2 ... argx
for(i=1;i<c;++i) ARGV[ARGC++]=ARGV[i]
}
!f { for (i=1;i<=NF;++i) h[$i] # read headers in associative array h[key]
nextfile
}
(f==1) && (FNR==1) { # process merged header
n=asorti(h) # sort header into h[idx] = key
for (i=1;i<=n;++i) # print header
printf "%s%s", h[i], (i==n?ORS:OFS)
f=2
}
# Start of processing the files
(FNR==1) { delete a; for(i=1;i<=NF;++i) a[$i]=i; next } # read header
{ for(i=1;i<=n;++i) printf "%s%s", (h[i] in a ? $(a[h[i]]) : s), (i==n?ORS:OFS) }
which can be ran as:
$ awk -f merge.awk filelist
$ find . | awk -f merge.awk "-"
$ find . | awk -f merge.awk
or any similar command.
As you see, by adding only a tiny block of code, we were able to flexibly adjust to awk code to support our needs.
Miller (johnkerl/miller) is so underused when dealing with huge files. It has tons of features included from all useful file processing tools out there. Like the official documentation says
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON. You get to work with your data using named fields, without needing to count positional column indices.
For this particular case, it supports a verb unsparsify, which by the documentation says
Prints records with the union of field names over all input records.
For field names absent in a given record but present in others, fills in
a value. This verb retains all input before producing any output.
You just need to do the following and reorder the file back with the column positions as you desire
mlr --tsvlite --opprint unsparsify then reorder -f a,b,c,d,e,f file{1..3}.dat
which produces the output in one-shot as
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
You can even customize what characters you can use to fill the empty fields with, with default being -. For custom characters use unsparsify --fill-with '#'
A brief explanation of the fields used
To delimit the input stream as a tab delimited content, --tsvlite
To pretty print the tabular data --opprint
And unsparsify like explained above does a union of all the field names over all input stream
The reordering verb reorder is needed because the column headers appear in random order between the files. So to define the order explicitly, use the -f option with the column headers you want the output to appear with.
And installation of the package is so straightforward. Miller is written in portable, modern C, with zero runtime dependencies. The installation via package managers is so easy and it supports all major package managers Homebrew, MacPorts, apt-get, apt and yum.
Given your updated information in comments about having about 10^5 input files (and so exceeding the shells max number of args for a non-builtin command) and wanting the output columns in the order they're seen rather than alphabetically sorted, the following will work using any awk and any find:
$ cat tst.sh
#!/bin/env bash
find . -maxdepth 1 -type f -name "$1" |
awk '
NR==FNR {
fileName = $0
ARGV[ARGC++] = fileName
if ( (getline fldList < fileName) > 0 ) {
if ( !seenList[fldList]++ ) {
numFlds = split(fldList,fldArr)
for (inFldNr=1; inFldNr<=numFlds; inFldNr++) {
fldName = fldArr[inFldNr]
if ( !seenName[fldName]++ ) {
hdr = (numOutFlds++ ? hdr OFS : "") fldName
outNr2name[numOutFlds] = fldName
}
}
}
}
close(fileName)
next
}
FNR == 1 {
if ( !doneHdr++ ) {
print hdr
}
delete name2inNr
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName] = inFldNr
}
next
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
inFldNr = name2inNr[fldName]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' -
.
$ ./tst.sh 'file*.dat'
a b c e f d g
5 7 2 - - - -
3 9 1 - - - -
2 9 - 8 3 - -
2 8 - 3 3 - -
1 0 - 3 2 - -
1 - 1 - - 5 2
Note that input to the script is now the globbing pattern you want find to use to find the files, not the list of files.
Original answer:
If you don't mind a combined shell+awk script then this will work using any awk:
$ cat tst.sh
#!/bin/env bash
awk -v hdrs="$(head -1 -q "$#" | tr ' ' '\n' | sort -u)" '
BEGIN {
numOutFlds = split(hdrs,outNr2name)
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
printf "%s%s", fldName, (outFldNr<numOutFlds ? OFS : ORS)
}
}
FNR == 1 {
delete name2inNr
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName] = inFldNr
}
next
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
fldName = outNr2name[outFldNr]
inFldNr = name2inNr[fldName]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' "$#"
.
$ ./tst.sh file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
otherwise this is all awk using GNU awk for arrays of arrays, sorted_in, and ARGIND:
$ cat tst.awk
BEGIN {
for (inFileNr=1; inFileNr<ARGC; inFileNr++) {
inFileName = ARGV[inFileNr]
if ( (getline < inFileName) > 0 ) {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
fldName = $inFldNr
name2inNr[fldName][inFileNr] = inFldNr
}
}
close(inFileName)
}
PROCINFO["sorted_in"] = "#ind_str_asc"
for (fldName in name2inNr) {
printf "%s%s", (numOutFlds++ ? OFS : ""), fldName
for (inFileNr in name2inNr[fldName]) {
outNr2inNr[numOutFlds][inFileNr] = name2inNr[fldName][inFileNr]
}
}
print ""
}
FNR > 1 {
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = outNr2inNr[outFldNr][ARGIND]
fldValue = (inFldNr ? $inFldNr : "-")
printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
}
}
.
$ awk -f tst.awk file{1..3}.dat
a b c d e f g
5 7 2 - - - -
3 9 1 - - - -
2 9 - - 8 3 -
2 8 - - 3 3 -
1 0 - - 3 2 -
1 - 1 5 - - 2
For efficiency the 2nd script above does all the heavy lifting in the BEGIN section so there's as little work left to do as possible in the main body of the script that's evaluated once per input line. In the BEGIN section it creates an associative array (outNr2inNr[]) that maps the outgoing field numbers (alphabetically sorted list of all field names across all input files) to the incoming field numbers so all that's left to do in the body is print the fields in that order.
Here is the solution I (the OP) have come up with so far. It may have some advantage over other approaches in that it processes the files in parallel.
R code:
library(parallel)
library(parallelMap)
# specify the directory containing the files we want to merge
args <- commandArgs(TRUE)
directory <- if (length(args)>0) args[1] else 'sg_grid'
#output_fname <- paste0(directory, '.dat')
# make a tmp directory that will store all the files
tmp_dir <- paste0(directory, '_tmp')
dir.create(tmp_dir)
# list the .dat files we want to merge
filenames <- list.files(directory)
filenames <- filenames[grep('.dat', filenames)]
# a function to read the column names
get_col_names <- function(filename)
colnames(read.table(file.path(directory, filename),
header=T, check.names=0, nrow=1))
# grab all the headers of all the files and merge them
col_names <- get_col_names(filenames[1])
for (simulation in filenames) {
col_names <- union(col_names, get_col_names(simulation))
}
# put those column names into a blank data frame
name_DF <- data.frame(matrix(ncol = length(col_names), nrow = 0))
colnames(name_DF) <- col_names
# save that as the header file
write.table(name_DF, file.path(tmp_dir, '0.dat'),
col.names=TRUE, row.names=F, quote=F, sep='\t')
# now read in every file and merge with the blank data frame
# it will have NAs in any columns it didn't have before
# save it to the tmp directory to be merged later
parallelStartMulticore(max(1,
min(as.numeric(Sys.getenv('OMP_NUM_THREADS')), 62)))
success <- parallelMap(function(filename) {
print(filename)
DF <- read.table(file.path(directory, filename),
header=1, check.names=0)
DF <- plyr:::rbind.fill(name_DF, DF)
write.table(DF, file.path(tmp_dir, filename),
quote=F, col.names=F, row.names=F, sep='\t')
}, filename=filenames)
# and we're done
print(all(unlist(success)))
This creates temporary versions of all the files, which each now have all the headers, which we can then cat together into the result:
ls -1 sg_grid_tmp/* | while read fn ; do cat "$fn" >> sg_grid.dat; done

How to concatenate the values based on other field in unix

I have detail.txt file ,which contains
cat >detail.txt
Student ID,Student Name, Percentage
101,A,75
102,B,77
103,C,34
104,D,42
105,E,75
106,F,42
107,G,77
1.I want to print concatenated output based on Percentage (group by Percentage) and print student name in single line separated by comma(,).
Expected Output:
75-A,E
77-B,G
42-D,F
34-C
For above question i got that how can achieve this for 75 or 77 or 42. But i did not get how to write a code grouping third field (Percentage).
I tried below code
awk -F"," '{OFS=",";if($3=="75") print $2}' detail.txt
2. I want to get output based on grading system which is given below.
marks < 45=THIRD
marks>=45 and marks<60 =SECOND
marks>=60 and marks<=75 =FIRST
marks>75 =DIST
Expected Output:
DIST:B,G
FIRST:A,E
THIRD:C,D,F
Please help me to get the expected output. Thank You..
awk solution:
awk -F, 'NR>1{
if ($3<45) k="THIRD"; else if ($3>=45 && $3<60) k="SECOND";
else if ($3>=60 && $3<=75) k="FIRST"; else k="DIST";
a[k] = a[k]? a[k]","$2 : $2;
}END{ for(i in a) print i":"a[i] }' detail.txt
k - variable that will be assigned with "grading system" name according to one of the if (...) <exp>; else if(...) <exp> ... statements
a[k] - array a is indexed by determined "grading system" name k
a[k] = a[k]? a[k]","$2 : $2 - all "student names"(presented by the 2nd field $2) are accumulated/grouped into the needed "grading system"
The output:
DIST:B,G
THIRD:C,D,F
FIRST:A,E
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR>1 {
stud = $2
pct = $3
if ( pct <= 45 ) { band = "THIRD" }
else if ( pct <= 60 ) { band = "SECOND" }
else if ( pct <= 75 ) { band = "FIRST" }
else { band = "DIST" }
pcts[pct][stud]
bands[band][stud]
}
END {
for (pct in pcts) {
out = ""
for (stud in pcts[pct]) {
out = (out == "" ? pct "-" : out OFS) stud
}
print out
}
print "----"
for (band in bands) {
out = ""
for (stud in bands[band]) {
out = (out == "" ? band ":" : out OFS) stud
}
print out
}
}
.
$ gawk -f tst.awk file
34-C
42-D,F
75-A,E
77-B,G
----
DIST:B,G
THIRD:C,D,F
FIRST:A,E
For your first question, the following awk one-liner should do:
awk -F, '{a[$3]=a[$3] (a[$3] ? "," : "") $2} END {for(i in a) printf "%s-%s\n", i, a[i]}' input.txt
The second question can work almost the same way, storing your mark divisions in an array, then stepping through that array to determine the subscript for a new array:
BEGIN { FS=","; m[0]="THIRD"; m[45]="SECOND"; m[60]="FIRST"; m[75]="DIST" } { for (i=0;i<=100;i++) if ((i in m) && $3 > i) mdiv=m[i]; marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2 } END { for(i in marks) printf "%s:%s\n", i, marks[i] }
But this is unreadable. When you need this level of complexity, you're past the point of a one-liner. :)
So .. combining the two and breaking them out for easier reading (and commenting) we get the following:
BEGIN {
FS=","
m[0]="THIRD"
m[45]="SECOND"
m[60]="FIRST"
m[75]="DIST"
}
{
a[$3]=a[$3] (a[$3] ? "," : "") $2 # Build an array with percentage as the index
for (i=0;i<=100;i++) # Walk through the possible marks
if ((i in m) && $3 > i) mdiv=m[i] # selecting the correct divider on the way
marks[mdiv]=marks[mdiv] (marks[mdiv] ? "," : "") $2
# then build another array with divider
# as the index
}
END { # Once we've processed all the input,
for(i in a) # step through the array,
printf "%s-%s\n", i, a[i] # printing the results.
print "----"
for(i in marks) # step through the array,
printf "%s:%s\n", i, marks[i] # printing the results.
}
You may be wondering why we for (i=0;i<=100;i++) instead of simply using for (i in m). This is because awk does not guarantee the order of array elements, and when stepping through the m array, it's important that we see the keys in increasing order.

How to print a line with a pattern which is nearest to another line with a specific pattern?

I want to find a pattern which is nearest to a specific pattern. Such as I want to print "bbb=" which is under the "yyyy:" (it is the closest line with bbb= to yyyy). It is line 8. line numbers and the order might be changed so it is better not to use line numbers.
root# vi a
"a" 15 lines
1 ## xxxx:
2 aaa=3
3 bbb=4
4 ccc=2
5 ddd=1
6 ## yyyy:
7 aaa=1
8 bbb=0
9 ccc=3
10 ddd=3
11 ## zzzz:
12 aaa=1
13 bbb=1
14 ccc=1
15 ddd=1
Do you have an idea using awk or grep for this purpose?
Something like this?
awk '/^## yyyy:/ { i = 1 }; i && /^bbb=/ { print; exit }'
Or can a line above also match if? In that case, perhaps:
awk '/^bbb=/ && !i { p=NR; s=$0 }; /^bbb=/ && i { print (NR-i < i-p) ? $0 : s; exit }; /^## yyyy:/ { i=NR }'
Taking into account that there might not be a previous or next entry:
/^bbb=/ && !i { p1 = NR; s1 = $0 }
/^bbb=/ && i { p2 = NR; s2 = $0; exit }
/^## yyyy:/ { i = NR }
END {
if (p1 == 0)
print s2
else if (p2 == 0)
print s1
else
print (i - p1 < p2 - i ? s1 : s2)
}
Quick and dirty using grep:
grep -A 100 '##yyyy' filename | grep 'bbb='

Resources