I need to validate the file with respect to the data types. I have a file with below data,
data.csv
Col1 | Col2 | Col3 | Col4
100 | XYZ | 200 | 2020-07-11
200 | XYZ | 500 | 2020-07-10
300 | XYZ | 700 | 2020-07-09
I have another file having the configurations,
Config_file.txt
Columns = Col1|Col2|Col3|Col4
Data_type = numeric|string|numeric|date
Delimiter = |
I have to compare the configuration file and data file and return a result.
For example:
In configuration file data_type of Col1 is numeric. In case if i get any string value in Col1 in data file, the script should return Datatype Mismatch Found in Col1. I have tried with awk, if its one line item its easy to get it done by defining the position of the columns. But am not sure how to loop entire file column by column ad check the data.
I have also tried providing the patterns and achieve this. But am unable to validate complete file. Any suggestion would be helpful.
awk -F "|" '$1 ~ "^[+-]?[0-9]+([.][0-9]+)?$" && $4 ~ "^[+-]?[0-9]+([.][0-9]+)?$" && length($5) == 10 {print}' data.csv
The goal is to compare the data file (data.csv) and Data_Type in config file(Config_file.txt) for each column and check if any column is having datatype mismatch.
For example, consider below data
Col1 | Col2 | Col3 | Col4
100 | XYZ | 200 | 2020-07-11
ABC | XYZ | 500 | 2020-07-10 -- This is incorrect data because Col1 is having string value `ABC`, in config file, the data type is numeric
300 | XYZ | 700 | 2020-07-09
300 | XYZ | 700 | 2020-07-09
300 | XYZ | XYZ | 2020-07-09 -- Incorrect Data
300 | 300 | 700 | 2020-07-09
300 | XYZ | 700 | XYX -- Incorrect Data
The data type provided in config table is as below,
Columns = Col1|Col2|Col3|Col4
Data_type = numeric|string|numeric|date
The script should echo the result as Data Type Mismatch Found in Col1
Here is a skeleton solution in GNU awk. In lack of sample output I improvised:
awk '
BEGIN {
FS=" *= *"
}
function numeric(p) { # testing for numeric
if(p==(p+0))
return 1
else return 0
}
function string(p) { # cant really fail string test, right
return 1
}
function date(p) {
gsub(/-/," ",p)
if(mktime(p " 0 0 0")>=0)
return 1
else return 0
}
NR==FNR{ # process config file
switch($1) {
case "Columns":
a["Columns"]=$NF;
break
case "Data_type":
a["Data_type"]=$NF;
break
case "Delimiter":
a["Delimiter"]=$NF;
}
if(a["Columns"] && a["Data_type"] && a["Delimiter"]) {
split(a["Columns"],c,a["Delimiter"])
split(a["Data_type"],d,a["Delimiter"])
for(i in c) { # b["Col1"]="string" etc.
b[c[i]]=d[i]
FS= a["Delimiter"]
}
}
next
}
FNR==1{ # processing headers of data file
for(i=1;i<=NF;i++) {
h[i]=$i # h[1]="Col1" etc.
}
}
{
for(i=1;i<=NF;i++) { # process all fields
f=b[h[i]] # using indirect function calls check
printf "%s%s",(#f($i)?$i:"FAIL"),(i==NF?ORS:FS) # the data
}
}' config <(tr -d \ <data) # deleting space from your data as "|"!=" | "
Sample output:
FAIL|Col2|FAIL|FAIL
100|XYZ|200|2020-07-11
200|XYZ|500|2020-07-10
300|XYZ|700|2020-07-09
FAIL|XYZ|FAIL|FAIL # duplicated previous record and malformed it
$ cat tst.awk
NR == FNR {
gsub(/^[[:space:]]+|[[:space:]]+$/,"")
tag = val = $0
sub(/[[:space:]]*=.*/,"",tag)
sub(/[^=]+=[[:space:]]*/,"",val)
cfg_tag2val[tag] = val
next
}
FNR == 1 {
FS = cfg_tag2val["Delimiter"]
$0 = $0
reqd_NF = split(cfg_tag2val["Columns"],reqd_names)
split(cfg_tag2val["Data_type"],reqd_types)
}
NF != reqd_NF {
printf "%s: Error: line %d NF (%d) != required NF (%d)\n", FILENAME, FNR, NF, reqd_NF | "cat>&2"
got_errors = 1
}
FNR == 1 {
for ( i=1; i<=NF; i++ ) {
reqd_name = reqd_names[i]
name = $i
gsub(/^[[:space:]]+|[[:space:]]+$/,"",name)
if ( name != reqd_name ) {
printf "%s: Error: line %d col %d name (%s) != required col name (%s)\n", FILENAME, FNR, i, name, reqd_name | "cat>&2"
got_errors = 1
}
}
}
FNR > 1 {
for ( i=1; i<=NF; i++ ) {
reqd_type = reqd_types[i]
if ( reqd_type != "string" ) {
value = $i
gsub(/^[[:space:]]+|[[:space:]]+$/,"",value)
type = val2type(value)
if ( type != reqd_type ) {
printf "%s: Error: line %d field %d (%s) type (%s) != required field type (%s)\n", FILENAME, FNR, i, value, type, reqd_type | "cat>&2"
got_errors = 1
}
}
}
}
END { exit got_errors }
function val2type(val, type) {
if ( val == val+0 ) { type = "numeric" }
else if ( val ~ /^[0-9]{4}(-[0-9]{2}){2}$/ ) { type = "date" }
else { type = "string" }
return type
}
.
$ awk -f tst.awk config.txt data.csv
data.csv: Error: line 3 field 1 (ABC) type (string) != required field type (numeric)
data.csv: Error: line 6 field 3 (XYZ) type (string) != required field type (numeric)
data.csv: Error: line 8 field 4 (XYX) type (string) != required field type (date)
Related
I cannot understand the behaviour of the update operator of jq (version 1.6) shown in the following examples.
Why does example 1 return an updated object, but example 2 and 3 return an empty object or a wrong result?
The difference between the examples is only the calling order of the function to convert a string into a number.
#!/bin/bash
#
# strange behaviour jq
# example 1 - works as expected
jq -n '
def numberify($x): $x | tonumber? // 0;
"1" as $stringValue
| numberify($stringValue) as $intValue
# | { } # version 1: a.b does not exist yet
| { a: { b: 1 } } # version 2: a.b exists already
| .["a"] |= { b: (.b + $intValue) }
'
# result example 1, version 1 - expected
# {
# "a": {
# "b": 1
# }
# }
# result example 1, version 2 - expected
# {
# "a": {
# "b": 2
# }
# }
# example 2 - erroneous result
jq -n '
def numberify($x): $x | tonumber? // 0;
"1" as $stringValue
# | { } # version 1: a.b does not exist yet
| { a: { b: 1 } } # version 2: a.b exists already
| .["a"] |= { b: (.b + numberify($stringValue)) }
'
# result example 2, version 1 - unexpected
# {}
# result example 2, version 2 - unexpected
# {}
# example 3 - erroneous result
jq -n '
def numberify($x): $x | try tonumber catch 0;
"1" as $stringValue
# | { } # version 1: a.b does not exist yet
| { a: { b: 1 } } # version 2: a.b exists already
| .["a"] |= { b: (.b + numberify($stringValue)) }
'
# result example 3, version 1 - unexpected
# {
# "a": {
# "b": 0
# }
# }
# result example 3, version 2 - unexpected
# {
# "a": {
# "b": 1
# }
# }
#oguzismail That's a good idea to use '+=' instead of '|='.
I hadn't thought of it before.
Currently, my code with the workaround for the bug looks like this:
def numberify($x): $x | tonumber? // 0;
"1" as $sumReqSize
| "10" as $sumResSize
| { statistics: { count: 1, sumReqSize: 2, sumResSize: 20 } }
| [numberify($sumReqSize), numberify($sumResSize)] as $sizes # workaround for bug
| .statistics |= {
count: (.count + 1),
sumReqSize: (.sumReqSize + $sizes[0]),
sumResSize: (.sumResSize + $sizes[1])
}
'
Following your suggestion it will be more concise and doesn't need the ugly workaround:
def numberify($x): $x | tonumber? // 0;
"1" as $sumReqSize
| "10" as $sumResSize
| { statistics: { count: 1, sumReqSize: 2, sumResSize: 20 } }
| .statistics.count += 1
| .statistics.sumReqSize += numberify($sumReqSize)
| .statistics.sumResSize += numberify($sumResSize)
This is a bug in jq 1.6. One option would be to use an earlier version of jq (e.g. jq 1.5).
Another would be to avoid |= by using = instead, along the lines of:
.a = (.a | ...)
or if the RHS does not actually depend on the LHS (as in your original examples), simply replacing |= by =.
This is a bug in jq 1.6. In this case you can use try-catch instead.
def numberify($x): $x | try tonumber catch 0;
But I don't know if there is a generic way to walk around this issue.
I am using awk to compute Mean Frational Bias form a data file. How can I make the data points a variable to call in to my equation?
Input.....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
I am looking to feed into this equation....
MFB = .5 * (Yavg-Xavg)/[(Yavg+Xavg)/2]
output....
col1 col2
row #1 Yavg: 14.87954
row #2 Xavg: 20.83804
row #3 Ystd: 7.886613
row #4 Xstd: 8.628519
row #5 MFB: (computed value)
currently trying to use the following code to do this but not working....
var= 'linear_reg-County119-O3-2004-Winter2013-2018XYstats.out.out'
val1=$(awk -F, OFS=":" "NR==2{print $2; exit}" <$var)
val2=$(awk -F, OFS=":" "NR==1{print $2; exit}" <$var)
#MFB = .5*((val2-val1)/((val2+val1)/2))
awk '{ print "MFB :" .5*((val2-val1)/((val2+val1)/2))}' >> linear_regCounty119-O3-2004-Winter2013-2018XYstats-wMFB.out
Try running: awk -f mfb.awk input.txt where
mfb.awk:
BEGIN { FS = OFS = ": " } # set the separators
{ v[$1] = $2; print } # store each line in an array named "v"
END {
MFB = 0.5 * (v["Yavg"] - v["Xavg"]) / ((v["Yavg"] + v["Xavg"]) / 2)
print "MFB", MFB
}
input.txt:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
Output:
Yavg: 14.87954
Xavg: 20.83804
Ystd: 7.886613
Xstd: 8.628519
MFB: -0.166823
Alternatively, mfb.awk can be the following, resembling your original code:
BEGIN { FS = OFS = ": " }
{ print }
NR == 1 { Yavg = $2 } NR == 2 { Xavg = $2 }
END {
MFB = 0.5 * (Yavg - Xavg) / ((Yavg + Xavg) / 2)
print "MFB", MFB
}
Note that you don't usually toss variables back and forth between the shell and Awk (at least when you deal with a single input file).
I need to compare two files column by column using unix shell, and store the difference in a resulting file.
For example if column 1 of the 1st record of the 1st file matches the column 1 of the 1st record of the 2nd file then the result will be stored as '=' in the resulting file against the column, but if it finds any difference in column values the same need to be printed in the resulting file.
Below is the exact requirement.
File 1:
id code name place
123 abc Tom phoenix
345 xyz Harry seattle
675 kyt Romil newyork
File 2:
id code name place
123 pkt Rosy phoenix
345 xyz Harry seattle
421 uty Romil Sanjose
Expected resulting file:
id_1 id_2 code_1 code_2 name_1 name_2 place_1 place_2
= = abc pkt Tom Rosy = =
= = = = = = = =
675 421 kyt uty = = Newyork Sanjose
Columns are tab delimited.
This is rather crudely coded, but shows a way to use awk to emit what you want, and can handle files of identical "schema" - not just the particular 4-field files you give as tests.
This approach uses pr to do a simple merge of the files: the same line of each input file is concatenated to present one line to the awk script.
The awk script assumes clean input, and uses the fact that if a variable n has the value 2, the value of $n in the script is the the same as $2. So, the script walks though pairs of fields using the i and j variables. For your test input, fields 1 and 5, then 2 and 6, etc., are processed.
Only very limited testing of input is performed: mainly, that the implied schema of the two input files (the names of columns/fields) is the same.
#!/bin/sh
[ $# -eq 2 ] || { echo "Usage: ${0##*/} <file1> <file2>" 1>&2; exit 1; }
[ -r "$1" -a -r "$2" ] || { echo "$1 or $2: cannot read" 1>&2; exit 1; }
set -e
pr -s -t -m "$#" | \
awk '
{
offset = int(NF/2)
tab = ""
for (i = 1; i <= offset; i++) {
j = i + offset
if (NR == 1) {
if ($i != $j) {
printf "\nColumn name mismatch (%s/%s)\n", $i, $j > "/dev/stderr"
exit
}
printf "%s%s_1\t%s_2", tab, $i, $j
} else if ($i == $j) {
printf "%s=\t=", tab
} else {
printf "%s%s\t%s", tab, $i, $j
}
tab = "\t"
}
printf "\n"
}
'
Tested on Linux: GNU Awk 4.1.0 and pr (GNU coreutils) 8.21.
I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list
I got 4 files i need to compare, and i want the output to display the numbers in file X that are not present in the other files.
The files consist of 1 column, where on each row we have a number.
I can't use the diff command because it is not available.
I was trying to use comm, but the output was wrong every time.
Thank for your help.
It's not an easy task actually. You can try this. It uses GNU awk.
gawk 'function report() { printf("File \"%s\" has different number of lines: %d\nIt should be %d\n", FILENAME, FNR, o); exit } NR == FNR { a[NR] = $0; ++o; next } FNR > o { report() } a[FNR] != $0 { printf("Line %d of file \"%s\" is different from line in file \"%s\": %s\nIt should be: %s\n", FNR, FILENAME, ARGV[1], $0, a[FNR]); exit } ENDFILE { if (ARGIND > 1 && FNR != o) report() }' file1 file2 file3
Script version:
#!/usr/bin/gawk -f
function report() {
printf("File \"%s\" has different number of lines: %d\nIt should be %d\n",
FILENAME, FNR, o)
exit 1
}
NR == FNR {
a[NR] = $0
++o
next
}
FNR > o {
report()
}
a[FNR] != $0 {
printf("Line %d of file \"%s\" is different from line in file \"%s\": %s\nIt should be: %s\n",
FNR, FILENAME, ARGV[1], $0, a[FNR])
exit 1
}
ENDFILE {
if (ARGIND > 1 && FNR != o)
report()
}
Usage:
gawk -f script.awk file1 file2 file3 ...
Consider we have 3 files N1 N2 N3 to compare.
Add all the values in the files in a new one called out, and remove duplicates. Then compare each file with the main one "out" and output Missing for each comparison. Then add the Missing values in 1 file, sort them and remove duplicates.
cat N1 N2 N3 | sort -u > out
comm -2 -3 out $sort N1 > Missing1
comm -2 -3 out $sort N2 > Missing2
comm -2 -3 out $sort N3 > Missing3
cat Missing1 Missing2 Missing3 | sort -u > Missing.txt
Thanks for the help :)
If you have vim installed then issue this command
vim -d file1 file2 file3 ...