Data Formatting from Text File

Data Formatting from Text File - unix

I have data in below format stored in a file.
ABC:9804
{
"count" : 492,
"_shards" : {
"total" : 19,
"successful" : 19,
"failed" : 0
}
}
Bye
ABC:95023
{
"count" : 865,
"_shards" : {
"total" : 19,
"successful" : 19,
"failed" : 0
}
}
Bye
ABCC:128
{
"count" : 479,
"_shards" : {
"total" : 19,
"successful" : 19,
"failed" : 0
}
}
Bye
I am trying to get the output like
ABC:9804 , 492
ABC:95023 , 865
ABCC:128 , 479
I tried using awk to get the 1st like and 3rd line but that is not working .

awk solution:
awk '/^ABC.*:/{ abc=$0 }$0~/"count"/{ gsub(/[^0-9]+/,"",$0); print abc" , "$0 }' file
The output:
ABC:9804 , 492
ABC:95023 , 865
ABCC:128 , 479

Input
$ cat infile
ABC:9804
{
"count" : 492,
"_shards" : {
"total" : 19,
"successful" : 19,
"failed" : 0
}
}
Bye
ABC:95023
{
"count" : 865,
"_shards" : {
"total" : 19,
"successful" : 19,
"failed" : 0
}
}
Bye
ABCC:128
{
"count" : 479,
"_shards" : {
"total" : 19,
"successful" : 19,
"failed" : 0
}
}
Bye
Output
Gawk
$ awk -F':' -v RS='[{},\n]' '/ABC.*|\"count"/{gsub(/[\n\t ]+/,""); printf /\"/? ", " $2 "\n": $0}' infile
ABC:9804, 492
ABC:95023, 865
ABCC:128, 479
Better Readable
awk -F':' -v RS='[{},\n]' '
/ABC.*|\"count"/{
gsub(/[\n\t ]+/,"");
printf /\"/? ", " $2 "\n": $0
}
' infile
non-gawk
$ awk -F'[ ,]' -v OFS=", " '/^[ \t]+(ABC.*|\"count\"[ ]?):/{ sub(/^[ \t]+/,""); printf /\"/ ? OFS $(NF-1) RS: $0 }' infile
ABC:9804, 492
ABC:95023, 865
ABCC:128, 479
Better Readable
awk -F'[ ,]' -v OFS=", " '
/^[ \t]+(ABC.*|\"count\"[ ]?):/{
sub(/^[ \t]+/,"");
printf /\"/ ? OFS $(NF-1) RS: $0
}
' infile

Related

Accessing field with jq that can be string or array

I have a large dump of data in json that looks like:
[{
"recordList" : {
"record" : [{
"Production" : {
"creator" : {
"name" : "A"
}
}
},
{
"Production" : {}
},
{
"Production" : [{
"creator" : {
"name" : "B"
},
"creator" : {
"name" : "C"
}
}]
}]
}
}]
I need to check if there is at least one creator in a record or not. If there is I give a 1 else a 0 for that field in a CSV-file.
My code:
jq -r '.[].recordList.record[]|"\(if ((.Production.creator.name)? // (.Production[]?.creator.name)?) == null or ((.Production.creator.name)?|length // (.Production[]?.creator.name)?|length) == 0 then 0 else 1 end),"' file.json
The problem is that the field 'Production' is only an array when there are multiple creators.
The result I want to get in this case is:
1,
0,
1,

jq solution:
jq -r '.[].recordList.record[].Production
| "\(if ((type == "array" and .[0].creator.name !="")
or (type == "object" and .creator.name and .creator.name !=""))
then 1 else 0 end),"' file.json
The output:
1,
0,
1,

Simplified jq solution:
jq -r '.[].recordList.record[].Production
| ((type == "array" and .[0].creator.name) or .creator.name)
| if . then "1," else "0," end' file.json

combine two jq filters into one

How jq filter combines the filter outputs? Following jq not generates output.json with respective input arg value ('jack').
input.json
{
"key1": "",
"key2": ""
}
jq --arg input "$username" \
'if .key1 == "<value1>"
then . + {"key1" : ($input) }
else . end' input.json |
'if .key2 == "<value2>"
then . + {"key2" : ($input) }
else . end' > output.json
output.json
{
"key1": "jack",
"key2": "jack"
}

The filter you are evidently trying to write is:
if .key1 == "" then . + {"key1" : $input } else . end
| if .key2 == "" then . + {"key2" : $input } else . end
This can be simplified to:
if .key1 == "" then .key1 = $input else . end
| if .key2 == "" then .key2 = $input else . end
You might also like to consider the following approach:
def update(f): f |= (if . == "" then $input else . end);
update(.key1) | update(.key2)

mapreduce javascript function error

I was unable to get the erlang function running so trying with Javascript as below :
curl -XPOST http://localhost:8098/mapred \
-H "Content-Type: application/json" \
-d #- \
<<EOF
{
"inputs":"logs",
"query":[{
"map":{
"language":"javascript",
"source":"function(riakObject, keydata, arg) {
var m = riakObject.values[0].data.match(/^INFO.*cart/);
return [(m ? m.length : 0 )];
}"
},
"reduce":{
"language":"javascript",
"source":"function(values, arg){
return [values.reduce(
function(total, v){ return total + v; }, 0)
];
}"
}
}]
}
EOF
seems not working with JS as well. Shell just hangs and does not return at all. Please suggest.
** UPDATE **
Today when i tried i see the following error :
An error occurred parsing the "query" field.
["Unrecognized format of query phase:\n ",
[123,
[34,<<"map">>,34],
58,
[123,
[34,<<"language">>,34],
58,
[34,<<"javascript">>,34],
44,
[34,<<"source">>,34],
58,
[34,
<<"function(riakObject, keydata, arg) { var m = riakObject.values[0].data.match(/^INFO.*Milk/); return [(m ? m.length : 0 )]; }">>,
34],
125],
44,
[34,<<"reduce">>,34],
58,
[123,
[34,<<"language">>,34],
58,
[34,<<"javascript">>,34],
44,
[34,<<"source">>,34],
58,
[34,
<<"function(values, arg){ return [values.reduce( function(total, v){ return total + v; }, 0) ]; }">>,
34],
125],
125],
"\n\nValid formats are:\n {\"map\":{...spec...}}\n {\"reduce\":{...spec...}}\n {\"link:{...spec}}\n"]

The query element should be a list of objects, with "map" and "reduce" being in discreet objects. Your JSON has them as properties of the same object.
This worked for me:
curl -XPOST http://localhost:8098/mapred -H "Content-Type: application/json" -d '{
"inputs":"logs",
"query":[
{"map":{
"language":"javascript",
"source":"function(riakObject, keydata, arg) {
var m = riakObject.values[0].data.match(/^INFO.*cart/);
return [(m ? m.length : 0 )];
}"
}},
{"reduce":{
"language":"javascript",
"source":"function(values, arg){
return [values.reduce(
function(total, v){ return total + v; }, 0)
];
}"
}}
]
}'

Convert rows into columns for semi structured data

I have a data set which looks like this
<SUBBEGIN
IMSI=xxxxxxxxxxxx;
MSISDN=xxxxxxxxx;
DEFCALL=TS11;
CURRENTNAM=BOTH;
CAT=COMMON;
VOLTE_TAG=NOT_DEFINED;
HLR_INDEX=1;
PS_MSISDNLESS_SUPPORTED=FALSE;
CS_MSISDNLESS_SUPPORTED=FALSE;
CSRATTYPE=NO-NO-NO-NO-NO;
PSRATTYPE=NO-NO-NO-NO-NO;
ICI=NO;
STE=NO;
<SUBEND
<SUBBEGIN
IMSI=xxxxxxxxxxxx;
MSISDN=xxxxxxxxx;
DEFCALL=TS11;
CURRENTNAM=BOTH;
VOLTE_TAG=NOT_DEFINED;
HLR_INDEX=1;
PS_MSISDNLESS_SUPPORTED=FALSE;
CS_MSISDNLESS_SUPPORTED=FALSE;
CSRATTYPE=NO-NO-NO-NO-NO;
<SUBEND
This is essentially one record and this is followed by multiple rows in the same format. I want the output to be in the format as:
IMSI|MSISDN|DEFCALL|CURRENTNAM|CAT...
xxxx|xxxx|TS11|BOTH|COMMON|COMMON
Any help is much appreciated.

$ cat tst.awk
BEGIN {FS="[=;]"; OFS="|" }
/^<SUB/ {
if (/END/) {
print (hdrPrinted++ ? "" : hdr ORS ) rec
hdr = rec = ""
}
next
}
{
sub(/^[[:space:]]+/,"")
hdr = (hdr=="" ? "" : hdr OFS) $1
rec = (rec=="" ? "" : rec OFS) $2
}
$ awk -f tst.awk file
IMSI|MSISDN|DEFCALL|CURRENTNAM|CAT|VOLTE_TAG|HLR_INDEX|PS_MSISDNLESS_SUPPORTED|CS_MSISDNLESS_SUPPORTED|CSRATTYPE|PSRATTYPE|ICI|STE
xxxxxxxxxxxx|xxxxxxxxx|TS11|BOTH|COMMON|NOT_DEFINED|1|FALSE|FALSE|NO-NO-NO-NO-NO|NO-NO-NO-NO-NO|NO|NO

$ cat test.txt
/<SUBBEGIN/ {f=1; next} # at start flag up
/<SUBEND/ { # at end
print b ORS c # print
f=0; b=c="" # flag up and reset variables
}
f { # between markers
split($1,a,"[=;]") # gather to 2 variables
b=b a[1] "|"
c=c a[2] "|"
}
Test it:
$ awk -f test.awk test.txt
IMSI|MSISDN|DEFCALL|CURRENTNAM|CAT|VOLTE_TAG|HLR_INDEX|PS_MSISDNLESS_SUPPORTED|CS_MSISDNLESS_SUPPORTED|CSRATTYPE|PSRATTYPE|ICI|STE|
xxxxxxxxxxxx|xxxxxxxxx|TS11|BOTH|COMMON|NOT_DEFINED|1|FALSE|FALSE|NO-NO-NO-NO-NO|NO-NO-NO-NO-NO|NO|NO|

Workaround using tr
tr -s '\n' ',' < file > tmpfile;
This gives me the output in form of
<SUBBEGIN IMSI=xxxxxxxxxxxx; MSISDN=xxxxxxxxx; DEFCALL=TS11; CURRENTNAM=BOTH; CAT=COMMON; VOLTE_TAG=NOT_DEFINED; HLR_INDEX=1; PS_MSISDNLESS_SUPPORTED=FALSE; CS_MSISDNLESS_SUPPORTED=FALSE; CSRATTYPE=NO-NO-NO-NO-NO; PSRATTYPE=NO-NO-NO-NO-NO; ICI=NO; STE=NO; <SUBEND
Replace the string "<SUBBEGIN" with \n

Start with this one:
sed '/<SUBBEGIN/{:a;N;/\<SUBEND/!ba;s/\n[^=]*=/ /g;s/.*SUBBEGIN//;s/;/|/g}' input

Here is an another solution:
awk script
#!/bin/awk
function print_record( hdr )
{
str = ""
for( i = 1; i <= 13; i++ )
{
if( hdr )
{
value = substr( $i, 1, index( $i, "=" ) - 1 )
}
else
{
value = substr( $i, index( $i, "=" ) + 1 )
}
gsub( /^[ \t]+/, "", value )
if( length(str) > 0 )
str = str OFS
str = str value
}
print str
}
BEGIN {
RS="<SUBBEGIN\n"
FS=";\n"
hdr=1
OFS="|"
}
{
if( index( $0, "=" ) && index( $0, ";" ) )
{
if( hdr )
{
print_record( 1 )
hdr = 0;
}
print_record( 0 )
}
}
# eof #
Input file
<SUBBEGIN
IMSI=xxxxxxxxxxxx;
MSISDN=xxxxxxxxx;
DEFCALL=TS11;
CURRENTNAM=BOTH;
CAT=COMMON;
VOLTE_TAG=NOT_DEFINED;
HLR_INDEX=1;
PS_MSISDNLESS_SUPPORTED=FALSE;
CS_MSISDNLESS_SUPPORTED=FALSE;
CSRATTYPE=NO-NO-NO-NO-NO;
PSRATTYPE=NO-NO-NO-NO-NO;
ICI=NO;
STE=NO;
<SUBEND
<SUBBEGIN
IMSI=yyyyyyyyyy;
MSISDN=yyyyyyyyy;
DEFCALL=TS11;
CURRENTNAM=BOTH;
CAT=COMMON;
VOLTE_TAG=NOT_DEFINED;
HLR_INDEX=2;
PS_MSISDNLESS_SUPPORTED=TRUE;
CS_MSISDNLESS_SUPPORTED=FALSE;
CSRATTYPE=NO-YES-NO-NO-NO;
PSRATTYPE=NO-NO-NO-YES-NO;
ICI=NO;
STE=NO;
<SUBEND
<SUBBEGIN
IMSI=zzzzzzzzzz;
MSISDN=zzzzzzzzzzzzzzz;
DEFCALL=TS11;
CURRENTNAM=BOTH;
CAT=COMMON;
VOLTE_TAG=NOT_DEFINED;
HLR_INDEX=3;
PS_MSISDNLESS_SUPPORTED=FALSE;
CS_MSISDNLESS_SUPPORTED=TRUE;
CSRATTYPE=NO-YES-YES-NO-NO;
PSRATTYPE=NO-NO-YES-YES-NO;
ICI=YES;
STE=YES;
<SUBEND
Output
$ awk -f script.awk -- input.txt
IMSI|MSISDN|DEFCALL|CURRENTNAM|CAT|VOLTE_TAG|HLR_INDEX|PS_MSISDNLESS_SUPPORTED|CS_MSISDNLESS_SUPPORTED|CSRATTYPE|PSRATTYPE|ICI|STE
xxxxxxxxxxxx|xxxxxxxxx|TS11|BOTH|COMMON|NOT_DEFINED|1|FALSE|FALSE|NO-NO-NO-NO-NO|NO-NO-NO-NO-NO|NO|NO
yyyyyyyyyy|yyyyyyyyy|TS11|BOTH|COMMON|NOT_DEFINED|2|TRUE|FALSE|NO-YES-NO-NO-NO|NO-NO-NO-YES-NO|NO|NO
zzzzzzzzzz|zzzzzzzzzzzzzzz|TS11|BOTH|COMMON|NOT_DEFINED|3|FALSE|TRUE|NO-YES-YES-NO-NO|NO-NO-YES-YES-NO|YES|YES
Hope It Helps!

with unix toolchain, perhaps the shortest...
$ sed '/^</d' file | tr '=' '\n' | tr -d ' ;' | pr -13ts'|'
IMSI|MSISDN|DEFCALL|CURRENTNAM|CAT|VOLTE_TAG|HLR_INDEX|PS_MSISDNLESS_SUPPORTED|CS_MSISDNLESS_SUPPORTED|CSRATTYPE|PSRATTYPE|ICI|STE
xxxxxxxxxxxx|xxxxxxxxx|TS11|BOTH|COMMON|NOT_DEFINED|1|FALSE|FALSE|NO-NO-NO-NO-NO|NO-NO-NO-NO-NO|NO|NO

Using a different input file for simplicity
$ cat ip.txt
<SUBBEGIN
i1=abc;
i2=ijk;
i3=xyz;
k1=NO;
t1=YES;
<SUBEND
<SUBBEGIN
i1=foo;
i2=bar;
i3=test;
k1=YES;
t1=NO;
<SUBEND
$ perl -nle '
$s=/<SUBBEGIN/ if /<SUB/;
if($s && !/<SUB/)
{
($k,$v) = /\S+(?==)|=\K[^;]+/g;
push(#key, $k);
push(#val, $v);
}
elsif(#key)
{
print join "|", #key;
print join "|", #val;
#key = ();
#val = ();
}
' ip.txt
i1|i2|i3|k1|t1
abc|ijk|xyz|NO|YES
i1|i2|i3|k1|t1
foo|bar|test|YES|NO
$s set flag if input line contains <SUBBEGIN
If flag is set and input line doesn't contain <SUB
extract key, value pair
populate them in two different arrays
Once input line contains <SUB again
Check if one of the array (say #key) is not empty
print the key value arrays with | as separator
empty the arrays
This will work whether there are empty lines or not between data structures

awk to extract lines based on Date range:

Would like to extract the line items, if the date range between 25-mar-2015 to 05-may-2015 from second field ($2) .
Date column is not sorted and each files contain millions of records.
Inputs.gz
Des,DateInfo,Amt,Loc,Des2
abc,02-dec-2014,10,def,xyz
abc,20-apr-2015,25,def,xyz
abc,14-apr-2015,40,def,xyz
abc,17-mar-2014,55,def,xyz
abc,24-nov-2011,70,def,xyz
abc,13-may-2015,85,def,xyz
abc,30-sep-2008,100,def,xyz
abc,20-jan-2014,115,def,xyz
abc,04-may-2015,130,def,xyz
abc,25-nov-2013,145,def,xyz
abc,29-mar-2015,55,def,xyz
I have tried like below command and in-complete :
function getDate(date) {
split(date, a, "-");
return mktime(a[3] " " sprintf("%02i",(index("janfebmaraprmayjunjulaugsepoctnovdec", a[2])+2)/3) " " a[1] " 00 00 00")
}
BEGIN {FS=","}
{ if ( getDate($2)>=getDate(25-mar-2015) && getDate($2)<=getDate(05-may-2015) ) print $0 }
Expected Output:
abc,20-apr-2015,25,def,xyz
abc,14-apr-2015,40,def,xyz
abc,04-may-2015,130,def,xyz
abc,29-mar-2015,55,def,xyz
Please suggest ... I dont have perl & python access.

$ cat tst.awk
function getDate(date, a) {
split(date, a, /-/)
return mktime(a[3]" "(index("janfebmaraprmayjunjulaugsepoctnovdec",a[2])+2)/3" "a[1]" 0 0 0")
}
BEGIN {
FS=","
beg = getDate("25-mar-2015")
end = getDate("05-may-2015")
}
{ cur = getDate($2) }
NR>1 && cur>=beg && cur<=end
$ awk -f tst.awk file
abc,20-apr-2015,25,def,xyz
abc,14-apr-2015,40,def,xyz
abc,04-may-2015,130,def,xyz
abc,29-mar-2015,55,def,xyz

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data Formatting from Text File - unix

awk solution: awk '/^ABC.*:/{ abc=$0 }$0~/"count"/{ gsub(/[^0-9]+/,"",$0); print abc" , "$0 }' file The output: ABC:9804 , 492 ABC:95023 , 865 ABCC:128 , 479

Related

Accessing field with jq that can be string or array

combine two jq filters into one

mapreduce javascript function error

Convert rows into columns for semi structured data

awk to extract lines based on Date range:

Categories

Resources