group_by behavior when using --stream - jq

Having (simplified for learning) input file:
{"type":"a","id":"1"}
{"type":"a","id":"2"}
{"type":"b","id":"1"}
{"type":"c","id":"3"}
I'd like to turn it into:
{
"a": [1,2],
"b": [1],
"c": [3]
}
via using --stream option, not needed here, just for learning. Or at least it does not seem that viable to use group_by or reduce without it on bigger files (even few G seems to be rather slow)
I understand that I can write smth like:
jq --stream -cn 'reduce (inputs|select(length==2)) as $i([]; . + ..... )' test3
but that would just process the data per line(processed item in stream), ie I can either see type or id, and this does not have place where to create pairing. I can cram it to one big array, but that opposite of what I have to do.
How to create such pairings? I don't even know how to create(using --stream):
{"a":1}
{"a":2}
...
I know both (first target transformation, and the one above this paragraph) are probably some trivial usage of for each, I have some working example of one here, but all it's .accumulator and .complete keywords(IIUC) are now just magic. I understood it once, but ... Sorry for trivial questions.
UPDATE regarding performace:
#pmf provided in his answer 2 solutions: streaming and non streaming. Thanks for that, I was able to write non-streaming version, but not the streaming one. But when testing it, the streaming variant was (I'm not 100% sure now, but ...) 2-4 times slower. Makes sense if data does not fit into memory, but luckily in my case, they do. So I ran the non streaming version for ~1G file on laptop, but not actually that slow i7-9850H CPU # 2.60GHz. For my surprise it wasn't done withing 16hours so I killed it as not viable solution for my usecase of potentially a lot bigger input files. Considering simplicity of input, I decided to write pipeline just via using bash, grep,sed,paste and tr, and eventhough it was using some regexes, and was overally inefficient as hell, and without any parallelism, the whole file was correctly crunched in 55 seconds. I understand that character manipulation is faster than parsing json, but that much difference? Isn't there some better approach while still parsing json? I don't mind spending more cpu power, but if I'm using jq, I'd like to use it's functions and process json as json, not just chars just as I did it with bash.

In the "unstreamed" case I`d use
jq -n 'reduce inputs as $i ({}; .[$i.type] += [$i.id | tonumber])'
Demo
With the --stream option set, just re-create the streamed items using fromstream:
jq --stream -n 'reduce fromstream(inputs) as $i ({}; .[$i.type] += [$i.id | tonumber])'
{
"a": [1,2],
"b": [1],
"c": [3]
}

Related

Maths on values of properties that may not exist

Here is a query (gremlin-python; tinkerpop v3.3.3) that inserts a node with properties 'positive' and 'negative', then subtracts one from the other and outputs the result:
g.withSack(0).addV('test').property('positive', 2).property('negative', 3) \
.sack(assign).by('positive') \
.sack(minus).by('negative') \
.sack().next()
Out[13]: -1
Trying the same query except with one of those properties missing produces an error:
g.withSack(0).addV('test').property('positive', 2) \
.sack(assign).by('positive') \
.sack(minus).by('negative') \
.sack().next()
Out[13]: ... GremlinServerError: 500: The property does not exist as the key has no associated value for the provided element: v[36]:negative
I can get around this by coalescing the four possible cases:
g.withSack(0).addV('test').property('positive', 2) \
.coalesce(has('positive').has('negative').sack(assign).by('positive').sack(minus).by('negative').sack(), \
has('positive').values('positive'), \
has('negative').sack(minus).by('negative').sack(), \
sack() \
).next()
It sure is ugly though - is there a neater solution? Ideally there would be the possibility to insert a default value in the absence of a property. I've tried using the 'math' step as well, but it's only slightly neater and doesn't avoid the problem of non-existent properties. To be clear, In the case of multiple traversers, I want a result for each traverser.
I think if you do math() or sack() to solve this, you should probably consider going with the idea of having "required" properties on these vertices on which you intend to do these calculations. That should make things much easier. I do feel like math() would be neater, though you said otherwise:
g.V().as('a').out('knows').as('b').
math("a - b").
by(coalesce(values('hasSomeValue'), constant(0))).
by(coalesce(values('missingValue'), constant(0)))
That's pretty straightforward though perhaps your examples were meant for simplicity and you have a lot more complexity to consider.
I suppose Gremlin could be changed to allow for a second parameter in the by() as some default if the first traversal did not return anything, thus:
g.V().as('a').out('knows').as('b').
math("a - b").
by(values('hasSomeValue'), constant(0)).
by(values('missingValue'), constant(0))
Saves some typing I'd suppose, but I'm not sure that it is as clear to read as with the use of coalesce(). I think I like the explicit use of coalesce() better.

Pipe file to stdin and keep alive, waiting for new data

A semi-newbie to UNIX piping, so apologies if I'm asking anything obvious here. I'm using a program called CCExtractor to grab the closed captions from a video file. It has the option to receive a file from stdin, and it works great if I do the following:
./ccextractor -stdin < myvideofile.wtv
However, I want to try using it "live" - as a video is recording, it'll transcribe the subtitles. From my understanding, < won't do that, as it'll stop as soon as it reaches the current end of the file. Following this answer on Stack Overflow, it seems like:
tail -c +1 -f myvideofile.wtv | ./ccextractor -stdin
should work - but it doesn't process any part of the video at all (it should, at least, work as well as the previous command and parse the existing data). I figured I'd take a step back and use a simple cat:
cat myvideofile.wtv | ./ccextractor -stdin
and that doesn't work either. I was of the belief that the first and third commands ought to be roughly equivalent, but that's obviously not the case. What are the differences, and how could I get this to work?

Bind query resolution time in munin

Is it possible to graph the query resolution time of bind9 in munin?
I know there is a way to graph it in a unbound server, is it already done in bind? If not how do I start writing a munin plugin for that? I'm getting stats from http://127.0.0.1:8053/ in the bind9 server.
I don't believe that "query time" is a function of BIND. About the only time that I see that value (with individual lookups) is when using dig. If you're willing to use that, the following might be a good starting point:
#!/bin/sh
case $1 in
config)
cat <<'EOM'
graph_title Red Hat Query Time
graph_vlabel time
time.label msec
EOM
exit 0;;
esac
echo -n "time.value "
dig www.redhat.com|grep Query|cut -d':' -f2|cut -d\ -f2
Note that there's two spaces after the "-d\" in the second cut statement. If you save the above as "querytime" and run it at the command line, output should look something like:
root#pi1:~# ./querytime
time.value 189
root#pi1:~# ./querytime config
graph_title Red Hat Query Time
graph_vlabel time
time.label msec
I'm not sure of the value in tracking the above though. The response time can be affected: if the query is an initial lookup, if the answer is cached locally, depending on server load, depending on intervening network congestion, etc.
Note: the above may be a bit buggy as I've written it on the fly, but it should give you a good starting point. That it returned the above output is a good sign.
In any case, recommend reading the following before you write your own: http://munin-monitoring.org/wiki/HowToWritePlugins

zsh: using "less -R" as READNULLCMD

Now, I'm pretty sure of the limitation here. But let's step back.
The simple statement
READNULLCMD="less -R"
doesn't work, generating the following error:
$ <basic.tex
zsh: command not found: less -R
OK. Pretty sure this is because, by default, zsh doesn't split string variables at every space. Wherever zsh is using this variable, it's using $READNULLCMD where it should be using ${=READNULLCMD}, to ensure the option argument is properly separated from the command by a normal space. See this discussion from way back in 1996(!):
http://www.zsh.org/mla/users/1996/msg00299.html
So, what's the best way around this, without setting SH_WORD_SPLIT (which I don't want 99% of the time)?
So far, my best idea is assigning READNULLCMD to a simple zsh script which just calls "less -R" on STDIN. e.g.
#!/opt/local/bin/zsh
less -R /dev/stdin
Unfortunately this seems to be a non-starter as less used in this fashion for some reason misses the first few lines on input from /dev/stdin.
Anybody have any better ideas?
The problem is not that less doesn't read its environment variables (LESS or LESSOPEN). The problem is that the READNULLCMD is not invoked as you might think.
<foo
does not translate into
less $LESS foo
but rather to something like
cat foo | less $LESS
or, perhaps
cat foo $LESSOPEN | less $LESS
I guess that you (like me) want to use -R to obtain syntax coloring (by using src-hilite-lesspipe.sh in LESSOPEN, which in turn uses the "source-highlight" utility). The problem with the latter two styles of invocation is that src-hilite-lesspipe.sh (embedded in $LESSOPEN) will not receive a filename, and hence it will not be able to deduce the file type (via the --infer-lang option to "source-highligt"). Without a filename suffix, "source-highlight" will revert to "no highlighting".
You can obtain syntax coloring in READNULLCMD, but in a rather useless way. This by specifying the language explicitly via the --lang-def option. However, you'll have as little clue as "source-higlight", since the there's no file name when the data is passed anonymously through the pipe. Maybe there's a way to do a on-the-fly heuristic parser and deduce it by contents, but then you've for sure left this little exercise.
export LESS=… may be a good solution exclusively for less and if you want such behavior the default in all cases, but if you want more generic one then you can use functions:
function _-readnullcmd()
{
less -R
}
READNULLCMD=_-readnullcmd
(_- or readnullcmd have no special meaning just the former never appears in any distributed zsh script and the latter indicates the purpose of the function).
Set the $LESS env var to the options you always want to have in less.
So don't touch READNULLCMD and use export LESS="R" (and other options you want) in your zshrc.

Compress EACH LINE of a file individually and independently of one another? (or preserve newlines)

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!
Are you sure you're running out of the Memory (RAM?) with your sort?
My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.
So check out your available disk space with :
df -g
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with
sort -T /alt/dir
Note that for sort version
sort (GNU coreutils) 5.97
The help says
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.
Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
I hope this helps.
As you appear to be a new user here on S.O., allow me to welcome you and remind you of four things we do:
. 1) Read the FAQs
. 2) Please accept the answer that best solves your problem, if any, by pressing the checkmark sign. This gives the respondent with the best answer 15 points of reputation. It is not subtracted (as some people seem to think) from your reputation points ;-)
. 3) When you see good Q&A, vote them up by using the gray triangles, as the credibility of the system is based on the reputation that users gain by sharing their knowledge.
. 4) As you receive help, try to give it too, answering questions in your area of expertise
There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.

Resources