rrd fetch last 4hr and last 24 hr - fetch

Would this be a valid way to get the total AVG from the first datasource in a rrd database file for the past 4 hrs? The parent solution is pnp4nagios and records a value every 5minuts/300 seconds and all that works great. Just want to learn how to query the rrd file manually and accurately.
I use some awk statements to average the data source column that I am interested in. This isn't pressing or anything, I am just trying to learn a little bit more about rrdtool
rrdtool fetch rrdfile.rrd AVERAGE -r 300 -s -4hr|awk '{print $2}'|grep -v nan|cut -c 1-4|awk 'NR>2'|awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }'
Found this solution after some time of reading and testing. If you use pnp4nagios and want to query a rrd database to print the values vs. graphing them, this worked for me.
rrdtool graph test.png --start -4hour 'DEF:data=test.rrd:1:AVERAGE' PRINT:data:AVERAGE:%lf|awk 'NR>1'
simply replace GPRINT with Print. you will then be able to compare the output against the pnp4nagios graphs and see that the Datasoure output that is printed is the same value as the graphs.

Count number of responses to a findscu query

What is the best way to get the number of responses of a findscu query?
For now I am thinking of exporting the responses to an .xml file and count the tags. Is there a better way?
The answer to your question depends on what you mean by "best way".
Using one of findscu's --extract-xxx options and then count the number of created files or, in case of --extract-xml-single, count the number of "dataset" elements would be a possible solution. However, it requires to create output files, which might slow down the process.
Alternatively, you could count the number of lines in the log output that contain the text "Find Response:", i.e. something like the following should work with "bash":
findscu dicomserver.co.uk 11112 -P -k 0008,0052=PATIENT -k 0010,0020 2>&1 | fgrep -c "Find Response:"

Get all lines that meet time condition from log file

Here is what my log file look like
[BCDID::16 T::LIVE_ANALYZER GCDID::16] {"t":"20:50:05","k":"7322","h":"178","s":-53.134575556764}
[BCDID::16 T::LIVE_ANALYZER GCDID::16] {"t":"20:50:06","k":"2115","h":"178","s":-53.134575556764}
[BCDID::16 T::LIVE_ANALYZER GCDID::16] {"t":"20:50:07","k":"1511","h":"178","s":-53.134575556764}
There are multiple log files with similar entries and they are updated every second.
here "t" : "20:50:05" is Time.
What I want to do is, get all logs between specific time from all files from the end of the files.
I tried with tail files*.log | grep -e "20:50:07 | 20:50:05" but it does not return anything.
How do I get get all log entries between given time, starting from the end of file from all logs files?
If you're looking for a range for records, and the format of the lines is consistent, the easiest way is probably to isolate the time field, strip out the colons, and leverage the power of arithmetic operators.
A one-liner awk solution, for example:
tail files*.log | awk -v from="205006" -v to="205007" -F"\"" '{ timeasint=$4; gsub(":","",timeasint); if (timeasint >= from && timeasint <= to) print $0 }'
would get you:
[BCDID::16 T::LIVE_ANALYZER GCDID::16] {"t":"20:50:06","k":"2115","h":"178","s":-53.134575556764}
[BCDID::16 T::LIVE_ANALYZER GCDID::16] {"t":"20:50:07","k":"1511","h":"178","s":-53.134575556764}
Of course you couldn't span across midnight (i.e., 25:59:59 - 00:00:01), but for that you'd need dates as well as times in your log anyway.
If you had dates, my suggestion would be converting them to epoch stamps (using date -d "string" or some other suitable method) and comparing the epoch stamps as integers.

Bash code behaving vague: huge file processing

This is bit of a hadoop & unix mix issue and I'm really not sure which one's responsible for the error.
I've a bash script that performs validation of file in terms of:
Checks whether the total row count of the file matches with the value mentioned in the footer of the file. (Footer is a metadata that contains total number of actual rows above it, and it's positioned at the end of data file)
Checks whether the name of file matches with what it should be.
The function calculating row count & extracting footer count of file is:
count() {
rowCount=$(echo $(hdfs dfs -cat ${hdfspath}/${fileName} | wc -l) - 2 | bc -l)
footer=$(hdfs dfs -tail ${hdfspath}/${fileName} | tail -1)
footerRecordCount=`echo $footer | sed "s/[^0-9]//g"`
The code snippet calling the above function & performing validation test is below:
if [[ ${footerRecordCount} -ne ${rowCount} ]]; then
echo "error=Number of records in the file doesn't match with record count value mentioned in the footer of the file" >&2
exit 1
fn_logMessage "Footer Record Count $footerRecordCount matched with rows count $rowCount"
if [ ${fileName} -ne ${actualFileName} ]; then
echo "error=File name mismatch"
exit 1
echo "File name matched"
The code looks fairly straighforward & simple; it is indeed and works as well perfectly.
However, the issue comes up when I started this test on huge file size (>400 GB). I receive the error as below:
Footer Record Count 00000003370000000002000082238384885577696960005044939533796567041020102349250692990597110000000000000000002222111111111111110200000003440000100013060089448361739204836173971223 matched with rows count 929901602
error=File name mismatch
The footer record count should actually be the number 929901602 but the number that comes up is some random number which doesn't exist anywhere in the file at all. However, even by the looks of it, it doesn't match, the output is thrown as "matched".
While the error of the next if loop is shown.
Not sure who is the culprit here, unix or hadoop. But I performed this test 3 times in a row. Every time the "Huge number" pops up is completely different from the previous one. So, there isn't even a correlation between these large numbers.
Any idea what on earth is going wrong?
PS: Like I said, the code works perfectly for small files like 20 GB, 50 GB.
Thanks in advance.

Different results from awk and nawk

I just ran these two commands on a file having around 250 million records.
awk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt
nawk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt
The record length is 482. The first command gave the correct number of records in file2.txt
i.e.; 60 million but the nawk command gives only 4.2 million.
I am confused and would like to know if someone has come across issue like this. How exactly this simple command being treated in a different way internally? Is there a buffer which can hold only up to certain number of bytes while using nawk?
would appreciate if someone can throw some light on this.
My OS details are
SunOS <hostname> 5.10 Generic_147148-26 i86pc i386 i86pc
The difference probably lies on the buffer limit of Nawk. One of the records (lines) found in your input file has probably exceeded it.
This crucial line can be found in awk.h:
#define RECSIZE (8 * 1024) /* sets limit on records, fields, etc., etc. */
Your command can be reduced to just this:
awk 'substr($0,472,1)==9'
On Solaris (which you are on) when you run awk by default you are running old, broken awk (/usr/bin/awk) so I suspect that nawk is the one producing the correct result.
Run /usr/xpg4/bin/awk with the same script/arguments and see which of your other results it's output agrees with.
Also, check if your input file was created on Windows by running dos2unix on it and see if it's size changes and, if so, re-run your awk commands on the modified files. If it was created on Windows then it will have some control-Ms in there that could be causing chaos.

grep -f maximum number of patterns?

I'd like to use grep on a text file with -f to match a long list (10,000) of patterns. Turns out that grep doesn't like this (who, knew?). After a day, it didn't produce anything. Smaller lists work almost instantaneously.
I was thinking I might split my long list up and do it a few times. Any idea what a good maximum length for the pattern list might be?
Also, I'm rather new with unix. Alternative approaches are welcome. The list of patterns, or search terms, are in a plaintext file, one per line.
Thank you everyone for your guidance.
From comments, it appears that the patterns you are matching are fixed strings. If that is the case, you should definitely use -F. That will increase the speed of the matching considerably. (Using 479,000 strings to match on an input file with 3 lines using -F takes under 1.5 seconds on a moderately powered machine. Not using -F, that same machine is not yet finished after several minutes.)
i got the same problem with approx. 4 million patterns to search for in a file with 9 million lines. Seems like it is a problem of RAM. so i got this neat little work around which might be slower than splitting and joining but it just need this one line.
while read line; do grep $line fileToSearchIn;done < patternFile
I needed to use the work around since the -F flag is no solution for that large files...
EDIT: This seems to be really slow for large files. After some more research i found 'faSomeRecords' and really other awesome tools from Kent NGS-editing-Tools
I tried it on my own by extracting 2 million fasta-rec from 5.5million records file. Took approx. 30 sec..
EDIT: direct download link
Here is a bash script you can run on your files (or if you would like, a subset of your files). It will split the key file into increasingly large blocks, and for each block attempt the grep operation. The operations are timed - right now I'm timing each grep operation, as well as the total time to process all the sub-expressions.
Output is in seconds - with some effort you can get ms, but with the problem you are having it's unlikely you need that granularity.
Run the script in a terminal window with a command of the form
./timeScript keyFile textFile 100 > outputFile
This will run the script, using keyFile as the file where the search keys are stored, and textFile as the file where you are looking for keys, and 100 as the initial block size. On each loop the block size will be doubled.
In a second terminal, run the command
tail -f outputFile
which will keep track of the output of your other process into the file outputFile
I recommend that you open a third terminal window, and that you run top in that window. You will be able to see how much memory and CPU your process is taking - again, if you see vast amounts of memory consumed it will give you a hint that things are not going well.
This should allow you to find out when things start to slow down - which is the answer to your question. I don't think there's a "magic number" - it probably depends on your machine, and in particular on the file size and the amount of memory you have.
You could take the output of the script and put it through a grep:
grep entire outputFile
You will end up with just the summaries - block size, and time taken, e.g.
Time for processing entire file with blocksize 800: 4 seconds
If you plot these numbers against each other (or simply inspect the numbers), you will see when the algorithm is optimal, and when it slows down.
Here is the code: I did not do extensive error checking but it seemed to work for me. Obviously in your ultimate solution you need to do something with the outputs of grep (instead of piping it to wc -l which I did just to see how many lines were matched)...
# script to look at difference in timing
# when grepping a file with a large number of expressions
# assume first argument = name of file with list of expressions
# second argument = name of file to check
# optional third argument = initial block size (default 100)
# split f1 into chunks of 1, 2, 4, 8... expressions at a time
# and print out how long it took to process all the lines in f2
if (($# < 2 )); then
echo Warning: need at leasttwo parameters.
echo Usage: timeScript keyFile searchFile [initial blocksize]
exit 0
f1_linecount=`cat $1 | wc -l`
echo linecount of file1 is $f1_linecount
f2_linecount=`cat $2 | wc -l`
echo linecount of file2 is $f2_linecount
if (($# < 3 )); then
while (($blockLength < f1_linecount))
echo Using blocks of $blockLength
#split is a built in command that splits the file
# -l tells it to break after $blockLength lines
# and the block$blockLength parameter is a prefix for the file
split -l $blockLength $1 block$blockLength
Tstart="$(date +%s)"
for fn in block*
echo "grep -f $fn $2 | wc -l"
echo number of lines matched: `grep -f $fn $2 | wc -l`
Tnow="$(($(date +%s)))"
echo Time taken: $(($Tnow - $Tbefore)) s
echo Time for processing entire file with blocksize $blockLength: $(($Tnow - $Tstart)) seconds
# remove the split files - no longer needed
rm block*
echo block length is now $blockLength and f1 linecount is $f1_linecount
exit 0
You could certainly give sed a try to see whether you get a better result, but it is a lot of work to do either way on a file of any size. You didn't provide any details on your problem, but if you have 10k patterns I would be trying to think about whether there is some way to generalize them into a smaller number of regular expressions.
Here is a perl script "match_many.pl" which addresses a very common subset of the "large number of keys vs. large number of records" problem. Keys are accepted one per line from stdin. The two command line parameters are the name of the file to search and the field (white space delimited) which must match a key. This subset of the original problem can be solved quickly since the location of the match (if any) in the record is known ahead of time and the key always corresponds to an entire field in the record. In one typical case it searched 9400265 records with 42899 keys, matching 42401 of the keys and emitting 1831944 records in 41s. The more general case, where the key may appear as a substring in any part of a record, is a more difficult problem that this script does not address. (If keys never include white space and always correspond to an entire word the script could be modified to handle that case by iterating over all fields per record, instead of just testing the one, at the cost of running M times slower, where M is the average field number where the matches are found.)
#!/usr/bin/perl -w
use strict;
use warnings;
my $kcount;
my ($infile,$test_field) = #ARGV;
if(!defined($infile) || "$infile" eq "" || !defined($test_field) || ($test_field <= 0)){
die "syntax: match_many.pl infile field"
my %keys; # hash of keys
$test_field--; # external range (1,N) to internal range (0,N-1)
while(<STDIN>) {
my $line = $_;
$keys {$line} = 1;
print STDERR "keys read: $kcount\n";
my $records = 0;
my $emitted = 0;
open(INFILE, $infile ) or die "Could not open $infile";
while(<INFILE>) {
if(substr($_,0,1) =~ /#/){ #skip comment lines
my $line = $_;
$line =~ s/^\s+//;
my #fields = split(/\s+/, $line);
print STDOUT "$line\n";
while( my( $key, $value ) = each %keys ){
if($value > 1){
print STDERR "records read: $records, emitted: $emitted; keys matched: $kcount\n";
