How do Cron "Steps" Work?

How do Cron "Steps" Work? - unix

I'm running into a situation where a cron job I thought was running every 55 minutes is actually running at 55 minutes after the hour and at the top of the hour. Actually, it's not a cron job, but it's a PHP scheduling application that uses cron syntax.
When I ask this application to schedule a job every 55 minutes, it creates a crontab line like the following.
*/55 * * * *
This crontab line ends up not running a job every 55 minutes. Instead a job runs at 55 minutes after the hours, and at the top of the hour. I do not desire this. I've run this though a cron tester, and it verifies the undesired behavior is correct cron behavior.
This leads me to looking up what the / actually means. When I looked at the cron manual I learned the slash indicated "steps", but the manual itself is a little fuzzy on that that means
Step values can be used in conjunction with ranges. Following a range with "<number>" specifies skips of the number's value through the range. For example, "0-23/2" can be used in the hours field to specify command execution every other hour (the alternative in the V7 standard is "0,2,4,6,8,10,12,14,16,18,20,22"). Steps are also permitted after an asterisk, so if you want to say "every two hours", just use "*/2".
The manual's description ("specifies skips of the number's value through the range") is a little vague, and the "every two hours" example is a little misleading (which is probably what led to the bug in the application)
So, two questions:
How does the unix cron program use the "step" information (the number after a slash) to decide if it should skip running a job? (modular division? If so, on what? With what conditions deciding a "true" run, and which decisions not? Or is it something else?)
Is it possible to configure a unix cron job to run every "N" minutes?

Step values can be used in conjunction with ranges. Following a range
with "<number>" specifies skips of the number's value through the range. For
example, "0-23/2" can be used in the hours field to specify command
execution every other hour (the alternative in the V7 standard is
"0,2,4,6,8,10,12,14,16,18,20,22"). Steps are also permitted after an
asterisk, so if you want to say "every two hours", just use "*/2".
The "range" being referred to here is the range given before the /, which is a subrange of the range of times for the particular field. The first field specifies minutes within an hour, so */... specifies a range from 0 to 59. A first field of */55 specifies all minutes (within the range 0-55) that are multiples of 55 -- i.e., 0 and 55 minutes after each hour.
Similarly, 0-23/2 or */2 in the second (hours) field specifies all hours (within the range 0-23) that are multiples of 2.
If you specify a range starting other than at 0, the number (say N) after the / specifies every Nth minute/hour/etc starting at the lower bound of the range. For example, 3-23/7 in the second field means every 7th hour starting at 03:00 (03:00, 10:00, 17:00).
This works best when the interval you want happens to divide evenly into the next higher unit of time. For example, you can easily specify an event to occur every 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, or 30 minutes, or every 1, 2, 3, 4, 6, or 12 hours. (Thank the Babylonians for choosing time units with so many nice divisors.)
Unfortunately, cron has no concept of "every 55 minutes" within a time range longer than an hour.
If you want to run a job every 55 minutes (say, at 00:00, 00:55, 01:50, 02:45, etc.), you'll have to do it indirectly. One approach is to schedule a script to run every 5 minutes; the script then checks the current time, and does its work only once every 11 times it's called.
Or you can use multiple lines in your crontab file to run the same job at 00:00, 00:55, 01:50, etc. -- except that a day is not a multiple of 55 minutes. If you don't mind having a longer or shorter interval once a day, week, or month, you can write a program to generate a large crontab with as many entries as you need, all running the same command at a specified time.

I came across this website that is helpful with regard to cron jobs.
https://crontab.guru
And specific to your case with * /55
https://crontab.guru/#*/55_*_*_*_*
It helped to get a better understanding of the concept behind it.

There is another tool named at that should be considered. It can be used instead of cron to achieve what the topic starter wants. As far as I remember, it is pre-installed in OS X but it isn't bundled with some Linux distros like Debian (simply apt install at).
It runs a job at a specific time of day and that time can be calculated using a complex specification. In our case the following can be used:
You can also give times like now + count time-units, where the time-units can be minutes, hours, days, or weeks and you
can tell at to run the job today by suffixing the time with today and to run the job tomorrow by suffixing the time with tomorrow.
The script every2min.sh is executed every 2 minutes. It delays next execution every time the instance is running:
#!/bin/sh
at -f ./every2min.sh now + 2 minutes
echo "$(date +'%F %T') running..." >> /tmp/every2min.log
Which outputs
2019-06-27 14:14:23 running...
2019-06-27 14:16:00 running...
2019-06-27 14:18:00 running...
As at does not know about "seconds" unit, the execution time will be rounded to full minute after the first run. But for a given task (with 55 minutes range) it should not be a big problem.
There also might be security considerations
For both at and batch, commands are read from standard input or the file specified with the -f option and executed. The working directory, the environment (except for the variables BASH_VERSINFO, DISPLAY, EUID, GROUPS, SHELLOPTS, TERM, UID, and _) and the umask are retained from the time of invocation.
This is the easiest way to schedule something to be ran every X minutes I've seen so far.

Related

Is it possible to encode date AND time (with some caveats) into 12 bits?

I have at my disposal 16 bits. Of them, 4 bits are the header and cannot be touched. This leaves us with 12 bits. I would like to encode date and time data into them. These are essentially logs being sent over LPWAN.
Obviously, it's impossible to encode proper generic date and time into it. Since the unix timestamp uses 32 bits, and projects like Compact Time Format use 5 bytes.
Let's say we don't really need the year, because this information is available elsewhere. Let's also say the time resolution of seconds doesn't have to be super accurate, so we can split the seconds into 30 second intervals. If we were to simply encode the data as is then:
4 bits month (0-11)
5 bits day (0-31)
5 bits hour (0-23)
6 bits minute (0-59)
1 bit second (0,30)
-----------------------------
21 bits
21 bits is much better than 32. But it's still not 12. I could subtract one bit from the minutes (rounding to the nearest even minute), and remove the seconds but that still leaves us with 19 bits. Which is still far from 12.
Just wondering if it's possible, and if anyone has any ideas.

12 bits can hold 2^12 = 4096 values, which feels pretty tight for a task. Not sure much can be done in terms of compressing a date time into a 4096 number. It is too little space to represent this data.
There are some workarounds, none of them able to achieve what you want, but maybe something you could use anyway:
Split date and time. Alternate with some algorithm between sending date/time, one bit can be used to indicate what data is being sent. This leaves 11 bits to encode either date or time. You could go a bit further and split time like this as well. Receiving side can then reconstruct a full date time having access to the previously received data.
You could have a schema where one date packet is sent as a starting point, and subsequent packets are incremented in N-second intervals from the start of the epoch
Remove date time from data completely, saving 12 bits, but send it periodically as a stand-alone heartbeat/datetime packet.
You could try compressing the whole data packet which could allow using more bits to represent date time and still fit into a smaller overall packet size
If data is sent at reasonable fixed intervals, you could use a circular counter of an interval of N seconds, this may work if you have few devices and you can keep track of when they start transmitting. For example a satellite was launched on XYZ date time, it send counter every 30 seconds, we received counter value of 100, to calculate date we use simple math XYZ + 30*100 seconds

No. Unless you'd be happy with representing less than a span of a day and a half. You can just count 4096 30-second intervals, and find that that will cover 34 hours and eight minutes. 4096 two-minute intervals is just four times that, or five days, 16 hours, and 32 minutes. Still a small fraction of a year.
If you can assure that the difference between successive log entries is small, then you can stuff that in 12 bits. You will need a special entry to give an initial date, and maybe you could insert such an entry when the difference betweem successive entries is too large.
#oleksii has some other good suggestions as well.

Display used CPU hours with slurm

I have a user account on a super computer where jobs are handled with slurm.
I would like to know the total amount of CPU hours that I have consumed on this super computer. I think that's an understandable question, because there is only a limited number of CPU hours available per project. I'm surprised that an answer is not easy to find.
I know that there are all these commands like sacct, sreport, sshare, etc... but it seems that there is no simple command that displays the used CPU hours.
Can someone help me out?

As others have commented, sacct should give you that information. You will need to look at the man page to get information for past jobs. You can specify a --starttime and --endtime to restrict your query to match your allocation as it ends/renews. The -l options should get you more information than you need so you can get a smaller set of options by specifying what you need with --format.
In your instance, the correct answer is to ask the administrators. You have been given an allocation of time to draw from. They likely have a system that will show you your balance and you can reconcile your balance against the output of sacct. Also, if the system you are using has different node types such as high memory, GPU, MIC, or old, they will likely charge you differently for those resources.

You can get an overview of the used CPU hours with the following:
sacct -SYYYY-mm-dd -u username -ojobid,start,end,alloccpu,cputime | column -t
You will could calculate the total accounting units (SBU in our system) multiplying CPUTime by AllocCPU which means multiplying the total (sysem+user) CPU time by the amount of CPU used.
An example:
JobID NodeList State Start End AllocCPUS CPUTime
------------ --------------- ---------- ------------------- ------------------- ---------- ----------
6328552 tcn[595-604] CANCELLED+ 2019-05-21T14:07:57 2019-05-23T16:48:15 240 506-17:12:00
6328552.bat+ tcn595 CANCELLED 2019-05-21T14:07:57 2019-05-23T16:48:16 24 50-16:07:36
6328552.0 tcn[595-604] FAILED 2019-05-21T14:10:37 2019-05-23T16:48:18 240 506-06:44:00
6332520 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 72 25-00:38:24
6332520.bat+ tcn384 COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 24 8-08:12:48
6332520.0 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:09 2019-05-24T00:26:33 60 20-20:24:00
6332530 tcn[37,41,44,4+ FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 240 400-08:12:00
6332530.bat+ tcn37 FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 24 40-00:49:12
6332530.0 tcn[37,41,44,4+ CANCELLED+ 2019-05-23T17:11:35 2019-05-25T09:13:34 240 400-07:56:00
The fields are shown in the the manpage. They can be shown as -oOPTION (in lower case or in proper POSIX notation --format='Option,AnotherOption...' (a list is in the man).
So far so good. But there is a big caveat here:
What you see here is perfect to get an idea of what you have run or what to expect in terms of CPU / hours. But this will not necessarily reflect your real budget status, as in many cases each node / partition may have an extra parameter, the weight, which is a parameter set for accounting purposes and not part of SLURM. For instance,the GPU nodes may have a weight value of x3, which means that each GPU/hour is measured as 3 SBU instead of 1 for budgetary purposes. What I mean to say is that you can use sacct to gain insight on the CPU times but this will not necessarily reflect how much SBU credits you still have.

Monitor and alert prometheus anomaly in number of metrics

We have a number of prometheus servers, each one monitors its own region (actually 2 per region), there are also thanos servers that can query multiple regions, and we also use alertmanager for the alerting.
Lately, we had an issue that few metrics stopped to report and we only discovered it when we needed the metrics.
We are trying to find out how to monitor the changes in the number of reported metrics in a scalable system that grow and shrink as required.
I'll be glad about your advice.

You can either count the number of timeseries in the head chunk (last 0-2 hours) or the rate at which you're ingesting samples:
prometheus_tsdb_head_series
or
rate(prometheus_tsdb_head_samples_appended_total[5m])
Then you compare said value with itself a few minutes/hours ago, e.g.
prometheus_tsdb_head_series / prometheus_tsdb_head_series offset 5m
and see whether it fits within an expected range (say 90-110%) and alert otherwise.
Or you can look at the metrics with the highest cardinality only:
topk(100, count({__name__=~".+"}) by (__name__))
Note however that this last expression can be quite costly to compute, so you may want to avoid it. Plus the comparison with 5 minutes ago will not be as straightforward:
label_replace(topk(100, count({__name__=~".+"}) by (__name__)), "metric", "$1", "__name__", "(.*)")
/
label_replace(count({__name__=~".+"} offset 5m) by (__name__), "metric", "$1", "__name__", "(.*)")
You need the label_replace there because the match for the division is done on labels other than __name__. Computing this latest expression takes ~10s on my Prometheus instance with 150k series, so it's anything but fast.
And finally, whichever approach you choose, you're likely to get a lot of false positives (whenever a large job is started or taken down), to the point that it's not going to be all that useful. I would personally not bother trying.

Graphite: sumSeries() does not sum

since this morning at 6 I'm experiencing a strange behavior of graphite.
We have two machine that collects date about calls received, I plot the charts and I also plot the sum of these two charts.
While the charts of single machine are fine, the sum is not working anymore.
This is a screenshot of graphtite and also grafana that shows how 4+5=5 (my math teacher is going to die for this)
This wrong sum happens also for other metrics. And I don't get why.
storage-scheams.conf
# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d,1h:7d,1d:1y,7d:5y
storage-aggregations.conf
# Schema definitions for whisper files. Entries are scanned in order,
# and first match wins.
#
# [name]
# pattern = regex
# retentions = timePerPoint:timeToStore, timePerPoint:timeToStore, ...
[time_data]
pattern = ^stats\.timers.*
xFilesFactor = 0.5
aggregationMethod = average
[storage_space]
pattern = \.postgresql\..*
xFilesFactor = 0.1
aggregationMethod = average
[default_1min_for_1day]
pattern = .*
xFilesFactor = 0
aggregationMethod = sum
aggregation-rules.conf This may be the cause, but it was working before 6AM. But anyway i don' see the stats_count.all metric.
stats_counts.all.rest.req (60) = sum stats_counts.srv_*_*.rest.req
stats_counts.all.rest.res (60) = sum stats_counts.srv_*_*.rest.res

It seems that the two series were not alligned by the timestamp, so the sum could not summarize the points. This is visible i the following chart, where selecting a time highliths point in two diffrent minute (charts from grafana).
I don't know why this happened. I resetarted some services (This charts comes from statsd for python and bucky). Maybe was the fault of one of those.
NOTE. Now this works, however, I would like to know if someone knows the reason and how I can solve it.

One thing you need to ensure is that the services sending metrics to Graphite do it at the same granularity as your smallest retention period or the period you will be rendering your graphs in. If the data points in the graph will be every 60 seconds, you need to send metrics every 60 seconds from each service. If the graph will be showing a data point for every hour, you can send your metrics every hour. In your case the smallest period is every 60 seconds.
I encountered a similar problem in our system - graphite was configured with the smallest retention period of 10s:6h, but we had 7 instances of the same service generating lots of metrics and configured them to send data every 20 seconds in order to avoid overloading our monitoring. This caused an almost unavoidable misalignment, where the series from the different instances will have a datapoint every 20 seconds, but some would have it at 10, 30, 50 and others will have it at 0, 20, 40. Depending on how many services were aligned, we would get a very jagged graph, looking similar to yours.
What I did to solve this problem for time periods that were returning data in 10 second increments was to use the keepLastValue function -> keepLastValue(1). I used 1 as parameter, because I only wanted to skip 1 None value, because I knew our service causes this by sending once every 20 seconds rather than every 10. This way the series generated by different services never had gaps, so sums were closer to the real number and the graphs stopped having the jagged look. I guess this introduced a bit of extra lag in the monitoring, but this is acceptable for our use case.

ISO 8601 Repeating Intervals without Date

With ISO8601, is there a way to specify a repeating interval which starts at a given time for any day, and repeats over time in that day?
For example, does the following hold:
R2/T09:00:00Z/PT1H = R/2000-01-01T09:00:00/P1D + R/2000-01-01T10:00:00/P1D?
Or is the former not correct under the standard?
The motivation behind this is to run a task at 9am and 10am every day.

No, Iso 8601 cannot irregular repetitions. You would need evaluate/run both of those expressions.
Cron expressions would be a better option as it is widely supported, especially for running tasks. You can find cron expression builders on the web and a library for every language (and OS support with crontab in Unix systems). This expression would handle your use case 0 0 9,10 ? * * * and would run at 9 and 10 am of every day of every year.
Sorry for the response 2 years later.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex