rrdtool does not keep max - aggregate-functions

I create a RRD with one DS, the processing time of a script, with 10 minutes step.
Having three archives, with MAX as aggregation function : all values over a week, every hour over a month, every day over two years.
rrdtool create RRD --start 1411561343 --step 600s \
DS:processtime:GAUGE:1200:0:U \
RRA:MAX:0.5:1:1008 \
RRA:MAX:0.5:6:744 \
RRA:MAX:0.5:144:732
I populate it from a file which contains all records from the 2014/09/24 at 14:32:23 (1411561943) to the 2016/01/11 at 11:07:25 (1452503245).
The maximum is for the 2015/09/11 at 14:18:35 (1441973915), 23340.
When I graph or dump the rrd, I have a lot of NaN, I don't see this maximum, neither many other significant values.
The max I have in the rrd is <!-- 2015-08-06 02:00:00 CEST / 1438819200 --> <row><v>8.0004250000e+02</v></row>.
Is it related to the fact that intervals are not exactly 10 minutes but between 8 and 12 ?
If so, it there a way to change this behavior ?

First rrdtool resamples the data to the given interval in in --step only then will the data be further processed. The intervals are aligned to epoch timestamps (starting 1970-01-01 00:00:00 GMT). If you want to have a maximum for a shorter interval, you have to lower the step and feed data more frequently.

As Tobi reminds you, your samples will first be normalised/resampled into regular 10min intervals, regardless of when they arrived. You cannot disable this behaviour except by inserting the data exactly on the time interval boundaries.
Your MAX values may not be for the intervals that you seem to think they are.
RRA:MAX:0.5:1:1008 \
RRA:MAX:0.5:6:744 \
RRA:MAX:0.5:144:732
Your RRA are for 1, 6 and 144 samples. Since your interval is 10min, these RRA correspond to 10min, 1hour, and 1day respectively, and will hold the maximum valued normalised sample within that interval.
Also, you have an XFF of 0.5, meaning that more than 50% of the required samples must be present for the RRA to store a value, and a heartbeat on your DS of 20min, meaning a sample is unknown if there is a gap of that long.
You might like to add a RRA:AVG:0.5:1:1008 for reference so that you can verify what data you are collecting and track down the source of the NaN.
Note that, when graphing or using xport, rrdtool will calculate a Max on the fly from available data if an appropriate RRA covering the entire requested time window is not available.
You can verify the MAX calculation by using an AVG RRA as above, and fetching this pre-aggregate data to compare with the values stored in your MAX RRA.

Related

Fetch Data At every 2 Minutes SQL Lite

Please can anyone help me in fetching data from my SQL Lite Table at every 2 minutes between my start time and stop time
I have two columns Data , TimeStamp and I am filtering between two timestamp and it is working fine but what I am trying to do is to result my data at every 2 minutes interval For example my start time is 2016-12-15 10:00:00 and stop time is 2016-12-15 10:10:00 the result should be 2016-12-15 10:00:00,2016-12-15 10:02:00,2016-12-15 10:04:00 ....
Add, to your where clause, an expression that looks for 2 minute boundaries:
strftime("%s", TimeStamp) % 120 = 0
This assumes you have data on exact, 2-minute boundaries. It will ignore data between those points.
strftime("%s", TimeStamp) converts your time stamp string into a single number representing the number of seconds since Jan 1st, 1970. The % 120 does modulo arithmetic resulting in 0 every 120 seconds. If you want minute boundaries, use 60. If you want hourly, use 3600.
What's more interesting -- and I've used this -- is to take all the data between boundaries and average them together:
SELECT CAST(strftime("%s", TimeStamp) / 120 AS INTEGER) * 120 as stamp, AVG(Data)
FROM table
WHERE TimeStamp >= '2016-12-15 10:00:00' AND
TimeStamp < '2016-12-15 10:10:00'
GROUP BY stamp;
This averages all data with timestamps in the same 2-minute "bin". The second date comparison is < rather than <= because then the last bin would only average one sample whereas the other bins would be averages of multiple values. You could also add MAX(Data) and MIN(Data) columns, if you want to know how much the data changed within each bin.

gnuplot, calculating and plotting monthly averages

I have a datafile with several months of minute data with lines like "2016-02-02 13:21(\t)value(\n)".
I need to plot the data (no problem with that) and calculate + plot an average for each month.
Is it possible in gnuplot?
I am able to get an overall average using
fit a "datafile" using 1:3 via a
I am also able to specify some time range for the fit using
fit [now_secs-3600*24*31:now_secs] b "datafile" using 1:3 via b
... and then plot them with
plot a t "Total average",b t "Last 31 days"
But no idea how to calculate and plot an average for each month (= one stepped line showing each month average)
Here is a way to do it purely in gnuplot. This method can be adapted (with a not small amount of effort) to work with files that cross a year boundary or span more than one year. It works just fine if the data starts with January or not. It computes the ordinary average for each month (the arithmetic mean) treating each data point as one value for the month. With somewhat significant modification, it can be used to work with weighted averages as well.
This makes a significant use of the stats function to compute values. It is a little long, partly because I commented it heavily. It uses 5.0 features (NaN for undefined values and in-memory datablocks instead of temporary files), but comments note how to change these for earlier versions.
Note: This script must be run before setting time mode. The stats function will not work in time mode. Time conversions are handled by the script functions.
data_time_format = "%Y-%m-%d %H:%M" #date format in file
date_cols = 2 # Number of columns consumed by date format
# get numeric month value of time - 1=January, 12=December
get_month(x) = 0+strftime("%m",strptime(data_time_format,x))
# get numeric year value of time
get_year(x) = 0+strftime("%Y",strptime(data_time_format,x))
# get internal time representation of day 1 of month x in year y
get_month_first(x,y) = strptime("%Y-%m-%d",sprintf("%d-%d-01",y,x))
# get internal time representation of date
get_date(x) = strptime(data_time_format,x)
# get date string in file format corresponding to day y in month x of year z
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-%02d",z,x,y)))
# determine if date represented by z is in month x of year y
check_valid(x,y,z) = (get_date(z)>=get_month_first(x,y))&(get_date(z)<get_month_first(x+1,y))
# Determine year and month range represented by file
year = 0
stats datafile u (year=get_year(strcol(1)),get_month(strcol(1))) nooutput
month_min = STATS_min
month_max = STATS_max
# list of average values for each month
aves = ""
# fill missing months at beginning of year with 0
do for[i=1:(month_min-1)] {
aves = sprintf("%s %d",aves,0)
}
# compute average of each month and store it at the end of aves
do for[i=month_min:month_max] {
# In versions prior to 5.0, replace NaN with 1/0
stats datafile u (check_valid(i,year,strcol(1))?column(date_cols+1):NaN) nooutput
aves = sprintf("%s %f",aves,STATS_mean)
}
# day on which to plot average
baseday = 15
# In version prior to 5.0, replace $k with a temporary file name
set print $k
# Change this to start at 1 if we want to fill in prior months
do for [i=month_min:month_max] {
print sprintf("%s %s",get_date_string(i,baseday,year),word(aves,i))
}
set print
This script will create either a in-memory datablock or a temporary file for earlier versions (with the noted changes) that contains a similar file to the original, but containing one entry per month with the value of the monthly average.
At the beginning we need to define our date format and the number of columns that the date format consumes. From then on it is assumed that the data file is structured as datetime value. Several functions are defined which make extensive use of the strptime function (to compute a date string to an internal integer) and the strftime function (to compute an internal representation to a string). Some of these functions compute both ways in order to extract the necessary values. Note the addition of 0 in the get_month and get_year function to convert a string value to an integer.
We do several steps with the data in order to build our resulting datablock/file.
Use the stats function to compute the first and last month and the year. We are assuming only one year is present. This step needs to be modified heavily if we need to work with more than one year. In particular months in a second year would need to be numbered 13 - 24 and in a third year 25 - 36 and so on. We would need to modify this line to capture multiple years as well. Probably two passes would be needed.
Build up a string which contains space separated values for the average value for each month. This is done by applying the stats function once for each month. The check_valid function checks if a value is in the month of interest, and a value that isn't is assigned NaN which causes the stats function to ignore it.
Loop over the months of interest and build a datablock/temporary file with one entry for each month with the average value for that month. In this case, the average value is assigned to the start of the 15th day of the month. This can be easily changed to any other desired time. The get_date_string function is used for assigning the value to a time.
Now to demonstrate this, suppose that we have the following data
2016-02-03 15:22 95
2016-02-20 18:03 23
2016-03-10 16:03 200
2016-03-15 03:02 100
2016-03-18 02:02 200
We wish to plot this data along with the average value for each month. We can run the above script, and we will get a datablock $k (make the commented change near the bottom to use a temporary file instead) containing the following
2016-02-15 00:00 59.000000
2016-03-15 00:00 166.666667
This is exactly the average values for each month. Now we can plot with
set xdata time
set timefmt data_time_format
set key outside top right
plot $k u 1:3 w points pt 7 t "Monthly Average",\
datafile u 1:3 with lines t "Original Data"
Here, just for illustration, I used points with the averages. Feel free to use any style that you want. If you choose to use steps, you will very likely want to adjust the day that is assigned† in the datablock/temporary file (probably the first or last day in the month depending on how you want to do it).
It is usually easier with a task like this to do some outside preprocessing, but this demonstrates that it is possible in pure gnuplot.
† Regarding changing the day that is assigned, using any specific day in the month is easy, as long as it is a day that occurs in every month (dates from the 1st to the 28th) - just change baseday. For other values modifications to the get_date_string function need to be made.
For example, to use the last day, the function can be defined as
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-01",z,x+1))-24*60*60)
This version actually computes the first day of the next month, and then subtracts one whole day from that. The second argument is ignored in this version, but preserved to allow it to be used without having to make any additional changes to the script.
With a recent version of gnuplot, you have the stats command and you can do something something like this:
stats "datafile" using 1:3 name m0
month_sec=3600*24*30.5
do for [month=1:12] {
stats [now_secs-(i+1)*month_sec:(i+0)*now_secs-month_sec] "datafile" using 1:3 name sprintf("m%d")
}
you get m0_mean value for the total mean and you get all m1_mean m2_mean variables for the previuos months etc... defined in gnuplot
Finally to plot the you should do something like:
plot 'datafile', for [month=0:12] value(sprintf("m%d_mean"))
see help stats help for help value help sprintf for more information on the above commands

R: subsetting timestamped dataframe periodically

I have a csv file that contains many thousands of timestamped data points. The file includes the following columns: Date, Tag, East, North & DistFromMean. The following is a sample of the data in the file:
The data is recorded approximately every 15 minutes for 12 tags over a month. What I'm wanting to do is select from the data, starting from the first date entry, subsets of data i.e. every 3 hours but due to the tags transmitting at slightly different rates I need a minimum and maximum value start and end time.
I have found the a related previous question but don't understand the answer enough to implement.
The solution could firstly ask for the Tag number, then the period required perhaps in minutes from the start time (i.e. every 3hrs or 180 minutes), the minimum time range and the maximum time range, both of which would be constant for whatever time period was used. The minimum and maximum would probably need to be plus and minus 6 minutes from the period selected.
As the code below shows, I've managed to read in the file, change the Date format to POSIXlt and extract data within a specific time frame but the bit I'm stuck on is extracting the data every nth minute and within a range.
TestData<- read.csv ("TestData.csv", header=TRUE, as.is=TRUE)
TestData$Date <- strptime(TestData$Date, "%d/%m/%Y %H:%M")
TestData[TestData$Date >= as.POSIXlt("2014-02-26 7:10:00") & TestData$Date < as.POSIXlt("2014-02-26 7:18:00"),]

Role of frequency parameter in ts

How does the ts() function use its frequency parameter? What is the effect of assigning wrong values as frequency?
I am trying to use 1.5 years of website usage data to build a time series model so that I can forecast the usage for coming periods. I am using data at daily level. What should be the frequency here - 7 or 365 or 365.25?
The frequency is "the" period at which seasonal cycles repeat. I use "the" in scare quotes since, of course, there are often multiple cycles in time series data. For instance, daily data often exhibit weekly patterns (a frequency of 7) and yearly patterns (a frequency of 365 or 365.25 - the difference often does not matter).
In your case, I would assume that weekly patterns dominate, so I would assign frequency=7. If your data exhibits additional patterns, e.g., holiday effects, you can use specialized methods accounting for multiple seasonalities, or work with dummy coding and a regression-based framework.
Here, the frequency parameter is not a frequency that you can observe in the data of your time series. Instead, you have to specify the frequency at which samples of the time series were taken. In your case, this is simply 1 day, or 1.
The value you give here will influence the results you get later when running analysis operations (examples are average requests per time unit or fourier transformation to get the (real) frequencies in the data). E.g. if you wanted to get all your results in the unit of hours instead of in days, you would pass 24 instead of 1 as frequency, because your data samples were taken in a frequency of 24 hours.

Using R to subset overlapping daily sensor data

I have a data set (3.2 million rows) in R which consists of pairs of time (milliseconds) and volts. The sensor that gathers the data only runs during the day so the time is actually the milliseconds since start-up that day.
For example, if the sensor runs 12 hours per day, then the maximum possible time value for one day is 43,200,000 ms (12h * 60m * 60s * 1000ms).
The data is continually added to a single file, which means there are many overlapping time values:
X: [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5...] // example if range was 1-5 for one day
Y: [voltage readings at each point in time...]
I would like to separate each "run" into unique data frames so that I could clearly see individual days. Currently when I plot the entire data set it is incredibly muddy because in fact all of the days are being shown in the single plot. Thanks for any help.
If your data.frame df has columns X and Y, you can use diff to find every time X goes down (meaning a new day, it sounds like):
df$Day = cumsum(c(1, diff(df$X) < 0))
Day1 = df[df$Day==1,]
plot(Day1$X, Day1$Y)

Resources