gnuplot, calculating and plotting monthly averages - plot

I have a datafile with several months of minute data with lines like "2016-02-02 13:21(\t)value(\n)".
I need to plot the data (no problem with that) and calculate + plot an average for each month.
Is it possible in gnuplot?
I am able to get an overall average using
fit a "datafile" using 1:3 via a
I am also able to specify some time range for the fit using
fit [now_secs-3600*24*31:now_secs] b "datafile" using 1:3 via b
... and then plot them with
plot a t "Total average",b t "Last 31 days"
But no idea how to calculate and plot an average for each month (= one stepped line showing each month average)

Here is a way to do it purely in gnuplot. This method can be adapted (with a not small amount of effort) to work with files that cross a year boundary or span more than one year. It works just fine if the data starts with January or not. It computes the ordinary average for each month (the arithmetic mean) treating each data point as one value for the month. With somewhat significant modification, it can be used to work with weighted averages as well.
This makes a significant use of the stats function to compute values. It is a little long, partly because I commented it heavily. It uses 5.0 features (NaN for undefined values and in-memory datablocks instead of temporary files), but comments note how to change these for earlier versions.
Note: This script must be run before setting time mode. The stats function will not work in time mode. Time conversions are handled by the script functions.
data_time_format = "%Y-%m-%d %H:%M" #date format in file
date_cols = 2 # Number of columns consumed by date format
# get numeric month value of time - 1=January, 12=December
get_month(x) = 0+strftime("%m",strptime(data_time_format,x))
# get numeric year value of time
get_year(x) = 0+strftime("%Y",strptime(data_time_format,x))
# get internal time representation of day 1 of month x in year y
get_month_first(x,y) = strptime("%Y-%m-%d",sprintf("%d-%d-01",y,x))
# get internal time representation of date
get_date(x) = strptime(data_time_format,x)
# get date string in file format corresponding to day y in month x of year z
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-%02d",z,x,y)))
# determine if date represented by z is in month x of year y
check_valid(x,y,z) = (get_date(z)>=get_month_first(x,y))&(get_date(z)<get_month_first(x+1,y))
# Determine year and month range represented by file
year = 0
stats datafile u (year=get_year(strcol(1)),get_month(strcol(1))) nooutput
month_min = STATS_min
month_max = STATS_max
# list of average values for each month
aves = ""
# fill missing months at beginning of year with 0
do for[i=1:(month_min-1)] {
aves = sprintf("%s %d",aves,0)
}
# compute average of each month and store it at the end of aves
do for[i=month_min:month_max] {
# In versions prior to 5.0, replace NaN with 1/0
stats datafile u (check_valid(i,year,strcol(1))?column(date_cols+1):NaN) nooutput
aves = sprintf("%s %f",aves,STATS_mean)
}
# day on which to plot average
baseday = 15
# In version prior to 5.0, replace $k with a temporary file name
set print $k
# Change this to start at 1 if we want to fill in prior months
do for [i=month_min:month_max] {
print sprintf("%s %s",get_date_string(i,baseday,year),word(aves,i))
}
set print
This script will create either a in-memory datablock or a temporary file for earlier versions (with the noted changes) that contains a similar file to the original, but containing one entry per month with the value of the monthly average.
At the beginning we need to define our date format and the number of columns that the date format consumes. From then on it is assumed that the data file is structured as datetime value. Several functions are defined which make extensive use of the strptime function (to compute a date string to an internal integer) and the strftime function (to compute an internal representation to a string). Some of these functions compute both ways in order to extract the necessary values. Note the addition of 0 in the get_month and get_year function to convert a string value to an integer.
We do several steps with the data in order to build our resulting datablock/file.
Use the stats function to compute the first and last month and the year. We are assuming only one year is present. This step needs to be modified heavily if we need to work with more than one year. In particular months in a second year would need to be numbered 13 - 24 and in a third year 25 - 36 and so on. We would need to modify this line to capture multiple years as well. Probably two passes would be needed.
Build up a string which contains space separated values for the average value for each month. This is done by applying the stats function once for each month. The check_valid function checks if a value is in the month of interest, and a value that isn't is assigned NaN which causes the stats function to ignore it.
Loop over the months of interest and build a datablock/temporary file with one entry for each month with the average value for that month. In this case, the average value is assigned to the start of the 15th day of the month. This can be easily changed to any other desired time. The get_date_string function is used for assigning the value to a time.
Now to demonstrate this, suppose that we have the following data
2016-02-03 15:22 95
2016-02-20 18:03 23
2016-03-10 16:03 200
2016-03-15 03:02 100
2016-03-18 02:02 200
We wish to plot this data along with the average value for each month. We can run the above script, and we will get a datablock $k (make the commented change near the bottom to use a temporary file instead) containing the following
2016-02-15 00:00 59.000000
2016-03-15 00:00 166.666667
This is exactly the average values for each month. Now we can plot with
set xdata time
set timefmt data_time_format
set key outside top right
plot $k u 1:3 w points pt 7 t "Monthly Average",\
datafile u 1:3 with lines t "Original Data"
Here, just for illustration, I used points with the averages. Feel free to use any style that you want. If you choose to use steps, you will very likely want to adjust the day that is assigned† in the datablock/temporary file (probably the first or last day in the month depending on how you want to do it).
It is usually easier with a task like this to do some outside preprocessing, but this demonstrates that it is possible in pure gnuplot.
† Regarding changing the day that is assigned, using any specific day in the month is easy, as long as it is a day that occurs in every month (dates from the 1st to the 28th) - just change baseday. For other values modifications to the get_date_string function need to be made.
For example, to use the last day, the function can be defined as
get_date_string(x,y,z) = strftime(data_time_format,strptime("%Y-%m-%d",sprintf("%04d-%02d-01",z,x+1))-24*60*60)
This version actually computes the first day of the next month, and then subtracts one whole day from that. The second argument is ignored in this version, but preserved to allow it to be used without having to make any additional changes to the script.

With a recent version of gnuplot, you have the stats command and you can do something something like this:
stats "datafile" using 1:3 name m0
month_sec=3600*24*30.5
do for [month=1:12] {
stats [now_secs-(i+1)*month_sec:(i+0)*now_secs-month_sec] "datafile" using 1:3 name sprintf("m%d")
}
you get m0_mean value for the total mean and you get all m1_mean m2_mean variables for the previuos months etc... defined in gnuplot
Finally to plot the you should do something like:
plot 'datafile', for [month=0:12] value(sprintf("m%d_mean"))
see help stats help for help value help sprintf for more information on the above commands

Related

My data does not convert to time series in R

My data contains several measurements in one day. It is stored in CSV-file and looks like this:
enter image description here
The V1 column is factor type, so I'm adding a extra column which is date-time -type: vd$Vdate <- as_datetime(vd$V1) :
enter image description here
Then I'm trying to convert the vd-data into time series: vd.ts<- ts(vd, frequency = 365)
But then the dates are gone:
enter image description here
I just cannot get it what I am doing wrong! Could someone help me, please.
Your dates are gone because you need to build the ts dataframe from your variables (V1, ... V7) disregarding the date field and your ts command will order R to structure the dates.
Also, I noticed that you have what is seems like hourly data, so you need to provide the frequency that is appropriate to your time not 365. Considering what you posted your frequency seems to be a bit odd. I recommend finding a way to establish the frequency correctly. For example, if I have hourly data for 365 days of the year then I have a frequency of 365.25*24 (0.25 for the leap years).
So the following is just as an example, it still won't work properly with what I see (it is limited view of your dataset so I am not sure 100%)
# Build ts data (univariate)
vs.ts <- ts(vd$V1, frequency = 365, start = c(2019, 4)
# check to see if it is structured correctly
print(vd.ts, calendar = T)
Finally my time series is working properly. I used
ts <- zoo(measurements, date_times)
and I found out that the date_times was supposed to be converted with as_datetime() as otherwise they were character type. The measurements are converted into data.frame type.

Define the week as 6-days in length r

I want to create a time series with date and quantity as variables. However I always have zero observation on Sundays. Therefore I want to define the week as 6-days in length in R. Any suggestions?

creating inteval object in r using lubridate package [duplicate]

This question already has an answer here:
indicateing to which interval a date belongs
(1 answer)
Closed 4 years ago.
hi i have data from uber :
about pick ups in NYC .
im trying to add a column to the raw data, that indicates for each row, for
which time interval (which is represented by a single timepoint at the beginning of thetime interval) it belongs.
i want to Create a vector containing all relevant timepoints (i.e. every 15 minutes
Use int_diff function from lubridate package on this vector to create an
interval object.
Run a loop on all the time points in the raw data and for each data
point; indicate to which interval (which is represented by a single
timepoint at the beginning of the time interval) it belongs.
i tried looking for explanations how to use the int_diff function but i dont understand how my vector should look and how the syntax of int_diff works
tanks for the help :)
Is this what you have in mind?
start <- mdy_hm('4/11/2014 0:00') # start of the period
end <- mdy_hm('5/12/2015 0:00') # end
time_seq <- seq(from = start, to = end, by = '15 mins') # sequence by 15 minutes
times <- mdy_hm(c('4/11/2014 0:12', '4/11/2014 1:24')) # times to find intervals for
dat <- data.frame(times)
dat$intervals <- cut(times, breaks = time_seq) # assign each time to an interval
intervals_cols <- model.matrix(~ - + intervals, dat) # turn this into a set of columns, one for each interval, with a 1 indicating that this observation falls into the column

gnuplot: incorrect time data (julian date, x-axis) when plotting

I'm trying to plot data as a function of time using gnuplot. I am having an issue with the time data (x-axis) being incorrect. This issue is similar to the one posted here, but that post does not appear to resolve my problem.
To start, here is a subset of file "data.txt" that shows the error
996,1.81014336621038094E+07,1.04721577434964254E+07
997,1.81073887058396861E+07,1.04688883975542113E+07
998,1.81123550412347727E+07,1.04660263576711770E+07
999,1.81165058190760165E+07,1.04628236696091276E+07
1000,1.81200135215993598E+07,1.04593579882744774E+07
1001,1.81230027468293682E+07,1.04556943748914227E+07
1002,1.81256090021481551E+07,1.04518411259850748E+07
1003,1.81280483217409961E+07,1.04478383895292878E+07
1004,1.81311435732491128E+07,1.04439282290004119E+07
The first column corresponds to a Julian date, and columns 2 and 3 contain data. To plot the data, I am using the following interactive gnuplot commands:
set datafile separator ","
set terminal png
set xdata time
set timefmt "%j"
set output "test_figure.png"
plot "data.txt" using 1:2 with lines lw 2 lt 1
This produces the following plot:
Figure with incorrect timeseries
I get the correct figure if I alter the data.txt file to be (the only difference is the leading zeros in the first column for the first 4 lines):
0996,1.81014336621038094E+07,1.04721577434964254E+07
0997,1.81073887058396861E+07,1.04688883975542113E+07
0998,1.81123550412347727E+07,1.04660263576711770E+07
0999,1.81165058190760165E+07,1.04628236696091276E+07
1000,1.81200135215993598E+07,1.04593579882744774E+07
1001,1.81230027468293682E+07,1.04556943748914227E+07
1002,1.81256090021481551E+07,1.04518411259850748E+07
1003,1.81280483217409961E+07,1.04478383895292878E+07
1004,1.81311435732491128E+07,1.04439282290004119E+07
Figure with correct timeseries
Is there a way that I can write the gnuplot code to not require the leading zeros? The actual dataset has Julian dates 1 to 10,000, and if I write the data with leading zeros to fill 5 digits (i.e., 00001), I get an "illegal day of year" error.
I did notice that the x-axis tick labels are different between the 2 plots (probably hints to the source of the issue that I am having), but I can't determine what is going wrong.
Note: This "error" only appears when I go from 999 to 1000. Going from Julian date 9 to 10 does not have this out-of-order issue.
Thanks ahead of time for the help!
I don't know what was causing the original issue where the data was getting reordered when plotting, but I realized that I was interpreting the data incorrectly. The first column wasn't actually a Julian date, but was instead the number of hours since the start date. So, a value of 25 wasn't 25 days into the data but was actually 1 day and 1 hour into the data.
Replacing the first column (counter) with "day-hour":
41-12,1.81014336621038094E+07,1.04721577434964254E+07
41-13,1.81073887058396861E+07,1.04688883975542113E+07
41-14,1.81123550412347727E+07,1.04660263576711770E+07
41-15,1.81165058190760165E+07,1.04628236696091276E+07
41-16,1.81200135215993598E+07,1.04593579882744774E+07
41-17,1.81230027468293682E+07,1.04556943748914227E+07
41-18,1.81256090021481551E+07,1.04518411259850748E+07
41-19,1.81280483217409961E+07,1.04478383895292878E+07
41-20,1.81311435732491128E+07,1.04439282290004119E+07
and then using set timefmt "%j-%H" allowed me to obtain the correct plot.
Let's first improve the x-axis labels in order to understand what happens:
set format x "%Y-%m-%d"
Then we increase the resolution of the resulting png, and we plot with linespoints instead of lines only. The script now looks like that:
set datafile separator ","
set terminal png size 1200,600
set xdata time
set timefmt "%j"
set output "test_figure.png"
set format x "%Y-%m-%d"
plot "data.txt" using 1:2 with linespoints lw 2 lt 1
This is the result:
There are some points in April 1970 and some points in September 1972. The time format modifier %j means the day of the year. The points in April 1970 correspond to day 100, the points in September 1972 correspond to days about 997, both times counted from January 1, 1970, the Unix epoch.
This means, gnuplot interprets the values 996 ... 999 as days counted from January 1, 1970. The values 1000 ... 1004 are (incorrectly) read as 100 days counted from January 1, 1970, the fourth digit is ignored (!).
If you add a leading 0 in front of the values 996 ... 999, they are now read as 99 which makes things worse.
I stop here as you have already figured out how to read the data :)

R: subsetting timestamped dataframe periodically

I have a csv file that contains many thousands of timestamped data points. The file includes the following columns: Date, Tag, East, North & DistFromMean. The following is a sample of the data in the file:
The data is recorded approximately every 15 minutes for 12 tags over a month. What I'm wanting to do is select from the data, starting from the first date entry, subsets of data i.e. every 3 hours but due to the tags transmitting at slightly different rates I need a minimum and maximum value start and end time.
I have found the a related previous question but don't understand the answer enough to implement.
The solution could firstly ask for the Tag number, then the period required perhaps in minutes from the start time (i.e. every 3hrs or 180 minutes), the minimum time range and the maximum time range, both of which would be constant for whatever time period was used. The minimum and maximum would probably need to be plus and minus 6 minutes from the period selected.
As the code below shows, I've managed to read in the file, change the Date format to POSIXlt and extract data within a specific time frame but the bit I'm stuck on is extracting the data every nth minute and within a range.
TestData<- read.csv ("TestData.csv", header=TRUE, as.is=TRUE)
TestData$Date <- strptime(TestData$Date, "%d/%m/%Y %H:%M")
TestData[TestData$Date >= as.POSIXlt("2014-02-26 7:10:00") & TestData$Date < as.POSIXlt("2014-02-26 7:18:00"),]

Resources