R time series with datapoints every 15 min - r

I have a data file containing the raw values of my measurments. The top entries look like that:
495.08
1117.728
872.712
665.632
713.296
1172.44
1302.544
1428.832
1413.536
1361.896
1126.656
644.776
1251.616
1252.824
[...]
The measurements are done in 15 min intervals. In order to put this in a R time series object I use the following code.
data <- scan("./data__2.dat",skip=1)
datats <- ts(data, frequency=24*60/15, start=c(2014,1))
But this leaves me with:
Although the data is only from one year. Hence, it seems like the frequency is wrong. Any thoughts on how to fix this?

By doing:
library(xts)
library(lubridate)
df <- data.frame(interval = seq(ymd_hms('2014-01-21 00:00:00'),
by = '15 min',length.out=(60*24*365/15)),
data = rnorm(60*24*365/15))
ts <- xts(df, order.by=df$interval)
You get:
interval data
1 2014-01-21 00:00:00 -1.3975823
2 2014-01-21 00:15:00 -0.4710713
3 2014-01-21 00:30:00 0.9149273
4 2014-01-21 00:45:00 -0.3053136
5 2014-01-21 01:00:00 -1.2459707
6 2014-01-21 01:15:00 0.4749215

Related

Missing time series for UK Summer time change - r

I am seeing missing timseries data corresponding to GMT time change for Summer. I guess this might also be for Winter. I have two parts of query.
How to generate the missing timeseries from the code below. The table is in a xts format.
How to filter the records by time that will include the missing time series once generated. This is only a sample dataset. Thanks.
start <- as.POSIXct("2022-03-27 00:58:00")
interval <- 2
end <- as.POSIXct("2022-03-27 03:00:00")
missing_timestamp <- data.frame(TIMESTAMP = seq(from=start, by=interval*60, to=end))
head(missing_timestamp)
TIMESTAMP
1 2022-03-27 00:58:00
2 2022-03-27 02:00:00
3 2022-03-27 02:02:00
4 2022-03-27 02:04:00
5 2022-03-27 02:06:00
6 2022-03-27 02:08:00
Update:
Linked to the second query, when below code is executed for time between 00hrs and 02:06hrs all data is returned rather than the first four records.
a <- missing_timestamp %>% filter(TIMESTAMP > ymd_hms("2022-03-27 00:00:00") & TIMESTAMP < ymd_hms("2022-03-27 02:06:00"))

Time Series within R (ColumnSorting)

I have a csv of real-time data inputs with timestamps and I am looking to group these data in a time series of 30 mins for analysis.
A sample of the real-time data is
Date:
2019-06-01 08:03:04
2019-06-01 08:20:04
2019-06-01 08:33:04
2019-06-01 08:54:04
...
I am looking to group them in a table with a step increment of 30 mins (i.e. 08:30, 09:00, etc..) to seek out the number of occurences during each period. I created a new csv file to access through R. This is so that I will not corrupt the formatting of the orginal dataset.
Date:
2019-06-01 08:00
2019-06-01 08:30
2019-06-01 09:00
2019-06-01 09:30
I have firstly constructed a list of 30 mins intervals by:
sheet_csv$Date <- as.POSIXct(paste(sheet_csv$Date), format = "%Y-%m-%d %H:%M", tz = "GMT") #to change to POSIXct
sheet_csv$Date <- timeDate::timeSequence(from = "2019-06-01 08:00", to = "2019-12-03 09:30", by = 1800,
format = "%Y-%m-%d %H:%M", zone = "GMT")
I encountered an error "Error in x[[idx]][[1]] : this S4 class is not subsettable" for this interval.
I am relatively new to R. Please do help out where you can. Greatly Appreciated.
You probably don't need the timeDate package for something like this. One package that is very helpful to manipulate dates and times is lubridate - you may want to consider going forward.
I used your example and added another date/time for illustration.
To create your 30 minute intervals, you could use cut and seq.POSIXt to create a sequence of date/times with 30 minute breaks. I used your minimum date/time to start with (rounding down to nearest hour) but you can also specify another date/time here.
Use of table will give you frequencies after cut.
sheet_csv <- data.frame(
Date = c("2019-06-01 08:03:04",
"2019-06-01 08:20:04",
"2019-06-01 08:33:04",
"2019-06-01 08:54:04",
"2019-06-01 10:21:04")
)
sheet_csv$Date <- as.POSIXct(sheet_csv$Date, format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
as.data.frame(table(cut(sheet_csv$Date,
breaks = seq.POSIXt(from = round(min(sheet_csv$Date), "hours"),
to = max(sheet_csv$Date) + .5 * 60 * 60,
by = "30 min"))))
Output
Var1 Freq
1 2019-06-01 08:00:00 2
2 2019-06-01 08:30:00 2
3 2019-06-01 09:00:00 0
4 2019-06-01 09:30:00 0
5 2019-06-01 10:00:00 1

Inserting Row in Missing Hourly Data in R using padr package - weird error

I am new to R and I am having some issues with the padr package described here.
I have a hourly data set that is missing hours and I would like to insert a row to input a value for the missing data. I am trying to use the pad function and the fill_by_value function from the padr package but I am getting an error when I use the pad function.
The data called Mendo is presented as:
Date.Local Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
I want the final data to look like:
Date.Local Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:03:00 999
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
I am under the impression the padr packaged wants a datetime POSIXct column so I use the command
Mendo$Time.Local <- as.POSIXct(paste(Mendo$Date.Local, Mendo$Time.Local), format = '%Y-%m-%d %H:%M')
to get:
Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
Now I try to use the pad function like instruction in the link provided above. My line of code is:
Mendo_padded <- Mendo %>% pad
and I get the error:
Error in if (total_invalid == nrow(x)) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In if (unique(nchar(x_char)) == 10) { :
the condition has length > 1 and only the first element will be used
If this were to work, I would then use the command
Mendo_padded %>% fill_by_value(Sample.Measurement, value = 999)
to get all the missing hours Sample.Measurement value to be 999.
I would love feedback, suggestions or comments on what I may be doing wrong and how I can go about getting this code to work! Thank you!
It seems that pad can automatically detect which column is of Date / POSIXct / POSIXlt type, so you do not need to supply Mendo$Time.Local to pad. The padding will be applied on hour intervals.
library(magrittr)
library(padr)
PM10 <- read.csv(file="../Downloads/hourly_81102_2016.csv",
stringsAsFactors = FALSE) # don't change the columns to factors
Mendo <- PM10[PM10$County.Name == "Mendocino",]
Mendo$Time.Local <-
as.POSIXct(paste(
Mendo$Date.Local, Mendo$Time.Local), format = '%Y-%m-%d %H:%M')
Mendo <- Mendo[,c("Time.Local", "Sample.Measurement")]
# remove Mendo$Time.Local
Mendo_padded <- Mendo %>% na.omit %>%
pad(interval = 'hour',
start_val = NULL, end_val = NULL, group = NULL,
break_above = 1)
You may also consider using the column Time.GMT and Date.GMT because date and time may depend on where you (your computer) are.
Edit: As suggested by OP, na.omit should be used before pad to avoid NA values in the Date column.

How to merge couples of Dates and values contained in a unique csv

We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877

R/ggplot error: too large for hashing

I'm plotting a time series graph using ggplot, however whenever the size of the data frame is greater than around 600, ggplot throws the following error:
Error in anyDuplicated.default(breaks) : length 1136073601 is too
large for hashing
In fact, it just gave me the same error when I try to plot 400 items.
The data is melted like so, except there are four variables- speed, dir, temp and pressure:
time variable value
1 2006-07-01 00:00:00 speed 4.180111
2 2006-07-02 00:00:00 speed 5.527226
3 2006-07-09 00:00:00 speed 6.650821
4 2006-07-16 00:00:00 speed 4.380063
5 2006-07-23 00:00:00 speed 5.641709
6 2006-07-30 00:00:00 speed 7.636913
7 2006-08-06 00:00:00 speed 7.128334
8 2006-08-13 00:00:00 speed 4.719046
...
201 2006-07-01 00:00:00 temp 17.140069
202 2006-07-02 00:00:00 temp 17.517480
203 2006-07-09 00:00:00 temp 14.211002
204 2006-07-16 00:00:00 temp 20.121617
205 2006-07-23 00:00:00 temp 17.933492
206 2006-07-30 00:00:00 temp 15.244583
My code to plot these is based on what I found here: http://had.co.nz/ggplot2/scale_date.html
qplot(time,value,data=test3,geom="line",group=variable) +
+ facet_grid(variable ~ ., scale = "free_y")
Any pointers and I'd be very grateful!!
To massage the date from character to date i'm using:
test$time <- strptime(test$time, format="%Y-%m-%d %H:%M:%S")
test$time <- as.POSIXct(test$time, format="%H:%M:%S")
test3 = melt(test,id="time")
class(test$time) returns "[1] "POSIXt" "POSIXct""
Try setting a timezone explicitly in the call to as.POSIXct(), as in https://gist.github.com/925852

Resources