I have different sets of data with the following format
Time Value1 Value2 ....
11/04/2015 15:12:22 1 2 ....
11/04/2015 15:13:46 1 2 ....
And I want to group them in intervals of 15 minutes. I can do this with the following code
data$time = cut(data$time, breaks = "15 min")
data.grouped <- aggregate(data[,c(-1)], by = list(time = datos$time), median)
The problem is that the time field in the output has the following values
12/04/2015 16:12
12/04/2015 16:27
12/04/2015 16:42
12/04/2015 16:57
And I want the times to be :00 :15 :30 or :45. Is there any way of forcing the intervals to be like this or a different approach to merge the data that allows it?
A sample data from dput:
structure(list(time = structure(list(sec = c(49, 5, 21, 37, 54,
10, 38), min = c(12L, 13L, 13L, 13L, 13L, 14L, 22L), hour = c(15L,
15L, 15L, 15L, 15L, 15L, 16L), mday = c(11L, 11L, 11L, 11L, 11L,
11L, 12L), mon = c(3L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(116L,
116L, 116L, 116L, 116L, 116L, 116L), wday = c(1L, 1L, 1L, 1L,
1L, 1L, 2L), yday = c(101L, 101L, 101L, 101L, 101L, 101L, 102L
), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), zone = c("CEST", "CEST",
"CEST", "CEST", "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_)), .Names = c("sec", "min", "hour", "mday", "mon",
"year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt",
"POSIXt")), value1 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("time",
"value1"), row.names = c(NA, -7L), class = "data.frame")
Starting with your dput, calling it df, first we'll convert your factor to a POSIXct class, then we will floor it to closest 15 minutes below. (use round instead of floor if you want the closest 15 minutes in general):
df$time = as.POSIXct(df$time)
df$time15 = lubridate::floor_date(df$time, unit = "15 min")
df
# time value1 time15
# 1 2016-04-11 15:12:49 0 2016-04-11 15:00:00
# 2 2016-04-11 15:13:05 0 2016-04-11 15:00:00
# 3 2016-04-11 15:13:21 0 2016-04-11 15:00:00
# 4 2016-04-11 15:13:37 0 2016-04-11 15:00:00
# 5 2016-04-11 15:13:54 0 2016-04-11 15:00:00
# 6 2016-04-11 15:14:10 0 2016-04-11 15:00:00
# 7 2016-04-12 16:22:38 0 2016-04-12 16:15:00
You can then aggregate using the time15 column as the grouper.
I provide an example you can replicate with your data frame. First, I create a dummy time series (ts) as.POSIXct by 5 min intervals and then group them by 15 min intervals using dplyr.
ts <- seq.POSIXt(as.POSIXct("2017-01-01", tz = "UTC"),
as.POSIXct("2017-02-01", tz = "UTC"),
by = "5 min")
ts <- as.data.frame(ts)
library(dplyr)
ts %>%
group_by(interval = cut(ts, breaks = "15 min")) %>%
summarise(count= n())
Output
# A tibble: 2,977 x 2
interval sumvalue
<fct> <int>
1 2017-01-01 00:00:00 3
2 2017-01-01 00:15:00 3
3 2017-01-01 00:30:00 3
4 2017-01-01 00:45:00 3
5 2017-01-01 01:00:00 3
6 2017-01-01 01:15:00 3
7 2017-01-01 01:30:00 3
8 2017-01-01 01:45:00 3
9 2017-01-01 02:00:00 3
10 2017-01-01 02:15:00 3
# ... with 2,967 more rows
Related
I had a data frame with a column labelled Date_Time_GMT_3 which contained date/times. I used the Date_Time_GMT_3 column to create another data frame with 3 extra columns that have the month year and day seperated. This new data frame looks like so:
df = structure(list(Date_Time_GMT_3 = structure(list(sec = c(0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L), hour = c(8L, 8L,
8L, 8L, 8L, 8L), mday = c(1L, 1L, 1L, 1L, 1L, 1L), mon = c(5L,
5L, 5L, 5L, 5L, 5L), year = c(121L, 121L, 121L, 121L, 121L, 121L
), wday = c(2L, 2L, 2L, 2L, 2L, 2L), yday = c(151L, 151L, 151L,
151L, 151L, 151L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L), zone = c("EST",
"EST", "EST", "EST", "EST", "EST"), gmtoff = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), tzone = "EST", class = c("POSIXlt",
"POSIXt")), name = c("X20676880_X3WR_AIR_Stationary", "X20819740_X3WR_U_Stationary",
"X20819740_X3WR_S_Stationary", "X21092860_X3WR_U_Compare", "X20676883_13WR_U_Stationary",
"X20676883_13WR_S_Stationary"), value = c(11.431, 11.625, NA,
NA, 10.651, NA), month = c(6, 6, 6, 6, 6, 6), year = c(2021,
2021, 2021, 2021, 2021, 2021), day = c(1L, 1L, 1L, 1L, 1L, 1L
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
The code I used to get the month day and year columns from the Date_Time_GMT_3 column looks like this
mutate(month = lubridate::month(Date_Time_GMT_3),
year = lubridate::year(Date_Time_GMT_3),
day = lubridate::day(Date_Time_GMT_3))
Is there a way to use the lubridate function to get a time column. I've tried this line of code
mutate(month = lubridate::month(Date_Time_GMT_3),
year = lubridate::year(Date_Time_GMT_3),
day = lubridate::day(Date_Time_GMT_3),
#New LINE OF CODE
time = lubridate::hms(Date_Time_GMT_3))
When I use that new line of code I get this error
Warning message:
Problem with `mutate()` column `TIME`.
i `TIME = lubridate::hms(Date_Time_GMT_3)`.
i Some strings failed to parse, or all strings are NAs
Any ideas how to make it work?
It doesn't work because hms() expects only numbers in triples, where you have a date before the time, so you need to remove that portion before passing it to hms(). I have used substr since all the dates must have the same format, in this case, YYYY-MM-DD, so keep everything starting from the 11th character.
lubridate::hms(substr(df$Date_Time_GMT_3, 11, nchar(df$Date_Time_GMT_3)))
[1] "8H 0M 0S" "8H 0M 0S" "8H 0M 0S" "8H 0M 0S" "8H 0M 0S" "8H 0M 0S"
In dplyr
df %>%
mutate(hms = lubridate::hms(substr(Date_Time_GMT_3, 11, nchar(Date_Time_GMT_3))))
# A tibble: 6 x 7
Date_Time_GMT_3 name value month year day hms
<dttm> <chr> <dbl> <dbl> <dbl> <int> <Period>
1 2021-06-01 08:00:00 X20676880_X3WR_AIR_Station~ 11.4 6 2021 1 8H 0M 0S
2 2021-06-01 08:00:00 X20819740_X3WR_U_Stationary 11.6 6 2021 1 8H 0M 0S
3 2021-06-01 08:00:00 X20819740_X3WR_S_Stationary NA 6 2021 1 8H 0M 0S
4 2021-06-01 08:00:00 X21092860_X3WR_U_Compare NA 6 2021 1 8H 0M 0S
5 2021-06-01 08:00:00 X20676883_13WR_U_Stationary 10.7 6 2021 1 8H 0M 0S
6 2021-06-01 08:00:00 X20676883_13WR_S_Stationary NA 6 2021 1 8H 0M 0S
I want to find the total sum of running minutes of a battery per month and year. For this I have the following condition:
If Battery.voltage < 50 then "Yes, otherwise "No.
Note: For calculating the total sum of mins, we can the time stamp column which is day, month, year, hour, mins.
This is my data:
# Time.stamp Battery.voltage Condition
# 1 01/04/2016 00:00 51 No
# 2 01/04/2016 00:01 52 No
# 3 01/04/2016 00:02 45 Yes
# 4 01/04/2016 00:03 48 Yes
# 5 01/04/2016 00:04 49 Yes
# 6 01/04/2016 00:05 55 No
# 7 01/04/2016 00:06 54 No
# ...
structure(list(
Time.stamp = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 10L, 11L, 12L, 12L, 13L),
.Label = c("01/04/2016 00:00", "01/04/2016 00:01", "01/04/2016 00:02", "01/04/2016 00:03",
"01/04/2016 00:04", "01/04/2016 00:05", "01/04/2016 00:06", "01/04/2016 00:07",
"01/04/2016 00:08", "01/04/2016 00:09", "01/04/2016 00:11", "01/04/2016 00:12",
"01/04/2016 00:13"), class = "factor"),
Battery.voltage = c(51L, 52L, 45L, 48L, 49L, 55L, 54L, 52L, 51L, 49L, 48L, 47L, 45L, 50L, 51L),
Condition = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L),
.Label = c("No", "Yes"), class = "factor")),
.Names = c("Time.stamp", "Battery.voltage", "Condition"),
class = "data.frame", row.names = c(NA, -15L))
My expected output is something like this:
Month year Sum of mins running in battery
Jan 2016 350min
Feb 2016 450min
etc.
Unfortunately, your sample data is not very representative of your problem statement, as it only includes data for one day. It would have been beneficial to provide some code that generates random data for sufficient entries (i.e. dates).
That aside, you could adapt the following solution (here I assume your timestamp format is "DD/MM/YYYY"):
df %>%
mutate(
Time.stamp = as.POSIXct(Time.stamp, format = "%d/%m/%Y %H:%M"),
byday = format(Time.stamp, "%d/%m/%Y"),
bymonth = format(Time.stamp, "%d/%m"),
byyear = format(Time.stamp, "%Y")) %>%
group_by(byday) %>%
summarise(sum.running.in.mins = sum(Condition == "Yes"))
## A tibble: 1 x 2
# byday sum.running.in.mins
# <chr> <int>
#1 01/04/2016 7
Here we create columns byday, bymonth and byyear according to which you can group entries and calculate the sum of total running time per group. In above example, I calculate the total running time by day; to get the total running time per month, you would replace group_by(byday) with group_by(bymonth).
I have 9x2 dataframe DATS with prices and POSIXct datetimestamps sampled every 15 minutes. and a list of dates FOMCDATES with the dates of recent FOMC events. I then split the POSIXct datetimestamps into separate Date and Time columns. I then add column FOMCBinary to DATS containing a 1 whenever the date in DATS is contained in FOMCDATES AND time is 14:30 (EDIT: FOMC is 14:00, used 14:30 by mistake - example still valid).
I would like to record the Close before the event takes place in a separate variable. The name of the variable should be based on the date of the event. In the case at hand, the result should be: PreEvent-2016-01-27 = 1122.7. Please take into account this would actually be run in a large sample with dozens of dates and the time can be other than 14:30 (e.g. if looking at NFP rather than FOMC).
DATS <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0,0, 0, 0, 0, 0), min = c(30L, 15L, 0L, 45L, 30L, 15L, 0L, 45L,30L), hour = c(15L, 15L, 15L, 14L, 14L, 14L, 14L, 13L, 13L),mday = c(27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L), mon = c(0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(116L, 116L, 116L,116L, 116L, 116L, 116L, 116L, 116L), wday = c(3L, 3L, 3L,3L, 3L, 3L, 3L, 3L, 3L), yday = c(26L, 26L, 26L, 26L, 26L,26L, 26L, 26L, 26L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L), zone = c("EST", "EST", "EST", "EST", "EST", "EST","EST", "EST", "EST"), gmtoff = c(NA_integer_, NA_integer_,NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour","mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Close = c(1127.2, 1127.5,1126.9, 1128.3, 1125.4, 1122.7, 1122.8, 1117.3, 1116)), .Names = c("DateTime","Close"), row.names = 2131:2139, class = "data.frame")
FOMCDATES <- structure(c(16785, 16827, 16876), class = "Date")
DATS$Time <- strftime(DATS$DateTime, format="%H:%M:%S")
DATS$Date <- as.Date(DATS$DateTime)
DATS$FOMCBinary <- ifelse( DATS$Time == "14:30:00" & DATS$Date %in% FOMCDATES, 1, 0)
#Output for FOMCDATES:
[1] 2015-12-16 2016-01-27 2016-03-16
#Output for DATS after calculations performed:
DateTime Close Time Date FOMCBinary
2131 2016-01-27 15:30:00 1127.2 15:30:00 2016-01-27 0
2132 2016-01-27 15:15:00 1127.5 15:15:00 2016-01-27 0
2133 2016-01-27 15:00:00 1126.9 15:00:00 2016-01-27 0
2134 2016-01-27 14:45:00 1128.3 14:45:00 2016-01-27 0
2135 2016-01-27 14:30:00 1125.4 14:30:00 2016-01-27 1
2136 2016-01-27 14:15:00 1122.7 14:15:00 2016-01-27 0
2137 2016-01-27 14:00:00 1122.8 14:00:00 2016-01-27 0
2138 2016-01-27 13:45:00 1117.3 13:45:00 2016-01-27 0
2139 2016-01-27 13:30:00 1116.0 13:30:00 2016-01-27 0
My attempt results in a vector rather than a single value, and the variable name is not dynamic.
#My failed attempt
#Define rowShift function
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r]) }
PreEventLevel <- ifelse(DATS$FOMCBinary > 0, rowShift(DATS$Close, +1), 0)
How could this be achieved?
Thank you very much!
Creating variables in the global environment with dynamic names is not a good practice... I would rather use a list as container for your values e.g. :
# get the indexes where FOMCBinary > 0
oneIdxs <- which(DATS$FOMCBinary > 0)
# get the close values using indexes on the shifted vector and put the values in a list
PreEventLevel <- as.list(rowShift(DATS$Close,1)[oneIdxs])
# set the dates as names of the element in the list
names(PreEventLevel) <- DATS$Date[oneIdxs]
> PreEventLevel
$`2016-01-27`
[1] 1122.7
# now you can access to values using:
# PreEventLevel[["2016-01-27"]]
# or
# PreEventLevel$`2016-01-27`
Note that you can also simply create a vector with names instead of a list (just remove as.list), and PreEventLevel will be:
> PreEventLevel
2016-01-27
1122.7
# you can access to values using PreEventLevel["2016-01-27"]
Thanks in advance for any help that is provided.
Long story short: I am working with hourly time series data from a measurement device (exported from SQL then imported in to R in order to properly format the date time ) - the time series contains missing data, sometimes in groups, and I need to locate these missing rows/indices and insert a new row for each instance that holds an NA value.
Related Questions that did not solve my problem:
how to insert missing observations on a data frame
Adding row to a data frame with missing values
Problem Data
The dataset that I am working with in this case is fairly large and varies depending on the measurement device I select. As a test case, I have one time series that contains 17469 hourly observations. I located a small section of the dataset that may be used for testing purposes. Here it is:
> snip
date Reading
408 2015-12-15 00:00:00 4.40
409 2015-12-14 23:00:00 4.62
410 2015-12-14 22:00:00 4.61
411 2015-12-14 21:00:00 6.15
412 2015-12-14 20:00:00 6.06
413 2015-12-14 19:00:00 7.04
414 2015-12-14 18:00:00 8.57
415 2015-12-14 11:00:00 4.12
416 2015-12-14 10:00:00 3.73
We can see that observations are missing for 2015-12-14 12:00:00 to 2015-12-14 17:00:00. I would like to first locate then populate the time series with these date times and input NA for the Reading column in these positions. I would also like to return the indices that are missing in an additional vector.
How can this be done?
So far I have tried the following code (as suggested here, how to add a missing dates and remove repeated dates in hourly time series), but all I end up with is NA values when I perform the merge function and still need to identify where the missing indices are located.
Here is the result:
> grid = data.frame(date=seq.POSIXt(min(snip[,1]), to=max(snip[,1]), by="1 hours"));
> dat = merge(grid, snip, by="date", all.x=TRUE)
> dat
date Reading
1 2015-12-14 10:00:00 NA
2 2015-12-14 11:00:00 NA
3 2015-12-14 12:00:00 NA
4 2015-12-14 13:00:00 NA
5 2015-12-14 14:00:00 NA
6 2015-12-14 15:00:00 NA
7 2015-12-14 16:00:00 NA
8 2015-12-14 17:00:00 NA
9 2015-12-14 18:00:00 NA
10 2015-12-14 19:00:00 NA
11 2015-12-14 20:00:00 NA
12 2015-12-14 21:00:00 NA
13 2015-12-14 22:00:00 NA
14 2015-12-14 23:00:00 NA
15 2015-12-15 00:00:00 NA
What am I missing here? Is it because grid and snip$date are in reverse order? For additional information here is what the date time format looks like (in case this is from where my issue stems):
> snip[2,1]
[1] "2015-12-14 23:00:00 GMT"
The result of the dput(snip) command is as follows (thanks for the suggestion #42):
> dput(snip)
structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), hour = c(0L,
23L, 22L, 21L, 20L, 19L, 18L, 11L, 10L), mday = c(15L, 14L, 14L,
14L, 14L, 14L, 14L, 14L, 14L), mon = c(11L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L), year = c(115L, 115L, 115L, 115L, 115L, 115L,
115L, 115L, 115L), wday = c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), yday = c(348L, 347L, 347L, 347L, 347L, 347L, 347L, 347L, 347L
), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT"), Reading = c(4.4,
4.62, 4.61, 6.15, 6.06, 7.04, 8.57, 4.12, 3.73)), .Names = c("date",
"Reading"), row.names = 408:416, class = "data.frame")
Here's how I was able to do it with some help from na.locf documentation. Does it help?
dat<- dget("yoursample")
require(xts)
datxts<- as.xts(dat[,-1],order.by = dat$date,frequency = 24)
tzn<-tzone(datxts)
g<- seq(start(datxts), end(datxts), "hour")
gxts<- xts(rep(NA,length(g)),order.by = as.POSIXct(g), tzone = tzn)
merge(datxts,gxts,all = T)$datxts
Edit: And also, your method works if you add a column of NA's to generated dataframe
dates=seq.POSIXt(min(snip[,1]), to=max(snip[,1]), by="1 hours")
grid = data.frame(date=dates,dummydata=rep(NA,length(dates)));
dat = merge(grid, snip, by="date", all=T)
Now my data frame is like below
dput(head(t.zoo))
structure(c(85.92, 85.85, 85.83, 85.83, 85.85, 85.87, 1300, 1300,
1299.75, 1299.75, 1299.75, 1300), .Dim = c(6L, 2L), .Dimnames = list(
NULL, c("cl", "es")), index = structure(list(sec = c(0.400000095367432,
0.900000095367432, 1.40000009536743, 1.90000009536743, 2.40000009536743,
2.90000009536743), min = c(30L, 30L, 30L, 30L, 30L, 30L), hour = c(10L,
10L, 10L, 10L, 10L, 10L), mday = c(6L, 6L, 6L, 6L, 6L, 6L), mon = c(5L,
5L, 5L, 5L, 5L, 5L), year = c(112L, 112L, 112L, 112L, 112L, 112L
), wday = c(3L, 3L, 3L, 3L, 3L, 3L), yday = c(157L, 157L, 157L,
157L, 157L, 157L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = c("", "EST", "EDT"
)), class = "zoo")
I have two questions, first is I would like to add a variable name for the first column and 2nd is i want to create a categorical variable to help me indicate 2010-06-06 (since there are 3 separate days)
What I should do for the date data?
I'm not familiar with zoo class, so the following code is not nice, but seems working.
yourdata<-as.matrix(yourdata)
justdate <- substr(rownames(yourdata), 1, 10)
justtime <- substr(rownames(yourdata), 11, 19)
row.names(yourdata) <- NULL
yourdata<-as.data.frame(yourdata)
yourdata[,"justdate"]<-justdate
yourdata[,"justtime"]<-justtime
yourdata[yourdata$justdate=="2012-06-06","newvariable"]<-1
> yourdata
cl es justdate justtime newvariable
1 85.92 1300.00 2012-06-06 10:30:00 1
2 85.85 1300.00 2012-06-06 10:30:00 1
3 85.83 1299.75 2012-06-06 10:30:01 1
4 85.83 1299.75 2012-06-06 10:30:01 1
5 85.85 1299.75 2012-06-06 10:30:02 1
6 85.87 1300.00 2012-06-06 10:30:02 1
zoo objects are a little bit different to work with from data.frames.
The "first column" (as you referred to it) is actually not a column, but the index of your object. Try index(t.zoo) and see what it returns. This index really should have unique values; in your case, there are duplicated values, which might affect your calculations.
Conversion to a data.frame can be done like the following. I've added separate "Date" and "Time" variables based on the index from t.zoo.
require(zoo) # Load the `zoo` package if you haven't already done so
t.df = data.frame(Date = format(index(t.zoo), "%Y-%m-%d"),
Time = format(index(t.zoo), "%H:%M:%S"),
data.frame(t.zoo))
t.df
# Date Time cl es
# 1 2012-06-06 10:30:00 85.92 1300.00
# 2 2012-06-06 10:30:00 85.85 1300.00
# 3 2012-06-06 10:30:01 85.83 1299.75
# 4 2012-06-06 10:30:01 85.83 1299.75
# 5 2012-06-06 10:30:02 85.85 1299.75
# 6 2012-06-06 10:30:02 85.87 1300.00
Converting back to a zoo object (keeping the new "Date" and "Time" columns, or any other columns that you have added) can be done like:
zoo(t.df, order.by=index(t.zoo))
Note, however, that this will give you a warning because you don't have unique "order.by" values.