Get mean of previous 6 hours in R - r

I have a dataset with a level of radiation per hour. I need to get the average level of radiation from the previous 6 hours. So for point c I need: mean(data$radiation[(c-7):(c-1)])
This would be a solution to my problem, if the dataset where to be complete (it is not, sometimes a few hours are missing) and I have no idea how to automate it without a for-loop (which I would like to avoid as there are 199056 entries)
I have the data in a data frame with radiation and time in a POSIXct format:
GLOBAL_radiation POSTIME
1383116 98 2016-06-10 18:00:00
1383118 55 2016-06-10 19:00:00
1383125 26 2016-06-10 20:00:00
1383130 6 2016-06-10 21:00:00
1383137 0 2016-06-10 22:00:00
1383142 0 2016-06-10 23:00:00
I've been cracking my brain on this for a while now, I do hope a function exists for this that I'm unaware of. Thanks in advance.

I'm not quite sure if this meet your needs, but I give it a try:
library(dplyr)
# define start value for date, which is assumed to be the
# last value in the time-vector
start <- dat$POSTIME[nrow(dat)]
# compute difference of all time points in relation
# to latest time point in data set
dat$hours <- as.vector(difftime(start, dat$POSTIME, units = "hours"))
# create a "grouping" vector, where all 6-hours-span-timepoints
# are grouped together
dat$grp <- as.integer(dat$hours / 6)
# group by 6-hours-span and compute mean for each
# 6-hours time-period
dat %>% group_by(grp) %>% summarise(mean(STRALING))

Related

In R, How can I create a new date variable adopting the nearest date value right after an index date variable?

My dataframe in R studio is as follows:
StudyID FITDate.1 ScopeDate.1 ScopeDate.2 ScopeDate.3 ScopeDate.4
1 2014-05-15 2010-06-02 2014-05-28 2014-08-01 2015-10-27
2 2017-11-29 2018-02-27
3 2015-10-04 2016-06-24 2017-01-18
I have a variable "FITDate.1" indicates the date for FIT test, and several variables "ScopeDate.x" indicates the dates for multiple scope tests.
In my research, a person can have only one date for FIT test, but can have multiple dates for scope. Clinically, if a person has a FIT test, then he will be referred to undertake scope test. However, this person may receive scope tests for other reasons.
So if the date of a scope test is right after the date of a FIT test, then we will define them highly related.
I want to create a variable "FITrelatedscopedate" to include the dates of FIT related scopes. For example, in the row of StudyID==1, the date of "FITDate.1"is 2014-05-15, which is right between ScopeDate.1 (2010-06-02) and ScopeDate.2 (2014-05-28). So the date value 2014-05-28 of ScopeDate.2 is what i need, and I will use 2014-05-28 as the FIT related scope date and write it in the new variable "FITrelatedscopedate".
I think I have to use loop syntax, but i had no experience to realize it. Do you have any experience to solve similar problem? Do you know any codes to realize it? Thanks, any help are appreciated.
Here is one approach with tidyverse assuming you start with two long data.frames, one for FIT testing, and the other for endoscopy.
df_fit <- data.frame(
StudyID = 1:3,
FITDate = as.Date(c("2014-05-15", "2017-11-29", "2015-10-04"))
)
df_fit
StudyID FITDate
1 1 2014-05-15
2 2 2017-11-29
3 3 2015-10-04
df_scope <- data.frame(
StudyID = c(1,1,1,1,2,3,3),
ScopeDate = as.Date(c("2010-06-02", "2014-05-28", "2014-08-01", "2015-10-27", "2018-02-27",
"2016-06-24", "2017-01-18"))
)
df_scope
StudyID ScopeDate
1 1 2010-06-02
2 1 2014-05-28
3 1 2014-08-01
4 1 2015-10-27
5 2 2018-02-27
6 3 2016-06-24
7 3 2017-01-18
First, you can do a left_join by the StudyID to add the scope dates to the FIT data. Then, you can filter to only keep scope dates after FIT testing. For each StudyID, use slice to retain only the first row (this assumes dates are in chronological order...if not, add arrange(ScopeDate) first in the pipe - let me know if you need help with this).
Then, you can right_join back to df_fit so that those FIT testing dates without endoscopy will have NA for the ScopeDate. The final statement with mutate will calculate the time duration between endoscopy and FIT testing.
library(tidyverse)
left_join(
df_fit,
df_scope,
by = "StudyID"
) %>%
filter(ScopeDate > FITDate) %>%
group_by(StudyID) %>%
slice(1) %>%
right_join(df_fit) %>%
mutate(Duration = ScopeDate - FITDate)
Output
StudyID FITDate ScopeDate Duration
<dbl> <date> <date> <drtn>
1 1 2014-05-15 2014-05-28 13 days
2 2 2017-11-29 2018-02-27 90 days
3 3 2015-10-04 2016-06-24 264 days
Let me know if this works for you. A data.table approach can be considered if you need something faster and have a very large dataset.
If you need the Duration as a numeric column, you can use as.numeric(ScopeDate - FITDate).

best practices for avoiding roundoff gotchas in date manipulation

I am doing some date/time manipulation and experiencing explicable, but unpleasant, round-tripping problems when converting date -> time -> date . I have temporarily overcome this problem by rounding at appropriate points, but I wonder if there are best practices for date handling that would be cleaner. I'm using a mix of base-R and lubridate functions.
tl;dr is there a good, simple way to convert from decimal date (YYYY.fff) to the Date class (and back) without going through POSIXt and incurring round-off (and potentially time-zone) complications??
Start with a few days from 1918, as separate year/month/day columns (not a critical part of my problem, but it's where my pipeline happens to start):
library(lubridate)
dd <- data.frame(year=1918,month=9,day=1:12)
Convert year/month/day -> date -> time:
dd <- transform(dd,
time=decimal_date(make_date(year, month, day)))
The successive differences in the resulting time vector are not exactly 1 because of roundoff: this is understandable but leads to problems down the road.
table(diff(dd$time)*365)
## 0.999999999985448 1.00000000006844
## 9 2
Now suppose I convert back to a date: the dates are slightly before or after midnight (off by <1 second in either direction):
d2 <- lubridate::date_decimal(dd$time)
# [1] "1918-09-01 00:00:00 UTC" "1918-09-02 00:00:00 UTC"
# [3] "1918-09-03 00:00:00 UTC" "1918-09-03 23:59:59 UTC"
# [5] "1918-09-04 23:59:59 UTC" "1918-09-05 23:59:59 UTC"
# [7] "1918-09-07 00:00:00 UTC" "1918-09-08 00:00:00 UTC"
# [9] "1918-09-09 00:00:00 UTC" "1918-09-09 23:59:59 UTC"
# [11] "1918-09-10 23:59:59 UTC" "1918-09-12 00:00:00 UTC"
If I now want dates (rather than POSIXct objects) I can use as.Date(), but to my dismay as.Date() truncates rather than rounding ...
tt <- as.Date(d2)
## [1] "1918-09-01" "1918-09-02" "1918-09-03" "1918-09-03" "1918-09-04"
## [6] "1918-09-05" "1918-09-07" "1918-09-08" "1918-09-09" "1918-09-09"
##[11] "1918-09-10" "1918-09-12"
So the differences are now 0/1/2 days:
table(diff(tt))
# 0 1 2
# 2 7 2
I can fix this by rounding first:
table(diff(as.Date(round(d2))))
## 1
## 11
but I wonder if there is a better way (e.g. keeping POSIXct out of my pipeline and staying with dates ...
As suggested by this R-help desk article from 2004 by Grothendieck and Petzoldt:
When considering which class to use, always
choose the least complex class that will support the
application. That is, use Date if possible, otherwise use
chron and otherwise use the POSIX classes. Such a strategy will greatly reduce the potential for error and increase the reliability of your application.
The extensive table in this article shows how to translate among Date, chron, and POSIXct, but doesn't include decimal time as one of the candidates ...
It seems like it would be best to avoid converting back from decimal time if at all possible.
When converting from date to decimal date, one also needs to account for time. Since Date does not have a specific time associated with it, decimal_date inherently assumes it to be 00:00:00.
However, if we are concerned only with the date (and not the time), we could assume the time to be anything. Arguably, middle of the day (12:00:00) is as good as the beginning of the day (00:00:00). This would make the conversion back to Date more reliable as we are not at the midnight mark and a few seconds off does not affect the output. One of the ways to do this would be to add 12*60*60/(365*24*60*60) to dd$time
dd$time2 = dd$time + 12*60*60/(365*24*60*60)
data.frame(dd[1:3],
"00:00:00" = as.Date(date_decimal(dd$time)),
"12:00:00" = as.Date(date_decimal(dd$time2)),
check.names = FALSE)
# year month day 00:00:00 12:00:00
#1 1918 9 1 1918-09-01 1918-09-01
#2 1918 9 2 1918-09-02 1918-09-02
#3 1918 9 3 1918-09-03 1918-09-03
#4 1918 9 4 1918-09-03 1918-09-04
#5 1918 9 5 1918-09-04 1918-09-05
#6 1918 9 6 1918-09-05 1918-09-06
#7 1918 9 7 1918-09-07 1918-09-07
#8 1918 9 8 1918-09-08 1918-09-08
#9 1918 9 9 1918-09-09 1918-09-09
#10 1918 9 10 1918-09-09 1918-09-10
#11 1918 9 11 1918-09-10 1918-09-11
#12 1918 9 12 1918-09-12 1918-09-12
It should be noted, however, that the value of decimal time obtained in this way will be different.
lubridate::decimal_date() is returning a numeric. If I understand you correctly, the question is how to convert that numeric into Date and have it round appropriately without bouncing through POSIXct.
as.Date(1L, origin = '1970-01-01') shows us that we can provide as.Date with days since some specified origin and convert immediately to the Date type. Knowing this, we can skip the year part entirely and set it as origin. Then we can convert our decimal dates to days:
as.Date((dd$time-trunc(dd$time)) * 365, origin = "1918-01-01").
So, a function like this might do the trick (at least for years without leap days):
date_decimal2 <- function(decimal_date) {
years <- trunc(decimal_date)
origins <- paste0(years, "-01-01")
# c.f. https://stackoverflow.com/questions/14449166/dates-with-lapply-and-sapply
do.call(c, mapply(as.Date.numeric, x = (decimal_date-years) * 365, origin = origins, SIMPLIFY = FALSE))
}
Side note: I admit I went down a bit of a rabbit hole with trying to move origin around deal with the pre-1970 date. I found that the further origin shifted from the target date, the more weird the results got (and not in ways that seemed to be easily explained by leap days). Since origin is flexible, I decided to target it right on top of the target values. For leap days, seconds, and whatever other weirdness time has in store for us, on your own head be it. =)

computing and formatting averages and squares of time intervals

I have a model which predicts the duration of certain events, and measures of durations for those events. I then want to compute the difference between Predicted and Measured, the mean difference and the RMSE. I'm able to do it, but the formatting is really awkward and not what I expected:
database <- data.frame(Predicted = c(strptime(c("4:00", "3:35", "3:38"), format = "%H:%M")),
Measured = c(strptime(c("3:39", "3:40", "3:53"), format = "%H:%M")))
database
> Predicted Measured
1 2016-11-28 04:00:00 2016-11-28 03:39:00
2 2016-11-28 03:35:00 2016-11-28 03:40:00
3 2016-11-28 03:38:00 2016-11-28 03:53:00
This is the first weirdness: why does R shows me a time and a date, even if I clearly specified a time-only format (%H:%M), and there was no date in my data to start with? It gets weirder:
database$Error <- with(database, Predicted-Measured)
database$Mean_Error <- with(database, mean(Predicted-Measured))
database$RMSE <- with(database, sqrt(mean(as.numeric(Predicted-Measured)^2)))
> database
Predicted Measured Error Mean_Error RMSE
1 2016-11-28 04:00:00 2016-11-28 03:39:00 21 mins 0.3333333 15.17674
2 2016-11-28 03:35:00 2016-11-28 03:40:00 -5 mins 0.3333333 15.17674
3 2016-11-28 03:38:00 2016-11-28 03:53:00 -15 mins 0.3333333 15.17674
Why is the variable Error expressed in minutes? For Error it's not a bad choice, but it becomes quite hard to read for Mean_Error. For RMSE it's even worse, but this could be due to the as.numeric function: if I remove it, R complains that '^' not defined for "difftime" objects. My questions are:
Is it possible to show the first 2 columns (Predicted and Measured) shown in the %H:%M format?
for the other 3 columns ( Error, Mean_Error and RMSE) I would like to compare a %M:%S format and a format in only seconds, and choose among the two. Is it possible?
EDIT: just to be more clear, my goal is to insert observations of time intervals into a dataframe and compute a vector of time interval differences. Then, compute some statistics for that vector: mean, RMSE, etc.. I know I could just enter the time observations in seconds, but that doesn't look very good: it's difficult to tell that 13200 seconds are 3 hours and 40 minutes. Thus I would like to be able to store the time intervals in the %H:%M, but then be able to manipulate them algebraically and show the results in a format of my choosing. Is that possible?
We can use difftime to specify the units for the difference in time. The output of difftime is an object of class difftime. When this difftime object is coerced to numeric using as.numeric, we can change these units (see the examples in ?difftime):
## Note we don't convert to date-time because we just want %H:%M
database <- data.frame(Predicted = c("4:00", "3:35", "3:38"),
Measured = c("3:39", "3:40", "3:53"))
## We now convert to date-time and use difftime to compute difference in minutes
database$Error <- with(database, difftime(strptime(Predicted,format="%H:%M"),strptime(Measured,format="%H:%M"), units="mins"))
## Use as.numeric to change units to seconds
database$Mean_Error <- with(database, mean(as.numeric(Error,units="secs")))
database$RMSE <- with(database, sqrt(mean(as.numeric(Error,units="secs")^2)))
## Predicted Measured Error Mean_Error RMSE
##1 4:00 3:39 21 mins 20 910.6042
##2 3:35 3:40 -5 mins 20 910.6042
##3 3:38 3:53 -15 mins 20 910.6042

Deseasonalize a "zoo" object containing intraday data

Using the zoo package (and help from SO) I have created a time series from the following:
z <- read.zoo("D:\\Futures Data\\BNVol3.csv", sep = ",", header = TRUE, index = 1:2,
tz="", format = "%d/%m/%Y %H:%M")
This holds data in the following format:(Intra-day from 07:00 to 20.50)
2012-10-01 14:50:00 2012-10-01 15:00:00 2012-10-01 15:10:00 2012-10-01 15:20:00
8638 9014 9402 9505
I want to "deseasonalize" the intra-day component of this data so that 1 day is considered a complete seasonal cycle. (I am using the day component because not all days will run from 07.00 to 20.50 due to bank holidays etc, but running from 07.00 to 20.50 is usually the standard. I assume that if i used the 84 intra-day points as 1 seasonal cycle then as some point the deseasonalizing will begin to get thrown off track)
I have tried to use the decompose method but this has not worked.
x <- Decompose(z)
Not sure "zoo" and decompose method are compatible but I thought "zoo" and "ts" were designed to be. Is there another way to do this?
Thanks in advance for any help.

Aggregating, restructuring hourly time series data in R

I have a year's worth of hourly data in a data frame in R:
> str(df.MHwind_load) # compactly displays structure of data frame
'data.frame': 8760 obs. of 6 variables:
$ Date : Factor w/ 365 levels "2010-04-01","2010-04-02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time..HRs. : int 1 2 3 4 5 6 7 8 9 10 ...
$ Hour.of.Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Wind.MW : int 375 492 483 476 486 512 421 396 456 453 ...
$ MSEDCL.Demand: int 13293 13140 12806 12891 13113 13802 14186 14104 14117 14462 ...
$ Net.Load : int 12918 12648 12323 12415 12627 13290 13765 13708 13661 14009 ...
While preserving the hourly structure, I would like to know how to extract
a particular month/group of months
the first day/first week etc of each month
all mondays, all tuesdays etc of the year
I have tried using "cut" without result and after looking online think that "lubridate" might be able to do so but haven't found suitable examples. I'd greatly appreciate help on this issue.
Edit: a sample of data in the data frame is below:
Date Hour.of.Year Wind.MW datetime
1 2010-04-01 1 375 2010-04-01 00:00:00
2 2010-04-01 2 492 2010-04-01 01:00:00
3 2010-04-01 3 483 2010-04-01 02:00:00
4 2010-04-01 4 476 2010-04-01 03:00:00
5 2010-04-01 5 486 2010-04-01 04:00:00
6 2010-04-01 6 512 2010-04-01 05:00:00
7 2010-04-01 7 421 2010-04-01 06:00:00
8 2010-04-01 8 396 2010-04-01 07:00:00
9 2010-04-01 9 456 2010-04-01 08:00:00
10 2010-04-01 10 453 2010-04-01 09:00:00
.. .. ... .......... ........
8758 2011-03-31 8758 302 2011-03-31 21:00:00
8759 2011-03-31 8759 378 2011-03-31 22:00:00
8760 2011-03-31 8760 356 2011-03-31 23:00:00
EDIT: Additional time-based operations I would like to perform on the same dataset
1. Perform hour-by-hour averaging for all data points i.e average of all values in the first hour of each day in the year. The output will be an "hourly profile" of the entire year (24 time points)
2. Perform the same for each week and each month i.e obtain 52 and 12 hourly profiles respectively
3. Do seasonal averages, for example for June to September
Convert the date to the format which lubridate understands and then use the functions month, mday, wday respectively.
Suppose you have a data.frame with the time stored in column Date, then the answer for your questions would be:
###dummy data.frame
df <- data.frame(Date=c("2012-01-01","2012-02-15","2012-03-01","2012-04-01"),a=1:4)
##1. Select rows for particular month
subset(df,month(Date)==1)
##2a. Select the first day of each month
subset(df,mday(Date)==1)
##2b. Select the first week of each month
##get the week numbers which have the first day of the month
wkd <- subset(week(df$Date),mday(df$Date)==1)
##select the weeks with particular numbers
subset(df,week(Date) %in% wkd)
##3. Select all mondays
subset(df,wday(Date)==1)
First switch to a Date representation: as.Date(df.MHwind_load$Date)
Then call weekdays on the date vector to get a new factor labelled with day of week
Then call months on the date vector to get a new factor labelled with name of month
Optionally create a years variable (see below).
Now subset the data frame using the relevant combination of these.
Step 2. gets an answer to your task 3. Steps 3. and 4. get you to task 1. Task 2 might require a line or two of R. Or just select rows corresponding to, say, all the Mondays in a month and call unique, or its alter-ego duplicated on the results.
To get you going...
newdf <- df.MHwind_load ## build an augmented data set
newdf$d <- as.Date(newdf$Date)
newdf$month <- months(newdf$d)
newdf$day <- weekdays(newdf$d)
## for some reason R has no years function. Here's one
years <- function(x){ format(as.Date(x), format = "%Y") }
newdf$year <- years(newdf$d)
# get observations from January to March of every year
subset(newdf, month %*% in c('January', 'February', 'March'))
# get all Monday observations
subset(newdf, day == 'Monday')
# get all Mondays in 1999
subset(newdf, day == 'Monday' & year == '1999')
# slightly fancier: _first_ Monday of each month
# get the first weeks
first.week.of.month <- !duplicated(cbind(newdf$month, newdf$day))
# now pull out the mondays
subset(newdf, first.monday.of.month & day=='Monday')
Since you're not asking about the time (hourly) part of your data, it is best to then store your data as a Date object. Otherwise, you might be interested in chron, which also has some convenience functions like you'll see below.
With respect to Conjugate Prior's answer, you should store your date data as a Date object. Since your data already follows the default format ('yyyy-mm-dd') you can just call as.Date on it. Otherwise, you would have to specify your string format. I would also use as.character on your factor to make sure you don't get errors inline. I know I've ran into problems with factors-into-Dates for that reason (possibly corrected in current version).
df.MHwind_load <- transform(df.MHwind_load, Date = as.Date(as.character(Date)))
Now you would do well to create wrapper functions that extract the information you desire. You could use transform like I did above to simply add those columns that represent months, days, years, etc, and then subset on them logically. Alternatively, you might do something like this:
getMonth <- function(x, mo) { # This function assumes w/in single year vector
isMonth <- month(x) %in% mo # Boolean of matching months
return(x[which(isMonth)] # Return vector of matching months
} # end function
Or, in short form
getMonth <- function(x, mo) x[month(x) %in% mo]
This is just a tradeoff between storing that information (transform frame) or having it processed when desired (use accessor methods).
A more complicated process is your need for, say, the first day of a month. This is not entirely difficult, though. Below is a function that will return all of those values, but it is rather simple to just subset a sorted vector of values for a given month and take their first one.
getFirstDay <- function(x, mo) {
isMonth <- months(x) %in% mo
x <- sort(x[isMonth]) # Look at only those in the desired month.
# Sort them by date. We only want the first day.
nFirsts <- rle(as.numeric(x))$len[1] # Returns length of 1st days
return(x[seq(nFirsts)])
} # end function
The easier alternative would be
getFirstDayOnly <- function(x, mo) {sort(x[months(x) %in% mo])[1]}
I haven't prototyped these, as you didn't provide any data samples, but this is the sort of approach that can help you get the information you desire. It is up to you to figure out how to put these into your work flow. For instance, say you want to get the first day for each month of a given year (assuming we're only looking at one year; you can create wrappers or pre-process your vector to a single year beforehand).
# Return a vector of first days for each month
df <- transform(df, date = as.Date(as.character(date)))
sapply(unique(months(df$date)), # Iterate through months in Dates
function(month) {getFirstDayOnly(df$date, month)})
The above could also be designed as a separate convenience function that uses the other accessor function. In this way, you create a series of direct but concise methods for getting pieces of the information you want. Then you simply pull them together to create very simple and easy to interpret functions that you can use in your scripts to get you precise what you desire in the most efficient manner.
You should be able to use the above examples to figure out how to prototype other wrappers for accessing the date information you require. If you need help on those, feel free to ask in a comment.

Resources