Mean hour-of-day and imputation...would this be easier with time calculations? - r

I'm working with a data set and am imputing NAs for times. I have a simplified example below where I am creating a new column that includes the original data and imputed values for NAs (i.e., the mean of the time of day). The code works fine, but I am so weak with dates I was wondering if there was an easier way to calculate the mean time of day date/time values?
arrivals <- data.frame(
ships=c("Glory","Discover","Intrepid","Enchantment","Summit"),
times=c("8:00","10:00","11:42",NA,"9:20"), stringsAsFactors=FALSE)
sumtime <- sapply(strsplit(as.character(arrivals$times),":"),
function(x) as.numeric(x[1])*60 + as.numeric(x[2]))
avgtime <- paste(trunc((mean(sumtime, na.rm=TRUE)/60)),":",
trunc(mean(sumtime, na.rm=TRUE)%%60), sep="")
arrivals$times2 <- arrivals$times
arrivals$times2[is.na(arrivals$times)] <- avgtime

You can use the chron package to convert your times column to a numeric representation that you can take the average of:
library(chron)
Arrivals <- arrivals[,c("ships","times")]
# Will give some warnings due to the missing value
Arrivals$times <- chron(times.=paste(Arrivals$times, ":00", sep=""))
Arrivals$times[is.na(Arrivals$times)] <- mean(Arrivals$times,na.rm=TRUE)
ships times
1 Glory 08:00:00
2 Discover 10:00:00
3 Intrepid 11:42:00
4 Enchantment 09:45:30
5 Summit 09:20:00

Related

Creating time stamps and repeat in the same column

I want to create a vector of time stamps consisting of 60 monthly dates and repeat the process for n number of times. That means, if n = 2, the vector should contain 120 times stamps.
A single vector of time stamps I am creating in this way,
t <- seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")
To repeat it n number of times I am doing the following,
n <- 2
X <- data.frame(replicate(n, seq(as.Date("2014-01-01"), as.Date("2018-12-31"), by = "month")))
Y <- stack(X)[,"values", drop=FALSE]
head(Y)
> head(Y)
values
1 16071
2 16102
3 16130
4 16161
5 16191
6 16222
As you see the values are not in time format anymore. My question is how to retain the time format in the vector Y? Is there any smarter way to do this problem?
Take a look at the 'zoo' package, there is an old thread here https://stat.ethz.ch/pipermail/r-help//2010-March/233159.html
where they talk about sort of the same problem.
Either way, after installing zoo you can do
as.Date(16071)
and it will return the date in date format. Hope this makes sense.

Mean Returns in Time Series - Restarting after NA values - rstudio

Has anyone encountered calculating historical mean log returns in time series datasets?
The dataset is ordered by individual security first and by time for each respective security. I am trying to form a historical mean log return, i.e. the mean log return for the security from its first appearance in the dataset to date, for each point in time for each security.
Luckily, the return time series contains NAs between returns for differing securities. My idea is to calculate a historical mean that restarts after each NA that appears.
A simple cumsum() probably will not do it, as the NAs will have to be dropped.
I thought about using rollmean(), if I only knew an efficient way to specify the 'width' parameter to the length of the vector of consecutive preceding non-NAs.
The current approach I am taking, based on Count how many consecutive values are true, takes significantly too much time, given the size of the data set I am working with.
For any x of the form x : [r(1) r(2) ... r(N)], where r(2) is the log return in period 2:
df <- data.frame(x, zcount = NA)
df[1,2] = 0 #df$x[1]=NA by construction of the data set
for(i in 2:nrow(df))
df$zcount[i] <- ifelse(!is.na(df$x[i]), df$zcount[i-1]+1, 0)
Any idea how to speed this up would be highly appreciated!
You will need to reshape the data.frame to apply the cumsum function
over each security. Here's how:
First, I'll generate some data on 100 securities over 100 months which I think corresponds to your description of the data set
securities <- 100
months <- 100
time <- seq.Date(as.Date("2010/1/1"), by = "months", length.out = months)
ID <- rep(paste0("sec", 1:months), each = securities)
returns <- rnorm(securities * months, mean = 0.08, sd = 2)
df <- data.frame(time, ID, returns)
head(df)
time ID returns
1 2010-01-01 sec1 -3.0114466
2 2010-02-01 sec1 -1.7566112
3 2010-03-01 sec1 1.6615731
4 2010-04-01 sec1 0.9692533
5 2010-05-01 sec1 1.3075774
6 2010-06-01 sec1 0.6323768
Now, you must reshape your data so that each security column contains its
returns, and each row represents the date.
library(tidyr)
df_wide <- spread(df, ID, returns)
Once this is done, you can use the apply function to sum every column which now represents each security. Or use the cumsum function. Notice the data object df_wide[-1], which drops the time column. This is necessary to avoid the sum or cumsum functions throwing an error.
matrix_sum <- apply(df_wide[-1], 2, FUN = sum)
matrix_cumsum <- apply(df_wide[-1], 2, FUN = cumsum)
Now, add the time column back as a data.frame if you like:
df_final <- data.frame(time = df_wide[,1], matrix_cumsum)

R - Next highest value in a time series

A relatively simple question, but one I can't seem to find any examples.
I have simple forex price data which is in a 2 column xts object called subx1:
Datetime, Price
2016-09-01 00:00:01, 1.11563
2016-09-01 00:00:01, 1.11564
2016-09-01 00:00:02, 1.11564
2016-09-01 00:00:03, 1.11565
... and so forth.
I'm trying to find the first time after 2pm when the price goes higher than the pre-2pm high which is held in another object's column called daypeakxts$before2.High and
Where a sample of daypeakxts is:
Date, before2.High
2016-09-01, 1.11567
2016-09-02, 1.11987
This is a bad example of what I'm trying to do:
subxresult <- index(subx1, subx1$datetime > daypeakxts$before2.High)
... so I'm looking to discover a datetime for a price using a conditional statement with a day's value in another xts object.
You didn't provide enough data for a reproducible example, so I'm going to use some daily data that comes with the xts package.
library(xts)
data(sample_matrix)
x <- as.xts(sample_matrix, dateForamt = "Date")
# Aggregate and find the high for each week
Week.High <- apply.weekly(x, function(x) max(x$High))
# Finding the pre-2pm high would be something like:
# Pre.2pm.High <- apply.daily(x["T00:00/T14:00"], function(x) max(x$High))
# Merge the period high with the original data, and
# fill NA with the last observation carried forward
y <- merge(x, Week.High, fill = na.locf)
# Lag the period high, so it aligns with the following period
y$Week.High <- lag(y$Week.High)
# Find the first instance where the next period's high
# is higher than the previous period's high
y$First.Higher <- apply.weekly(y, function(x) which(x$High > x$Week.High)[1])

Finding a more elegant was to aggregate hourly data to mean hourly data using zoo

I have a chunk of data logging temperatures from a few dozen devices every hour for over a year. The data are stored as a zoo object. I'd very much like to summarize those data by looking at the average values for every one of the 24 hours in a day (1am, 2am, 3am, etc.). So that for each device I can see what its average value is for all the 1am times, 2am times, and so on. I can do this with a loop but sense that there must be a way to do this in zoo with an artful use of aggregate.zoo. Any help?
require(zoo)
# random hourly data over 30 days for five series
x <- matrix(rnorm(24 * 30 * 5),ncol=5)
# Assign hourly data with a real time and date
x.DateTime <- as.POSIXct("2014-01-01 0100",format = "%Y-%m-%d %H") +
seq(0,24 * 30 * 60 * 60, by=3600)
# make a zoo object
x.zoo <- zoo(x, x.DateTime)
#plot(x.zoo)
# what I want:
# the average value for each series at 1am, 2am, 3am, etc. so that
# the dimensions of the output are 24 (hours) by 5 (series)
# If I were just working on x I might do something like:
res <- matrix(NA,ncol=5,nrow=24)
for(i in 1:nrow(res)){
res[i,] <- apply(x[seq(i,nrow(x),by=24),],2,mean)
}
res
# how can I avoid the loop and write an aggregate statement in zoo that
# will get me what I want?
Calculate the hour for each time point and then aggregate by that:
hr <- as.numeric(format(time(x.zoo), "%H"))
ag <- aggregate(x.zoo, hr, mean)
dim(ag)
## [1] 24 5
ADDED
Alternately use hours from chron or hour from data.table:
library(chron)
ag <- aggregate(x.zoo, hours, mean)
This is quite similar to the other answer but takes advantage of the fact the the by=... argument to aggregate.zoo(...) can be a function which will be applied to time(x.zoo):
as.hour <- function(t) as.numeric(format(t,"%H"))
result <- aggregate(x.zoo,as.hour,mean)
identical(result,ag) # ag from G. Grothendieck answer
# [1] TRUE
Note that this produces a result identical to the other answer, not not the same as yours. This is because your dataset starts at 1:00am, not midnight, so your loop produces a matrix wherein the 1st row corresponds to 1:00am and the last row corresponds to midnight. These solutions produce zoo objects wherein the first row corresponds to midnight.

Recall in date POSIXct

I couldn't find a solution of my problem with POSIXct format - I have a monthly data. This is a scrap of my code:
Data <- as.POSIXct(as.character(czerwiec$Data), format = "%Y-%m-%d %H:%M:%S")
get.rows <- Data >= as.POSIXct(as.character("2013-06-03 00:00:01")) & Data <= as.POSIXct(as.character("2013-06-09 23:59:59"))
czerwiec <- czerwiec[get.rows,]
Data <- Data[get.rows]
I chose one hole week of June from 3 to 9 and wanted to estimate the sum of column X (czerwiec$X) by every hours. As you see I could reduce time, but it will be stupid to do it, like this
get.rows <- Data >= as.POSIXct(as.character("2013-06-03 00:00:01")) &
Data <= as.POSIXct(as.character("2013-06-03 00:59:59"))
then
get.rows <- Data >= as.POSIXct(as.character("2013-06-04 00:00:01")) &
Data <= as.POSIXct(as.character("2013-06-04 00:59:59"))
And in the end of this operations, I can estimate sum for this hour etc.
Do you have any idea, how I can recall to every rows, which have time like 2013-06-03 to 2013-06-09 and 00:00:01 to 00:59:59??
Something about data frame "czerwiec", so I have three columns, where first call "ID", second "Price" and third "Data" (means Date).
Thx for help :)
This might help. I've used the lubridate package, which doesn't really do anything you can't do in base R, but it makes handling dates much easier
# Set up Data as a string vector
Data <- c("2013-06-01 05:05:05", "2013-06-06 05:05:05", "2013-06-06 08:10:05", "2013-07-07 05:05:05")
require(lubridate)
# Set up the data frame with fake data. This makes a reproducible example
set.seed(4) #For reproducibility, always set the seed when using random numbers
# Create a data frame with Data and price
czerwiec <- data.frame(price=runif(4))
# Use lubridate to turn the Data string into a vector of POSIXctn objects
czerwiec$Data <- ymd_hms(Data)
# Determine the 'yearday' -i.e. yearday of Jan 1 is 1; yearday of Dec 31 is 365 (or 366 in a leap year)
czerwiec$yday <- yday(czerwiec$Data)
# in.range is true if the date is in the desired date range
czerwiec$in.range <- czerwiec$yday[czerwiec$yday >= yday(ymd("2013-06-03")) &
czerwiec$yday yday(ymd("2013-06-09")]
# Pick out the dates that have the range that you want
selected_dates <- subset(czerwiec, in.range==TRUE)

Resources