storing lubridate intervals in a dataframe (R) - r

I want to create a dataframe of 15 minute intervals over 24 hours starting with a certain inverval on several dates. I use a loop for that but instant of the actual intervals it stores the number of seconds which is not useful in my case. Is there any way to avoid this? I need the intervals to look how often timed events happen in these intervals. I found one similar question, but the answer concentrated on using lapply instead of apply, which isn't applicable here.
So here is a basic example:
begin<-as.POSIXct(rbind("2016-03-31 09:00:00","2016-04-12 09:00:00"))
end<-as.POSIXct(rbind("2016-03-31 09:15:00","2016-04-12 09:15:00"))
int<-as.interval(begin,end)
aufl<-duration(15, "mins")
Intervall=data.frame()
for (j in 1:length(int)){for (i in 1:96){Intervall[j,i]<-int_shift(int[j],aufl*(i-1))}}
Intervall

I created an answer, I hope this is what you are looking for. If not, please comment:
library(lubridate)
begin <- as.POSIXct(rbind("2016-03-31 09:00:00","2016-04-12 09:00:00"))
# copy begin time for loop
begin_new <- begin
# create durateion object
aufl <- duration(15, "mins")
# count times for loop
times <- 24*60/15
# create dataframe with begin time
Intervall <- data.frame(begin,stringsAsFactors = FALSE)
for (i in 1:times){
cat("test",i,"\n")
# save old time for interval calculation
begin_start <- begin_new
# add 15 Minutes to original time
begin_new <- begin_new + aufl
cat(begin_new,"\n")
# create an interval object between
new_dur <- interval(begin_start,begin_new)
# bind to original dataframe
Intervall <- cbind(Intervall,new_dur)
}
# Add column names
vec_names <- paste0("v",c(1:(times+1)))
colnames(Intervall) <- vec_names

Related

Counting the number of days excluding Sundays between two dates and creating a new column in R DataFrame

I have a data.frame in R in which includes two variables with a Start-Date and an End-Date. I would like to add a new column with the number of days between the two dates and reduce the result by the number of sundays in each interval. I tried it like below but it doesn't work:
Data$Start <- as.Date(Data$Start, "%d.%m.%y")
Data$End <- as.Date(Data$End,"%d.%m.%y")
interval <- difftime(Data$Start, Data$End, units = "days")
sundays <- seq(from = Data$Start, to = Data$End, by = "days")
number.sundays <- length(which(wday(sundays)==1))
Data$DaysAhead <- interval - number.sundays
I get the error message in the seq() function, that it has to have the lenght 1 but I don't understand how I can handle this. Can somenone help me out with that?
Here's an example that works:
Data <- data.frame(
Start = c("01.01.2020", "01.06.2020"),
End = c("01.03.2020", "01.09.2020")
)
Data$Start <- as.Date(Data$Start, "%d.%m.%Y")
Data$End <- as.Date(Data$End,"%d.%m.%Y")
interval <- difftime(Data$End, Data$Start, units = "days")
sundays <- lapply(1:nrow(Data), function(i)seq(from = Data$Start[i], to = Data$End[i], by = "days"))
number.sundays <- sapply(sundays, function(x)length(which(lubridate::wday(x)==1)))
Data$DaysAhead <- interval - number.sundays
The problem is that seq() isn't vectorized, it assumes a single start and single end point. If you putt it inside of a loop (like lapply()) it will work and generate the relevant sequence for each start and end time. Then you can use sapply() to figure out how many sundays and since the returned value is a scalar, the return from sapply() will be a vector of the same length as interval.
I realized with an updated data set that there's a problem with the solution above, when Start-Date and End-Date aren't in the same year. I still want to count the days except Sundays starting on the 20.12.2020 until 10.01.2021 for example. The error message showing up in that case is that the sign with the argument "by" is wrong. I just can't manage to get it running . If I turn the dates around, the output makes no sense and the number of days is too high. What do I have to do to get this running over the year-end?

R - More elegant way of writing line of code

Ok so lets take this code below which calculates a rolling simple moving average over 2 day period:
# Use TTR package to create rolling SMA n day moving average
new.df$close.sma.n2 <- SMA(new.df[,"Close"], 2)
Lets say I want to calculate the n day period of 2:30
The inputs here is:
close.sma.n**
and also the numerical value for the SMA calculation.
So my question is:
How can I write one line of code to perform the above calculation on different SMA periods and also making a new column with corresponding close.sma.n2,3,4,5,6 etc value in a data frame?
I understand I can do:
n.sma <- 2:30
and put that variable in:
new.df$close.sma.n2 <- SMA(new.df[,"Close"], n.sma)
Perhaps I can:
name <- "n2:30"
and place that inside:
new.df$close.sma.name <- SMA(new.df[,"Close"], n.sma)
You didn't provide sample data or SMA, so I made up dummy functions to test my code.
df <- data.frame(Close=c(1, 2, 3, 4))
SMA <- function(x, numdays) {numdays}
Then, I wrote a function that takes in the number of days to average, and returns a function that takes a data.frame and takes the SMA over that many days.
getSMA <- function(numdays) {
function(new.df) {
SMA(new.df[,"Close"], numdays)
}
}
Then, create a matrix to put the SMAs in
smas <- matrix(nrow=nrow(df), ncol=0)
and fill it.
for (i in 2:30) {
smas <- cbind(smas, getSMA(i)(df))
}
Then, set the column names to what you want them to be
colnames(smas) <- sapply(2:30, function(n)paste("close.sma.n", n, sep=""))
and bind it with the starting data frame.
df <- cbind(df, smas)
Now you have your original data.frame, plus "close.sma.n2" through ".n30"

Creating next Data.Frame by pulling columns from previous sequenced Data.Frame in a for Loop

My goal: is to write a loop that creates a unique data.frame for a multivariate time series that is split by a factor called hour. The data is daily and has values for demand and ad spend at each hour of the day. Each data.frame has1 date column, one demand column and 8 adspend columns representing the adspend for the current hour and the 7 previous hours.
For example two loop cycles for I=3 and I=9 would produce:
For 9am hour Data.Frame: the columns will be Date Demand9AM, AdSpend9AM...AdSpend2AM
For 3am hour Data.Frame: the columns will be Date Demand3AM, AdSpend3AM...AdSpend9PM (yesterday)
The trick is that earlier hours will have to pull some adspend from the previous days hours.
A couple of solid coders on this site suggested I read about the "zoo" package. I did! So I have been able to take this problem to a solid place. Here is code for pseudo data that outputs a sequence of data.frames similar to what I need. Because I am a novice, I am not sure this is the most efficient way to create this loop.
So my questions are:
Is there a simpler way to create this loop?
Is there a way to assign names to the variables within the loop?
Is it possible to create the dataframes in a vectorized way?
The first question is far more important.
Thank you
set.seed(1)
library(forecast)
library(lubridate)
library(zoo)
library(reshape)
set.seed(31)
foo <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*(myHour) + (150*myDate)
}
Hour <- 1:24
Day <-1:90
dates <-seq(as.Date("2012-01-01"), as.Date("2012-3-30"), by = "day")
myData <- expand.grid( Day, Hour)
names(myData) <- c("Date","Hour")
myData$Adspend <- apply(myData, 1, function(x) foo(x[2], x[1]))
myData$Date <-dates
myData$Demand <-(rnorm(1,mean = 0, sd=1)+.75*myData$Adspend)
## ok, done with the fake data generation.
myData
ADDate<-myData[,-4]
DemDate<-myData[,-3]
HourAD<-melt(ADDate, id=c("Date","Hour"), measured=c("Adspend"))
HourAD<-cast(HourAD,...~Hour)
ADHR<-zoo(HourAD,HourAD$Date)
HourDemand<-melt(DemDate, id=c("Date","Hour"), measured=c("Demand"))
HourDemand<-cast(HourDemand,...~Hour)
DEMHR<-zoo(HourDemand,HourDemand$Date)
DATASET <-vector("list",length(Hour))
for(i in seq_along(Hour)) { ifelse(i==1, DATASET[[i]]<-merge(DEMHR[,1],ADHR[,1],lag(ADHR[,18:24],-1),DATASET[[i]]<-merge(DEMHR[,i],ADHR[,i],DATASET[[i-1]]))}
DATASET <-vector("list",length(Hour))
for(i in seq_along(Hour)) { ifelse(i==1, DATASET[[i]]<-merge(DEMHR[,1],ADHR[,1],lag(ADHR[,18:24],-1)),DATASET[[i]]<-merge(DEMHR[,i],ADHR[,i],DATASET[[i-1]][,2:7]))}

Adding Values In a Stochastic Simulation using R

Here I've managed to extract extract time in discrete form such as 1,2,3...,50 from each simulation with help of the users. But, since there is no value for interval 20-21 and more, is there any coding such that I can add the value inside myself? Because, if there is no reading for that time interval, that means the readings are same until the next interval. The coding I used as below:
library(GillespieSSA)
parms <- c(beta=0.591,sigma=1/8,gamma=1/7)
x0 <- c(S=50,E=0,I=1,R=0)
a <- c("beta*S*I","sigma*E","gamma*I")
nu <- matrix(c(-1,0,0, 1,-1,0, 0,1,-1, 0,0,1),nrow=4,byrow=TRUE)
set.seed(12345)
out <- lapply(X=1:1,FUN=function(x)
ssa(x0,a,nu,parms,tf=50)$data)
out a<-as.data.frame(out)
idx <- diff(ceiling(a$V1)) == 1 a[idx,]
## change ==1 to >0
idx <- diff(ceiling(a$V1)) > 0
## get discrete time series
discrete.data <- a[idx,]
## get the last time step value
end.time <- ceiling(tail(discrete.data$V1,1))
## create an empty data frame with all time steps
new.df <- data.frame(t=0:end.time, S=0, E=0, I=0, R=0)
## replace only those time steps that have valid values
new.df[new.df$t %in% ceiling(discrete.data$V1),2:5] <- discrete.data[,2:5]
If necessary missing values can be replace by NA, depends on how you want to handle them.

Optimizing search in time series data frame

I have a data frame of 50 columns by 2.5 million rows in R, representing a time series. The time column is of class POSIXct. For analysis, I repeatedly need to find the state of the system for a given class at a particular time.
My current approach is the following (simplified and reproducible):
set.seed(1)
N <- 10000
.time <- sort(sample(1:(100*N),N))
class(.time) <- c("POSIXct", "POSIXt")
df <- data.frame(
time=.time,
distance1=sort(sample(1:(100*N),N)),
distance2=sort(sample(1:(100*N),N)),
letter=sample(letters,N,replace=TRUE)
)
# state search function
time.state <- function(df,searchtime,searchclass){
# find all rows in between the searchtime and a while (here 10k seconds)
# before that
rows <- which(findInterval(df$time,c(searchtime-10000,searchtime))==1)
# find the latest state of the given class within the search interval
return(rev(rows)[match(T,rev(df[rows,"letter"]==searchclass))])
}
# evaluate the function to retrieve the latest known state of the system
# at time 500,000.
df[time.state(df,500000,"a"),]
However, the call to which is very costly. Alternatively, I could first filter by class and then find the time, but that doesn't change the evaluation time much. According to Rprof, it's which and == that cost the majority of the time.
Is there a more efficient solution? The time points are sorted weakly increasing.
Because which, == and [ are all linear with the size of the data frame, the solution is to generate subset data frames for bulk operations, as follows:
# function that applies time.state to a series of time/class cominations
time.states <- function(df,times,classes,day.length=24){
result <- vector("list",length(times))
day.end <- 0
for(i in 1:length(times)){
if(times[i] > day.end){
# create subset interval from 1h before to 24h after
day.begin <- times[i]-60*60
day.end <- times[i]+day.length*60*60
df.subset <- df[findInterval(df$time,c(day.begin,day.end))==1,]
}
# save the resulting row from data frame
result[[i]] <- df.subset[time.state(df.subset,times[i],classes[i]),]
}
return(do.call("rbind",result))
}
With dT=diff(range(df$times)) and dT/day.length large, this reduces the evaluation time with a factor of dT/(day.length+1).

Resources