Subset time series by groups based on cutoff date data frame - r

I have a data frame with time series data for several different groups. I want to apply different start and end cutoff dates to each group within the original data frame.
Here's a sample data frame:
date <- seq(as.POSIXct("2014-07-21 17:00:00", tz= "GMT"), as.POSIXct("2014-09-11 24:00:00", tz= "GMT"), by="hour")
group <- letters[1:4]
datereps <- rep(date, length(group))
attr(datereps, "tzone") <- "GMT"
sitereps <- rep(group, each = length(date))
value <- rnorm(length(datereps))
df <- data.frame(DateTime = datereps, Group = group, Value = value)
and here's the data frame 'cut' of cutoff dates to use:
start <- c("2014-08-01 00:00:00 GMT", "2014-07-26 00:00:00 GMT", "2014-07-21 17:00:00 GMT", "2014-08-03 24:00:00 GMT")
end <- c("2014-09-11 24:00:00 GMT", "2014-09-01 24:00:00 GMT", "2014-09-07 24:00:00 GMT", "2014-09-11 24:00:00 GMT")
cut <- data.frame(Group = group, Start = as.POSIXct(start), End = as.POSIXct(end))
I can do it manually for each group, getting rid of the data I don't want on both ends of the time series using ![(),]:
df2 <- df[!(df$Group == "a" & df$DateTime > "2014-08-01 00:00:00 GMT" & df$DateTime < "2014-09-11 24:00:00 GMT"),]
But, how can I automate this?

Just merge the cuts into the data frame, and then create a new data frame using the new columns, like below. df3 contains the removed records, df4 contains the retained ones.
df2 <- merge(x = df,y = cut,by = "Group")
df3 <- df2[df2$DateTime <= df2$Start | df2$DateTime >= df2$End,]
df4 <- df2[!(df2$DateTime <= df2$Start | df2$DateTime >= df2$End),]

Related

how to search and change dates. in r using datatable

this just head() from my dataset, there are millions of rows. this is how it looks
#before
dt$date[c("2010-05-12" "2010-05-28" "2010-06-29" "2010-06-30" "2010-07-02" "2010-07-02")]
i want to convert these to the last day of the month.
but i have to accommodate the fact that months end in 30 and 31. how would i change accordingly?
#after
[c("2010-05-31" "2010-05-31" "2010-06-30" "2010-06-30" "2010-07-31" "2010-07-31")]
Cheers
using the lubridate package
require(lubridate)
require(data.table)
dt <- data.table(date = as.Date(c("2010-05-12", "2010-05-28", "2010-06-29", "2010-06-30", "2010-07-02", "2010-07-02")))
day(dt$date) <- days_in_month(dt$date)
output:
> dt
date
1: 2010-05-31
2: 2010-05-31
3: 2010-06-30
4: 2010-06-30
5: 2010-07-31
6: 2010-07-31
Base solution (works on data.table objects) with several steps.
# Find the range: date_range => Date vector
date_range <- range(df$date)
# Generate a sequence, having every date in the range:
# date_lkp => Date vector
date_lkp <- seq.Date(date_range[1], date_range[2], by = "days")
# Truncate the date to months: mth_date => Date Vector
mth_date <- as.Date(paste0(substr(date_lkp, 1, 8), "01"), "%Y-%m-%d")
# Store every date in the sequence as well as the date for each
# end of month: date_tab => data.frame
date_tab <- data.frame(date_lkp = date_lkp,
eom_date = ave(date_lkp, mth_date, FUN = max))
# Perform a lookup for each date in the original data to retrieve
# the last day of each month: new_date => Date vector
df$eom_date <- date_tab$eom_date[match(df$date, date_tab$date_lkp)]
# Data:
date <- c("2010-05-12",
"2010-05-28",
"2010-06-29",
"2010-06-30",
"2010-07-02",
"2010-07-02")
library(data.table)
df <- data.table(date = date, stringsAsFactors = FALSE)
df$date <- as.Date(df$date, "%Y-%m-%d")

How to decompose xts half hourly time-series data

I have the dataset below with half hourly timeseries data.
Date <- c("2018-01-01 08:00:00", "2018-01-01 08:30:00",
"2018-01-01 08:59:59","2018-01-01 09:29:59")
Volume <- c(195, 188, 345, 123)
Dataset <- data.frame(Date, Volume)
I convert to Date format with:
Dataset$Date <- as.POSIXct(Dataset$Date)
Create xts object
library(xts)
Dataset.xts <- xts(Dataset$Volume, order.by=Dataset$Date)
When I try to decompose it based on this Q with:
attr(Dataset.xts, 'frequency')<- 48
decompose(ts(Dataset.xts, frequency = 48))
I get:
Error in decompose(ts(Dataset.xts, frequency = 48)) :
time series has no or less than 2 periods
As I mentioned in the comments you need as.ts instead of ts. Also you are specifying a frequency higher than the number of records you have. Both lead to errors.
This code works:
library(xts)
df1 <- data.frame(date = as.POSIXct(c("2018-01-01 08:00:00", "2018-01-01 08:30:00",
"2018-01-01 08:59:59","2018-01-01 09:29:59")),
volume = c(195, 188, 345, 123))
df1_xts <- xts(df1$volume, order.by = df1$date)
attr(df1_xts, 'frequency') <- 2
decompose(as.ts(df1_xts))
This doesn't (frequency higher than number of records):
attr(df1_xts, 'frequency') <- 48
decompose(as.ts(df1_xts))
Error in decompose(as.ts(df1_xts)) :
time series has no or less than 2 periods
Neither does this (ts instead of as.ts):
attr(df1_xts, 'frequency') <- 2
decompose(ts(df1_xts))
Error in decompose(ts(df1_xts)) :
time series has no or less than 2 periods

R - How to subset a table between two specific dates?

I have hourly data values of eight years, and I would like to subset all the values within an specific year. For example a data set for 2007, another for 2008 and so on. At the moment I have many problems with the date format, because when I specific a time period, I get another date period.
Here is my table: LValley, and that is what I have tried:
LValley <- read.table("C:/LValley.txt", header=TRUE, dec = ",", sep="\t")
year2007 <- subset(LValley, date > as.Date("01.01.2007 01:00", "%d.%m.%Y %H:%M") & date < as.Date("01.02.2008 01:00", "%d.%m.%Y %H:%M"))
but it returns me another date period, and I would like exactly all the data from 2007.
I have used also the function of this example, and I have the same results # Subset a dataframe between 2 dates
mydatefunc <- function(x,y){LValley[LValley$date >= x & LValley$date <= y,]}
DATE1 <- as.Date("01.01.2007 01:00", "%d.%m.%Y %H:%M")
DATE2 <- as.Date("01.01.2008 00:00", "%d.%m.%Y %H:%M")
Test2007 <- mydatefunc(DATE1,DATE2)
I will appreciate very much you help,
Kind regards,
Darwin
You need to convert the date column in the file to date class. For example:
LValley <- read.table("LValley.txt", header=TRUE,dec=",", sep="\t", stringsAsFactors=FALSE)
date1 <- as.Date(LValley$date, "%d.%m.%Y %H:%M")
Test2007 <- subset(LValley, date1>=DATE1 & date1 <=DATE2)
dim(Test2007)
#[1] 6249 4

subtracting dates with standardised result

I am subtracting dates in xts i.e.
library(xts)
# make data
x <- data.frame(x = 1:4,
BDate = c("1/1/2000 12:00","2/1/2000 12:00","3/1/2000 12:00","4/1/2000 12:00"),
CDate = c("2/1/2000 12:00","3/1/2000 12:00","4/1/2000 12:00","9/1/2000 12:00"),
ADate = c("3/1/2000","4/1/2000","5/1/2000","10/1/2000"),
stringsAsFactors = FALSE)
x$ADate <- as.POSIXct(x$ADate, format = "%d/%m/%Y")
# object we will use
xxts <- xts(x[, 1:3], order.by= x[, 4] )
#### The subtractions
# anwser in days
transform(xxts, lag = as.POSIXct(BDate, format = "%d/%m/%Y %H:%M") - index(xxts))
# asnwer in hours
transform(xxts, lag = as.POSIXct(CDate, format = "%d/%m/%Y %H:%M") - index(xxts))
Question: How can I standardise the result so that I always get the answer in hours. Not by multiplying the days by 24 as I will not know before han whther the subtratcion will round to days or hours....
Unless I can somehow check if the format is in days perhaps using grep and regexand then multiply within an if clause.
I have tried to work through this and went for the grep regex apprach but this doesnt even keep the negative sign..
p <- transform(xxts, lag = as.POSIXct(BDate, format = "%d/%m/%Y %H:%M") - index(xxts))
library(stringr)
ind <- grep("days", p$lag)
p$lag[ind] <- as.numeric( str_extract_all(p$lag[ind], "\\(?[0-9,.]+\\)?")) * 24
p$lag
#2000-01-03 2000-01-04 2000-01-05 2000-01-10
# 36 36 36 132
I am convinced there is a more elegant solution...
ok difftime works...
transform(xxts, lag = difftime(as.POSIXct(BDate, format = "%d/%m/%Y %H:%M"), index(xxts), unit = "hours"))

find corresponding dateTime in several time series in R

I am an R newbie and am finding the conversion from matlab rather tricky, so apologies in advance for what could be a very simple question.
I am analyzing some time series data and the problem outlined below demonstrates the problem I am having in R:
Dat1 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 00:00","2012-05-03 02:00",
"2012-05-03 02:30","2012-05-03 05:00",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat2 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 01:00","2012-05-03 01:30",
"2012-05-03 02:30","2012-05-03 06:00",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat3 <- data.frame(dateTime = as.POSIXct(c("2012-05-03 00:15","2012-05-03 02:20",
"2012-05-03 02:40","2012-05-03 06:25",
"2012-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
Dat4 <- data.frame(dateTime = as.POSIXct(c("2010-05-03 00:15","2010-05-03 02:20",
"2010-05-03 02:40","2010-05-03 06:25",
"2010-05-03 07:00"), tz = 'UTC'),x1 = rnorm(5))
So, here I have 5 data frames where all of the data are measured at similar times. I am now trying to ensure that all of the data frames have an identical time step i.e. all measured at the same time. I can do this for two data frames:
idx1 <- (Dat1[,1] %in% Dat2[,1])
which will tell me the index of the consistent times in these two data frames. I can then re-define the data frame by
newDat1 <- Dat1[idx1,]
to get the data desired.
My question now is, how do I apply this to all of the data frames i.e. more than 2. I have tried:
idx1 <- (Dat1[,1] %in% (Dat2[,1] %in% (Dat3[,1] %in% Dat4[,1])))
but I can see that this is completely wrong. Any suggestions? Please keep in mind that I have many data frames (more than five), where each contain different variables.
EDIT:
I may have found one way this can be done:
idx1 <- (Dat1[,1] %in% intersect(intersect(intersect(Dat1[,1],Dat2[,1]),Dat3[,1]),Dat4[,1]))
which will give the index, and can be used to define a new variable:
Dat1 <- Dat1[idx1,]
Dat2 <- Dat2[idx1,]
Dat3 <- Dat3[idx1,]
Dat4 <- Dat4[idx1,]
Although this work for this example, I was hoping to find a way of making this work for n number of data frames without having to write repeat this n number of times
To identify timestamps that are common to all data frames, create a function to return the intersection of multiple vectors
intersectMulti <- function(x=list()){
for(i in 2:length(x)){
if(i==2) foo <- x[[i-1]]
foo <- intersect(foo,x[[i]]) #find intersection between ith and previous
}
return(x[[1]][match(foo, x[[1]])]) #get original to retain format
}
Note that there are no common timestamps among the four dataframes in the question
> intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1],Dat4[,1]))
character(0)
But there is one common timestamp in the first three dataframes
> intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1]))
[1] "2012-05-03 07:00:00 UTC"
Use the result from the function to subset rows of each dataframe with common timestamp:
m <- intersectMulti(x=list(Dat1[,1],Dat2[,1],Dat3[,1]))
Dat1 <- Dat1[Dat1$dateTime %in% m,]
Dat2 <- Dat2[Dat2$dateTime %in% m,]
Dat3 <- Dat3[Dat3$dateTime %in% m,]
Dat4 <- Dat4[Dat4$dateTime %in% m,]
> Dat1
dateTime x1
5 2012-05-03 07:00:00 -0.1607363
> Dat2
dateTime x1
5 2012-05-03 07:00:00 -0.2662494
> Dat3
dateTime x1
5 2012-05-03 07:00:00 -0.1917905
If this works for you:
idx1 <- (Dat1[,1] %in% intersect(intersect(intersect(Dat1[,1],Dat2[,1]),Dat3[,1]),Dat4[,1]))
then try this, it works on lists/vectors and more elegant:
idx1 <- Dat1[,1] %in% Reduce(intersect, list(Dat1[,1], Dat2[,1], Dat3[,1], Dat4[,1]))

Resources