R: How to work with time series of sub-hour data? - r

i just started with R and finished some tutorials. However, i am trying to get into time series analysis and got big troubles with it. I made a data frame that looks like that:
Date Time T1
1 2014-05-22 15:15:00 21.6
2 2014-05-22 15:20:00 21.2
3 2014-05-22 15:25:00 21.3
4 2014-05-22 15:30:00 21.5
5 2014-05-22 15:35:00 21.1
6 2014-05-22 15:40:00 21.5
Since i didn't want to work with half days, i removed the first and last day from the data frame. Since R didnt recognize the date nor time as such, but as "factor", i used the lubridate library to change it properly. Now it looks like that:
Date Time T1
1 2014-05-23 0S 14.2
2 2014-05-23 5M 0S 14.1
3 2014-05-23 10M 0S 14.6
4 2014-05-23 15M 0S 14.3
5 2014-05-23 20M 0S 14.4
6 2014-05-23 25M 0S 14.5
Now the trouble really starts. Using ts function changes date to 16944 and time to 0. How do i setup a data frame with the correct start date and frequency? A new set of data comes in everty 5 min so frequency should be 288. I also tried to set the start date as a vector. Since 22th of may was the 142th day of the year i tried this
ts_df <- ts(df, start=c(2014, 142/365), frequency=288)
No error there, but when i go for start(ds_df) i get and end(ds_df):
[1] 2013.998
[1] 2058.994
Can anyone give me a hint how to work with these kind of data?

"ts" class is typically not a good fit for that type of data. Assuming DF is the data frame shown reproducibly in the Note at the end of this answer we convert it to a "zoo" class object and then perform some manipulations. The related xts package could also be used.
library(zoo)
z <- read.zoo(DF, index = 1:2, tz = "")
window(z, start = "2014-05-22 15:25:00")
head(z, 3) # first 3
head(z, -3) # all but last 3
tail(z, 3) # last 3
tail(z, -3) # all but first 3
z[2:4] # 2nd, 3rd and 4th element of z
coredata(z) # numeric vector of data values
time(z) # vector of datetimes
fortify.zoo(z) # data frame whose 2 cols are (1) datetimes and (2) data values
aggregate(z, as.Date, mean) # convert to daily averaging values
ym <- aggregate(z, as.yearmon, mean) # convert to monthly averaging values
frequency(ym) <- 12 # only needed since ym only has length 1
as.ts(ym) # year/month series can be reasonably converted to ts
plot(z)
library(ggplot2)
autoplot(z)
read.zoo could also have been used to read the data in from a file.
Note: DF used above in reproducible form:
DF <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2014-05-22",
class = "factor"),
Time = structure(1:6, .Label = c("15:15:00", "15:20:00",
"15:25:00", "15:30:00", "15:35:00", "15:40:00"), class = "factor"),
T1 = c(21.6, 21.2, 21.3, 21.5, 21.1, 21.5)), .Names = c("Date",
"Time", "T1"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))

Related

Prophet Date Format R

year_month amount_usd
201501 -390217.24
201502 230944.09
201503 367259.69
201504 15000.00
201505 27000.21
201506 38249.65
df <- structure(list(year_month = 201501:201506, amount_usd = c(-390217.24,
230944.09, 367259.69, 15000, 27000.21, 38249.65)), class = "data.frame", row.names = c(NA,
-6L))
I want to bring it in to DD/MM/YYYY format for usability in Prophet Forecasting code.
this is what i have tried so far.
for (loopitem in loopvec){
df2 <- subset(df, account_id==loopitem)
df3 <- df2[,c("year_month","amount_usd")]
df3$year_month <- as.Date(df3$year_month, format="YYYY-MM", origin="1/1/1970")
try <- prophet(df3, seasonality.mode = 'multiplicative')
}
Error in fit.prophet(m, df, ...) :
Dataframe must have columns 'ds' and 'y' with the dates and values respectively.
You need to paste the day number (I'm just using the first) to the year_month values, then can use the ymd() function from lubridate to convert the column to a date object.
library(dplyr)
library(lubridate)
mutate_at(df, "year_month", ~ymd(paste(., "01")))
year_month amount_usd
1 2015-01-01 -390217.24
2 2015-02-01 230944.09
3 2015-03-01 367259.69
4 2015-04-01 15000.00
5 2015-05-01 27000.21
6 2015-06-01 38249.65

convert quarter year to last date of quarter in R

I have an issue when I use as.Date(as.yearqtr(test[,1],format ="%qQ%Y"),frac =1), but it returns an error,and quater-year didn't change to date. The error is:
error in as.yearqtr(as.numeric(x)) (list) object cannot be coerced to type 'double'
This is my dataframe in R.
TIME VALUE
1Q2019 1
2Q2019 2
3Q2019 3
4Q2019 4
The ideal output is
TIME VALUE
2019-03-31 1
2019-06-30 2
2019-09-30 3
2019-12-31 4
We can convert to Date with zoo and get the last date of the quarter with frac. We use some RegEx to rearrange in zoo's suitable format:
df$TIME=as.Date(as.yearqtr(gsub("(\\d)(Q)(\\d{1,})","\\3 Q\\1",df$TIME)),frac = 1)
df
TIME VALUE
1 2019-03-31 1
2 2019-06-30 2
3 2019-09-30 3
4 2019-12-31 4
Data:
df <-structure(list(TIME = structure(1:4, .Label = c("1Q2019", "2Q2019",
"3Q2019", "4Q2019"), class = "factor"), VALUE = 1:4), class = "data.frame", row.names = c(NA,
-4L))
Here is a function that will return a vector of dates, given an input vector in the form of 1Q2019...
dateStrings <- c("1Q2019","2Q2019","3Q2019","4Q2019","1Q2020")
lastDayOfQuarter <- function(x){
require(lubridate)
result <- NULL
months <-c(3,6,9,12)
days <- c(31,30,30,31)
for(i in 1:length(x)) {
qtr <- as.numeric(substr(x[i],1,1))
result[i] <- mdy(paste(months[qtr],days[qtr],(substr(x[i],3,6)),sep="-"))
}
as.Date(result)
}
lastDayOfQuarter(dateStrings)
and the output:
>lastDayOfQuarter(dateStrings)
[1] "2019-03-31" "2019-06-30" "2019-09-30" "2019-12-31" "2020-03-31"
>

Aggregate hourly data into monthly data starting with the yyyy-mm-dd h:m format in R

I've been actively looking for a solution to my question in R and did not find anything to solve my problem...
I have an R report to submit for the beginning of January, using pepe memes data. I am studying the price of pepe memes through times, and here comes my problem. I have the dates in the format yyyy-mm-dd h:m, and I want to aggregate those into means of monthly data. I was thinking about first making a new file, with my timestamp in format yyyy-mm but I am not able to this. I was successful when translating into the yyyy-mm-dd format, but I really have an issue when I want to go to the yyyy-mm format.
So, more clearly, here are my two questions :
How do I aggregate my yyyy-mm-dd h:m dates into monthly ones with the average of monthly data (so, in the format yyyy-mm) ?
If you do not know how to aggregate directly the dates, does any of you know how to go from the yyyy-mm-dd h:m format to the yyyy-mm one ?
Here are some rows of my dataset (just an abstract, it contains more than 250 rows):
Timestamp ForwardQuantity TotalPriceUSDPerUnit
------------------------------------------------------------
1 2016-09-26 04:00:00 3 3.44
2 2016-09-26 04:00:00 7 3.44
3 2016-09-26 05:00:00 3 3.39
4 2016-09-26 05:00:00 1 3.39
5 2016-09-26 06:00:00 2 3.39
6 2016-09-26 13:00:00 4 2.84
7 2016-09-28 04:00:00 1 2.88
8 2016-09-28 04:00:00 1 2.92
9 2016-09-28 06:00:00 1 2.92
10 2016-09-28 06:00:00 1 2.92
Many thanks in advance, and have a nice christmas for those celebrating it!
EDIT : Result expected :
Timestamp Average price
------------------------------------
1 2016-09 2.9981
Here the average price has been obtained by multiplying the forward quantity above with its related price
EDIT 2 : The output of dput(head(DatasHAIRPEPE3col, 10)) is the following
structure(list(Timestamp = structure(c(1474862400, 1474862400,
1474866000, 1474866000, 1474869600, 1474894800, 1475035200, 1475035200,
1475042400, 1475042400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
ForwardQuantity = c(3L, 7L, 3L, 1L, 2L, 4L, 1L, 1L, 1L, 1L
), TotalPriceUSDPerUnit = c(3.445, 3.445, 3.392, 3.392, 3.392,
2.8352, 2.8795, 2.9238, 2.9238, 2.9238)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Using the data shown reproducibly in the Note at the end
1) zoo convert the data to a zoo object aggregating it at the same time to class yearmon. That will give a zoo object Mean with one mean per year/month. You can either use that or convert it to a data.frame using fortify.zoo. This solution is probably more convenient than (2) below since we directly represent the year/month as a yearmon class object which can be plotted and manipulated in a logical manner.
library(zoo)
Mean <- read.zoo(DF, FUN = as.yearmon, aggregate = mean)
fortify.zoo(Mean) # optional
giving this data frame:
Index Mean
1 Sep 2016 3.406667
You could now further manipulate, e.g. plot it using plot.zoo like this:
plot(Mean)
2) Base R Alternately, use the first 7 characters of each timestamp to represent the year/month and aggregate by that.
DF2 <- transform(DF, Timestamp = substring(Timestamp, 1, 7))
aggregate(UsdPricePerUnit ~ Timestamp, DF2, mean)
giving:
Timestamp UsdPricePerUnit
1 2016-09 3.406667
Note
Lines <- "
Timestamp UsdPricePerUnit
2016-09-26 04:00:00 3.44
2016-09-26 04:00:00 3.44
2016-09-26 05:00:00 3.39
2016-09-26 05:00:00 3.39
2016-09-26 05:00:00 3.39
2016-09-26 06:00:00 3.39"
DF <- read.csv(textConnection(gsub(" +", ",", Lines)))
Using the sample data provided in a previous answer (with an additional month added for demonstration) along with dplyr and anytime
library(tidyverse)
library(anytime)
Lines <- "
Timestamp ForwardQuantity UsdPricePerUnit
2016-09-26 04:00:00 3 3.44
2016-09-26 04:00:00 7 3.44
2016-09-26 05:00:00 3 3.39
2016-10-26 05:00:00 1 3.39
2016-10-26 05:00:00 2 3.39
2016-10-26 06:00:00 4 3.39"
DF <- read.csv(textConnection(gsub(" +", ",", Lines)))
DF %>%
mutate(month = format(anydate((Timestamp)), "%Y-%m")) %>%
group_by(month) %>%
mutate(MonthlySpend = ForwardQuantity*UsdPricePerUnit) %>%
summarise(QuanPerMon = sum(ForwardQuantity),
SpendPerMon = sum(MonthlySpend)) %>%
mutate(AveragePrice = SpendPerMon/QuanPerMon) %>%
select(1,4)
# A tibble: 2 x 2
month AveragePrice
<chr> <dbl>
1 2016-09 3.43
2 2016-10 3.39
EDIT - New data added to question
This worked for me with your data
df %>%
mutate(month = format(anydate((Timestamp)), "%Y-%m")) %>%
group_by(month) %>%
mutate(MonthlySpend = ForwardQuantity*TotalPriceUSDPerUnit) %>%
summarise(QuanPerMon = sum(ForwardQuantity),
SpendPerMon = sum(MonthlySpend)) %>%
mutate(AveragePrice = SpendPerMon/QuanPerMon) %>%
select(1,4)
# A tibble: 1 x 2
month AveragePrice
<chr> <dbl>
1 2016-09 3.24

How to subset rows for a specific range of dates in r?

I am new to R and currently working on some rainfall data. I have two data frames named df1 and df2.
df1
Date Duration_sum
5/28/2014 110
5/31/2014 20
5/31/2014 20
6/1/2014 10
6/1/2014 110
6/3/2014 140
6/4/2014 40
6/5/2014 60
6/12/2014 10
6/14/2014 100
df2
Date PercentRemoval
6/2/2014 25.8
6/5/2014 78.58
6/6/2014 15.6
6/13/2014 70.06
I want to look up the dates from df2 in df1. For example, if the 1st date from df2 is available in df1, I want to subset rows in df1 within the range of that specific date and 3 days prior to that. If that date is not available, then just look for the previous 3 days.
In case the data for previous 3 days are not available, then it will extract as many days as available but maximum limit is 3 days prior to the specific date of df2. If none of the dates are available in df1, then that date is ignored and look for the next date in df2. Also, for example, 3 days prior to 6/6/2014 is available in df1 but we have already considered those days for 6/5/2014. So, 6/6/2014 is ignored.
The resulted data frame should look something like this:
df3
col_1 Date Duration_sum
5/31/2014 20
5/31/2014 20
6/1/2014 10
6/2/2014 6/1/2014 110
6/3/2014 140
6/4/2014 40
6/5/2014 6/5/2014 60
6/13/2014 6/12/2014 10
I have used this code:
df3 <- df1[df1$Date %in% as.Date(c(df2)),]
this code gives me the results for specific dates but not for the previous 3 days. I would really appreciate If someone can help me out with this code or some other codes. Thanks in advance.
This may be one way to do the task. If I am correctly reading your question, you want to remove any date, which does not have more than 3 days as an interval with a previous date. In this way, you can avoid the overlapping issue you mentioned in your question; you can successfully remove the 5th of June, 2014. Once you filter dates in df2, you can subset df1 for each date in the revised df2 in the lapply() part. The output is a list, and you want to assign names to each data frame in the list. Finally, you bind all data frames.
library(dplyr)
mutate(df1, Date = as.Date(Date, format = "%m/%d/%Y")) -> df1
mutate(df2, Date = as.Date(Date, format = "%m/%d/%Y")) %>%
filter(!(Date - lag(Date, default = 0) < 3)) -> df2
lapply(df2$Date, function(x){
filter(df1, between(Date, x-3, x)) -> foo
foo
}) -> temp
names(temp) <- as.character(df2$Date)
bind_rows(temp, .id = "df2.date")
# df2.date Date Duration_sum
#1 2014-06-02 2014-05-31 20
#2 2014-06-02 2014-05-31 20
#3 2014-06-02 2014-06-01 10
#4 2014-06-02 2014-06-01 110
#5 2014-06-05 2014-06-03 140
#6 2014-06-05 2014-06-04 40
#7 2014-06-05 2014-06-05 60
#8 2014-06-13 2014-06-12 10
DATA
df1 <- structure(list(Date = c("5/28/2014", "5/31/2014", "5/31/2014",
"6/1/2014", "6/1/2014", "6/3/2014", "6/4/2014", "6/5/2014", "6/12/2014",
"6/14/2014"), Duration_sum = c(110L, 20L, 20L, 10L, 110L, 140L,
40L, 60L, 10L, 100L)), .Names = c("Date", "Duration_sum"), class = "data.frame", row.names = c(NA,
-10L))
df2 <- structure(list(Date = c("6/2/2014", "6/5/2014", "6/6/2014", "6/13/2014"
), PercentRemoval = c(25.8, 78.58, 15.6, 70.06)), .Names = c("Date",
"PercentRemoval"), class = "data.frame", row.names = c(NA, -4L
))

Aggregate (count) occurences of values over arbitrary timeframe

I have a CSV file with timestamps and certain event-types which happened at this time.
What I want is count the number of occurences of certain event-types in 6-minutes intervals.
The input-data looks like:
date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
I load and cure the data with this piece of code:
> raw_data <- read.csv('input.csv')
> cured_dates <- c(strptime(raw_data$date, '%b %d, %Y %H:%M:%S', tz="CEST"))
> cured_data <- data.frame(cured_dates, c(raw_data$type))
> colnames(cured_data) <- c('date', 'type')
After curing the data looks like this:
> head(cured_data)
date type
1 2011-09-22 14:54:53 2
2 2011-09-22 14:54:53 2
3 2011-09-22 14:54:53 2
4 2011-09-22 14:54:53 2
5 2011-09-22 14:54:53 1
6 2011-09-22 14:54:53 1
I read a lot of samples for xts and zoo, but somehow I can't get a hang on it.
The output data should look something like:
date type count
2011-09-22 14:54:00 CEST 1 11
2011-09-22 14:54:00 CEST 2 19
2011-09-22 15:00:00 CEST 1 9
2011-09-22 15:00:00 CEST 2 12
2011-09-22 15:06:00 CEST 1 23
2011-09-22 15:06:00 CEST 2 18
Zoo's aggregate function looks promising, I found this code-snippet:
# aggregate POSIXct seconds data every 10 minutes
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(x, time(x) - as.numeric(time(x)) %% 600, mean)
Now I'm just wondering how I could apply this on my use case.
Naive as I am I tried:
> zoo_data <- zoo(cured_data$type, structure(cured_data$time, class = c("POSIXt", "POSIXct")))
> aggr_data = aggregate(zoo_data$type, time(zoo_data$time), - as.numeric(time(zoo_data$time)) %% 360, count)
Error in `$.zoo`(zoo_data, type) : not possible for univariate zoo series
I must admit that I'm not really confident in R, but I try. :-)
I'm kinda lost. Could anyone point me into the right direction?
Thanks a lot!
Cheers, Alex.
Here the output of dput for a small subset of my data. The data itself is something around 80 million rows.
structure(list(date = structure(c(1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885,
1316697885, 1316697885), class = c("POSIXct", "POSIXt"), tzone = ""),
type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L)), .Names = c("date",
"type"), row.names = c(NA, -23L), class = "data.frame")
We can read it using read.csv, convert the first column to a date time binned into 6 minute intervals and add a dummy column of 1's. Then re-read it using read.zoo splitting on the type and aggregating on the dummy column:
# test data
Lines <- 'date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
"Sep 22, 2011 12:54:53.081240000","3"
"Sep 22, 2011 12:54:53.083493000","3"
"Sep 22, 2011 12:54:53.084025000","3"
"Sep 22, 2011 12:54:53.086493000","4"'
library(zoo)
library(chron)
# convert to chron and bin into 6 minute bins using trunc
# Also add a dummy column of 1's
# and remove any leading space (removing space not needed if there is none)
DF <- read.csv(textConnection(Lines), as.is = TRUE)
fmt <- '%b %d, %Y %H:%M:%S'
DF <- transform(DF, dummy = 1,
date = trunc(as.chron(sub("^ *", "", date), format = fmt), "00:06:00"))
# split and aggregate
z <- read.zoo(DF, split = 2, aggregate = length)
With the above test data the solution looks like this:
> z
2 3 4
(09/22/11 12:54:00) 4 3 1
Note that the above has been done in wide form since that form constitutes a time series whereas the long form does not. There is one column for each type. In our test data we had types 2, 3 and 4 so there are three columns.
(We have used chron here since its trunc method fits well with binning into 6 minute groups. chron does not support time zones which can be an advantage since you can't make one of the many possible time zone errors but if you want POSIXct anyways convert it at the end, e.g. time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z)) %% 1)) . This expression is shown in a table in one of the R News 4/1 articles except we used as.Date.dates instead of just as.Date to work around a bug that seems to have been introduced since then. We could also use time(z) <- as.POSIXct(time(z)) but that would result in a different time zone.)
EDIT:
The original solution binned into dates but I noticed afterwards that you wish to bin into 6 minute periods so the solution was revised.
EDIT:
Revised based on comment.
You are almost all the way there. All you need to do now is create a zoo-isch version of that data and map it to the aggregate.zoo code. Since you want to categorize by both time and by type your second argument to aggregate.zoo must be a bit more complex and you want counts rather than means so your should use length(). I do not think that count is a base R or zoo function and the only count function I see in my workspace comes from pkg:plyr so I don't know how well it would play with aggregate.zoo. length works as most people expect for vectors but is often surprises people when working with data.frames. If you do not get what you want with length, then you should see if NROW works instead (and with your data layout they both succeed): With the new data object it is necessary to put the type argument first. AND it surns out the aggregate/zoo only handles single category classifiers so you need to put in the as.vector to remove it zoo-ness:
with(cured_data,
aggregate(as.vector(x), list(type = type,
interval=as.factor(time(x) - as.numeric(time(x)) %% 360)),
FUN=NROW)
)
# interval x
#1 2011-09-22 09:24:00 12
#2 2011-09-22 09:24:00 11
This is an example modified from where you got the code (an example on SO by WizaRd Dirk):
Aggregate (count) occurences of values over arbitrary timeframe
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(as.vector(x), by=list(cat=as.factor(x),
tms = as.factor(index(x) - as.numeric(index(x)) %% 600)), length)
cat tms x
1 1 1969-12-31 19:00:00 26
2 2 1969-12-31 19:00:00 22
3 3 1969-12-31 19:00:00 11
4 1 1969-12-31 19:10:00 17
5 2 1969-12-31 19:10:00 28
6 3 1969-12-31 19:10:00 15
7 1 1969-12-31 19:20:00 17
8 2 1969-12-31 19:20:00 16
9 3 1969-12-31 19:20:00 27
10 1 1969-12-31 19:30:00 8
11 2 1969-12-31 19:30:00 4
12 3 1969-12-31 19:30:00 9

Resources