compute mean of last 5 days of each month in R - r

I am finding this to be quite tricky. I have an R time series data frame, consisting of a value for each day for about 50 years of data. I would like to compute the mean of only the last 5 values for each month. This would be simple if each month ended in the same 31st day, in which case I could just subset. However, as we all know some months end in 31, some in 30, and then we have leap years. So, is there a simple way to do this in R without having to write a complex indexing function to take account of all the possibilities including leap years? Perhaps a function that works on zoo type objects? The data frame is as follows:
Date val
1 2014-01-06 1.49
2 2014-01-03 1.38
3 2014-01-02 1.34
4 2013-12-31 1.26
5 2013-12-30 2.11
6 2013-12-26 3.20
7 2013-12-25 3.00
8 2013-12-24 2.89
9 2013-12-23 2.90
10 2013-12-22 4.5

tapply Try this where dd is your data frame and we have assumed that the Date column is of class "Date". (If dd is already sorted in descending order of Date as it appears it might be in the question then we can shorten it a bit by replacing the anonymous function with function(x) mean(head(x, 5)) .)
> tapply(dd$val, format(dd$Date, "%Y-%m"), function(x) mean(tail(sort(x), 5)))
2013-12 2014-01
2.492000 1.403333
aggregate.zoo In terms of zoo we can do this which returns another zoo object and its index is of class "yearmon". (In the case of zoo it does not matter whether dd is sorted or not since zoo will sort it automatically.)
> library(zoo)
> z <- read.zoo(dd)
> aggregate(z, as.yearmon, function(x) mean(tail(x, 5)))
Dec 2013 Jan 2014
2.492000 1.403333
REVISIONS. Made some corrections.

Related

How to apply several time series models like ets, auto.arima etc. to groups in the data in R using purrr/tidyverse?

My dataset looks like following. I am trying to predict the 'amount' for next 2 months using either the ets, auto.arima, Prophet or any other model. But my issue is that I would like to predict amount based on each groups i.e A,B,C for next 2 months. I am not sure how to do that in R ?
data = data.frame(Date=c('2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01','2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01','2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01'),
Group=c('A','A','A','A','A','A','B','B','B','B','B','B','C','C','C','C','C','C'),
Amount=c('12.1','13','15','10','12','9.0','12.5','13.3','14.8','11','10','12.1','13','12.2','11','10.9','13.4','11.1'))
data
Date Group Amount
1 2017-01-01 A 12.1
2 2017-02-01 A 13
3 2017-03-01 A 15
4 2017-04-01 A 10
5 2017-05-01 A 12
6 2017-06-01 A 9.0
7 2017-01-01 B 12.5
8 2017-02-01 B 13.3
9 2017-03-01 B 14.8
10 2017-04-01 B 11
11 2017-05-01 B 10
12 2017-06-01 B 12.1
13 2017-01-01 C 13
14 2017-02-01 C 12.2
15 2017-03-01 C 11
16 2017-04-01 C 10.9
17 2017-05-01 C 13.4
18 2017-06-01 C 11.1
I need to forecast multiple univariate time series models (ets, auto.arima and prophet) by groups (A, B, C). Assume the groups are independent of each other.Also how can we extract error metrics and point forecasts say 2 period ahead (in a data frame) and plot the forecasts, again grouped by groups.Need help here!!!
Iterative methods like using packages such as tidyverse/purrr, or sweep etc. may be a solution here. ?
First convert the dates to yearmon class in order that the months be regularly spaced since Dates are not due to the different number of days per month. yearmon represents dates internally as year + 0 for Jan, year + 1/12 for Feb, ..., year + 11/12 for Dec. If desired the Date can subsequently be converted from yearmon to numeric using as.numeric to get the internal represntation.
calc represents the function that performs the calculation on a single group. Replace it with your function. Its first argument should be a data frame with Date and Amount columns. Additional arguments are optional and only needed if it is desired to pass fixed parameters that do not vary across groups. In the example below we pass a string, "Hello" to the msg argument. The function can return any sort of object such as a plain vector, list or other object.
In the last line by will call calc, once per group, returning a list of the return values from calc, one component per group.
library(zoo)
data2 <- transform(data,
Date = as.yearmon(Date),
Amount = as.numeric(Amount)
)
calc <- function(dat, msg) {
print(msg)
fm <- lm(Amount ~ Date, dat)
predict(fm, list(Date = tail(dat$Date, 1) + 2/12))
}
by(data2[-2], data2[[2]], calc, msg = "Hello")

How to iterate and make new variables within a function in R? [duplicate]

This question already has answers here:
How to split a data frame?
(8 answers)
Closed 4 years ago.
Is there a way within R to make a function that would make subsets (for example by dates) into it's own data frame? For example I have 30 days worth of data, and I want to break each day down into individual days and output it into a new individual data frame. I can't figure out how to do it in a function. Any clues?
Example:
Dataframe: df_of_month
Output desired via a loop function of sorts:
df_of_month_day1
df_of_month_day2
df_of_month_day3
df_of_month_day4
df_of_month_day5
df_of_month_day6
etc?.... I've been looking for multiple way sand it's not working.
To give you an answer to your question, you would achieve this with lapply. For instance, consider the following:
Create some sample data:
df <- data.frame(Day = rep(seq.Date(from = as.Date('2010-01-01'), to = as.Date('2010-01-30'), by =1), 5))
df$somevar <- rnorm(nrow(df))
head(df)
Day somevar
1 2010-01-01 -0.946059466
2 2010-01-02 0.005897001
3 2010-01-03 -0.297566286
4 2010-01-04 -0.637562495
5 2010-01-05 -0.549800912
6 2010-01-06 0.287709994
Now, observe that unique can give you a vector with all unique dates:
unique(df$Day)
[1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
[11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15" "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
[21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25" "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
This you can pass to lapply to be used for subsetting:
lapply(unique(df$Day), function(x) df[df[,"Day"]==x,])
[[1]]
Day somevar
1 2010-01-01 -0.9460595
31 2010-01-01 -0.3434005
61 2010-01-01 -1.5463641
91 2010-01-01 -0.5192375
121 2010-01-01 -1.1780619
[[2]]
Day somevar
2 2010-01-02 0.005897001
32 2010-01-02 -1.346336688
62 2010-01-02 -0.321702391
92 2010-01-02 -0.384277955
122 2010-01-02 0.058906305
... (output omitted)
where the output of lapply is a list with the corresponding dataframes.
Needless to say, you would assign this to a name to capture all dataframes in a list as in mylist <- lapply(...). However, if you want to have them in your global environment, you can first give each dataframe a name, for instance using setNames as in setNames(mylist, paste0("df", format(unique(df$Day), format = "%Y%m%d"))) and then you could use list2env(mylist) to push each list element into the global environment.
However, as mentioned in the comments, this is probably not a good idea. If you want to do something to each date, consider the group-by solution with dplyr: For instance, imagine you want to get the mean by date:
library(dplyr)
df %>% group_by(Day) %>% summarize(mean_var = mean(somevar))
# A tibble: 30 x 2
Day mean_var
<date> <dbl>
1 2010-01-01 -0.907
2 2010-01-02 -0.398
3 2010-01-03 0.213
4 2010-01-04 -0.142
5 2010-01-05 -0.377
6 2010-01-06 0.404
7 2010-01-07 -0.634
8 2010-01-08 1.00
9 2010-01-09 0.378
10 2010-01-10 -0.0863
# ... with 20 more rows
where each row corresponds to the group-wise mean. This is called split-apply-combine and is worthwhile googling. It will come again and again.
Just for reference, in base R, you could achieve this using e.g. by, as in
by(df$somevar, df$Day, FUN = mean)
though either dplyr or data.table are probably more user-friendly.

Replace NA with Median Based on Multiple Conditions [duplicate]

This question already has answers here:
Summarizing by group of two variables
(2 answers)
Replacing Missing Value in R
(4 answers)
Closed 4 years ago.
This is my first Stack Overflow post. I researched extensively but have not found a similar post.
I am trying to impute the median for NA values based on two conditions.
Here is my code:
#Create sample of original data for reproducibility
Date<-c("2009-05-01","2009-05-02","2009-05-03","2009-06-01","2009-06-02",
"2009-06-03", "2010-05-01","2010-05-02","2010-05-03","2010-06-01",
"2010-06-02","2010-06-03","2011-05-01","2011-05-02","2011-05-03",
"2011-06-01","2011-06-02","2011-06-03")
Month<- c("May","May","May","June","June","June",
"May","May","May","June","June","June",
"May","May","May","June","June","June")
DayType<- c("Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
"Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
"Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday")
Qty<- c(NA,NA,NA,NA,NA,NA,
1,2,1,10,15,13,
3,2,5,20,14,16)
#Combine into dataframe
Example<-data.frame(Date,Month,DayType,Qty)
#Test output
Example
# Make a separate dataframe to calculate the median value based on day of the month
test1 <- ddply(Example,. (DayType,Month),summarize,median=median(Qty,na.rm=TRUE))
This works as expected. Test1 output looks like this:
DayType Month Median
Monday June 15.0
Monday May 2.0
Tuesday June 14.5
Tuesday May 2.0
Wednesday June 14.5
Wednesday May 3.0
My second step replaces "NA" values in the original dataset with the medians calculated in test1. This is where my issue comes in.
Example$Qty[is.na(Example$Qty)] <- test1$median[match(Example$DayType,test1$DayType,Example$Month,test1$Month)][is.na(Example$Qty)]
Example
Match[] only matches on the median value for each day, rather than the median value for each day by month. The output is the same seven repeating values for the entire set. I have not figured out how to match on both columns simultaneously.
Output:
Date DayType Month GSEvtQty
2009-05-01 Monday May 15.0 *should be 2.0, matching to June
2009-05-02 Tuesday May 14.5 *should be 2.0, matching to June
2009-05-03 Wednesday May 14.5 *should be 3.0, matching to June
2009-06-01 Monday June 15.0 *imputes correctly
2009-06-02 Tuesday June 14.5 *imputes correctly
2009-06-03 Wednesday June 14.5 *imputes correctly
2010-05-01 Monday May 1.0
2010-05-02 Tuesday May 2.0
2010-05-03 Wednesday May 1.0
2010-06-01 Monday June 10.0
2010-06-02 Tuesday June 15.0
2010-06-03 Wednesday June 13.0
I have also tried using %in%:
Example$Qty[is.na(Example$Qty)] <- test1$median[Example$DayType %in% test1$DayType & Example$Month %in% test1$Month][is.na(Example$Qty)]
But that does not match correctly and only outputs a limited number of values rather than over the entire series of NAs.
Using na.aggregate via the Zoo package as cleverly suggested by #Jaap:
setDT(Example)[, Value := na.aggregate("Qty", FUN = median), by = c("DayType","Month")]
For some reason does not transform the NAs:
Output:
Date Month DayType Qty
2009-05-01 May Monday NA
2009-05-02 May Tuesday NA
2009-05-03 May Wednesday NA
2009-06-01 June Monday NA
Any suggestions would be greatly appreciated! Thanks for sticking with this post for so long and look forward to paying the assistance forward in the future.
This is what merge was created for.
info$GSEvtQty[is.na(info$GSEvtQty)]<- merge(info[is.na(info$GSEvtQty,)], test1, by=c("DayType", "Month"))[,"GSEvtQty"]

FRED data: aggregate quarterly data into annual

I need to convert quarterly data into yearly, by summing over 4 quarters in each year. When I searched stackoverflow.com, I found that using a function to sum over periods, seem to work. However, the format did not match, so I couldn't work with the converted year data array with the other arrays
For example, annual data in FRED looks as follows:
2009-01-01 12126.078
2010-01-01 12739.542
2011-01-01 13352.255
2012-01-01 14061.878
2013-01-01 14444.823
However, when I changed the data using the following function:
library("quantmod")
library(zoo)
library(mFilter)
library(nleqslv)
fredsym <- c("PROPINC")
quarter.proprietors_income <- PROPINC
## convert to annual
as.year <- function(x) as.integer(as.yearqtr(x)) # a new function
annual.proprietors_income <- aggregate(quarter.proprietors_income, as.yearqtr, sum) # sum over quarters
it changes from this:
2016-01-01 1327.613
2016-04-01 1339.493
2016-07-01 1346.067
2016-10-01 1354.560
2017-01-01 1380.221
2017-04-01 1378.637
2017-07-01 1381.911
2017-10-01 1403.114
to this:
2011 4574.669
2012 4965.486
2013 5138.968
2014 5263.208
2015 5275.225
2016 5367.733
2017 5543.883
What I need is having an annual data but with the original YYYY-MM-DD format, and it should appear as 01-01 for each yearly data.. Otherwise it doesn't work with other annual data...
Is there any way to solve this issue?
Using DF in the Note below use cut as shown:
aggregate(DF["value"], list(year = as.Date(cut(as.Date(DF$Date), "year"))), sum)
giving:
year value
1 2016-01-01 5367.733
2 2017-01-01 5543.883
Note
Lines <- "Date value
2016-01-01 1327.613
2016-04-01 1339.493
2016-07-01 1346.067
2016-10-01 1354.560
2017-01-01 1380.221
2017-04-01 1378.637
2017-07-01 1381.911
2017-10-01 1403.114"
DF <- read.table(text = Lines, header = TRUE)
I found that, the aggregate command makes the class into zoo. No more xts to be remained as time series.
Alternatively, apply.yearly seems to work.
annual.proprietors_income <- apply.yearly(xts(quarter.proprietors_income),sum)
This is now in xts. BUt the thing is they show mon-day as ending quarter as YYYY-10-01 for each year. How can I make it into YYYY-01-01....

period.apply over an hour with deciding start time

So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)

Resources