The problem is to use apply.monthly or any other similar function to do monthly operations with a dataset. The data I have looks like the following:
> minidata[1:10,]
date Month Year TMIN
1 1948-01-01 Jan 1948 1.1
2 1948-01-02 Jan 1948 7.2
3 1948-01-03 Jan 1948 5.0
4 1948-01-04 Jan 1948 9.4
5 1948-01-05 Jan 1948 4.4
> tail(minidata)
date Month Year TMIN
54 1948-02-23 Feb 1948 2.8
55 1948-02-24 Feb 1948 -0.6
56 1948-02-25 Feb 1948 1.7
57 1948-02-26 Feb 1948 2.8
58 1948-02-27 Feb 1948 4.4
59 1948-02-28 Feb 1948 3.3
Task, use my own function to produce the monthly mean:
mymean <- function(date){
for (j in 1:days_in_month(date)){
avg = (1/(days_in_month(date))
*sum(minidata$TMIN[1:days_in_month(date)])}
return(avg)
}
The result must be the same as the R function in the xts package:
dat.xts <- xts(x= minidata$TMIN,order.by = minidata$date)
> apply.monthly(dat.xts,mean)
[,1]
1948-01-31 2.312903
1948-02-28 2.082143
My function outputs the correct values:
> mymean(minidata$date[1])
Jan
2.312903
> mymean(dat.xts[1])
Jan
2.312903
I wouldn't mind if $apply.monthly$ generated a new column with the means, but I have to use my own function! (This is an example, in reality my function is a lot harder).
I tried:
> apply.monthly(dat.xts,function(dat.xts) mymean(dat.xts))
Error in coredata.xts(x) : currently unsupported data type
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Thanks!
Update: days_in_month can be found in the lubridate package. It calculates the number of days in a given month
Your function is the issue, not apply.monthly. I don't know where the days_in_month function is defined, but it probably doesn't work with xts objects. I assume it expects a date-time class.
And your mymean function references an object that isn't passed to it, which is not good practice because it makes R search for minidata.
Your function should expect an xts object containing a month of data and operate only on that data, not some object outside the function scope. For example:
mymean <- function(Data) {
days <- days_in_month(index(Data)[1])
avg <- (1/days) * sum(Data$Close)
return(avg)
}
require(xts)
data(sample_matrix)
x <- as.xts(sample_matrix)
apply.monthly(x, mymean)
To perform operations within groups of a data frame, you can use the dplyr package. For instance, to get the average TMIN within each group:
library(dplyr)
summarize(group_by(minidata, Month), mean = mean(TMIN))
This is often written as:
minidata %>% group_by(Month) %>%
summarize(mean = mean(TMIN))
Your function works only on data frames, the xts object is different and isn't going to work the way you want. That is why it is giving you errors.
Besides that, you do not want to do this with a loop. It is going to take much longer than many other ways of doing it.
David's answer (use dplyr::group_by and dplyr::summarize) is the best way to handle this. You can use a custom function in the summarize if that is what the problem is. Just define your function and use it there.
Related
I am learning time series analysis with R and came across these 2 functions while learning. I do understand that the output of both of these is a periodic data defined by the frequency of period and the only difference I can see is the OHLC output option in the to.period().
Other than the OHLC when a particular of these functions is to be used?
to.period and all the to.minutes, to.weekly, to.quarterly are indeed meant for OHLC data.
If you take the function to.period it will take the open from the first day of the period, the close of the last day of the period and the highest high / lowest low of the specified period. These functions work very well together with the quantmod / tidyquant / quantstrat packages. See code example 1.
If you give the to.period non-OHLC data, but a timeseries with 1 data column, you still get a sort of OHLC back. See code example 2.
Now period.apply is is more interesting. Here you can supply your own functions to be applied on the data. Especially in combination with endpoints this can be a powerful function in timeseries data if you want to aggregate your function to different time periods. The index is mostly specified with endpoints, since with endpoints you can create the index you need to get to higher time levels (from day to week / etc etc). See code example 3 and 4.
Remember to use matrix functions with period.apply if you have more than 1 column of data since xts is basicly a matrix and an index. See code example 5.
More info on this data.camp course.
library(xts)
data(sample_matrix)
zoo.data <- zoo(rnorm(31)+10,as.Date(13514:13744,origin="1970-01-01"))
# code example 1
to.quarterly(sample_matrix)
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_matrix.Close
2007 Q1 50.03978 51.32342 48.23648 48.97490
2007 Q2 48.94407 50.33781 47.09144 47.76719
# same as to.quarterly
to.period(sample_matrix, period = "quarters")
sample_matrix.Open sample_matrix.High sample_matrix.Low sample_matrix.Close
2007 Q1 50.03978 51.32342 48.23648 48.97490
2007 Q2 48.94407 50.33781 47.09144 47.76719
# code example 2
to.period(zoo.data, period = "quarters")
zoo.data.Open zoo.data.High zoo.data.Low zoo.data.Close
2007-03-31 9.039875 11.31391 7.451139 10.35057
2007-06-30 10.834614 11.31391 7.451139 11.28427
2007-08-19 11.004465 11.31391 7.451139 11.30360
# code example 3 using base standard deviation in the chosen period
period.apply(zoo.data, endpoints(zoo.data, on = "quarters"), sd)
2007-03-31 2007-06-30 2007-08-19
1.026825 1.052786 1.071758
# self defined function of summing x + x for the period
period.apply(zoo.data, endpoints(zoo.data, on = "quarters"), function(x) sum(x + x) )
2007-03-31 2007-06-30 2007-08-19
1798.7240 1812.4736 993.5729
# code example 5
period.apply(sample_matrix, endpoints(sample_matrix, on = "quarters"), colMeans)
Open High Low Close
2007-03-31 50.15493 50.24838 50.05231 50.14677
2007-06-30 48.47278 48.56691 48.36606 48.45318
I intend to perform a time series analysis on my data set. I have imported the data (monthly data from January 2015 till December 2017) from a csv file and my codes in RStudio appear as follows:
library(timetk)
library(tidyquant)
library(timeSeries)
library(tseries)
library(forecast)
mydata1 <- read.csv("mydata.csv", as.is=TRUE, header = TRUE)
mydata1
date pkgrev
1 1/1/2015 39103770
2 2/1/2015 27652952
3 3/1/2015 30324308
4 4/1/2015 35347040
5 5/1/2015 31093119
6 6/1/2015 20670477
7 7/1/2015 24841570
mydata2 <- mydata1 %>%
mutate(date = mdy(date))
mydata2
date pkgrev
1 2015-01-01 39103770
2 2015-02-01 27652952
3 2015-03-01 30324308
4 2015-04-01 35347040
5 2015-05-01 31093119
6 2015-06-01 20670477
7 2015-07-01 24841570
class(mydata2)
[1] "data.frame"
It is when running this piece of code that things get a little weird (for me at least):
mydata2_ts <- ts(mydata2, start=c(2015,1), freq=12)
mydata2_ts
date pkgrev
Jan 2015 16436 39103770
Feb 2015 16467 27652952
Mar 2015 16495 30324308
Apr 2015 16526 35347040
May 2015 16556 31093119
Jun 2015 16587 20670477
Jul 2015 16617 24841570
I don't really understand the values in the date column! It seems the dates have been converted into numeric format.
class(mydata2_ts)
[1] "mts" "ts" "matrix"
Now, running the following codes give me an error:
stlRes <- stl(mydata2_ts, s.window = "periodic")
Error in stl(mydata2_ts, s.window = "periodic") :
only univariate series are allowed
What is wrong with my process?
The reason that you got this error is because you tried to feed a data set with two variables (date + pkgrev) into STL's argument, which only takes a univariate time series as a proper argument.
To solve this problem, you could create a univariate ts object without the date variable. In your case, you need to use mydata2$pkgrev (or mydata2["pkgrev"] after mydata2 is converted into a dataframe) instead of mydata2 in your code mydata2_ts <- ts(mydata2, start=c(2015,1), freq=12). The ts object is already supplied with the temporal information as you specified start date and frequency in the argument.
If you would like to create a new dataframe with both the ts object and its corresponding date variable, I would suggest you to use the following code:
mydata3 = cbind(as.Date(time(mydata2_ts)), mydata2_ts)
mydata3 = as.data.frame(mydata3)
However, for the purpose of STL decompostion, the input of the first argument should be a ts object, i.e., mydata2_ts.
I have created a time series using zoo. It has daily values for a long period of time (40 years). I can easily plot it, but what I want is to create a time series with monthly (mean) values from this original time series and then plot it as monthly values.
I thought the package lubridate could be a good option for this and maybe there is an easy way but I don't see how. I'm a beginner in R. Has somebody a tip here?
You can use apply.monthly() from the xts package.
library(xts)
data(sample_matrix)
x <- as.xts(sample_matrix, dateFormat = "Date")
(m <- apply.monthly(x, mean))
# Open High Low Close
# 2007-01-31 50.21140 50.31528 50.12072 50.22791
# 2007-02-28 50.78427 50.88091 50.69639 50.79533
# 2007-03-31 49.53185 49.61232 49.40435 49.48246
# 2007-04-30 49.62687 49.71287 49.53189 49.62978
# 2007-05-31 48.31942 48.41694 48.18960 48.26699
# 2007-06-30 47.47717 47.57592 47.38255 47.46899
You might also want to convert your index from Date to yearmon, which you can do like this:
index(m) <- as.yearmon(index(m))
m
# Open High Low Close
# Jan 2007 50.21140 50.31528 50.12072 50.22791
# Feb 2007 50.78427 50.88091 50.69639 50.79533
# Mar 2007 49.53185 49.61232 49.40435 49.48246
# Apr 2007 49.62687 49.71287 49.53189 49.62978
# May 2007 48.31942 48.41694 48.18960 48.26699
# Jun 2007 47.47717 47.57592 47.38255 47.46899
You can use aggregate.zoo as shown in the examples:
x2a <- aggregate(x, as.Date(as.yearmon(time(x))), mean)
if you want to stick to zoo.
I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.
The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).
merge data frames to eliminate missing observations
The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):
# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame
# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame
# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)
The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.
I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows
intersection, union and left and right joins to be expressed". But trying all combinations of the
following use of "all" yielded the same results (as to why would be a secondary question).
z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"
While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)
The all argument of merge.zoo must be (quoting from the help file):
logical vector having the same length as the number of "zoo" objects to be merged
(otherwise expanded)
and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).
merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))
The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.
Also note that %Y should have been %y in the question's code.
I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.
merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)
# Date_Stk ABC_Stk DEF_Stk GHI_Stk ABC_Nx DEF_Nx GHI_Nx
# 1 1/2/13 65.73 42.98 32.18 51.42857 35.23809 7.14256
# 2 1/3/13 66.85 42.92 31.73 NA NA NA
# 3 1/4/13 66.92 43.47 32.43 51.67565 36.66667 8.44573
# 4 1/7/13 66.60 43.16 32.13 58.57143 27.23150 5.99239
# 5 1/8/13 66.07 43.71 32.18 58.99564 26.94331 6.10034
Just a little background: I got into programming through statistics, and I don't have much formal programming experience, I just know how to make things work. I'm open to any suggestions to come at this from a differet direction, but I'm currently using multiple sqldf queries to get my desired data. I originally started statistical programming in SAS and one of the things I used on a regular basis was the macro programing ability.
For a simplistic example say that I have my table A as the following:
Name Sex A B DateAdded
John M 72 1476 01/14/12
Sue F 44 3269 02/09/12
Liz F 90 7130 01/01/12
Steve M 21 3161 02/29/12
The select statement that I'm currently using is of the form:
sqldf("SELECT AVG(A), SUM(B) FROM A WHERE DateAdded >= '2012-01-01' AND DateAdded <= '2012-01-31'")
Now I'd like to run this same query on the enteries where DateAdded is in the Month of February. From my experience with SAS, you would create macro variables for the values of DateAdded. I've considered running this as a (very very slow) for loop, but I'm not sure how to pass an R variable into the sqldf, or whether that's even possible. In my table, I'm using the same query over years worth of data--any way to streamline my code would be much appreciated.
Read in the data, convert the DateAdded column to Date class, add a yearmon (year/month) column and then use sqldf or aggregate to aggregate by year/month:
Lines <- "Name Sex A B DateAdded
John M 72 1476 01/14/12
Sue F 44 3269 02/09/12
Liz F 90 7130 01/01/12
Steve M 21 3161 02/29/12"
DF <- read.table(text = Lines, header = TRUE)
# convert DateAdded column to Date class
DF$DateAdded <- as.Date(DF$DateAdded, format = "%m/%d/%y")
# add a year/month column using zoo
library(zoo)
DF$yearmon <- as.yearmon(DF$DateAdded)
Now that we have the data and its in the right form the answer is just one line of code. Here are two ways:
# 1. using sqldf
library(sqldf)
sqldf("select yearmon, avg(A), avg(B) from DF group by yearmon")
# 2. using aggregate
aggregate(cbind(A, B) ~ yearmon, DF, mean)
The result of the last two lines is:
> sqldf("select yearmon, avg(A), avg(B) from DF group by yearmon")
yearmon avg(A) avg(B)
1 Jan 2012 81.0 4303
2 Feb 2012 32.5 3215
>
> # 2. using aggregate
> aggregate(cbind(A, B) ~ yearmon, DF, mean)
yearmon A B
1 Jan 2012 81.0 4303
2 Feb 2012 32.5 3215
EDIT:
Regarding your question of doing it by week see the nextfri function in the zoo quick reference vignette.