R: left sided moving average for periods (months) - r

I have a question which might be trivial for most of you guys. I tried a lot, didn't come to a solution, so I would be glad if somebody could give me a hint. The starting point is a weekly xts-time series.
Month Week Value Goal
Dec 2011 W50 a a
Dec 2011 W51 b mean(a,b)
Dec 2011 W52 c mean(a,b,c)
Dec 2011 W53 d mean(a,b,c,d)
Jan 2012 W01 e e
Jan 2012 W02 f mean(e,f)
Jan 2012 W03 g mean(e,f,g)
Jan 2012 W04 h mean(e,f,g,h)
Feb 2012 W05 i i
Feb 2012 W06 j mean(i,j)
Please excuse the Excel notation, but I think it makes it pretty clear what I want to do: I want to calculate a left sided moving average for the column "Value" but just for the respective month, as it is displayed in the column Goal. I experimented with apply.monthly() and period.apply(). But it didn't get me what I want. Can sombody of you give me a hint how to solve the problem? Just a hint which function I should use would be already enough!
Thank you very much!
Best regards,
Andreas

apply.monthly will not work because it only assigns one value to the endpoint of the period, whereas you want to assign many values to each monthly period.
You can do this pretty easily by splitting your xts data by month, applying a cumulative mean function to each, and rbind'ing the list back together.
library(quantmod)
# Sample data
getSymbols("SPY")
spy <- to.weekly(SPY)
# Cumulative mean function
cummean <- function(x) cumsum(x)/seq_along(x)
# Expanding average calculation
spy$EA <- do.call(rbind, lapply(split(Cl(spy),'months'), cummean))

I hope I got your question right. but is it that what you are looking for:
require(plyr)
require(PerformanceAnalytics)
ddply(data, .(Week), summarize, Goal=apply.fromstart(Value,fun="mean"))
this should work - though a reproducible expample would have been nice.
here's what it does.
df <- data.frame(Week=rep(1:5, each=5), Value=c(1:25)*runif(25)) #sample data
require(plyr)
require(PerformanceAnalytics)
df$Goal <- ddply(df, .(Week), summarize, Goal=apply.fromstart(Value,FUN="mean"))[,2]
outcome:
Week Value Goal
1 1 0.7528037 0.7528037
2 1 1.9622622 1.3575330
3 1 0.3367802 1.0172820
4 1 2.5177284 1.3923936
of course you may obtain further info via the help: ?ddply or ?apply.fromstart.

Related

How to calculate growth rate in R? [duplicate]

I have a data frame and would like to calculate the growth rate of nominal GDP in R. I know how to do it in Excel with the formula ((gdp of this year)-gdp of last year)/( gdp of last year))*100. What kind of command could be used in R to calculate it?
year nominal gdp
2003 7696034.9
2004 8690254.3
2005 9424601.9
2006 10520792.8
2007 11399472.2
2008 12256863.6
2009 12072541.6
2010 13266857.9
2011 14527336.9
2012 15599270.7
2013 16078959.8
You can also use the lag() fuction from dplyr. It gives the previous values in a vector. Here is an example
data <- data.frame(year = c(2003:2013),
gdp = c(7696034.9, 8690254.3, 9424601.9, 10520792.8,
11399472.2, 12256863.6, 12072541.6, 13266857.9,
14527336.9, 15599270.7, 16078959.8))
library(dplyr)
growth_rate <- function(x)(x/lag(x)-1)*100
data$growth_rate <- growth_rate(data$gdp)
It's probably best for you to get familiar with data tables, and do something like this:
library(data.table)
dt_gdp <- data.table(df)
dt_gdp[, growth_rate_of_gdp := 100 * (Producto.interno.bruto..PIB. - shift(Producto.interno.bruto..PIB.)) / shift(Producto.interno.bruto..PIB.)]
A base-R solution:
with(data,
c(NA, ## augment results (growth rate unknown in year 1)
diff(gdp)/ ## this is gdp(t) - gdp(t-1)
head(gdp, -1)) ## gdp(t-1)
*100) ## scale to percentage growth
head(gdp, -1) is perhaps a little too clever. gdp[-length(gdp)] (i.e. "gdp, excluding the last value") would be slightly more idiomatic.
Or
(gdp/c(NA,gdp[-length(gdp)])-1)*100

Converting zoo time series from daily to monthly means

I have created a time series using zoo. It has daily values for a long period of time (40 years). I can easily plot it, but what I want is to create a time series with monthly (mean) values from this original time series and then plot it as monthly values.
I thought the package lubridate could be a good option for this and maybe there is an easy way but I don't see how. I'm a beginner in R. Has somebody a tip here?
You can use apply.monthly() from the xts package.
library(xts)
data(sample_matrix)
x <- as.xts(sample_matrix, dateFormat = "Date")
(m <- apply.monthly(x, mean))
# Open High Low Close
# 2007-01-31 50.21140 50.31528 50.12072 50.22791
# 2007-02-28 50.78427 50.88091 50.69639 50.79533
# 2007-03-31 49.53185 49.61232 49.40435 49.48246
# 2007-04-30 49.62687 49.71287 49.53189 49.62978
# 2007-05-31 48.31942 48.41694 48.18960 48.26699
# 2007-06-30 47.47717 47.57592 47.38255 47.46899
You might also want to convert your index from Date to yearmon, which you can do like this:
index(m) <- as.yearmon(index(m))
m
# Open High Low Close
# Jan 2007 50.21140 50.31528 50.12072 50.22791
# Feb 2007 50.78427 50.88091 50.69639 50.79533
# Mar 2007 49.53185 49.61232 49.40435 49.48246
# Apr 2007 49.62687 49.71287 49.53189 49.62978
# May 2007 48.31942 48.41694 48.18960 48.26699
# Jun 2007 47.47717 47.57592 47.38255 47.46899
You can use aggregate.zoo as shown in the examples:
x2a <- aggregate(x, as.Date(as.yearmon(time(x))), mean)
if you want to stick to zoo.

R how to remove duplicates elements in the column and get average value

Sorry I am new to R, and the problem is quite hard for me,
Here is the matrix:
V1 predictions
1 Jeffery Howes 0.0909596345057677
2 Sherilee Waring 0.00434589236424605
3 Rachel Maitland 0.0909596345057677
4 Jan Maitland 0.0909596345057677
5 Jan Maitland 0.0909596345057677
6 Jan Maitland 0.0909596345057677
7 Jan Maitland 0.0909596345057677
8 Sandra McEwen 0.0909596345057677
....
How can I remove the duplicates in the columns (that's okay for me, could use unique, but the following problem is quite hard for me).
For example, there are many duplicated name Jan Maitland, duplicates should be removed, but the predications values should be calculated (the final result left should be the average value of those duplicate names)
Could someone help me on that? thanks a lot!!
you can use the dplyr library :
result%.%group_by(V1)%.%summarise(predictions = mean(predictions))
# the 2nd syntax
summarise(group_by(result, V1), predictions=mean(predictions))
hth

Eliminating Existing Observations in a Zoo Merge

I'm trying to do a zoo merge between stock prices from selected trading days and observations about those same stocks (we call these "Nx observations") made on the same days. Sometimes do not have Nx observations on stock trading days and sometimes we have Nx observations on non-trading days. We want to place an "NA" where we do not have any Nx observations on trading days but eliminate Nx observations where we have them on non-trading day since without trading data for the same day, Nx observations are useless.
The following SO question is close to mine, but I would characterize that question as REPLACING missing data, whereas my objective is to truly eliminate observations made on non-trading days (if necessary, we can change the process by which Nx observations are taken, but it would be a much less expensive solution to leave it alone).
merge data frames to eliminate missing observations
The script I have prepared to illustrate follows (I'm new to R and SO; all suggestions welcome):
# create Stk_data data.frame for use in the Stack Overflow question
Date_Stk <- c("1/2/13", "1/3/13", "1/4/13", "1/7/13", "1/8/13") # dates for stock prices used in the example
ABC_Stk <- c(65.73, 66.85, 66.92, 66.60, 66.07) # stock prices for tkr ABC for Jan 1 2013 through Jan 8 2013
DEF_Stk <- c(42.98, 42.92, 43.47, 43.16, 43.71) # stock prices for tkr DEF for Jan 1 2013 through Jan 8 2013
GHI_Stk <- c(32.18, 31.73, 32.43, 32.13, 32.18) # stock prices for tkr GHI for Jan 1 2013 through Jan 8 2013
Stk_data <- data.frame(Date_Stk, ABC_Stk, DEF_Stk, GHI_Stk) # create the stock price data.frame
# create Nx_data data.frame for use in the Stack Overflow question
Date_Nx <- c("1/2/13", "1/4/13", "1/5/13", "1/6/13", "1/7/13", "1/8/13") # dates for Nx Observations used in the example
ABC_Nx <- c(51.42857, 51.67565, 57.61905, 57.78349, 58.57143, 58.99564) # Nx scores for stock ABC for Jan 1 2013 through Jan 8 2013
DEF_Nx <- c(35.23809, 36.66667, 28.57142, 28.51778, 27.23150, 26.94331) # Nx scores for stock DEF for Jan 1 2013 through Jan 8 2013
GHI_Nx <- c(7.14256, 8.44573, 6.25344, 6.00423, 5.99239, 6.10034) # Nx scores for stock GHI for Jan 1 2013 through Jan 8 2013
Nx_data <- data.frame(Date_Nx, ABC_Nx, DEF_Nx, GHI_Nx) # create the Nx scores data.frame
# create zoo objects & merge
z.Stk_data <- zoo(Stk_data, as.Date(as.character(Stk_data[, 1]), format = "%m/%d/%Y"))
z.Nx_data <- zoo(Nx_data, as.Date(as.character(Nx_data[, 1]), format = "%m/%d/%Y"))
z.data.outer <- merge(z.Stk_data, z.Nx_data)
The NAs on Jan 3 2013 for the Nx observations are fine (we'll use the na.locf) but we need to eliminate the Nx observations that appear on Jan 5 and 6 as well as the associated NAs in the Stock price section of the zoo objects.
I've read the R Documentation for merge.zoo regarding the use of "all": that its use "allows
intersection, union and left and right joins to be expressed". But trying all combinations of the
following use of "all" yielded the same results (as to why would be a secondary question).
z.data.outer <- zoo(merge(x = Stk_data, y = Nx_data, all.x = FALSE)) # try using "all"
While I would appreciate comments on the secondary question, I'm primarily interested in learning how to eliminate the extraneous Nx observations on days when there is no trading of stocks. Thanks. (And thanks in general to the community for all the great explanations of R!)
The all argument of merge.zoo must be (quoting from the help file):
logical vector having the same length as the number of "zoo" objects to be merged
(otherwise expanded)
and you want to keep all rows from the first argument but not the second so its value should be c(TRUE, FALSE).
merge(z.Stk_data, z.Nx_data, all = c(TRUE, FALSE))
The reason for the change in all syntax for merge.zoo relative to merge.data.frame is that merge.zoo can merge any number of arguments whereas merge.data.frame only handles two so the syntax had to be extended to handle that.
Also note that %Y should have been %y in the question's code.
I hope I have understood your desired output correctly ("NAs on Jan 3 2013 for the Nx observations are fine"; "eliminate [...] observations that appear on Jan 5 and 6"). I don't quite see the need for zoo in the merging step.
merge(Stk_data, Nx_data, by.x = "Date_Stk", by.y = "Date_Nx", all.x = TRUE)
# Date_Stk ABC_Stk DEF_Stk GHI_Stk ABC_Nx DEF_Nx GHI_Nx
# 1 1/2/13 65.73 42.98 32.18 51.42857 35.23809 7.14256
# 2 1/3/13 66.85 42.92 31.73 NA NA NA
# 3 1/4/13 66.92 43.47 32.43 51.67565 36.66667 8.44573
# 4 1/7/13 66.60 43.16 32.13 58.57143 27.23150 5.99239
# 5 1/8/13 66.07 43.71 32.18 58.99564 26.94331 6.10034

Repeat sqldf over different values of a variable

Just a little background: I got into programming through statistics, and I don't have much formal programming experience, I just know how to make things work. I'm open to any suggestions to come at this from a differet direction, but I'm currently using multiple sqldf queries to get my desired data. I originally started statistical programming in SAS and one of the things I used on a regular basis was the macro programing ability.
For a simplistic example say that I have my table A as the following:
Name Sex A B DateAdded
John M 72 1476 01/14/12
Sue F 44 3269 02/09/12
Liz F 90 7130 01/01/12
Steve M 21 3161 02/29/12
The select statement that I'm currently using is of the form:
sqldf("SELECT AVG(A), SUM(B) FROM A WHERE DateAdded >= '2012-01-01' AND DateAdded <= '2012-01-31'")
Now I'd like to run this same query on the enteries where DateAdded is in the Month of February. From my experience with SAS, you would create macro variables for the values of DateAdded. I've considered running this as a (very very slow) for loop, but I'm not sure how to pass an R variable into the sqldf, or whether that's even possible. In my table, I'm using the same query over years worth of data--any way to streamline my code would be much appreciated.
Read in the data, convert the DateAdded column to Date class, add a yearmon (year/month) column and then use sqldf or aggregate to aggregate by year/month:
Lines <- "Name Sex A B DateAdded
John M 72 1476 01/14/12
Sue F 44 3269 02/09/12
Liz F 90 7130 01/01/12
Steve M 21 3161 02/29/12"
DF <- read.table(text = Lines, header = TRUE)
# convert DateAdded column to Date class
DF$DateAdded <- as.Date(DF$DateAdded, format = "%m/%d/%y")
# add a year/month column using zoo
library(zoo)
DF$yearmon <- as.yearmon(DF$DateAdded)
Now that we have the data and its in the right form the answer is just one line of code. Here are two ways:
# 1. using sqldf
library(sqldf)
sqldf("select yearmon, avg(A), avg(B) from DF group by yearmon")
# 2. using aggregate
aggregate(cbind(A, B) ~ yearmon, DF, mean)
The result of the last two lines is:
> sqldf("select yearmon, avg(A), avg(B) from DF group by yearmon")
yearmon avg(A) avg(B)
1 Jan 2012 81.0 4303
2 Feb 2012 32.5 3215
>
> # 2. using aggregate
> aggregate(cbind(A, B) ~ yearmon, DF, mean)
yearmon A B
1 Jan 2012 81.0 4303
2 Feb 2012 32.5 3215
EDIT:
Regarding your question of doing it by week see the nextfri function in the zoo quick reference vignette.

Resources