I have a data frame that look like this:
z <- data.frame(ent = c(1, 1, 1, 2, 2, 2, 3, 3, 3), year = c(1995, 2000, 2005, 1995, 2000, 2005, 1995, 2000, 2005), pobtot = c(50, 60, 70, 10, 4, 1, 100, 105, 110))
As you can see, there is a gap between 5 years for every "ent". I want to interpolate data to every missing year: 1996, 1997, 1998, 1999, 2001, 2002, 2003, 2004 and also prognosticate to 2006, 2007 and 2008. Is there a way to do this?
Any help would be appreciated.
We can use complete to expand the data for each 'ent' and the 'year' range, then with na.approx interpolate the missing values in 'pobtot'
library(dplyr)
library(tidyr)
z %>%
complete(ent, year = 1995:2008) %>%
mutate(pobtot = zoo::na.approx(pobtot, na.rm = FALSE))
Assuming you want linear interpolation, R uses approx() for such things by default, e.g. for drawing lines in a plot. We may also use that function to interpolate the years. It doesn't extrapolate, though, but we could use forecast::ets() with default settings for this which calculates an exponential smoothing state space model. Note, however, that this may also produce negative values, but OP hasn't stated what is needed in such a case. So anyway in a by() approach we could do:
library(forecast)
p <- 3 ## define number of years for prediction
res <- do.call(rbind, by(z, z$ent, function(x) {
yseq <- min(x$year):(max(x$year) + p) ## sequence of years + piction
a <- approx(x$year, x$pobtot, head(yseq, -p))$y ## linear interpolation
f <- predict(ets(a), 3) ## predict `p` years
r <- c(a, f$mean) ## combine interpolation and prediction
data.frame(ent=x$ent[1], year=yseq, pobtot=r) ## output as data frame
}))
Result
res
# ent year pobtot
# 1.1 1 1995 50.0
# 1.2 1 1996 52.0
# 1.3 1 1997 54.0
# 1.4 1 1998 56.0
# 1.5 1 1999 58.0
# 1.6 1 2000 60.0
# 1.7 1 2001 62.0
# 1.8 1 2002 64.0
# 1.9 1 2003 66.0
# 1.10 1 2004 68.0
# 1.11 1 2005 70.0
# 1.12 1 2006 72.0
# 1.13 1 2007 74.0
# 1.14 1 2008 76.0
# 2.1 2 1995 10.0
# 2.2 2 1996 8.8
# 2.3 2 1997 7.6
# 2.4 2 1998 6.4
# 2.5 2 1999 5.2
# 2.6 2 2000 4.0
# 2.7 2 2001 3.4
# 2.8 2 2002 2.8
# 2.9 2 2003 2.2
# 2.10 2 2004 1.6
# 2.11 2 2005 1.0
# 2.12 2 2006 0.4
# 2.13 2 2007 -0.2
# 2.14 2 2008 -0.8
# 3.1 3 1995 100.0
# 3.2 3 1996 101.0
# 3.3 3 1997 102.0
# 3.4 3 1998 103.0
# 3.5 3 1999 104.0
# 3.6 3 2000 105.0
# 3.7 3 2001 106.0
# 3.8 3 2002 107.0
# 3.9 3 2003 108.0
# 3.10 3 2004 109.0
# 3.11 3 2005 110.0
# 3.12 3 2006 111.0
# 3.13 3 2007 112.0
# 3.14 3 2008 113.0
We could quickly check this in a plot, which, apart from the negative values of entity 2 looks quite reasonable.
with(res, plot(year, pobtot, type='n', main='z'))
with(res[res$year < 2006, ], points(year, pobtot, pch=20, col=3))
with(res[res$year > 2005, ], points(year, pobtot, pch=20, col=4))
with(res[res$year %in% z$year, ], points(year, pobtot, pch=20, col=1))
abline(h=0, lty=3)
legend(2005.25, 50, c('measurem.', 'interpol.', 'extrapol.'), pch=20,
col=c(1, 3, 4), cex=.8, bty='n')
Related
I have 300 stocks (here for example i show you 5), how can i create an equally weighted portfolio and then backtest it ?
Book1
# A tibble: 3,385 x 6
...1 AAA BBB CCC DDD EEE
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-02-08 00:00:00 100 100 100 100 100
2 2007-02-09 00:00:00 100. 100. 100. 100. 101.
3 2007-02-12 00:00:00 100. 100. 100. 100. 101.
4 2007-02-13 00:00:00 99.9 99.9 100. 99.9 100.
5 2007-02-14 00:00:00 100. 100. 99.9 100. 99.9
6 2007-02-15 00:00:00 100. 100. 99.9 100. 99.5
7 2007-02-16 00:00:00 100. 100. 100. 100. 100.
8 2007-02-20 00:00:00 100. 100. 99.9 100. 100.
9 2007-02-21 00:00:00 101. 100. 100. 100. 101.
10 2007-02-22 00:00:00 101. 101. 100. 100. 102.
# ... with 3,375 more rows
Could you help me ? i tried to follow other posts but it seems not working when creating the portfolio, and as a consequence impossible to do some backtesting
There a different packages that could help you running a backtest. Which is most appropriate (and whether you will want to use a package at all) will depend on how fine-grained a backtest you want to run.
Here is one example, using the PMwR package (which I maintain).
I start by creating a dataset of five assets, using data from Kenneth French's website.
library("PMwR")
library("NMOF")
P <- French(tempdir(),
"5_Industry_Portfolios_daily_CSV.zip",
frequency = "daily",
price.series = TRUE)
head(P)
## Cnsmr Manuf HiTec Hlth Other
## 1926-06-30 1.000000 1.000000 1.000000 1.000000 1.000000
## 1926-07-01 0.999200 1.002200 0.998900 1.009700 1.002100
## 1926-07-02 1.003796 1.009115 1.001997 1.011013 1.003202
## 1926-07-06 1.006507 1.011941 1.005203 1.013338 1.001296
## 1926-07-07 1.006406 1.013054 1.006409 1.016682 1.002798
## 1926-07-08 1.008821 1.013966 1.010234 1.025934 1.006709
These five series are now stored in a data frame named P.
Running a backtest for an equal-weight portfolio could look as follows:
bt <- btest(prices = list(as.matrix(P)),
timestamp = as.Date(row.names(P)),
signal = function(k) rep(1/k, k),
do.signal = "lastofquarter",
initial.cash = 100,
convert.weights = TRUE,
k = 5)
Results:
journal(bt)
## instrument timestamp amount price
## 1 Cnsmr 1926-09-30 18.13189758568 1.1082127
## 2 Manuf 1926-09-30 19.15734113773 1.0465962
## 3 HiTec 1926-09-30 19.00858248070 1.0538398
## 4 Hlth 1926-09-30 18.63527183032 1.0685114
## 5 Other 1926-09-30 18.75046122697 1.0696270
## 6 Cnsmr 1926-12-31 -0.15078058427 1.1441818
## 7 Manuf 1926-12-31 -0.03046886314 1.0757494
## ....
summary(as.NAVseries(bt))
## ---------------------------------------------------------
## 30 Jun 1926 ==> 29 Jan 2021 (24,916 data points, 0 NAs)
## 100 1528568
## ---------------------------------------------------------
## High 1590130.44 (20 Jan 2021)
## Low 43.43 (08 Jul 1932)
## ---------------------------------------------------------
## Return (%) 10.7 (annualised)
## ---------------------------------------------------------
## Max. drawdown (%) 82.3
## _ peak 245.20 (03 Sep 1929)
## _ trough 43.43 (08 Jul 1932)
## _ recovery (13 Jun 1944)
## _ underwater now (%) 3.9
## ---------------------------------------------------------
## Volatility (%) 18.1 (annualised)
## _ upside 14.4
## _ downside 11.5
## ---------------------------------------------------------
##
## Monthly returns ▁▁▆█▁▁▁
##
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec YTD
## 1926 0.0 0.0 0.0 -2.4 3.4 2.2 3.1
## 1927 1.2 4.0 1.2 1.5 5.5 -1.2 8.0 1.9 5.0 -2.2 6.0 1.7 37.2
## 1928 0.1 -1.7 9.7 3.5 3.4 -4.2 1.0 8.4 1.7 1.2 10.5 0.7 38.8
## 1929 5.5 -0.2 -0.2 1.6 -5.5 9.2 5.1 7.6 -5.7 -18.8 -11.3 1.7 -14.1
## 1930 5.0 3.2 6.7 -2.3 -1.1 -14.6 4.6 2.1 -11.4 -8.0 -2.6 -7.3 -24.9
## 1931 7.2 10.0 -4.7 -8.0 -12.5 13.1 -4.9 0.1 -28.8 8.0 -8.8 -11.2 -39.4
## ....
## 2020 -0.5 -8.0 -12.7 13.4 5.0 1.6 5.4 6.6 -3.2 -2.2 12.3 4.4 20.6
## 2021 0.0 0.0
As I said, there are many different ways, and a number of decisions you'll have to take (transaction costs, how often to rebalance, ...); but I hope the example gets you started.
I am have some sales calculation and define some basic predicted sales as per the formula given.
df1: cut_of_sales
cut-off_sales
1
2
1
3
df2: actual df for data:
Sales
NA
NA
NA
NA
1.2
2.1
1.4
1.1
2.1
1.4
1.1
1.2
2.1
1.4
1.1
1.2
2.1
1.4
1.1
2.3
First 4 quarters are NA. Keep them as they are.
Start with 5th row by adding the first value for cutoff_sales
Explanation:
1. cutoff_sales is given predefined by the company, 4 values for each quaters are given.
2. Add the q1 quarter of the cutoff sales with 2010q1 = ansq1
3. Add the q2 quater of the cutoff sales with 2010q2 = ansq2
4. Do the same for q3 and q4.
Now the answer of above addition will, will be input for next 2011 quaters.
so ansq1 + 2012q1 = ans...
ansq2 + 2012q2 = ans ....
and so on for below quarter answer for 2012 quaters will be input for 2013 and so on for rest of the 10 years.
Please help me in doing this addition.
I was only able to do the first year addition.
please help me writting a function or a loop that would be iterative as there would be many years coming up.
thanks.
For updated question
With the updated question, the following is one way to achieve the task. Since this is quarter data and the first four rows are NA, you can add the values of cut_off in mydf1 to Sales first. Then, you create a grouping variable. 1 indicates first quarter. You can sum up Sales with cumsum() as I suggested in my previous answer. It seems that you want to keep the NAs. So I converted 0 to NA in the end.
mydf2$Sales[5:8] <- mydf2$Sales[5:8] + mydf1$cut_off
group_by(mydf2, quarter = rep(1:4, times = n()/4)) %>%
mutate(Sales = cumsum(if_else(is.na(Sales), 0, Sales)),
Sales = na_if(Sales, 0))
Sales quarter
<dbl> <int>
1 NA 1
2 NA 2
3 NA 3
4 NA 4
5 2.20 1
6 4.10 2
7 2.40 3
8 4.10 4
9 4.30 1
10 5.50 2
11 3.50 3
12 5.30 4
13 6.40 1
14 6.90 2
15 4.60 3
16 6.50 4
17 8.50 1
18 8.30 2
19 5.70 3
20 8.80 4
DATA
mydf2 <- structure(list(Sales = c(NA, NA, NA, NA, 2.2, 4.1, 2.4, 4.1,
2.1, 1.4, 1.1, 1.2, 2.1, 1.4, 1.1, 1.2, 2.1, 1.4, 1.1, 2.3)), .Names = "Sales", row.names = c(NA,
-20L), class = "data.frame")
For original question
Here is one approach. I considered cases where you would have NA in your data. First, I added the values of cut_off in mydf1. Then, I create a new variable called quarter and defined groups. For each group, I applied cumsum() and summed up the values. If you do not have any NA, the final line would be mutate(sales = cumsum(sales)) in the code below.
library(dplyr)
mydf2 %>%
mutate(sales = if_else(substr(sales_quarter, 1,4) == "2010", sales + mydf1$cut_off, sales)) %>%
group_by(quarter = substr(sales_quarter, 5, 6)) %>%
mutate(sales = cumsum(if_else(is.na(sales), 0, sales)))
sales_quarter sales quarter
<chr> <dbl> <chr>
1 2010Q1 2.20 Q1
2 2010Q2 4.10 Q2
3 2010Q3 2.40 Q3
4 2010Q4 4.10 Q4
5 2011Q1 4.30 Q1
6 2011Q2 5.50 Q2
7 2011Q3 3.50 Q3
8 2011Q4 5.30 Q4
9 2012Q1 6.40 Q1
10 2012Q2 6.90 Q2
11 2012Q3 4.60 Q3
12 2012Q4 6.50 Q4
13 2013Q1 8.50 Q1
14 2013Q2 8.30 Q2
15 2013Q3 5.70 Q3
16 2013Q4 8.80 Q4
DATA
mydf1 <- structure(list(cut_off = c(1, 2, 1, 3)), .Names = "cut_off", row.names = c(NA,
4L), class = "data.frame")
mydf2 <- structure(list(sales_quarter = c("2010Q1", "2010Q2", "2010Q3",
"2010Q4", "2011Q1", "2011Q2", "2011Q3", "2011Q4", "2012Q1", "2012Q2",
"2012Q3", "2012Q4", "2013Q1", "2013Q2", "2013Q3", "2013Q4"),
sales = c(1.2, 2.1, 1.4, 1.1, 2.1, 1.4, 1.1, 1.2, 2.1, 1.4,
1.1, 1.2, 2.1, 1.4, 1.1, 2.3)), .Names = c("sales_quarter",
"sales"), class = "data.frame", row.names = c(NA, -16L))
New sequential answer:
> df
year_quater sales pred_sales
1 2010Q1 1.2 NA
2 2010Q2 2.1 NA
3 2010Q3 1.4 NA
4 2010Q4 1.1 NA
5 2011Q1 2.1 NA
6 2011Q2 1.4 NA
7 2011Q3 1.1 NA
8 2011Q4 1.2 NA
9 2012Q1 2.1 NA
10 2012Q2 1.4 NA
11 2012Q3 1.1 NA
12 2012Q4 1.2 NA
13 2013Q1 2.1 NA
14 2013Q2 1.4 NA
15 2013Q3 1.1 NA
16 2013Q4 2.3 NA
pred <- c(1,2,1,3)
for(i in seq(1, nrow(df), 4)){
df$pred_sales[i:(i+3)] <- df$sales[i:(i+3)] + pred
pred <- df$pred_sales[i:(i+3)]
}
> df
year_quater sales pred_sales
1 2010Q1 1.2 2.2
2 2010Q2 2.1 4.1
3 2010Q3 1.4 2.4
4 2010Q4 1.1 4.1
5 2011Q1 2.1 4.3
6 2011Q2 1.4 5.5
7 2011Q3 1.1 3.5
8 2011Q4 1.2 5.3
9 2012Q1 2.1 6.4
10 2012Q2 1.4 6.9
11 2012Q3 1.1 4.6
12 2012Q4 1.2 6.5
13 2013Q1 2.1 8.5
14 2013Q2 1.4 8.3
15 2013Q3 1.1 5.7
16 2013Q4 2.3 8.8
This answer creates a variable sequence by using the number of rows of your data and loops through every 4 rows, calculates the pred_sales, updates the pred values to use in the next loop iteration.
I am trying to read in a time series and do a plot.ts(), however I am getting weird results. Perhaps I did something wrong. I tried including the start and end dates but the output is still wrong.
Any help appreciated. Thank you.
This is the code and output:
sales1 <- read.csv("TimeS.csv",header=TRUE)
sales1
salesT <- ts(sales1)
salesT
plot.ts(salesT)
output:
> sales1 <- read.csv("TimeS.csv",header=TRUE)
> sales1
year q1 q2 q3 q4
1 1991 4.8 4.1 6.0 6.5
2 1992 5.8 5.2 6.8 7.4
3 1993 6.0 5.6 7.5 7.8
4 1994 6.3 5.9 8.0 8.4
> salesT <- ts(sales1)
> salesT
Time Series:
Start = 1
End = 4
Frequency = 1
year q1 q2 q3 q4
1 1991 4.8 4.1 6.0 6.5
2 1992 5.8 5.2 6.8 7.4
3 1993 6.0 5.6 7.5 7.8
4 1994 6.3 5.9 8.0 8.4
> plot.ts(salesT)
It looks like I can't paste the plot. instead of 1 graph it has 5 separate
plots stacked onto each other.
Try this
salesT<-ts(unlist(t(sales1[,-1])),start=c(1991,1),freq=4)
The format of the original data is difficult to use directly for a time series. You could try this instead:
sales1 <- t(sales1[,-1])
sales1 <- as.vector(sales1)
my_ts <- ts(sales1, frequency = 4, start=c(1991,1))
plot.ts(my_ts)
Here I think you need to format it correctly try this:
salesT <- ts(sales1)
ts.plot(salesT, frequency = 4, start = c(1991, 1), end = c(1994, 4)))
This line is making the times into one of the series which is unlikely what you want:
> salesT <- ts(sales1)
We need to transpose the data frame in order that it reads across the rows rather than down and we use c to turn the resulting matrix into a vector forming the data portion of the series. (continued after chart)
# create sales1
Lines <- "year q1 q2 q3 q4
1 1991 4.8 4.1 6.0 6.5
2 1992 5.8 5.2 6.8 7.4
3 1993 6.0 5.6 7.5 7.8
4 1994 6.3 5.9 8.0 8.4"
sales1 <- read.table(text = Lines, header = TRUE)
# convert to ts and plot
salesT <- ts(c(t(sales1[-1])), start = sales1[1, 1], freq = 4)
plot(salesT)
Regarding the comment, if the data looks like this then it is more straight forward and the lines below will produce the above plot. We have assumed that the data is sorted and starts at the bginning of a year so we do not need to use the second column:
Lines2 <- "year qtr sales
1 1991 q1 4.8
2 1991 q2 4.1
3 1991 q3 6.0
4 1991 q4 6.5
5 1992 q1 5.8
6 1992 q2 5.2
7 1992 q3 6.8
8 1992 q4 7.4
9 1993 q1 6.0
10 1993 q2 5.6
11 1993 q3 7.5
12 1993 q4 7.8
13 1994 q1 6.3
14 1994 q2 5.9
15 1994 q3 8.0
16 1994 q4 8.4"
sales2 <- read.table(text = Lines2, header = TRUE)
salesT2 <- ts(sales2$sales, start = sales2$year[1], freq = 4)
plot(salesT2)
Update fixed. Added response to comments.
I am trying to calculate diameter growth for a set of trees over a number of years in a dataframe in which each row is a given tree during a given year. Typically, this sort of data has each individual stem as a single row with that stem's diameter for each year given in a separate column, but for various reasons, this dataframe needs to remain such that each row is an individual stem in an individual year. A simplistic model version of the data would be as follows
df<-data.frame("Stem"=c(1:5,1:5,1,2,3,5,1,2,3,5,6),
"Year"=c(rep(1997,5), rep(1998,5), rep(1999,4), rep(2000,5)),
"Diameter"=c(1:5,seq(1.5,5.5,1),2,3,4,6,3,5,7,9,15))
df
Stem Year DAP
1 1 1997 1.0
2 2 1997 2.0
3 3 1997 3.0
4 4 1997 4.0
5 5 1997 5.0
6 1 1998 1.5
7 2 1998 2.5
8 3 1998 3.5
9 4 1998 4.5
10 5 1998 5.5
11 1 1999 2.0
12 2 1999 3.0
13 3 1999 4.0
14 5 1999 6.0
15 1 2000 3.0
16 2 2000 5.0
17 3 2000 7.0
18 5 2000 9.0
19 6 2000 15.0
What I am trying to accomplish is to make a new column that takes the diameter for a given stem in a given year and subtracts the diameter for that same stem in the previous year. I assume that this will require some set of nested for loops. Something like
for (i in 1:length(unique(df$Stem_ID){
for (t in 2:length(unique(df$Year){
.....
}
}
What I'm struggling with is how to write the function that calculates:
Diameter[t]-Diameter[t-1] for each stem. Any suggestions would be greatly appreciated.
Try:
> do.call(rbind, lapply(split(df, df$Stem), function(x) transform(x, diff = c(0,diff(x$Diameter)))))
Stem Year Diameter diff
1.1 1 1997 1.0 0.0
1.6 1 1998 1.5 0.5
1.11 1 1999 2.0 0.5
1.15 1 2000 3.0 1.0
2.2 2 1997 2.0 0.0
2.7 2 1998 2.5 0.5
2.12 2 1999 3.0 0.5
2.16 2 2000 5.0 2.0
3.3 3 1997 3.0 0.0
3.8 3 1998 3.5 0.5
3.13 3 1999 4.0 0.5
3.17 3 2000 7.0 3.0
4.4 4 1997 4.0 0.0
4.9 4 1998 4.5 0.5
5.5 5 1997 5.0 0.0
5.10 5 1998 5.5 0.5
5.14 5 1999 6.0 0.5
5.18 5 2000 9.0 3.0
6 6 2000 15.0 0.0
Rnso's answer works. You could also do the slightly shorter:
>df[order(df$Stem),]
>df$diff <- unlist(tapply(df$Diameter,df$Stem, function(x) c(NA,diff(x))))
Stem Year Diameter diff
1 1 1997 1.0 NA
6 1 1998 1.5 0.5
11 1 1999 2.0 0.5
15 1 2000 3.0 1.0
2 2 1997 2.0 NA
7 2 1998 2.5 0.5
12 2 1999 3.0 0.5
16 2 2000 5.0 2.0
3 3 1997 3.0 NA
8 3 1998 3.5 0.5
13 3 1999 4.0 0.5
17 3 2000 7.0 3.0
4 4 1997 4.0 NA
9 4 1998 4.5 0.5
5 5 1997 5.0 NA
10 5 1998 5.5 0.5
14 5 1999 6.0 0.5
18 5 2000 9.0 3.0
19 6 2000 15.0 NA
Or if you're willing to use the data.table package you can be very succinct:
>require(data.table)
>DT <- data.table(df)
>setkey(DT,Stem)
>DT <- DT[,diff:= c(NA, diff(Diameter)), by = Stem]
>df <- as.data.frame(DT)
Stem Year Diameter diff
1 1 1997 1.0 NA
2 1 1998 1.5 0.5
3 1 1999 2.0 0.5
4 1 2000 3.0 1.0
5 2 1997 2.0 NA
6 2 1998 2.5 0.5
7 2 1999 3.0 0.5
8 2 2000 5.0 2.0
9 3 1997 3.0 NA
10 3 1998 3.5 0.5
11 3 1999 4.0 0.5
12 3 2000 7.0 3.0
13 4 1997 4.0 NA
14 4 1998 4.5 0.5
15 5 1997 5.0 NA
16 5 1998 5.5 0.5
17 5 1999 6.0 0.5
18 5 2000 9.0 3.0
19 6 2000 15.0 NA
If you have a large dataset, data.table has the advantage of being extremely fast.
I'd like to calculate monthly temperature anomalies on a time-series with several stations.
I call here "anomaly" the difference of a single value from a mean calculated on a period.
My data frame looks like this (let's call it "data"):
Station Year Month Temp
A 1950 1 15.6
A 1980 1 12.3
A 1990 2 11.4
A 1950 1 15.6
B 1970 1 12.3
B 1977 2 11.4
B 1977 4 18.6
B 1980 1 12.3
B 1990 11 7.4
First, I made a subset with the years comprised between 1980 and 1990:
data2 <- subset(data, Year>=1980& Year<=1990)
Second, I used plyr to calculate monthly mean (let's call this "MeanBase") between 1980 and 1990 for each station:
data3 <- ddply(data2, .(Station, Month), summarise,
MeanBase = mean(Temp, na.rm=TRUE))
Now, I'd like to calculate, for each line of data, the difference between the corresponding MeanBase and the value of Temp... but I'm not sure to be in the right way (I don't see how to use data3).
You can use ave in base R to get this.
transform(data,
Demeaned=Temp - ave(replace(Temp, Year < 1980 | Year > 1990, NA),
Station, Month, FUN=function(t) mean(t, na.rm=TRUE)))
# Station Year Month Temp Demeaned
# 1 A 1950 1 15.6 3.3
# 2 A 1980 1 12.3 0.0
# 3 A 1990 2 11.4 0.0
# 4 A 1950 1 15.6 3.3
# 5 B 1970 1 12.3 0.0
# 6 B 1977 2 11.4 NaN
# 7 B 1977 4 18.6 NaN
# 8 B 1980 1 12.3 0.0
# 9 B 1990 11 7.4 0.0
The result column will have NaNs for Month-Station combinations that have no years in your specified range.