I couldn't find a solution to this on net. The two xts objects match on number of rows and columns. Still I get following error for merge operation - "number of items to replace is not a multiple of replacement length".
Following is the R code along with printed output at interim steps. I am bit new to R. So if you notice any steps in program that could be done better then can you advise me on that as well. Thanks.
> # LOAD THE SPY DATA AND CREATE A DATA FRAME WITH RETURN COLUMN
> library(quantmod)
> library(PerformanceAnalytics)
> getSymbols("SPY", src='yahoo', index.class=c("POSIXt","POSIXct"), from='2002-01-01')
> SPY<-to.monthly(SPY)
> SPY.ret<-Return.calculate(SPY$SPY.Close)
> print(head(SPY.ret))
SPY.Close
Jan 2002 NA
Feb 2002 -0.018098831
Mar 2002 0.029868840
Apr 2002 -0.059915390
May 2002 -0.005951292
Jun 2002 -0.080167070
> index(SPY.ret) = as.Date(index(SPY)) # Convert to Date format as xts index is a Date.
> colnames(SPY.ret) <- "SPY"
> print(head(SPY.ret))
SPY
2002-01-01 NA
2002-02-01 -0.018098831
2002-03-01 0.029868840
2002-04-01 -0.059915390
2002-05-01 -0.005951292
2002-06-01 -0.080167070
> #LOAD THE TRADE FILE & CREATE A DATA FRAME WITH PROFIT COLUMN
> trades = as.xts(read.zoo(file="Anvi/CSV/ARS_EW_R2_SPDR.csv", index.column="Exit.time", format="%m/%d/%Y", header=TRUE, sep=","))
Warning message:
In zoo(rval3, ix) :
some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
> df = trades$Profit
> print(head(df))
Profit
2003-09-30 " 0.079734219"
2004-01-31 " 0.116722585"
2004-03-31 " 0.060347888"
2004-04-30 " 0.100379816"
2004-07-31 " 0.084048027"
2004-07-31 " 0.018710103"
> df$Profits = as.numeric(trades$Profit)
> df = df$Profit #Inefficent way to convert Profit column to numeric?
> print(head(df))
Profit
2003-09-30 0.07973422
2004-01-31 0.11672259
2004-03-31 0.06034789
2004-04-30 0.10037982
2004-07-31 0.08404803
2004-07-31 0.01871010
> df = aggregate(df, by=index(df))
> colnames(df) = "Profit"
> print(head(df))
Profit
2003-09-30 0.07973422
2004-01-31 0.11672259
2004-03-31 0.06034789
2004-04-30 0.10037982
2004-07-31 0.10275813
2004-11-30 0.02533904
>
> #MERGE THE SPY RET AND TRADE RESULTS DATA FRAMES
> temp = head(df)
> temp1 = head(SPY.ret)
> print(temp)
Profit
2003-09-30 0.07973422
2004-01-31 0.11672259
2004-03-31 0.06034789
2004-04-30 0.10037982
2004-07-31 0.10275813
2004-11-30 0.02533904
> print(temp1)
SPY
2002-01-01 NA (Note: I tried replacing NA with 0 but still same error).
2002-02-01 -0.018098831
2002-03-01 0.029868840
2002-04-01 -0.059915390
2002-05-01 -0.005951292
2002-06-01 -0.080167070
> mdf = merge(x=temp, y=temp1, all=TRUE)
Error in z[match0(index(a), indexes), ] <- a[match0(indexes, index(a)), :
number of items to replace is not a multiple of replacement length
>
What I am trying to do above is merge the objects such that resulting object's index is a UNION and has two columns "SPY", "PROFIT". The empty cells in each of the columns in the merged object are filled with 0.
aggregate returns a zoo object, not an xts object. That means the zoo method of merge is being dispatched instead of the xts method. Your code works fine if both objects are xts objects.
temp <-
structure(c(0.07973422, 0.11672259, 0.06034789, 0.10037982, 0.10275813,
0.02533904), .Dim = c(6L, 1L), index = structure(c(12325, 12448,
12508, 12538, 12630, 12752), class = "Date"), class = "zoo",
.Dimnames = list(NULL, "Profit"))
temp1 <-
structure(c(NA, -0.018098831, 0.02986884, -0.05991539, -0.005951292,
-0.08016707), .Dim = c(6L, 1L), index = structure(c(1009864800,
1012543200, 1014962400, 1017640800, 1020229200, 1022907600), tzone = "",
tclass = "Date"), .indexCLASS = "Date", tclass = "Date", .indexTZ = "",
tzone = "", .Dimnames = list(NULL, "SPY"), class = c("xts", "zoo"))
merge(temp, temp1) # error
merge(as.xts(temp), temp1, fill=0) # works, filled with zeros
# Profit SPY
# 2002-01-01 0.00000000 NA
# 2002-02-01 0.00000000 -0.018098831
# 2002-03-01 0.00000000 0.029868840
# 2002-04-01 0.00000000 -0.059915390
# 2002-05-01 0.00000000 -0.005951292
# 2002-06-01 0.00000000 -0.080167070
# 2003-09-30 0.07973422 0.000000000
# 2004-01-31 0.11672259 0.000000000
# 2004-03-31 0.06034789 0.000000000
# 2004-04-30 0.10037982 0.000000000
# 2004-07-31 0.10275813 0.000000000
# 2004-11-30 0.02533904 0.000000000
Related
I would like to construct annualized volatility of returns for a panel data set in R. I have monthly returns (%) per month, per firm (entity), for a large dataset.
I would like to construct the five year average of annualized volatility of monthly returns - per year (t+5) and per firm.
Constructing this measure by it self is not difficult, but I would like to do it in R, so that it groups by firm & year. I am thankful for any help.
The data looks like this:
library(xts)
library(PerformanceAnalytics)
library(quantmod)
library(lubridate)
library(data.table)
library(stringr)
# let's fetch some real-world panel data in a similar format to that cited by OP
symbols <- c('GOOG', 'AAPL', 'AMZN')
quantmod::getSymbols(symbols,
auto.assign = TRUE,
from = Sys.time() - years(20),
periodicity = 'monthly')
lapply(symbols, function(x) {
tmp <- get(x, envir = .GlobalEnv)
tmp$Return <- CalculateReturns(Ad(tmp), method = 'discrete')
tmp$LogReturn <- CalculateReturns(Ad(tmp), method = 'log')
assign(x, tmp, envir = .GlobalEnv)
}) |> invisible()
panel_data_df <- lapply(symbols, function(x) {
tmp <- get(x, envir = .GlobalEnv)
df <- data.frame(Symbol = x,
Date = index(tmp),
Return = round(tmp$Return * 1e2, 2) |>
sprintf(fmt = '%s%%') |>
str_replace_all('NA%', NA_character_),
LogReturn = tmp$LogReturn)
df
}) |>
rbindlist() |>
as.data.frame()
head(panel_data_df)
Symbol Date Return LogReturn
1 GOOG 2004-09-01 <NA> NA
2 GOOG 2004-10-01 47.1% 0.38593415
3 GOOG 2004-11-01 -4.54% -0.04649014
4 GOOG 2004-12-01 5.94% 0.05770476
5 GOOG 2005-01-01 1.47% 0.01457253
6 GOOG 2005-02-01 -3.9% -0.03978529
# now let's calculate the 5 year mean of annualized monthly volatility
metrics_df <- split(panel_data_df, panel_data_df$Symbol) |>
lapply(function(x) {
df_xts <- xts(x$LogReturn, order.by = as.POSIXct(x$Date))
stddev_1yr <- period.apply(df_xts,
endpoints(df_xts, 'years', 1),
StdDev.annualized)
stddev_1yr_5yr_mean <- period.apply(stddev_1yr,
endpoints(stddev_1yr, 'years', 5),
mean)
stddev_1yr_5yr_mean_df <- as.data.frame(stddev_1yr_5yr_mean)
colnames(stddev_1yr_5yr_mean_df) <- 'StDevAnn5YrMean'
stddev_1yr_5yr_mean_df$Date <- rownames(stddev_1yr_5yr_mean_df) |>
str_split('\\s') |>
sapply('[', 1)
rownames(stddev_1yr_5yr_mean_df) <- NULL
stddev_1yr_5yr_mean_df$Symbol <- x$Symbol[ 1 ]
stddev_1yr_5yr_mean_df
}) |> rbindlist() |> as.data.frame()
panel_data_df <- merge(panel_data_df,
metrics_df,
by = c('Symbol', 'Date'),
all = TRUE)
head(panel_data_df, 50)
Symbol Date Return LogReturn StDevAnn5YrMean
1 AAPL 2002-11-01 <NA> NA NA
2 AAPL 2002-12-01 -7.55% -0.078484655 NA
3 AAPL 2003-01-01 0.21% 0.002089444 NA
4 AAPL 2003-02-01 4.53% 0.044272032 NA
5 AAPL 2003-03-01 -5.8% -0.059709353 NA
6 AAPL 2003-04-01 0.57% 0.005642860 NA
7 AAPL 2003-05-01 26.23% 0.232938925 NA
8 AAPL 2003-06-01 6.18% 0.060001124 NA
9 AAPL 2003-07-01 10.6% 0.100732953 NA
[ ... ]
26 AAPL 2004-12-01 -3.95% -0.040325449 NA
27 AAPL 2004-12-31 <NA> NA 0.2947654
28 AAPL 2005-01-01 19.41% 0.177392802 NA
29 AAPL 2005-02-01 16.67% 0.154188206 NA
30 AAPL 2005-03-01 -7.11% -0.073765972 NA
[ ... ]
I'm interested in expanding a data frame with missing values across any number of columns for the periods where data is missing following the data units.
Example
The problem can be easily illustrated on with use of a simple example.
Data
The generated data contains some time series observations and dates missing on random.
# Data generation
# Seed
set.seed(1)
# Size
sizeDf <- 10
# Populate data frame
dta <- data.frame(
dates = seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
),
varA = runif(n = sizeDf),
varB = runif(n = sizeDf),
varC = runif(n = sizeDf)
)
# Delete rows
dta <-
dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]
Preview
>> dta
dates varA varB varC
1 2016-07-28 0.26550866 0.2059746 0.93470523
2 2016-07-29 0.37212390 0.1765568 0.21214252
3 2016-07-30 0.57285336 0.6870228 0.65167377
4 2016-07-31 0.90820779 0.3841037 0.12555510
7 2016-08-03 0.94467527 0.7176185 0.01339033
8 2016-08-04 0.66079779 0.9919061 0.38238796
9 2016-08-05 0.62911404 0.3800352 0.86969085
10 2016-08-06 0.06178627 0.7774452 0.34034900
Key characteristics
From the perspective of the proposed analysis, the key characteristics are:
The date units, days in that case
Randomly missing dates
Missing dates
seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
)[!(seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
) %in% dta$dates)]
"2016-08-01" "2016-08-02"
Desired results
The newly created data frame should look like that:
>> dtaNew
dates varA varB varC
1 2016-07-28 0.3337749 0.32535215 0.8762692
2 2016-07-29 0.4763512 0.75708715 0.7789147
3 2016-07-30 0.8921983 0.20269226 0.7973088
4 2016-07-31 0.8643395 0.71112122 0.4552745
5 2016-08-01 NA NA NA
6 2016-08-02 NA NA NA
7 2016-08-03 0.9606180 0.14330438 0.6049333
8 2016-08-04 0.4346595 0.23962942 0.6547239
9 2016-08-05 0.7125147 0.05893438 0.3531973
10 2016-08-06 0.3999944 0.64228826 0.2702601
This simply obtained with use of:
dtaNew[dtaNew$dates %in% missDates, 2:4] <- NA
where the missDates is taken from the previous seq.
Attempts
Creating vector with all the dates is simple:
allDates <- seq(from = min(dta$dates), to = max(dta$dates), by = 1)
but obviously I cannot just push it to the data frame:
>> dta$allDates <- allDates
Error in `$<-.data.frame`(`*tmp*`, "allDates", value = c(17010, 17011, :
replacement has 10 rows, data has 8
The possible solution could use the loop that would push the row with NA values to the data frame row by row for each of the dates identified as missing but this is grossly inefficient and messy.
To sum up, I'm interested in achieving the following:
Expanding the data frame with all the dates following the same unit. I.e. for missing daily data days are added, for missing quarterly data quarters are added.
I would like to then push the NA values across all the columns in the data frame for where the missing date was found
If I understand your question, you can use rbind.fill from the plyr package to get your desired output:
sizeDf <- 10
# Populate data frame
dta <- data.frame(
dates = seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
),
varA = runif(n = sizeDf),
varB = runif(n = sizeDf),
varC = runif(n = sizeDf)
)
# Delete rows
dta <-dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]
#Get missing dates
missing_dates <- seq(from=min(dta$dates), to=max(dta$dates), by=1)[!(seq(from=min(dta$dates), to=max(dta$dates), by=1) %in% dta$dates)]
#Create the new dataset by using plyr's rbind.fill function
dta_new <- plyr::rbind.fill(dta,data.frame(dates=missing_dates))
#Order the data by the dates column
dta_new <- dta_new[order(dta_new$dates),]
#Print it
print(dta_new, row.names = F, right = F)
dates varA varB varC
2016-07-28 0.837859418 0.2966637 0.61245244
2016-07-29 0.144884547 0.9284294 0.11033990
2016-07-30 NA NA NA
2016-07-31 NA NA NA
2016-08-01 0.003167049 0.9096805 0.29239470
2016-08-02 0.574859760 0.1466993 0.69541969
2016-08-03 NA NA NA
2016-08-04 0.748639215 0.9602836 0.67681826
2016-08-05 0.983939562 0.4867804 0.35270309
2016-08-06 0.383366957 0.2241982 0.09244522
I hope this helps.
I am trying to summarize dates by ID based on the max() of ExitDate. When I run the following code, however, I receive this message:
In max.default(structure(NA_real_, class = "Date"), na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
I have imported the data and set the date values using setAs. Using setClass eliminated the initial warning message (as noted in another answer) but I don't know how to eliminate these other warning messages.
Any advice would be greatly appreciated!
setClass("myDate")
setAs("character", "myDate", function(from)
as.Date(from, format = "%m/%d/%Y"))
prog <- read.csv("Program.csv",
stringsAsFactors = FALSE,
colClass = c("EntryDate" = "myDate",
"ExitDate" = "myDate",
"DateUpdated"= "myDate")
prog2 <- prog %>%
group_by(id, EntryDate) %>%
summarize(new_exit = as.Date(max(ExitDate, na.rm = TRUE), origin ="1970-01-01")) %>%
right_join(prg, by = c("id", "EntryDate"))
id EntryDate ExitDate
1 5 2014-10-06 <NA>
2 5 2014-02-05 2014-02-21
3 3 2014-02-05 2014-02-28
4 3 2014-09-30 2014-11-25
5 3 2014-11-25 <NA>
6 4 2014-10-03 <NA>
Currently I have multiple dataframes in a list with the following format:
datetime precip code
1 2015-04-15 00:00:00 NA M
2 2015-04-15 01:00:00 NA M
3 2015-04-15 02:00:00 NA M
4 2015-04-15 03:00:00 NA M
5 2015-04-15 04:00:00 NA M
6 2015-04-15 05:00:00 NA M
Each dataframe has a different start and end date but I will like each dataframe to start from 2015-04-01 0:00:00 to 2015-11-30 23:59:59. I would like to generate rows for the missing dates in datetime in each dataframe and fill the precipcolumn with NAso that I have a continuous time series with nrow=5856in each dataframe.
Ignore the code column. If values exist for precip, do not alter them, just fill the additional datetime rows with NAs
My attempt so far yields an error:
library(dplyr)
dates <- seq.POSIXt(as.POSIXlt("2015-04-01 0:00:00"), as.POSIXlt("2015-11-30 23:59:59"), by="hour",tz="GMT")
ts <- format.POSIXct(dates,"%Y/%m/%d %H:%M")
df <- data.frame(datetime=ts)
dat=mylist
final_list <- lapply(dat, function(x) full_join(df,dat$precip))
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"
link to sample file in case it is needed
Thanks for your suggestions.
As vitor pointed out above, you can only join two data.frames, not a data.frame and a vector. dplyr also plays nice with POSIXct, but not POSIXlt (Hadley has a preference), so if you store your data as actual time, it will be easier to join usefully.
Further, within lapply, you need to use the variable of the function you create (x here), or you'll just be repeating the same thing. Don't subset the data.frames, either, if you want to join them; you need a column in each with the same name and data type.
All together, you need something like:
library(dplyr)
df$datetime <- as.POSIXct(df$datetime, tz = "GMT")
df <- tbl_df(df) # not necessary, but prints nicely
list_df <- list(df, df) # fake list of data.frames
# make a data.frame of sequence to join on
seq_df <- data_frame(datetime = seq.POSIXt(as.POSIXct("2015-04-01 0:00:00", tz = 'GMT'),
as.POSIXct("2015-11-30 23:59:59", tz = 'GMT'),
by="hour",tz="GMT"))
lapply(list_df, function(x){full_join(x, seq_df)})
# Joining by: "datetime"
# Joining by: "datetime"
# [[1]]
# Source: local data frame [5,857 x 3]
#
# datetime precip code
# (POSI) (lgl) (fctr)
# 1 2015-04-15 00:00:00 NA M
# 2 2015-04-15 01:00:00 NA M
# 3 2015-04-15 02:00:00 NA M
# 4 2015-04-15 03:00:00 NA M
# 5 2015-04-15 04:00:00 NA M
# 6 2015-04-15 05:00:00 NA M
# 7 2015-04-01 04:00:00 NA NA
# 8 2015-04-01 05:00:00 NA NA
# 9 2015-04-01 06:00:00 NA NA
# 10 2015-04-01 07:00:00 NA NA
# .. ... ... ...
#
# [[2]]
# Source: local data frame [5,857 x 3]
#
# datetime precip code
# (POSI) (lgl) (fctr)
# 1 2015-04-15 00:00:00 NA M
# 2 2015-04-15 01:00:00 NA M
# 3 2015-04-15 02:00:00 NA M
# 4 2015-04-15 03:00:00 NA M
# 5 2015-04-15 04:00:00 NA M
# 6 2015-04-15 05:00:00 NA M
# 7 2015-04-01 04:00:00 NA NA
# 8 2015-04-01 05:00:00 NA NA
# 9 2015-04-01 06:00:00 NA NA
# 10 2015-04-01 07:00:00 NA NA
# .. ... ... ...
Data:
df <- structure(list(datetime = structure(c(1429056000, 1429059600, 1429063200, 1429066800,
1429070400, 1429074000), class = c("POSIXct", "POSIXt"), tzone = "GMT"), precip = c(NA,
NA, NA, NA, NA, NA), code = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "M",
class = "factor")), .Names = c("datetime", "precip", "code"), row.names = c("1",
"2", "3", "4", "5", "6"), class = c("tbl_df", "tbl", "data.frame"))
Hi I have a dataframe object called NormalizedPnL. Here it is:
print(head(NormalizedPnL))
print(class(NormalizedPnL))
print(class(NormalizedPnL[,1]))
print(class(NormalizedPnL[,2]))
print(class(NormalizedPnL[,3]))
businessdate start A Strat B Strat C
1 2014-01-01 0.000000000 0.00000000 0.000000000
2 2014-01-02 0.016764200 0.04218263 0.011912007
3 2014-01-03 0.001179697 -0.02683310 -0.000897083
4 2014-01-06 -0.033131903 0.01902207 0.021104512
5 2014-01-07 -0.033215587 -0.06347915 -0.018900792
6 2014-01-08 0.045181350 -0.00732205 -0.016600410
[1] "data.frame"
[1] "Date" # you can see here the this column is a date
[1] "numeric"
[1] "numeric"
Now I make it a xts object:
NormalizedPnL_xts<- xts(NormalizedPnL[,-1], order.by= as.Date(NormalizedPnL[,1]))
print(head(NormalizedPnL_xts))
print(class(head(NormalizedPnL_xts)))
print(class(NormalizedPnL_xts[,1]))
print(class(NormalizedPnL_xts[,2]))
print(class(NormalizedPnL_xts[,3]))
Strat A Start B Strat C
2014-01-01 0.000000000 0.00000000 0.000000000
2014-01-02 0.016764200 0.04218263 0.011912007
2014-01-03 0.001179697 -0.02683310 -0.000897083
2014-01-06 -0.033131903 0.01902207 0.021104512
2014-01-07 -0.033215587 -0.06347915 -0.018900792
2014-01-08 0.045181350 -0.00732205 -0.016600410
[1] "xts" "zoo"
[1] "xts" "zoo"
[1] "xts" "zoo"
[1] "xts" "zoo"
you can see that all the columns are zoo
Now I try to use performance Analytics:
charts.PerformanceSummary(NormalizedPnL_xts,geometric= FALSE,cex.axis=1.5)
And I get the error:
The data cannot be converted into a time series. If you are trying to pass
in names from a data object with one column, you should use the
form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as
'1985-03-15'.
Error in `[<-.data.frame`(`*tmp*`, i, 2, value = c(90600, 60400, 302000 :
replacement has 3 rows, data has 1
Error in `[<-.data.frame`(`*tmp*`, i, 8, value = c(-742200, -494800, -2474000 :
replacement has 3 rows, data has 1
Can anyone tell me what the issue is? The error mentions having 1 column but I have 3.
Works for me:
library(PerformanceAnalytics)
NormalizedPnL <-
structure(list(businessdate = c("2014-01-01", "2014-01-02", "2014-01-03",
"2014-01-06", "2014-01-07", "2014-01-08"), StratA = c(0, 0.0167642,
0.001179697, -0.033131903, -0.033215587, 0.04518135), StratB = c(0,
0.04218263, -0.0268331, 0.01902207, -0.06347915, -0.00732205),
StratC = c(0, 0.011912007, -0.000897083, 0.021104512, -0.018900792,
-0.01660041)), .Names = c("businessdate", "StratA", "StratB",
"StratC"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))
NormalizedPnL_xts <- xts(NormalizedPnL[,-1], order.by=as.Date(NormalizedPnL[,1]))
charts.PerformanceSummary(NormalizedPnL_xts,geometric=FALSE,cex.axis=1.5)