how to split data frame by time interval - r

I have two data frames, first is the daily return of 3 securities, second is the weights of the securities, as the following:
daily.return <- data.frame(date = seq.Date(from = as.Date("2015-01-01"),
by = "days",
length.out = 100),
a = runif(100,-0.1,0.1),
b = runif(100,-0.1,0.1),
c = runif(100,-0.1,0.1))
weights <- data.frame(startDate = c(as.Date("2015-01-01"),
as.Date("2015-02-10"),
as.Date("2015-03-15")),
endDate = c(as.Date("2015-02-09"),
as.Date("2015-03-14"),
as.Date("2015-04-10")),
a = c(0.3,0.5,0.2),
b = c(0.4,0.2,0.1),
c = c(0.3,0.3,0.7)
)
I know how to split data fame by weeks etc.., if we convert data frame to xts;but how to split this daily.return according to startDate and endDate in weights?
Suppose a fund have this three securities,how to calculate the fund nav and daily return?

This should do the job.
daily.return <- data.frame(date = seq.Date(from = as.Date("2015-01-01"),
by = "days",
length.out = 100),
a = runif(100,-0.1,0.1),
b = runif(100,-0.1,0.1),
c = runif(100,-0.1,0.1))
weights <- data.frame(startDate = c(as.Date("2015-01-01"),
as.Date("2015-02-10"),
as.Date("2015-03-15")),
endDate = c(as.Date("2015-02-09"),
as.Date("2015-03-14"),
as.Date("2015-04-10")),
a = c(0.3,0.5,0.2),
b = c(0.4,0.2,0.1),
c = c(0.3,0.3,0.7)
)
library(quantmod)
daily.xts <- as.xts(daily.return[,-1],daily.return[,1])
# Assuming that the total period is the same in both the data frames
weights.xts <- xts(matrix(NA,nrow(daily.xts),3),order.by=index(daily.xts))
names(weights.xts) <- c("a","b","c")
for (i in 1:nrow(weights)){
temp.inputs <- weights[i,]
temp.period <- paste(temp.inputs[,1],temp.inputs[,2],sep="/")
len <- nrow(weights.xts[temp.period])
weights.xts[temp.period,1:3] <- matrix(rep(as.numeric(temp.inputs[,3:5]),len),len,byrow=T)
}
weighted.returns <- daily.xts * weights.xts
weighted.returns <- as.xts(rowSums(weighted.returns),index(weighted.returns))
names(weighted.returns) <- "Weighted Returns"
weighted.returns$Cumulative <- cumsum(weighted.returns)
plot(weighted.returns$Cumulative)

You can split daily.return according to start and end date in weights using apply, performing row-wise operation
apply(weights, 1, function(x) daily.return[daily.return$date >= x[1]
& daily.return$date <= x[2], ])
This will give a list of 3 dataframes splitted according to the range in weights.
EDIT
If I have understood correctly, you want each value in the column a, b, c of the daily.return to multiply with respective columns in the weights.
apply(weights, 1, function(x) {
A <- daily.return[daily.return$date >= x[1] & daily.return$date <= x[2], ]
t(t(A[, 2:4]) * as.numeric(x[3:5]))
}
)

Related

Avoid for loop using data.table

I have a simulation over time (dev_quarters) that looks like this, which is a data.table :
simulation <- data.table(`Scenario ID` = 1, dev_quarter = seq(1:80), brand = 1, proportion = runif(80))
For each scenario, we have n_brand, n_scenario and a proportion.
I try to code the following : for each scenario, for each brand, compute the difference of the proportion between the beginning and the end of the year, for each year.
I made the following to recover the corresponding dev_quarters for each year :
x <- 2002:2021
lookup_T <- as.integer(format(Sys.Date(), "%Y"))
lookup_period <- data.table(years = lookup_T-x+1, quarters_t = (lookup_T-x+1)*4, quarters_t1 = (lookup_T-x+2)*4)
With a small example
n_scenario <- 1
n_brand <- 10
An ugly code that uses for loops :
result <- data.table(`Scenario ID` = numeric(), years = numeric(), brand = numeric(), proportion = numeric())
for(i in 1:n_scenario){
for(j in 1:n_brand){
prop_per_year <- c()
# for each year
for(k in 1:length(x)){
year <- lookup_period[k, ]
quarter_start_year <- year[["quarters_t"]]
quarter_end_year <- year[["quarters_t1"]]
end_year_prop <- simulation[`Scenario ID`==i & brand==j & dev_quarter==quarter_end_year]
start_year_prop <- simulation[`Scenario ID`==i & brand==j & dev_quarter==quarter_start_year]
prop_this_year <- max(end_year_prop[["proportion"]] - start_year_prop[["proportion"]], 0)
prop_per_year <- append(prop_per_year, prop_this_year)
}
result_temp <- data.table(`Scenario ID` = i, years = x, brand = j, proportion = prop_per_year)
result <- rbind(result, result_temp)
}
}
I considered to filter my data.table, using only rows were dev_quarters were 4k factors, but the issue remains the same about the for loops.
How can I avoid them using data.table ?
Thanks.
The absolute change in proportion between the 4th and 1st quarter can be calculated much more easily.
simulation[, year := 2002 + (dev_quarter-1) %/% 4] # Easier way to calculate the year
simulation[, .(change = last(proportion) - first(proportion)), by = c("Scenario ID", "brand", "year")

Extracting highest value in some unequal periods/time-series

I have two data frame: period_example (consists of Beg, and End) and price_example (consists of Date and High). I want the highest value of High for each Beg-End period. How to do it? Thank you.
Here is the data:
period_example <- data.frame(Beg = as.Date(c("2000-01-01","2000-01-04","2000-01-09")),
End = as.Date(c("2000-01-03","2000-01-08","2000-01-12")))
price_example <- data.frame(Date = seq(as.Date("2000-01-01"), as.Date("2000-01-12"), by="days"),
High = c(100,105,104,103,102,106,107,108,109,110,115,114))
The result should be like this:
result <- data.frame(Beg = as.Date(c("2000-01-01","2000-01-04","2000-01-09")),
End = as.Date(c("2000-01-03","2000-01-08","2000-01-12")),
High = c(105,108,115))
I think I found a solution for this problem, you could apply a function to each row and find the max between these dates in the other data frame:
period_example <- data.frame(Beg = as.Date(c("2000-01-01","2000-01-04","2000-01-09")),End = as.Date(c("2000-01-03","2000-01-08","2000-01-12")))
price_example <- data.frame(Date = seq(as.Date("2000-01-01"), as.Date("2000-01-12"),by="days"), High = c(100,105,104,103,102,106,107,108,109,110,115,114))
period_example$High <- apply(period_example,1 , function(x) max(price_example[price_example$Date >= x[1] & price_example$Date <= x[2], "High"]))
> period_example
Beg End High
1 2000-01-01 2000-01-03 105
2 2000-01-04 2000-01-08 108
3 2000-01-09 2000-01-12 115
data.table has a fast function for this: foverlaps.
library(data.table)
x = setDT(period_example)
y = setDT(price_example)
y[, `:=` (Beg = Date, End = Date)]
setkey(x, Beg, End)
z = foverlaps(y, x)
z[, .(High = max(High)), by = .(Beg, End)]
This should work
period_example <- data.frame(Beg = as.Date(c("2000-01-01","2000-01-04","2000-01-09")),End = as.Date(c("2000-01-03","2000-01-08","2000-01-12")))
price_example <- data.frame(Date = seq(as.Date("2000-01-01"), as.Date("2000-01-12"),by="days"), High = c(100,105,104,103,102,106,107,108,109,110,115,114))
betweenDates <- function(target,beg,end){
beg <- as.Date(beg)
end <- as.Date(end)
target <- as.Date(target)
return(target>=beg&target<=end)
}
selecteDates <- sapply(price_example$Date,function(x) betweenDates(x,period_example$Beg,period_example$End))
highValues <- sapply(1:nrow(period_example),function(x) max(price_example$High[selecteDates[x,]]))
result <- data.frame(period_example,High=highValues)

Holt Winters Forecast with Multiple Input Variables

For context, I'm a novice R user, so please forgive any incorrect terminology/processes. I am actively trying to improve my coding ability, but recently have become stumped.
I have the following data set where A * B * C = Output:
Date A B C Output
1/1/2013 177352 0.908329198 0.237047935 38187
1/2/2013 240724 0.852033865 0.237273592 48666
1/3/2013 243932 0.908380204 0.237039845 52524
1/4/2013 221485 0.820543152 0.236356733 42955
1/5/2013 202590 0.818066045 0.240900973 39925
1/6/2013 238038 0.770057722 0.247344561 45339
1/7/2013 271511 0.794258796 0.241252029 52026
1/8/2013 283434 0.807817693 0.233810703 53534
1/9/2013 275016 0.843220031 0.243769917 56530
1/10/2013 255266 0.797791324 0.238562428 48583
1/11/2013 226564 0.815791564 0.236153417 43648
1/12/2013 214366 0.800066242 0.237961133 40812
1/13/2013 256946 0.764845532 0.237640186 46702
1/14/2013 282298 0.816537843 0.234257528 53998
I have a few years worth of data and I'm trying for forecast Output, using A, B, and C. However, when I model out A, B, and C individually, the Output becomes very skewed. If I forecast just Output then I lose the input factors.
What is the best package/code to accomplish this task? I've tried Googling and searching on here numerous different methods, but haven't found the solution I'm looking for.
Here is some of the code:
DataSet1[,"Date"] <- mdy(DataSet[,"Date"])
DataSet1
TotalSet <- ts(DataSet1, frequency = 365, start =c(2013,1))
DataA <- ts(DataSet1$A, frequency = 365, start = c(2013,1))
DataB <- ts(DataSet1$B, frequency = 365, start = c(2013,1))
DataC <- ts(DataSet1$C, frequency = 365, start = c(2013,1))
OutputData <- ts(DataSet$Output, frequency = 365, start = c(2013,1))
ADecompose <- decompose(DataA)
BDecompose <- decompose(DataB)
CDecompose <- decompose(DataC)
OutputDecompose <- decompose(OutputData)
DataAHW <- HoltWinters(DataA, seasonal = "mult")
DataBHW <- HoltWinters(DataB, seasonal = "mult")
DataCHW <- HoltWinters(DataC, seasonal = "mult")
OutputDataHW <- HoltWinters(OutputData, seasonal = "mult")
FC.A <- forecast.HoltWinters(DataAHW)
FC.B <- forecast.HoltWinters(DataBHW)
FC.C <- forecast.HoltWinters(DataCHW)
FC.Output <- forecast.HoltWinters(OutputDataHW)
plot(ForecastVisits)
plot(ForecastCPV)
plot(ForecastRPC)
plot(ForecastRevenue)
Here is another model I built for the Output and I've plugged A, B, and C into it individually then combined them in excel. I'm sure there is a more appropriate way to handle this, but given my lack of experience I am reaching out for help
dataset <- testData
##FORECAST
forecastingFuntion <- function(dataset, lenghtOfForecast)
{
dataset[,"Date"] <- mdy(dataset[,"Date"])
myts <- ts(dataset[,"DataSet$Output"], start = c(2013,1), frequency = 365)
hwModel <- HoltWinters(myts, seasonal = "mult")
future <- data.frame(predict(hwModel, n.ahead = lenghtOfForecast, level = 0.9))
fittedValues <- data.frame(as.numeric(hwModel$fitted[,"xhat"]))
names(fittedValues) <- "fit"
futureDates <- c()
predicitedValues <- rbind(fittedValues, future)
for(i in 1: lenghtOfForecast)
{
futureDateSingle <- data.frame(dataset[nrow(dataset),"Date"] + days(i))
futureDates <- rbind(futureDates, futureDateSingle)
}
names(futureDates) <- "Date"
dates <- data.frame(dataset[366:(nrow(dataset)),"Date"])
names(dates) <- "Date"
dates <- rbind(dates, futureDates)
predictedData <- data.frame(predicitedValues, dates)
names(predictedData) <- c("predictedValues","Date")
finalData2 <- mergeData <- merge(predictedData, dataset, all.x = T, all.y = F, by = "Date")
finalData2
}
finalData2 <- forecastingFuntion(testData, 612)
rm(list=setdiff(ls(), c("finalData2")))
write.csv(finalData2, file="B2BForecastVisits.csv")
Thanks!

Aggregate data by week or few days

I'm trying to aggregate a data frame as to obtain a table with weekly averages of a variable. I found the following package provides a nice solution, and I've been using it for aggregating data yearly and monthly. However, the function to aggregate data weekly simply is not working as described. Does anyone has an idea how I can fix this up?
For instance, following the manual:
require(TSAgg)
#Load the data:
data(foo)
##Format the data using the timeSeries function.
foo.ts<-timeSeries(foo[,1], "%d/%m/%Y %H:%M",foo[,3])
##Aggregate the data into 6 days blocks using max
(mean.month <- monthsAgg(foo.ts,mean,6))
#Aggregate the data into weeks, using 7 days and mean:
(foo.week<-daysAgg(foo.ts,mean,7) )
The last command doesn't work. The function is the following:
daysAgg <-
function (data, process, multiple = NULL, na.rm = FALSE)
{
if (is.null(multiple)) {
multiple = 1
}
if (multiple == 1) {
day <- aggregate(data[, 8:length(data)], list(day = data$day,
month = data$month, year = data$year), process, na.rm = na.rm)
days <- ymd(paste(day$year, day$month, day$day))
data2 <- data.frame(date = days, data = day[, 4:length(day)])
names(data2) <- c("Date", names(data[8:length(data)]))
return(data2)
}
temp <- data
day <- aggregate(list(data[, 8:length(data)], count = 1),
list(day = data$day, month = data$month, year = data$year),
process, na.rm = na.rm)
days <- ymd(paste(day$year, day$month, day$day))
data <- data.frame(date = days, day[, 5:length(day) - 1],
count = day[length(day)])
days = paste(multiple, "days")
all.dates <- seq.Date(as.Date(data$date[1]), as.Date(data$date[length(data[,
1])]), by = "day")
dates <- data.frame(date = all.dates)
aggreGated <- merge(dates, data, by = "date", all.x = TRUE)
aggreGated$date <- rep(seq.Date(as.Date(data$date[1]), as.Date(data$date[length(data[,
1])]), by = days), each = multiple, length = length(all.dates))
results <- aggregate(list(aggreGated[2:length(aggreGated)]),
list(date = aggreGated$date), process, na.rm = TRUE)
results <- subset(results, results$count != 0)
results <- results[, -length(results)]
names(results) <- c("Date", names(temp[8:length(temp)]))
return(results)
}
The problem in the code stems from its usage of the function ymd, which attaches " UTC" to the end of all dates it outputs. It is possible to overload the function by defining ymd again using
ymd <- function(x) {
as.Date(x, "%Y %m %d")
}
before you call daysAgg.

r - find same times in n number of data frames

Consider the following example:
Date1 = seq(from = as.POSIXct("2010-05-03 00:00"),
to = as.POSIXct("2010-06-20 23:00"), by = 120)
Dat1 <- data.frame(DateTime = Date1,
x1 = rnorm(length(Date1)))
Date2 <- seq(from = as.POSIXct("2010-05-01 03:30"),
to = as.POSIXct("2010-07-03 22:00"), by = 120)
Dat2 <- data.frame(DateTime = Date2,
x1 = rnorm(length(Date2)))
Date3 <- seq(from = as.POSIXct("2010-06-08 01:30"),
to = as.POSIXct("2010-07-13 11:00"), by = 120)
Dat3Matrix <- matrix(data = rnorm(length(Date3)*3), ncol = 3)
Dat3 <- data.frame(DateTime = Date3,
x1 = Dat3Matrix)
list1 <- list(Dat1,Dat2,Dat3)
Here I build three data.frames as an example and placed them all into a list. From here I would like to write a routine that would return the 3 data frames but only keeping the times that were present in each of the others i.e. all three data frames should be reduced to the times that were consistent among all of the data frames. How can this be done?
zoo has a multi-way merge. This lapply's read.zoo over the components of list1 converting them each to zoo class. tz="" tells it to use POSIXct for the resulting date/times. It then merges the converted components using all=FALSE so that only intersecting times are kept.
library(zoo)
z <- do.call("merge", c(lapply(setNames(list1, 1:3), read.zoo, tz = ""), all = FALSE))
If we later wish to convert z to data.frame try dd <- cbind(Time = time(z), coredata(z)) but it might be better to keep it as a zoo object (or convert it to an xts object) so that further processing is simplified as well.
One approach is to find the respective indices and then subset accordingly:
idx1 <- (Dat1[,1] %in% Dat2[,1]) & (Dat1[,1] %in% Dat3[,1])
idx2 <- (Dat2[,1] %in% Dat1[,1]) & (Dat2[,1] %in% Dat3[,1])
idx3 <- (Dat3[,1] %in% Dat1[,1]) & (Dat3[,1] %in% Dat2[,1])
Now Dat1[idx1,], Dat2[idx2,], Dat3[idx3,] should give the desired result.
You could use merge:
res <- NULL
for (i in 2:length(list1)) {
dat <- list1[[i]]
names(dat)[2] <- paste0(names(dat)[2], "_", i);
dat[[paste0("id_", i)]] <- 1:nrow(dat)
if (is.null(res)) {
res <- dat
} else {
res <- merge(res, dat, by="DateTime")
}
}
I added columns with id's; you could use these to index the records in the original data.frames

Resources